Category: Single Molecule

  • Nabsys Genome Mapping Technology Launches at ASHG 2023

    Nabsys Genome Mapping Technology Launches at ASHG 2023

    Introduction to genetic structural variation

    It is an exciting time to be involved in genetics and its application to healthcare. It was only a little over two decades ago the first draft of the Human Genome Project was published, and last year the first telomere-to-telomere sequence of a single chromosome was achieved.  And the impact of next-generation sequencing is seeing in increasingly valuable applications in the clinic: as a companion diagnostic for targeted cancer therapy, as a method for prenatal non-invasive testing for trisomy, and for rare disease diagnostics. Yet there are still so many big problems that remain unsolved.

    One of the larger problems in genetics (and by association application and impact to healthcare) is the detection and characterization of structural variation. A single gene can be damaged via a multitude of mechanisms (such as non-homologous recombination), and is a different kind of variation compared to Single Nucleotide Polymorphisms (SNPs) which genotyping microarrays measure, or insertion / deletion mutations (called indels where a single base to several dozens of bases can be inserted or deleted) which sequencing can also detect.

    The size of these insertions and deletions, however, can exceed the resolving power of next-generation sequencing, where readlengths can be limited to 150 to 300 bases. There can be insertions and deletions of kilobases or hundreds of kilobases long, which will be invisible to NGS analyses.

    In a given individual’s whole-genome sequence, there will be some 4 to 5 million SNPs and indels detected. The structural rearrangements (above 50 bases of inserted or deleted nucleotides, to several million bases or even entire chromosome arms) go undetected. For clinical cases, a pathology cytogenetics laboratory routinely uses techniques such as Fluorescent In-Situ Hybridization (known by its acronym FISH), karyotyping and microarrays (typically aCGH or array Comparative Genomic Hybridization) to detect structural rearrangements and specific gene fusions for diagnosing and appropriately guiding the treatment of cancer.

    Figure 1 below (kindly provided by Nabsys) compares conventional next-generation Sequencing by Synthesis (SBS) to genome mapping.

    Figure 1: Sequencing by Synthesis (typical NGS method) compared to genome mapping. Image kindly provided by Nabsys.

    There are estimated over 20,000 structural variants in a single human genome, yet with current sequencing technology (including single molecule sequencing from manufacturers such as Pacific Biosciences or Oxford Nanopore Technologies) large swaths of genome sequence can be rearranged but go undetected.

    For example, say there is a balanced structural variant, where a large multi-megabase region is inverted. It is called balanced because there is no gain or loss of DNA sequence, however there is a stretch of several megabases in the completely opposite orientation. Even with the technical advances of single-molecule sequencing to the tens or even hundreds of kilobases long, detecting all the different kinds of variation with a wide range of sizes and complexity remains a challenge.

    Mapping versus long read sequencing

    One definite trend over the past few years has been a consistent increase in throughput of short read sequencing, in addition to the similar throughput increases in long read sequencing as well. However on a cost-per-gigabase basis, long read sequencing remains 5-fold to 10-fold more expensive, severely limiting its applicability to clinical applications.

    Genome mapping using an optical method has been on the market for several years from Bionano Genomics, and is accepted as a complement to whole genome or whole exome sequencing to understand the nature of structural variants and disease. Nabsys now offers better resolution of variants at lower cost, detecting SV’s as small as 300 base pairs with >100kb long segments of the genome electronically mapped.

    Nabsys OhmX™ technology

    For a Nabsys run, high-molecular weight genomic DNA (50 kb to 500 kb) is first nicked using sequence-specific nickase enzymes, that could be used alone or in combinations, then labeled and coated with a protein called RecA (the RecA protein serves to stiffen the DNA for analyses). The samples are injected into the instrument, and the data is collected.

    Single DNA molecules are translocated through a silicon nanochannel, and the labeled locations are electronically detected to determine the distance between sequence-specific tags on individual molecules. While each electronic event is measured across the linear DNA molecule, there is a time-to-distance conversion and the entire genome has enough overlap to assemble what is effectively a restriction map of overlapping fragments (see figure 2).

    Figure 2: Individual molecules labeled with sequence specific labels, measured in a Nabsys OhmX Analyzer using a Nabsys OhmX-8 nanochannel device, and assembled into a Genome Map. Drawing courtesy of Nabsys.

    This capability was showcased a few years ago for microbial genomes, and a few publications1, 2, 3 show the proof of the approach for analyzing DNA maps this way at single-molecule resolution in bacterial genomes.

    With the recent commercial release of the Nabsys OhmX Analyzer system and OhmX-8 Detector consumables, a 10-fold increase in throughput has been achieved combined with 250 electronic detectors per channel. Nabsys uses a kit for efficient high molecular-weight DNA extraction and labeling in preparation for loading onto the system. (The sample input requirement is 5 ug of starting material, sufficient for several instrument runs if necessary; less input can be used if DNA quantities are limited.) In addition, as there are no optics (only fluidics and electronics) the Nabsys instrument is much more compact and less expensive than the equivalent optical instrument, as well as less expensive to run.

    Applications for human disease: cancer and rare disease

    Cancer has been correctly described as a ‘disease of the genome’, and as a research tool understanding the role structural variation has in cancer progression and treatment is an ongoing area of important work. Another important application of genomic mapping is for rare disease; currently it is estimated that about 70% of suspected Mendelian disorders go undiagnosed even with current short-read whole-genome sequencing4.

    It remains to be seen whether better detection and characterization of structural variation can provide the needed insights into these two important research areas, currently limited by cost of existing technology.

    Nabsys at ASHG 2023

    At the upcoming American Society for Human Genetics conference in Washington DC November 2 – 5, 2023 Nabsys will be present in the Hitachi High-Tech America Booth 1423. Hitachi will present their Human Chromosome Explorer bioinformatics pipeline for a low-cost, scalable Structural Variation validation and discovery platform.

    You can find out more about the Nabsys OhmX Analyzer here (a downloadable brochure is available on that page) and also more information about the overall approach to electronic genome mapping is here. A handy whitepaper about EGM can be found here (PDF).

    1. Passera A and Casati P et al. Characterization of Lysinibacillus fusiformis strain S4C11: In vitro, in planta, and in silico analyses reveal a plant-beneficial microbe. Microbiol Res. (2021) 244:126665. doi:10.1016/j.micres.2020.126665
    2. Weigand MR and Tondella ML et al. Screening and Genomic Characterization of Filamentous Hemagglutinin-Deficient Bordetella pertussis. Infect Immun. (2018) 86(4):e00869-17.  doi:10.1128/IAI.00869-17
    3. Abrahams JS and Preston A et al. Towards comprehensive understanding of bacterial genetic diversity: large-scale amplifications in Bordetella pertussis and Mycobacterium tuberculosis. Microb Genom. (2022) 8(2):000761. doi:10.1099/mgen.0.000761
    4. Rehm HL. Evolving health care through personal genomics. Nat Rev Genet. (2017) 18(4):259-267. doi:10.1038/nrg.2016.162
  • The Unmet Needs of Next-Generation Sequencing (NGS)

    The Unmet Needs of Next-Generation Sequencing (NGS)

    There are plenty of unmet needs in the current iteration of NGS, not the least of which is the effort involved in getting plenty of sequence data

    A short list

    The current NGS market is estimated to be about $7,000,000,000 (that’s $7 Billion) which is something remarkable for a market that started only in 2005 with the advent of the 454 / Roche GS20 (now discontinued). After the market leader Illumina there are alternative sequencing platforms such as Ion Torrent / Thermo Fisher Scientific, newcomers Element Biosciences, Singular Genomics and Ultima Genomics, and single molecule companies Pacific Biosciences (also known as PacBio) and Oxford Nanopore Technologies (also known at ONT).

    With major revenues coming from cancer testing (Illumina estimates NGS at $1.5 Billion of the oncology testing market), genetic disease testing ($800 Million) and reproductive health ($700 Million), NGS is well-entrenched in these routine assay fields.

    Yet from a different lens, that oncology testing market of $1.5B is from a total of a $78B cancer testing market, or a 2% penetration of the cancer testing landscape. Similarly, both genetic disease (a $10B market) and reproductive health (a $9B market) NGS has penetrated only about 8% of each of these markets. So there is plenty of room to grow.

    Yet what is holding the adoption of NGS back from more clinical applications? Here I propose a (relatively short) list of unmet needs, which serve as barriers to adoption. To look at it another way, this is a list from which current NGS providers (as well as new NGS providers) would do well to improve upon.

    The list is:

    • PCR bias
    • Sample input amount
    • Library preparation workflow
    • High cost of instrumentation and reagents
    • Sequencing run-times

    We’ll address each of these in order, and comment how single-molecule sequencing (aka ‘third-generation sequencing’, but that term really isn’t used much any more’) has addressed this issue (or not, as the case may be).

    PCR Bias

    PCR bias is a thing that people doing routine NGS may not think about, but for those who are doing whole genome assemblies or otherwise have to get sequence data from areas particular G-C or A-T rich, this is a Big Problem. Because PCR works on the basis of the short oligonucleotides hybridizing under a set of temperature, salt and cation concentrations, the melting temperature (Tm) of the short DNA primers is really important. PCR also depends upon the strands denaturing and re-annealing at certain temperatures, and in all this limits of G-C percent of all sequence being amplified has a very strong influence on efficiency of the reaction.

    This is a complex topic, where research papers dwell on a computer scientist’s love for sequence windows and normalized read coverage and statistical equations, suffice it to say bias can go in all kinds of directions, and there’s a lot of variability in several dimensions. (See figure – both sequence plots are from the same species of bacteria, just different strains: S. aureus USA300 and S. aureus MRSA252. One has a falling coverage as a function of G-C content, and the other a rising coverage plot (!)

    Figure 1 from Chen YC and Hwang CC et al. Effects of GC bias in next-generation-sequencing data on de novo genome assembly. PLoS One. (2013) 8(4):e62856.

    A nagging question though – what do these coverage plots imply for all the missing data? A given genomic region could have all kinds of G-C content variation making it very resistant to study. Single molecule sequencing has largely solved this G-C bias question (link to 2013 publication) but the PCR bias is something to still consider.

    Sample input amount

    Back in the early days of the Ion Torrent PGM (around 2011 to 2012), a key selling point for the Personal Genome Machine was ability of the AmpliSeq technology to use only 10 ng of FFPE DNA as input material, whereas the Illumina hybridization-based enrichment methods could not touch. (Their approach at that time was in the 100’s of ng input.)

    As many of you are aware, FFPE tissues are in limited supply, the fixation and embedding process damages nucleic acids (typically fragmented to about 300 bp in length) and in the current Standard of Care the FFPE process is firmly embedded in the surgeon – to – pathologist workflow in the hospital environment.

    Another instance of limiting input is for cell-free DNA analysis, where amounts of cfDNA from a given 10 mL blood draw is a range from 10 ng to 40 ng. (This varies by individual, healthy versus diseased, as well as inflamed or normal along with a host of other variables.) And from this limited amount of DNA, companies like GRAIL and Guardant are detecting down to one part in a thousand, or 0.1% minor allele fraction (MAF). They use plenty of tricks and techniques to generate useable NGS libraries.

    Yet with a system that does not use PCR for their library preparation (Helicos, covered here) one limitation of using 1 – 9 ng of input material (and simply adding a Poly-A tail to the DNA) was that barcodes could not be added. Thus a key design feature of the Helicos instrument was a flowcell with 50 individual lanes, one per sample. It makes me think they could have perhaps been a bit more imaginative with the sample preparation and somehow add a sample barcode before polyadenylation to enable sample multiplexing.

    Library preparation workflow

    Ask anyone who does routine NGS library preparation, with or without the aid of liquid handling automation, and they will tell you it’s work. You purify DNA, you do an enrichment step (whether multiplex PCR like AmpliSeq or QIAseq or Pillar SLIMamp or a hybrid capture step from NimbleGen / Roche or Agilent SureSelect), you cleanup those reactions, you ligate adapters, you purify it again, you setup a short-cycle PCR to add sample indexes, you clean it up again, you quantitate via fluorometry or qPCR.

    And then do some calculations to normalize molarity.

    After loading the NGS instrument you wait for cluster generation / emulsion reaction with beads, and then you sequence. You are glad you are not in the early 2010’s with separate instruments for these processes, but still this all takes time.

    And there still is a danger of overloading the NGS instrument. This danger is done away with in single molecule sequencing (a single pore or Zero Mode Waveguide can only handle one molecule, higher concentrations of molecules doesn’t matter).

    High cost of instrumentation and reagents

    A NovaSeq 6000 is almost a cool $1,000,000. A single run at maximum capacity on the NovaSeq is $30,000. This gets the cost-per-Gigabase down to $5.

    The newest NovaSeq X from Illumina is even more – a cost of an eye-watering $1,250,000. This gets the cost/GB down to $3, with additional cost lowering effort with higher throughput flowcells later in 2023 to about $2/GB.

    For single molecule sequencing, the instrumentation is still high. A PacBio Sequel II is $525K, the new PacBio Revio about $750K. One exception is the ONT PromethION is $310K. However for all these instruments, the cost per GB is very high – Sequel per-GB cost is around $45 (that’s a good 50% more what the per GB cost on an Illumina NextSeq, however you are getting long reads with the Sequel). Revio is a lot better at $20/GB, so better than the NextSeq short reads but still 4x what a NovaSeq 6000 costs per GB.

    ONT is an attractive $10/GB, still 2x the NovaSeq though.

    In many ways single molecule sequencing is far superior to short-read NGS for clinical WGS. (See my friend Brian Krueger’s LinkedIn post – and poll – here and some great insights about the value of clinical WGS in an older post here.) The main barrier to wider clinical use of WGS is cost, and with a need for at least 15x if not 30x genome coverage, that’s 50GB to 100GB of sequence data. Put another way, currently on ONT the cost to generate WGS data is $500 to $1,000, while on the PacBio Revio it is $1,000 to $2,000, which is still too expensive (even though it has broken through the ‘$1000 Genome’ barrier).

    Sequencing run-times

    Lastly NGS takes a long time to run. A MiSeq running 2×150 paired-end (PE) reads is over 48 hours; a NextSeq run is at least over a day; and a NovaSeq almost two days.

    Here single molecule sequencing isn’t much faster, a Sequel II run takes over a day, Revio still takes a day, and ONT is three days (in order to maximize throughput, these pores are allowed to run for a long time).

    For ultra-fast WGS, ONT was used to get WGS data in 3 hours on the PromethION (GenomeWeb link paywalled) for newborn screening in the NICU, where fast turnaround time is paramount. There are some eight programs worldwide underway to utilize NGS in newborn screening programs, enrolling from 1,000 to over 100,000 infants. One prominent one is Stephen Kingsmore’s Rady Children’s Institute “BeginNGS” program, subtitled “Newborn genetic screening to end the diagnostic odyssey”. (Here’s a paywalled 360dx article laying out the details of these eight different genomic screening for newborn programs worldwide).

    Anything else?

    Okay, that is my list of ‘unmet needs in NGS’ with some problems solved by single molecule approaches, yet other problems remain. Did I miss anything else?

  • Observations about Helicos, a single molecule sequencer from 2008

    Observations about Helicos, a single molecule sequencer from 2008

    A brief history of Helicos Biosciences

    Does anyone remember Helicos Biosciences? Way back when, in 2009 (per Wikipedia) Stanford co-founder Stephen Quake had his genome sequenced (and published in the prestigious journal Nature Biotechnology) for a reported $50K cost in Helicos reagents; that year I remember hearing a talk given by Arul Chinnaiyan at the NCI with single-molecule RNA-Seq data, it was an exciting time.

    Crunchbase indicates Helicos raised $77M and went public in 2007; they shipped their first Heliscope in 2008, only to be delisted in 2010 and then declared bankruptcy in 2012. Remember,  the Solexa 1G / Illumina Genome Analyzer (“GA”) only started selling commercially in late 2006. So those early days it was something of a dogfight. I was selling Illumina microarrays for GWAS to the NIH from 2005 onward through the first GA’s and GA IIx’s and then from 2010 started selling the Applied Biosystems / Life Technologies’ SOLiD 4’s.

    A few distinguishing features

    For those familiar with library preparation for Illumina sequencing, it takes time and several rounds of PCR and PCR cleanup, along with quantification, to be ready for sequencing. For Helicos, it used poly-adenylated nucleic acid to bind to the flowcell (and then the chemistry would sequence the DNA directly, without any further amplification in emulsions, nanoballs, or clusters depending on the platform).

    As the worlds’ “first true single molecule sequencer”, the DNA sequence had no amplification bias and could read high GC or low GC stretches of DNA without any impairment to the accuracy. The bases were accurate to about 96%; this 4% error was not sequence dependent and was basically random, simplifying analysis. There was only a tiny amount of sample input required; 3 ng input amounts of RNA or DNA. The two flowcells gave the instrument the capability to run 50 samples in 1 run, which took 7 or 8 days to complete. And at a 2008 price per sample of about $325 (for about 14M unique reads, this info is from an old GenomeWeb interview), the price/sample for RNA-Seq was attractive, although it would require a 48-sample experiment (two of the 50 lanes were reserved for controls), or some $15,600 for a single experiment, which naturally would limit the market for high-throughput operations.

    In 2008 the Solexa 1G and Genome Analyzers were all the rage

    The Solexa acquisition occurred in Nov 2006 and several Solexa 1G’s had already been shipped and started producing data in customer laboratories, and at only 25 basepair (bp) reads they still produced about 800 MB of sequencing data per 3+ days’ sequencing run. I was selling microarrays to the NIH for Illumina since 2005, and in early 2007 the first Solexa 1G was installed at the laboratory of Keji Zhao at NHLBI, who ended up being the first person to publish a ChIP-Seq paper using NGS (it was in the journal Cell in May 2007, here’s a PubMed link).

    By the time Helicos commercially launched with their first commercial sale in February 2008, Illumina had already sold 50 Genome Analyzers by the summer of 2007, and by February 2008 had updated the instrument to do paired ends and the readlengths were extended from 25 bases to 35 bases. Illumina announced progress to moving to 50 base readlengths.

    Against this backdrop, Helicos designed and build their ‘Heliscope’ single molecule sequencer to be highly scaled: 50 channels, about 25GB of sequence data per 7 or 8 day run, read lengths 25 to 55 basepairs long with an average of about 30 to 35, about 4% error rate with bases having a G/C content ranging from 20% to 80%, and the error model was random (no systematic bias which was their big selling point against Illumina, where cluster generation as well as library preparation uses a form of PCR amplification introducing bias).

    And according to a conversation I had this week, the price of the Heliscope was also scaled: $1.2M was the instrument at the start in 2008, and steadily lowered over time to about $900K in 2011 when they ceased operations. Requiring 48 samples for an RNA-Seq experiment, taking an entire week to generate data, and costing over $15,000 was a tall order to fill; sequencing a whole genome for $50,000 was also not something many laboratories or individuals could afford to spend in 2008.

    Important aspects of the Heliscope

    Being able to source a 2008 product sheet of the HelicoScope (PDF), the data storage capacity on-board (remember this is 2008) was a whopping 28 Terabytes. This was to store enormous imaging data for the flowcells, of which there were a pair of them, and to do all the image registration and base calling. Any way you look at it, a single run producing 25 Gigabases of sequencing data in 2008 was going to pose some challenges.

    And this instrument was big: the spec sheet says the main Heliscope sequencer was four feet by 3 feet by 6 feet tall, and a whopping 1,890 lbs. An 800 lb block of Vermont granite was included at the bottom of the instrument to stabilize it against vibration. However it’s clear from a photograph of the instrument that they were fitted with wheels, so you can say it was portable, as much as a 2,000 lb instrument is portable.

    The world’s first single molecule sequencing technology (they trademarked the name of their technology, calling it True Single Molecule Sequencing (tSMS)™), the chemistry was not in ‘real-time’ like the latest PacBio Revio™ or Oxford Nanopore PromethION™, it was sequencing-by-synthesis of a single base and then imaging the entire flowcell surface. With two flowcells (each with 25 lanes), one would be imaged while the other had its flowcell biochemistry being performed. Impressively (or perhaps not that realistically?) they claimed improvements in the flowcell density and tSMS reagent efficiency they promised to eventually produce 1GB of sequence per hour (about 7x the above numbers in terms of density and thus overall throughput).

    One source told me in those days the flowcell had uneven densities of poly-T molecules, so there were unusable areas to call bases. If it was too sparse, not worth the effort of scanning and analyzing; if it was too dense, the signals would collide and no usable sequence could be obtained. The original design however scanned all the surface of all 50 channels; usable data or not, all the images were scanned and analyzed. There wasn’t the luxury of time and engineering resources to optimize this.

    What was the cause of the ultimate demise of the Heliscope?

    Not only was the instrument cost an issue, there was also the problem of getting to longer reads. In 2008 Illumina was getting 35bp reads and on their way to 50bp reads, along with paired-end capability that meant a large increase in throughput. (For those unaware, in 2023 these reads now go out to 300bp.) Helicos could not catch up; due to the likeliness of restrictions on detectability and the optical system, the laser illumination to excite the fluor labels on the nucleotides would also hold the potential to damage the DNA from being a usable molecule. And thus Helicos could talk about extending the average readlength from 35 (plus or minus 10 or 15 bases as it was a distribution of reads) to 50 or longer, but it just did not happen in the timeframe from 2008 to 2011 when Helicos stopped selling the Heliscope systems. It is my understanding that they did not sell many of these $1M systems, less than a dozen or so worldwide.

    Price of a new instrument from a new company at the $1M pricepoint is a tall order. One life science company that sold single-cell analysis equipment and consumables, Berkeley Lights (now renamed PhenomeX after an acquisition of the single-cell proteomics company Isoplexis) tried for years to reduce the size and cost of their flagship Beacon system, however were unable to and has a limited market for their analzyer.

    You can say Helicos paved the way for market reception of PacBio in 2012 and then Oxford Nanopore a few years later in 2015. The relatively high cost (some 7x to 10x on a cost-per-base relative to sequence data coming off of Illumina’s flagship NovaSeq X) remains a large barrier.

    Now that Element Biosciences, Singular Genomics, and Ultima Genomics (and let’s not forget PacBio’s Onso) are competing head-to-head with Illumina on short reads, is there room for innovation (and cost reduction) in single molecule long reads? I would certainly hope so.