The past 60 years have seen the development of various sequencing techniques, which scientists can employ to identify nucleic acid sequences in biological samples. These techniques have opened doors to progression in diagnosing and managing several diseases and disorders. To date, the most advanced type of sequencing is next-generation sequencing (NGS), which allows scientists to identify germ line or somatic mutations.
NGS has also demonstrated its worth in metagenomic studies. For example, in 2020, scientists leveraged NGS to characterize the SARS-CoV-2 genome and track the spread of Covid-19 around the world.
Since the 2000s, which marked NGS’ first commercial use, its applications have grown exponentially. Those hoping to gain a better understanding of NGS will find information on its history, sequencing methods, and workflow in the peer-reviewed, open-access journal BioTechniques. This journal also explores NGS data analysis, the differences between short- and long-read sequencing, the differences between whole-genome and whole-exome sequencing, and the technology’s bottlenecks.
Here, we’ll provide an overview of these topics, explaining, overall, how next-generation sequencing works.
Next-Generation Sequencing History
Sequencing techniques have grown, over time, out of Watson and Crick’s discovery of the structure of DNA, which they achieved using Rosalind Franklin’s DNA crystallography and x-ray diffraction work in 1953. From here, Robert Holley sequenced the first nucleic acidmolecule, tRNA, in 1965. Since then, several research groups have adapted the first sequencing methods to develop DNA sequencing techniques.
One of the most notable developments in DNA sequencing came in 1977 when Frederick Sanger worked with his colleagues to develop Sanger sequencing, also known as the chain-termination method. By 1986, scientists had also developed the first automated DNA sequencing method, marking the advent of a golden era for sequencing platforms like the capillary DNA sequencer.
Developments in sequencing led to researchers completing the human genome project in 2003 and introducing the first second-generation (2G), commercially available NGS platform in 2005. With this platform, scientists could amplify millions of copies of a DNA fragment simultaneously, which hadn’t been possible with Sanger sequencing.
2G sequencing techniques are similar to Sanger sequencing in some ways, but 2G sequencing techniques have a much higher sequencing volume. As a result, scientists can employ these techniques to process millions of reactions at the same time. This way, they can raise the throughput, sensitivity, and speed of the process, all at a lower cost. Therefore, scientists could theoretically now complete the same genome sequencing projects that took years to complete with Sanger sequencing in hours.
While NGS once only encompassed 2G technologies, there are now third- and fourth-generation (3G and 4G) technologies on the sequencing scene.
2G sequencing methods share various features, but their underlying detection chemistries differentiate them. These detection chemistries include sequencing by ligation and sequencing by synthesis (SBS), which breaks down further into proton detection, reversible terminator, and pyrosequencing.
These sequencing methods superseded previous techniques because they enable scientists to produce sequencing reads quickly, sensitively, and cost-effectively. However, 2G technologies require scientists to complete PCR amplification before sequencing, and the short read lengths create a need for deeper sequencing coverage. 2G techniques also run the risk of poor interpretation of homopolymers and the incorporation of incorrect dNTPs by polymerases, which can lead to sequencing errors.
The launch of 3G technologies meant that scientists could sequence single molecules without completing amplification steps. With the first single molecule sequencing (SMS) technology, which Stephen Quake, currently at Stanford University, CA, USA, developed with his colleagues, scientists could obtain sequence information with the use of DNA polymerase. They achieved this by monitoring the integration of fluorescently labeled nucleotides into DNA strands with single-base resolution.
The benefits of 3G technologies vary depending on the technique and tools selected. That said, the usual benefits include real-time monitoring of nucleotide incorporation, non-biased sequencing, and longer read lengths. However, 3G technologies’ high effort rates, high costs, large quantities of sequencing data, and low read depth can all lead to challenges.
The latest suite of NGS techniques (4G techniques) combine nanopore technology with single-molecule sequencing. The nanopore technology passes the single molecule through nanopores. Like 3G techniques, 4G techniques don’t require amplification. However, although these techniques facilitate the fastest whole-genome sequence scans to date, 2G techniques are often still seen as the gold standard in terms of accuracy and so are sometimes used to validate 4D sequencing results.
The 2G Next-Generation Sequencing Workflow
When following a 2G sequencing method, scientists must work through four stages, which they can adapt based on the target DNA/RNA and their selected sequencing system.
1. Sample Preparation
First, scientists extract nucleic acids from the samples. The nucleic acids may be DNA or RNA, and the samples can vary from blood and sputum to waste water and environmental samples and many more sources. Scientists check the extracted samples for quality control, employing gel spectrophotometric, electrophoretic, fluorometric, or similar methods. If the scientists are working with RNA, they will reverse transcribe the samples into cDNA, although some library preparation kits include this step.
2. Library Preparation
When choosing a library preparation kit and sequencing platform, scientists should consider a range of factors. The research question, sample type, and most appropriate extraction method are all important, as are the required read depth (coverage), read length, and sample concentration.
Scientists should also consider whether they need to sequence the whole genome or specific regions of the genome, whether short- or long-read sequencing is more appropriate, whether they need to look at the genome or transcriptome (DNA or RNA), whether they can multiplex samples, whether they need bioinformation tools, and whether they should use mate pair, paired end, or single end reads.
When beginning this stage of the NGS workflow, scientists randomly fragment the DNA or cDNA, usually with an enzymatic treatment or through sonication. The platform they use dictates the ideal fragment length. Scientists may also need to run a small amount of the fragmented sample on an electrophoresis gel when optimizing the process. They can then ligate and end-repair the fragments to smaller, generic DNA fragments known as “adapters.”
Adapters have defined lengths with known oligomer sequencers. This makes them compatible with the applied sequencing platform and recognizable when scientists carry out multiplex sequencing, which makes it possible to pool and sequence several libraries at once.
Scientists then carry out size selection, which they can complete with magnetic beads or gel electrophoresis. The size selection process eliminates fragments that are too long or short for the chosen sequencing platform and protocol. From here, they carry out PCR steps to achieve library enrichment/amplification. They may also apply a clean-up step, possibly using magnetic beads, to eradicate unwanted fragments, thereby improving sequencing efficiency.
To finish the library preparation step, scientists may use qPCR to run a quality control check on the final libraries. This check confirms the quantity and quality of DNA, allowing scientists to prepare the optimal concentration of the sample for sequencing.
Scientists may carry out clonal amplification of the library fragments before sequencer loading (emulsion PCR). Whether or not they perform this step depends on the platform, chemistry, and the quantity of target sample available. On the other hand, they may perform this amplification on the sequencer itself (bridge PCR). Scientists detect and report the sequences according to the selected platform.
4. Data Analysis
The final stage of the NGS workflow involves the analysis of the generated data files. Scientists select an analysis method based on their workflow and study aim. For example, if they opt for downstream data analysis, mate pair and paired end sequencing are ideal options, particularly for de novo assemblies. These sequencing techniques link sequencing reads that are separated by an intervening DNA region (mate pair) or are read from both ends of a fragment (paired end).
Analyzing Next-Generation Sequencing Data
NGS technologies produce high volumes of data, which is why data analysis processes typically involve:
- A raw read quality control step
- Pre-processing and mapping
- Post-alignment processing
- Variant annotation
- Variant calling
Scientists assess the raw sequencing data to understand the quality of the data and prepare for downstream analyses. These assessments allow scientists to gain an overview of the number and length of reads and determine any reads with low coverage or contaminating sequences.
Scientists can use various applications for computing control statistics of sequencing, such as FastQC. But they need extra tools for additional pre-processing like read-filtering and trimming. Trimming bases at the ends of reads and eliminating excess adapter sequences can improve data quality. Scientists can also employ modern tools (such as fast-p) to read filtering and base correction on sequencing data, merge features from traditional applications 2-5 times quicker than standard applications can, and perform quality control checks.
The existence of a reference genome informs the next stage of data analysis. For example, if there is a de novo genome assembly, scientists will align the generated sequences into contigs using their overlapping regions. They may utilize processing pipelines to do so, and this may introduce scaffolding steps to aid orientation, contig ordering, and the elimination of repetitive regions, which can, by extension, increase the assembly continuity.
If the generated sequences are mapped to a transcriptome or reference genome, scientists may identify variations compared to the reference sequence. They can choose from a large selection of mapping tools to handle large data volumes while mapping reads. They may take an experiment-specific approach to analyze the reads, identifying differential gene transcription in RNA sequences, haplotypes, single nucleotide polymorphisms (SNPs), indels (insertions or deletions of bases), and inversions.
Data complexity can lead to challenges when scientists complete the visualization step. The study and research question should inform the visualization tools that scientists select. While the Genome Browser and Integrated Genome Viewer (IGV) are ideal when reference genomes are available, the VISTA tool allows scientists to compare genome sequences. Meanwhile, the Variant Explorer can sieve through thousands of variants and make it possible for scientists to focus on the most useful findings, suiting this tool to whole-exome and whole-genome experiments.
Differences Between Short and Long-Read Sequencing
NGS techniques typically fall into two categories: short-read sequencing techniques and long-read sequencing techniques. Each type offers advantages and disadvantages that influence scientists’ decisions over which to use for various applications. While short-read sequencing is a cost-effective approach that allows scientists to sequence fragmented DNA and achieve higher sequence fidelity, this kind of sequencing won’t resolve phase alleles or structural variants (SVs) or distinguish highly homologous genomic regions.
Long-read sequencing allows scientists to sequence genetic regions that are challenging to characterize with short-read techniques because of repeat sequences. This type of sequencing also allows scientists to read an entire RNA transcript and determine the specific isoform, assist de novo genome assembly, and resolve structural rearrangements or homologous regions. However, long-read sequencing has a lower per-read accuracy, and bioinformatic difficulties can stem from a range of factors: scalability, limited availability of appropriate pipelines, coverage biases, and high error rates in base allocation.
Differences Between Whole-Genome and Whole-Exome Sequencing
During whole-genome sequencing, a scientist analyzes a genome’s entire nucleic sequence. On the other hand, whole-exome sequencing is a form of targeted sequencing. This method only addresses protein-coding exons.
Whole-exome sequencing comes at a lower cost than whole-genome sequencing because of its lower volume, lower sequencing burden, and smaller complexity of the resulting sequencing data. However, sequencing only a small amount of a genome lowers the chances of making unique discoveries, and scientists may miss important information. As a result, whole-genome sequencing offers a wider-ranging analysis that usually leads to bigger-picture results. As its costs reduce over time, this type of sequencing is becoming more common.
Overcoming Bottlenecks During Next-Generation Sequencing
While NGS is accelerating scientists’ ability to study and understand genomes, there are bottlenecks in the ways that scientists manage, analyze, and store high data loads. The assembly, annotation, and analysis of sequencing data require huge computational resources that can lead to challenges, particularly as some data centers struggle to handle the rising demand for storage capacity.
That said, possible strategies to improve sequencing efficacy and reproducibility, reduce sequencing error, and enable correct data management are in progress. As NGS’ capabilities improve and its costs lower, more and more clinicians will be able to utilize these modern technologies in their practices, offering methods such as whole-genome, whole-exome, transcriptome, metagenome, epigenome, and targeted sequencing, each of which offers its own benefits.
Research professionals from all over the world use BioTechniques, an acclaimed journal, to develop their knowledge of processes like NGS, chromatography, western blotting, polymerase chain reaction, and CRISPR gene editing. These individuals specialize in a variety of fields, from the life sciences, chemistry, physics, and computer science to plant and agricultural science. As the first journal to publish peer-reviewed research that comments on the efficacy of laboratory methods, since launching in 1983, BioTechniques has cultivated an ever-growing community of research professionals who are interested in the future of science and medicine.