Sources of Erroneous Sequences and Artifact Chimeric Reads in Next Generation Sequencing of Genomic DNA From Formalin-Fixed Paraffin-Embedded Samples

Read the Publication

This week we profile a recent publication in Nucleic Acids Research led by Simon Haile (pictured, left) and Richard
Corbett (pictured, right) from the laboratory of Dr. Marco Marra at BC Cancer’s Genome Sciences Centre.

Can you provide a brief overview of your lab’s current research focus?

This research was performed by the Technology Development Group at BC Cancer’s Genome Sciences Centre, which is focused on maintaining, developing and enhancing the libraries and equipment used by the GSC for its cutting-edge sequencing and bioinformatics research. One aspect of this work involves constantly improving and developing Next Generation Sequencing (NGS) methods. The group, overseen by Dr. Marco Marra, aims to find ways to improve data quality, lower amounts of required starting materials, shorten turnaround time and develop NGS protocols.

What is the significance of the findings in this publication?

Millions of clinical specimens are stored in the form of formalin-fixed paraffin-embedded (FFPE) samples. The capacity to tap into these samples is crucial for NGS analysis; however, there are two major challenges with FFPE samples. First, the extraction and sample preparation methods are tedious. Second, data quality is degraded following formalin fixation, storage and purification.

We addressed the first challenge by developing an automated magnetic bead-based extraction protocol for the simultaneous processing of 96 samples (PMID: 28570594). In this publication, we address the second challenge; in particular, we show that among the sequencing artifacts are chimeric reads that appear to be derived from non-contiguous portions that align to both the ‘Watson’ and ‘Crick’ strands of the reference genome. We refer to these as strand-split artifact reads (SSARs).

Here we provide the conceptual framework for the mechanistic basis of the genesis of SSARs and other chimeric artifacts along with the tools to quantify them. We propose that single stranded DNA (ss-DNA) from denatured DNA fragments are the probable source of SSARs and other artifacts. Short stretches of sequence complementarity in the ss-DNA regions can link fragments together, yielding the chimeras after the library construction step of end-repair, which then become templates for T4 DNA ligase.

We present the following lines of evidence supporting the proposed mechanism: (i) verification of the existence of short complementary regions in 100 per cent of the SSARs we studied; (ii) the relationship between reductions in nucleic acid heat exposure (which presumably also reduces denaturation and therefore ss-DNA) and increased quality of FFPE libraries; (iii) the impact of removing ss-DNA fragments via S1 nuclease treatment; and (iv) the use of a tagmentation-based library construction protocol that lacks an end-repair step. S1 nuclease also reduces sequence bias, base error rates and false positive detection of copy number and single nucleotide variants.

What are the next steps for this research?

An important outcome of this work is that it provides a conceptual framework to design further improvements for FFPE genome data.

This research was funded by:

National Institutes of Health, BC Cancer Foundation, Genome Canada, Genome British Columbia and Canadian Institutes of Health Research

Read the Publication

Sources of Erroneous Sequences and Artifact Chimeric Reads in Next Generation Sequencing of Genomic DNA From Formalin-Fixed Paraffin-Embedded Samples

Categories