When we sequence DNA, we're trying to read the precise order of billions of chemical letters that make up a genome. This process isn't like reading a book—it's more like trying to reconstruct a massive encyclopedia after it's been shredded into millions of tiny, overlapping fragments, some of which are smudged or torn.
The sequencing process begins with biological samples that must be prepared and amplified in the lab. During this preparation, systematic biases creep in. Some regions of DNA are easier to capture and copy than others, particularly those with balanced chemical composition. Regions that are unusually rich in certain bases get underrepresented. The copying process itself, while necessary to generate enough material to sequence, can introduce duplicates that are hard to distinguish from truly repeated sequences in the genome.
Once prepared, samples go into sequencing machines that detect chemical signals and convert them into digital base calls—the A's, C's, G's, and T's of genetic code. Each sequencing platform has its own characteristic errors. Some struggle with repeated identical bases in a row, others have trouble when signals from adjacent DNA fragments bleed into each other. These aren't random errors—they're predictable patterns tied to the specific chemistry and optics of each machine.
After getting these raw base calls, we face the challenge of figuring out where each fragment came from in the genome. We do this by comparing fragments to a reference genome, but this process inherently favors sequences that match the reference. If someone's DNA differs substantially from the reference—which is especially true for structurally complex regions or underrepresented populations—those differences might be missed or misrepresented. Repetitive sequences, where the same pattern appears in multiple locations, create ambiguity about where fragments truly belong.
The next step is identifying variants—places where an individual's sequence differs from the reference. This involves setting thresholds and making judgment calls about what constitutes a real variant versus noise. Different choices about these thresholds affect which variants get called, and the same sample can produce different results depending on exactly which software version or parameter settings are used.
Perhaps the most insidious problem is what happens when you sequence many samples over time. Different batches of samples, even when processed with the same nominal protocol, can show systematic differences. Maybe they were run on different instruments, used different batches of chemical reagents, or were handled by different lab technicians. These differences create patterns in the data that have nothing to do with biology but can be mistaken for real genetic signals. A variant that appears associated with disease might actually just be associated with which sequencing center processed the samples.
Even when each individual sample looks fine by standard quality checks, these batch-to-batch differences persist. They show up as subtle shifts in the frequency of genetic variants, or as clustering patterns where samples group by when or where they were sequenced rather than by their actual biology. For rare variants—the ones often most important for disease—these effects can make results unreliable.
Compounding all of this is incomplete record-keeping. When the exact software versions, reference genomes, and processing parameters aren't carefully tracked, it becomes nearly impossible to reproduce results or to meaningfully compare data generated at different times or places.
Addressing these challenges requires three interconnected improvements. First, we need to model and remove the specific noise patterns introduced by different platforms and batches, tackling errors at their source rather than just filtering results at the end. Second, we need quality control metrics that don't just look at individual samples in isolation but understand patterns across entire cohorts and can flag batch-related problems. Third, we need rigorous version control that locks every analysis to specific, documented versions of all software, references, and parameters, making results truly comparable across time and sites.
High quality genomic data is one of Meirona's key priorities.