A novel bioinformatics tool built by researchers at the Wellcome Sanger Institute is able to identify a range of structural genetic variants that are responsible for causing rare diseases using primarily standard short-read whole-genome sequencing data.
While long read sequencing is still the “gold standard” for identifying and classifying complex structural variants in the genome, the new tool is able to overcome known limitations of short read sequencing by computationally improving filtering, classification and validation steps to pick up structural variants that would otherwise be missed.
Many rare diseases are caused by different types of genetic variants, some of which are inherited and some that occur during development. Structural variants include cases where some of the DNA is deleted or inserted where it shouldn’t be, as well as duplicated. The sequence can also be partly inverted or translocated to somewhere else in the genome. Repeat expansion variants, such as those that cause Huntington’s disease and Fragile X syndrome are also classed as structural variants.
Typically structural variants, particularly those that are large and complex, have been hard to diagnose using short read sequencing as many structural variants are larger (at least 1000 bp) than the typical read length used in short read sequencing (150-300 bp) and piecing together the variants accurately with fragmented sequence data can be difficult.
As reported in Nature Communications, the bioinformatics pipeline the researchers created was able to identify 1,870 structural genetic variants from whole-genome sequencing data collected from 12,568 families that took part in the 100,000 Genomes Project in the U.K., including 13,698 children with rare diseases.
Of the structural variants they discovered, around eight percent were complex and contained multiple changes. The researchers characterized and classified most of them into 11 different subtypes.
The majority of the genomic data analyzed in the study was from short read sequencing, with long read data only available for 23 samples.
Notably, the analysis was able to offer an updated diagnosis for 145 children in the group, by identifying the structural variants causing their conditions. Just under half this group have causative variants that are hard to detect using other genetic tests.
“This new method, which allows us to identify and analyze complex structural variants, opens up new possibilities for the understanding and possibly management of health conditions,” explained first author Hyunchul Jung, PhD, a researcher at the Wellcome Sanger Institute, in a press statement.
“We’ve shown that it’s not just about finding a deletion or duplication in the genome, it’s how such changes happen together—something that was not possible to see before. Our robust pipeline allows us to look close enough at the genome to start to build a clearer picture for researchers, clinicians, and patients.”
The researchers hope their tool will help other investigators to better characterize and find out more about different structural variants and how they cause disease. It could also offer insights into progression of different conditions in the future.
		