Published by the Students of Johns Hopkins since 1896
April 16, 2024

Insights across species: Mapping the genomes of vertebrates

By SHREYA TIWARI | February 18, 2024



New bioinformatics software allows researchers to investigate vertebrate genomes in an efficient, organizable and accessible way. 

Michael Schatz, a Bloomberg Distinguished Professor of the Department of Computer Science, collaborated with the Pennsylvania State University, Rockefeller University and various other institutions to increase the efficiency of whole genome assembly. They developed a pipeline, a software that automates critical processes for genome assembly. It is now publicly available on Galaxy, a hub for publicly storing large datasets and software for data analysis. 

These efforts allowed researchers to sequence the complete genomes of 51 vertebrates, contributing to the larger Vertebrate Genomes Project (VGP). Schatz’s research supports the VGP’s overall mission to generate and store the whole genomes of living vertebrate species — an effort that could have critical implications in disease and evolutionary studies. 

Genome assembly is an ordered map of an organism’s genes. Developing an organism's whole genome assembly (WGA) is computationally challenging. However, with recent advancement in sequencing technology, long lengths of DNA can be read and stored, increasing the efficiency with which scientists can compile genomic data. 

Genome assembly is comparable to a massive jigsaw puzzle with missing and fragmented pieces. Researchers rely on current knowledge of genes and attempt to fit together similar sequences to create maps of an organism’s genome. This process is highly prone to errors: Pieces of genomes can be joined incorrectly, placed in the wrong orientation or discarded entirely because they appear redundant or contain errors. Consequently, publicly available whole genome assemblies are often inaccurate and require constant improvement. 

Large-scale efforts like the VGP emphasize the importance of making publicly available genomes error-free. Knowledge of vertebrate genomes is integral to helping researchers explore the genetic mechanisms behind the evolution and survival of individual vertebrate species through understanding the relationship between DNA sequences of vertebrates, including humans. Endeavors like the VGP could also unravel evolutionary similarities and differences across vertebrate species in development, neurological function, chromosomal development, sex determination and disease resistance. 

Alex Ostrovsky is an engineer on the University’s Galaxy team who has been working on making the developed pipeline more accessible to inexperienced coders. Ostrovsky shared how increasing the number of publicly available reference genomes could open a Pandora’s box of new unanswered questions. 

“People will use specific [model] organisms because they’re well studied. You would be avoiding things that aren’t [studied] as much, but that doesn’t mean they’re not scientifically relevant. Being able to have a reference genome for all of these vertebrates opens up a lot more research and gives a great starting point for a lot of further science,” Ostrovsky said in an interview with The News-Letter.

The team’s research has implications beyond its contributions to evolutionary biology and medicine since it holds great promise for improving public accessibility of genomic datasets and data analysis software. Prior VGP work was available on DNAnexus, and the Galaxy team was instrumental in transferring the new pipelines developed by Schatz to the publicly available Galaxy ecosystem.  

“You’re getting large datasets out of this, and a lot of them, so we’ve had to create systems by which you can tag datasets and make things automated as much as possible — whereas, before, they would be a lot more manual,” Ostrovsky explained the role of Galaxy in implementing these new VGP workflows. 

In addition to localizing the VGP genome datasets on its system, the Galaxy team helped integrate Schatz’s pipeline for developing new genome assemblies with its system, an effort that contributed to resolving some of the biggest challenges in bioinformatics research. 

Bioinformatics software has a very short half-life, introducing challenges to maintain reproducibility. The same software or analytical tool can produce different results on a different day or may stop working entirely. Depending on the computer where the software is run, researchers from different fields or institutions would receive different results after performing the same analysis. Additionally, genomic datasets are massive. Therefore, maintaining organization for storing and analyzing genomic datasets directly on a computer — compared to larger software hubs external to a computer’s hard drive — often requires powerful technology. 

To resolve these issues, Galaxy’s platform not only maintains datasets in an easily organizable manner but also provides full workflows and software for anyone to run analysis on the same datasets. These software pipelines have been tested multiple times and generate reproducible results. This means that conclusions from the same datasets can be verified by multiple researchers, increasing the accuracy and validity of bioinformatics research. Schatz’s research supports Galaxy’s overall goal of making large datasets and software pipelines from computationally intensive projects publicly available. 

“Part of Galaxy’s mission is to democratize analysis. Having a standardized system in which to run these analyses means that you can get a consistent output. If somebody were to rerun assemblies for the VGP, they would have reproducible data, which is very important in these [computationally intensive] large projects,” Ostrovsky concluded.   

Have a tip or story idea?
Let us know!

Comments powered by Disqus

Please note All comments are eligible for publication in The News-Letter.

Alumni Weekend 2024
Leisure Interactive Food Map
The News-Letter Print Locations
News-Letter Special Editions