Published by the Students of Johns Hopkins since 1896
November 9, 2025
November 9, 2025 | Published by the Students of Johns Hopkins since 1896

Sohini Ramachandran discusses population genetics and computational clustering algorithms

By PALAK SADANA and ADITYA SANKAR | November 7, 2025

pull-quotes-2025-2026-10

Family has always been important to those working in population genetics. When Sohini Ramachandran was a postdoc, the issue of relatives in a dataset causing inaccurate results was considered a major problem in the field. In a Biology Department Seminar held at Mudd Hall on Oct. 9, she expanded upon two of her related research projects describing the analysis of genomic datasets.

When analyzing genetic data, one of the main principles to consider is that two individuals will always share a common ancestor. That is to say, if two individuals’ inherited alleles are traced far back enough, they will eventually coalesce at a common ancestor. The presence of such ancestral individuals existing for any sample shows the effects of behaviors like survival, migration, mating and reproduction on the human genome.

Ramachandran highlighted that genes were more scattered in the recent past than would be anticipated by classic models — potentially the impact of increased migrations, mixing and other population changes that alter natural evolution as a result of human influences like colonialism.

Now, with large genetic biobanks available, more in-depth analysis of the proportion of both distant and close relatives in a population can be conducted. Ramachandran’s team aimed to distinguish relatives based on if they were maternal or paternal to resolve challenges in phasing — the process by which parental haplotypes can be found from their child’s diploid genotype. While phasing normally focuses on assigning maternal or paternal sections in individual chromosomes, more haplotypes emerge when considering the combinations between multiple chromosomes. This could carry significant implications in understanding long range parental origin effects, or to understand how each parent’s genetic contribution differed in their offspring.

To understand why the effectiveness of this technique varied, Ramachandran’s team analyzed their methodology to find the existence of relatives who shared both maternal and paternal DNA. They found that the frequency of these relatives concentrated in specific geographic locations as they looked further back in time. This observation could likely be explained by geographically constrained human reproductive patterns in ancient times.

Thus, Ramachandran’s work raised questions about how population data can inform our understanding of demographic dispersal, how relatedness bias can affect our understanding of past population distribution and, most importantly, how the redaction of large amounts of data from population censuses due to genetic similarity could be mitigated.

The Ramachandran Lab also focuses on using clustering algorithms to analyze cells to determine their cell types and functions. These data allow researchers to study processes like cancer progression, tissue formation and gene expression changes that occur under different conditions.

Designing efficient algorithms to determine cell types has its challenges. Firstly, many of these algorithms, known as stochastic algorithms, tend to give multiple different series of results that need to be properly aligned before they can be interpreted. This process is time-consuming and possibly error-prone. Secondly, these algorithms will sometimes output different solutions for the same set of inputs, making it difficult to determine an accurate classification of cell type and function.

"When I was in graduate school, and even afterwards, I had to spend a lot of time trying to align these plots. I did it by hand, and I think I wasted days of my life doing this,” Ramachandran recalled.

To solve this, Xiran Liu, a postdoc at Ramachandran’s Lab, developed a software called “Clumppling,” which solves some of the aforementioned issues, decreasing the complexity of algorithm results and making them more comprehensible.

The team performed tests using outputs from Seurat and Scanpy, two well-known clustering algorithms, on the Clumppling software to ascertain its reliability and efficiency at identifying cell groups. Clumppling revealed that the classification of some cell groups, such as CD14+ monocytes, was accurate, whereas the classification of other cell groups was less clear. This initial example demonstrates how Clumppling can be used in conjunction with clustering algorithms to increase the precision of single-cell analyses.

Clumppling was also applied to analyze breast cancer tumors. The program showed that the healthy tissue and tumor edge areas were less clearly defined than the invasive carcinoma tumors. Importantly, Clumppling enabled the team to identify previously-known markers driving these cell classifications, validating their approach and indicating the effectiveness of their program.

Lastly, Ramachandran's work challenged a common practice in single-cell analysis. Researchers typically analyze only highly variable genes (HVGs) to improve computational efficiency, discarding genes that show less variation across cells. However, when her team performed clustering using either all genes, only HVGs or only non-HVGs separately, Clumppling revealed that some non-HVG genes were significant for clustering. This result suggests that excluding non-HVGs may cause researchers to overlook potentially valuable biological information.

Ramachandran questioned the general practice of only analyzing HVGs. 

“The last thought I want to leave you with is whether subsetting to highly variable genes is a good practice for clustering or not,” Ramachandran said. “It's a very common thing that's done in this field. But one question is, should we actually be doing it?”

Ramachandran's clustering alignment framework addresses the critical issue regarding the inherent variability present in clustering algorithms. With Clumppling, her lab has pioneered a systematic approach to evaluate result consistency, track cluster emergence and discover biologically relevant genes. As genomic datasets grow larger and more complex, such methodologies will be essential to ensure that computational convenience does not come at the cost of biological insight.

“We would like to recommend to people who work with functional genomic data to run clustering multiple times and apply clustering alignment,” Ramachandran explained. “It gives us the opportunity to think about identifying genes that are driving clusters, which I think would be an exciting thing for functional genomics."


Have a tip or story idea?
Let us know!

News-Letter Magazine