More accurate genetic risk assessment for people of non-European ancestries by new machine learning model

Nilanjan Chatterjee, a professor of biostatistics and genetic epidemiology at the School of Medicine and also recognized as a Bloomberg Distinguished Professor, collaborated on a machine-learning model that would improve the predictive ability of polygenic risk scores in non-European populations. This research, a collaboration with the Harvard T.H. Chan School of Public Health and Haoyu Zhang from the National Cancer Institute, was recently published in Nature Genetics.

Polygenic risk scores (PRSs), an example of precision health, represent an individual’s risk of disease development and are calculated based on an individual’s genetic makeup. Researchers begin by identifying genomic variants that are associated with a disease, such as type 1 diabetes and breast cancer, by comparing genomes of individuals with and without the disease. Then, statistical methods are used to yield PRSs. By evaluating the presence or absence of certain genomic variants in an individual’s DNA, researchers can assess the likelihood of that individual developing a disease.

Data used to calculate PRSs are derived from the summarized statistics of a genome-wide association study (GWAS). However, the generalizability of this data is limited because the data are mostly collected from European populations; a significant majority of participants in GWASs are of European descent. Genetic risk for disease can differ among people with different ancestries. For example, sickle cell disease has greater prominence among people of African ancestry and Tay-Sachs disease is particularly prevalent among Ashkenazi Jewish populations. Therefore, non-European populations are not benefiting from the current predictive performance of PRSs because they are mainly modeled after individuals of European ancestry.

The researchers’ new model, the CT-SLEB method, combines multiple machine learning and statistical techniques, including the clumping and thresholding (CT) method, a super-learning (SL) model of machine learning and empirical Bayesian (EB) modeling. First, the CT method was used to identify single-nucleotide differences across populations that show an increased risk of disease. Next, EB modeling was used to estimate the effect size of each single-nucleotide polymorphism identified. Finally, the PRSs derived from the first two steps were used to train the SL model. The performance of the CT-SLEB model was evaluated using a separate test data set composed of data from both European and non-European populations.

To evaluate the CT-SLEB model, researchers quantified the performance of its calculated PRSs using multiple metrics, such as the correlation coefficient for measuring how much of the variance in a particular trait could be predicted and the area under the curve to determine how well the model could discriminate between cases and controls.

Chatterjee explained how, during the model development process, researchers analyzed factors — such as race, sample size and genetic architecture of the disease — that had to be considered to augment the model’s predictive ability.

“We did theoretical work that showed how the performance of this high-dimensional model depends on the sample size of your training data set and the architecture of the disease... or how many genetic variants are actually associated with the disease,” Chatterjee said in an interview with The News-Letter.

The researchers compared their method to nine existing models used to calculate PRSs to assess performance in European and non-European populations. They found that CT-SLEB had significantly improved predictive accuracy in non-European populations compared to existing methods because of the model’s architecture and because it was trained on GWAS data from both European and non-European populations. The model performed especially well compared to existing models in populations of African ancestry, where performance has previously been disappointing due to underrepresentation.

However, despite these breakthroughs, CT-SLEB still has performance gaps between European and non-European populations. One of the biggest reasons for this gap is the simple lack of expertise on training GWAS models in non-European populations that is seen across the field of genomic medicine.

Chatterjee discussed current efforts to increase available data in non-European populations, specifically referring to the All of Us Research Program started by the National Institutes of Health (NIH), which aims to equip scientific findings with increased generalizability through collecting more health data from diverse populations, especially those that have historically been underrepresented in biomedical research.

“The NIH has a number of funding initiatives [that] are encouraging people to collect more data on diverse populations,” Chatterjee explained.

Zhang, the lead author of the study and an Earl Stadtman Investigator in the Biostatistics Branch at the NIH, also highlighted the importance of finding local collaborators in areas where data isn’t widely available. In an interview with The News-Letter, Zhang emphasized the importance of international collaboration on developing larger sample sizes for populations of non-European ancestry to train future models and improve the calculation of PRSs.

“It’s not only about the funding... you also need people who really have the background. We are trying to find local collaborators who have local community knowledge,” Zhang said.

Models such as CT-SLEB could eventually be beneficial in clinical practice. By calculating PRSs based on an individual’s specific genetic makeup and enabling personalized healthcare, clinicians may be able to advise earlier screening times for certain diseases and provide timely preventative care. Existing models don’t have high predictive accuracy in non-European populations. Clinical implementation of such models could then result in higher rates of misdiagnosis and thus mistakes in subsequent treatments in non-European populations, exacerbating existing health disparities.

“There are currently a lot of efforts trying to push these genetic models in a clinical setting... but if you directly apply them in non-European populations... it might cause some health disparities,“ Zhang said. “One of our model’s contributions is [that] we can do a better job in non-European population prediction so the model also works... better than existing methods regarding giving more personalized recommendations.”

The researchers hope to use their work on the CT-SLEB model to eventually develop a model that can benefit people of all ancestries equally in clinical settings.

“Model power will increase as more data comes in. You will reach the point where you can give personalized recommendations given everyone’s risk factors... everyone’s genetics — and it works universally well across different ancestries. That’s the vision of this research,” Zhang concluded.

More accurate genetic risk assessment for people of non-European ancestries by new machine learning model

Trending

Anhedonia

2025 college football season recap

A broad outlook: 2026 Winter Olympic qualifiers

Humans of Hopkins: Stone Meng

Made in Baltimore: inside Ace’s Baltimore coffee series

Hopkins hosts the 21st annual Lighting of the Quads

Weekly Rundown

Events this weekend (Jan. 31–Feb. 1)

Events this weekend (Jan. 23–25)

Events this weekend (Dec. 5–7)

Events this weekend (Nov. 29–30)

News-Letter Magazine

More accurate genetic risk assessment for people of non-European ancestries by new machine learning model

Related Articles

Trending

Weekly Rundown

News-Letter Magazine