Our Research

January 1, 2025

The Computational Genomics Research Lab is led by Dr. Sina Majidian and is located at the Data Science and AI division at Chalmers University of Technology. We use machine learning and statistical analysis to address biological questions across four main research topics in genomics.

Genomic Language Model 

Personalized human pangenome

Pangenomes are now redefining our understanding of genetic variations across populations and genome evolution across species. A single reference genome cannot fully represent human genetic diversity, even when it is complete enough to be considered “Telomere-to-Telomere”. To fully harness the power of pangenomes in biomedicine, there is a pressing need for efficient methods to store, visualize, and extract relevant information. Our lab aims to understand human genome variation and evolution across different genomic regions by developing interpretable and efficient methods in comparative pan-genomics, leveraging machine learning methods and statistical analysis. Check out GIAB genomic stratifications, and ImputeFirst publications.

Varianet effect prediction via genomics language model

Alignment enables the identification of similar regions across species, encompassing both genic and intergenic regions. Unlike traditional annotation liftover approaches, which are largely restricted to coding regions, we aim to discover novel homologies and population-specific differences by leveraging large language models such as DNABERT, MSA Transformer, and GPN-MSA. This work helps identify alternative model organisms that best represent regulatory elements for further study. Additionally, leveraging cross-species conservation and genomic context enables more accurate predictions of allele effects and a deeper understanding of genotype–phenotype relationships. We have ongoing collaborations with Johns Hopkins University and the Technical University of Munich on this line of research.

Genomic Language Model 

Comparative genomics

Comparative analysis reveals biological relationships and evolution between species, and enhances understanding of gene function and structure. We have developed FastOMA, an accurate method for orthology inference at scale. Current methods are limited to analyzing tens to hundreds of genomes but FastOMA provides a paradigm shift by enabling the analysis of thousands of species, published in Nature Methods. FastOMA maps input genes to reference gene families using k-mers and infers gene trees at each taxonomic level to distinguish orthologs. This approach avoids comparing genes across different families, which do not share homology, significantly reducing computational complexity. Its scalability has been demonstrated by inferring the evolutionary history of all human gene families and identifying duplicated, lost, and gained genes across all the 2,000 UniProt eukaryotic reference species in one day using 300 CPU cores. Check out Read2Tree and FastOMA publications.

Orthology inference 

Haplotype assembly using matrix completion

Reconstructing haplotype phasing from sequencing reads by linking alleles at each heterozygous genetic variation is crucial in the study of Mendelian diseases, cancer genomics and drug response. We have developed software for estimating haplotype blocks from single nucleotide variants (SNVs) called from DNA sequencing reads. In one project, we benefited from low-rank matrix recovery in haplotype estimation and applied it to human sequencing data. In another project, we studied hexaploid sweet potato (Ipomoea batatas) using a 10X Genomics linked-read dataset, which resulted in long and accurate haplotypes. A longstanding problem in the field was understanding the limitations of the Minimum Error Correction (MEC) approach in haplotype assembly, for which we developed a solid framework. Check out PhaseME, Hap10, and HapManifold publications.

Haplotype modeling