CGRLab

QfO-9

2026-01-07T00:00:00+00:00

We are organising the 9th edition of the Quest for Orthologs meeting on August 29–30, 2026 (right before ECCB), in Lausanne, Switzerland.

Hiring PhD students and PostDocs!

2026-01-01T00:00:00+00:00

The Computational Genomics Research Lab is hiring PhD students and PostDocs in the field of computational genomics. Interested candidates are welcome to send their CV and a letter of interest to Sina’ gmail (See adverts here).

Launch of the CGR Lab!

2026-01-01T00:00:00+00:00

We are excited to announce that the Computational Genomics Research Lab will be launched in 2026.

YouTube Videos

2025-12-31T00:00:00+00:00

Two recordings of teaching on DNA indexing with k-mers and a journal club presentation on DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome are now available on YouTube.

FastOMA is published !

2025-01-07T00:00:00+00:00

We are delighted to announce that FastOMA is published in Nature Methods!

Genomic data is expanding at a rapid pace, driven by ambitious efforts to sequence the DNA of millions of species worldwide. Comparative genomics, essentially the science of comparing genomes across species, helps us understand the evolutionary relationships between species. A key part of this is to find homologous regions, which are regions of DNA that are shared across species due to having a common ancestor.

When it comes to homologous genes, there are two main types to know about: orthologs and paralogs. Orthologs are genes that started diverging because of speciation (evolutionary branching into new species), while paralogs diverged because of gene duplication. Orthologs often have similar functions across species, which makes them extremely useful for transferring knowledge from well-studied organisms to newly sequenced ones (Nicheperovich 2022).

A bit of History!

The idea of distinguishing orthologs from paralogs goes back to Walter Fitch’s seminal work at the University of Wisconsin in 1970 (Fitch 1970). Since then, several research groups have been working on algorithms to accurately estimate orthology. One of the first contributions was the Clusters of Orthologous Groups of proteins (COGs) database, launched by NCBI in 2000, covering 21 genomes of bacteria, archaea, and eukaryotes (Tatusov 2000). More recently, the Orthofinder tool made it possible to find orthologs for a set of genomes of interest with high accuracy. This well-known software uses fast all-against-all gene comparisons with DIAMOND to group genes into orthogroups and refine them with gene trees. Earlier this year, Sonicparanoid presented its second version, which benefits from machine learning to efficiently avoid unnecessary all-against-all alignments, which makes it even faster. All these exciting advancements highlight the thriving community that works in the field of orthology and comparative genomics.

The OMA (Orthologous MAtrix) project came along in 2004 as a method and database for identifying orthologs across genomes (Dessimoz et al. 2005). The original OMA algorithm uses all-against-all gene comparisons with Smith-Waterman to find homologous sequences and then infers orthology relationships from there. Since 2010, Adrian Altenhoff has been the OMA project manager and OMA is hosted at the Comparative Genomics lab, led by Christophe Dessimoz and Natasha Glover. In 2017, Clément Train, a talented PhD student in the lab, took things to the next level with OMA algorithm 2.0, which delivered high precision in orthology inference (Train et al. 2017). Fast forward to today, the OMA Browser has seen 24 major updates where all the orthology data of around 3000 genomes is now presented for easy access with visualization innovations for phylostratigraphy, synteny and gene information (Altenhoff et al. 2024). Along the way, OMA also became a core resource supported by the SIB Swiss Institute of Bioinformatics.

In 2021, I joined the Comparative Genomics lab in Lausanne as a postdoc, took a leap of faith and started working on developing a new algorithm for orthology. The goal was to make it work for several thousands of species, basically scaling to the tree of life—something that’s really needed these days. At first, it felt quite overwhelming as there were several efficient ortholog inference tools such as Panther, OrthoMCL, Orthofinder, Sonicparanoid, Ensembl compara, Domainoid, MetaPhOrs, TOGA and GETHOGS (to name only a few) that are being maintained rigorously and regularly. The developer of these tools made great contributions to the field, and the huge number of comparative genomics studies over the years wouldn’t have been possible without these softwares. Their intricate design and comprehensive algorithms are accurate and efficient, making it hard to imagine advancing the field even further.

On top of that, I was new to the field—my PhD was on diploid and polyploid haplotype phasing using DNA sequencing reads (Majidian et al. 2020) and my background is in engineering and signal processing. But, I embarked on this journey and started learning concepts and methods in comparative genomics. I was lucky to have great mentors and lab mates who were always open to answering my questions, over zoom and in-person.

OMA turns young!

Let’s talk about FastOMA. With contributions from several lab members (Stefano, Yannis, Ali, Alex, David) and guidance from Christophe, Adrian and Natasha, we developed and implemented the FastOMA method. FastOMA works by benefiting from the current knowledge of orthology available on the OMA browser. FastOMA first maps the input genes (at amino-acid level) to reference gene families (the Hierarchical Orthologous Groups, HOGs), using OMAmer, a fast k-mer-based mapper. To learn about HOG, see this YouTube video by Natasha. Next, FastOMA works on each family separately. In other words, FastOMA does not perform comparison of genes from one family to another since these genes do not have any shared homology. This is an important step which saves us a huge amount of computations. Then, FastOMA infers the gene trees on (a subsample of) genes at each taxonomic level to distinguish orthologs from paralogs within each family. This phylogeny-guided subsampling is also key to maintaining speed and accuracy at the same time.

FastOMA’s speed makes it possible to handle genomic datasets with thousands of species. FastOMA uses the “OMA’s knowledge”, and is now swift as OMA turns young. FastOMA achieves high accuracy and resolution, as shown by the Quest for Orthologs benchmarks (Majidian, 2024).

To the future!

As a community, we work collaboratively to advance the field and the lab has been contributing to the benchmarking datasets, making it possible to compare the performance of different tools, and ultimately advance the field. Earlier this year, in July, the Quest for Orthologs event (QFO8) was held at the University of Montreal, where recent advancements in orthology inference were discussed, and FastOMA was also presented there. The QFO 9 will be in Switzerland in 2026!

There are several directions for improving FastOMA’s accuracy and speed further. One exciting direction is taking advantage of recent advancements in protein structure prediction to reconstruct structural trees (Moi et al. 2023) in the context of orthology inference. This could really help boost resolution at deeper evolutionary levels. Besides, it would be very interesting to use gene order conservation, a.k.a, synteny information (Bernard et al. 2024), which could serve as an additional layer of information to refine orthology predictions. We hope our proposed hierarchical approach accompanied with several ideas will stimulate further developments.

So far, FastOMA has caught the attention of several labs around the world, who incorporated FastOMA in their studies. We are excited to hear how you plan to use FastOMA into your own research. Feel free to create a GitHub issue (https://github.com/DessimozLab/FastOMA) or send us an email if any help is needed!

To learn more see FastOMA academy: https://omabrowser.org/oma/academy/module/fastOMA

Our Research

2025-01-01T00:00:00+00:00

The Computational Genomics Research Lab is led by Dr. Sina Majidian and is located at the Data Science and AI division at Chalmers University of Technology. We use machine learning and statistical analysis to address biological questions across four main research topics in genomics.

Personalized human pangenome

Pangenomes are now redefining our understanding of genetic variations across populations and genome evolution across species. A single reference genome cannot fully represent human genetic diversity, even when it is complete enough to be considered “Telomere-to-Telomere”. To fully harness the power of pangenomes in biomedicine, there is a pressing need for efficient methods to store, visualize, and extract relevant information. Our lab aims to understand human genome variation and evolution across different genomic regions by developing interpretable and efficient methods in comparative pan-genomics, leveraging machine learning methods and statistical analysis. Check out GIAB genomic stratifications, and ImputeFirst publications.

Varianet effect prediction via genomics language model

Alignment enables the identification of similar regions across species, encompassing both genic and intergenic regions. Unlike traditional annotation liftover approaches, which are largely restricted to coding regions, we aim to discover novel homologies and population-specific differences by leveraging large language models such as DNABERT, MSA Transformer, and GPN-MSA. This work helps identify alternative model organisms that best represent regulatory elements for further study. Additionally, leveraging cross-species conservation and genomic context enables more accurate predictions of allele effects and a deeper understanding of genotype–phenotype relationships. We have ongoing collaborations with Johns Hopkins University and the Technical University of Munich on this line of research.

Comparative genomics

Comparative analysis reveals biological relationships and evolution between species, and enhances understanding of gene function and structure. We have developed FastOMA, an accurate method for orthology inference at scale. Current methods are limited to analyzing tens to hundreds of genomes but FastOMA provides a paradigm shift by enabling the analysis of thousands of species, published in Nature Methods. FastOMA maps input genes to reference gene families using k-mers and infers gene trees at each taxonomic level to distinguish orthologs. This approach avoids comparing genes across different families, which do not share homology, significantly reducing computational complexity. Its scalability has been demonstrated by inferring the evolutionary history of all human gene families and identifying duplicated, lost, and gained genes across all the 2,000 UniProt eukaryotic reference species in one day using 300 CPU cores. Check out Read2Tree and FastOMA publications.

Haplotype assembly using matrix completion

Reconstructing haplotype phasing from sequencing reads by linking alleles at each heterozygous genetic variation is crucial in the study of Mendelian diseases, cancer genomics and drug response. We have developed software for estimating haplotype blocks from single nucleotide variants (SNVs) called from DNA sequencing reads. In one project, we benefited from low-rank matrix recovery in haplotype estimation and applied it to human sequencing data. In another project, we studied hexaploid sweet potato (Ipomoea batatas) using a 10X Genomics linked-read dataset, which resulted in long and accurate haplotypes. A longstanding problem in the field was understanding the limitations of the Minimum Error Correction (MEC) approach in haplotype assembly, for which we developed a solid framework. Check out PhaseME, Hap10, and HapManifold publications.