The first step is to familiarize yourself with the key concepts and literature. Here is a list of review papers and book chapters to get you started.
- Deep learning: new computational modelling techniques for genomics. Nature Reviews Genetics, 2019, Paper.
- Transformers and genome language models. Nature Machine Intelligence, 2025 Paper.
- Genomic language models: opportunities and challenges. Trends in Genetics, 2025. Paper.
And these are some introductory YouTube videos:
-
Large Language Models in Computational Biology by Jian Ma Link 43mins.
-
MIA Primer: Gokcen Eraslan, A Primer on DNA Foundation Modeling Link 61mins.
Next, we need to familiarize ourselves with the literature. These are the latest methods papers:
- Species-aware DNA language models capture regulatory elements and their evolution. Genome Biology, 2024, Paper.
- A DNA language model based on multispecies alignment predicts the effects of genome-wide variants. Nature Biotechnology, 2025, Paper.
- Predicting functional constraints across evolutionary timescales with phylogeny-informed genomic language models Paper.
Next steps include identifying the knowledge gaps and areas for improvement. It is also important to run the tools on small datasets and review their outputs.