Genomic language models
Pre-training transformers on raw DNA so they learn the regulatory grammar, splicing, and chromatin context biology cares about. Co-developer of the DNABERT series.
Pratik Dutta is a Senior Research Scientist in the Department of Biomedical Informatics at Stony Brook University, working in Prof. Ramana V. Davuluri's lab. His research builds interpretable genomic foundation models that learn the language of DNA and can explain the biological mechanisms behind their predictions, with applications across cancer, neurodegeneration, viral genomics, spatial biology, and population-scale variation. He is broadly interested in closing the gap between the scale of modern foundation models and the interpretability we need to act on them clinically, and his recent work spans pre-training large genomic language models, predicting the regulatory impact of non-coding variants, and orchestrating multi-modal reasoning over biological knowledge. Before Stony Brook, he completed his PhD at IIT Patna under Dr. Sriparna Saha as a Visvesvaraya Research Fellow, with bachelor's and master's degrees from IIEST Shibpur. He is currently exploring tenure-track faculty opportunities and welcomes prospective collaborators and students.
Five interlocking threads, each sharing one goal: AI that doesn't just predict, but explains. From building foundation models for the genome to extending them across spatial tissue context and human population variation.
Pre-training transformers on raw DNA so they learn the regulatory grammar, splicing, and chromatin context biology cares about. Co-developer of the DNABERT series.
462 fine-tuned regulatory models that score the impact of non-coding variants across pan-cancer and neurodegenerative cohorts, with attention-based motif evidence. Next: agentic orchestration over the model bank.
Regulome-R orchestrates pretrained genomic models with biological knowledge graphs and literature, moving from "this variant is significant" to "this variant disrupts this TF in this cell type."
Extending interpretable foundation models into spatial tissue context: benchmarking deep learning architectures and quantifying uncertainty for cell-cell interaction inference. Active collaboration with Oak Ridge National Laboratory.
Using genomic language model embeddings to study human population structure and ancestry-specific regulatory variation, with the goal of building interpretable variant effect predictors that work across diverse populations.
mSystems, American Society for Microbiology, 2026 Under review
International Conference on Learning Representations (ICLR), 2024
NAR Genomics and Bioinformatics, Oxford Academic, 2026 NARGAB-2025-333.R1
31st Annual Intelligent Systems for Molecular Biology / 22nd European Conference on Computational Biology (ISMB/ECCB), 2025 CAMDA 2nd place
Bioinformatics, vol. 40, Oxford Academic, 2024
Bioinformatics Advances, vol. 3, vbad075, Oxford University Press, 2023
IEEE/ACM Transactions on Computational Biology and Bioinformatics, IEEE Computer Society, 2021
Scientific Reports, vol. 10, Nature Publishing Group, 2020
IEEE Journal of Biomedical and Health Informatics, IEEE, 2020
20th Machine Learning in Computational Biology (MLCB), 2025
NeurIPS 2025 Workshop on Imageomics: Discovering Biological Knowledge from Images Using AI, 2025
Annual Meeting of the Association for Computational Linguistics (ACL), 2020 Core A*
Full publication list on Google Scholar ↗
Indian Institute of Engineering Science and Technology (IIEST), Shibpur · Formerly Bengal Engineering and Science University
Indian Institute of Engineering Science and Technology (IIEST), Shibpur · Advisor: Dr. Hafizur Rahaman
Indian Institute of Technology Patna · Visvesvaraya Research Fellow · Advisor: Dr. Sriparna Saha
Strand Life Sciences, Bangalore · BioBERT for clinical phenotype extraction · Advisors: Dr. Vamsi Veeramachaneni and Dr. Rajesh Sundaresan (IISc)
Stony Brook Cancer Center · Stony Brook University · Advisor: Prof. Ramana V. Davuluri
Department of Biomedical Informatics · Stony Brook University · Davuluri Lab
Open-source releases used by groups worldwide, peer review for top journals and conferences, invited talks, and active mentorship of graduate and undergraduate researchers.
The best way to reach me is by email. I am on the academic job market for tenure-track positions and welcome conversations with prospective collaborators, students, and search committees.