diff --git a/auxiliaries/abreviations.tex b/auxiliaries/abreviations.tex index 3097663..3792325 100644 --- a/auxiliaries/abreviations.tex +++ b/auxiliaries/abreviations.tex @@ -507,4 +507,10 @@ \newacronym{ipTM}{ipTM}{interface predicted Template Modeling score} \glsaddall -\printglossary[type=\acronymtype, title=Liste des abréviations, toctitle=Liste des abréviations] \ No newline at end of file +\printglossary[type=\acronymtype, title=Liste des abréviations, toctitle=Liste des abréviations] +\newglossaryentry{API} +{ + name={API}, + description={Application Programming Interface}, + type=\acronymtype +} diff --git a/auxiliaries/background.tex b/auxiliaries/background.tex index 2b7c070..7d2b1dc 100644 --- a/auxiliaries/background.tex +++ b/auxiliaries/background.tex @@ -1,63 +1,14 @@ \raggedbottom % Allow flexible page heights to reduce underfull vbox warnings -\chapter*{Personal Motivation} -\addcontentsline{toc}{chapter}{Personal Motivation} +\chapter*{Acknowledgements} +\addcontentsline{toc}{chapter}{Acknowledgements} -This Ph.D. started relatively late in my career. This chapter explains the personal and professional path that led to this thesis, and the objectives I set for myself when starting it. +This Ph.D.\ was undertaken at the Institut Pasteur, in the ML4IG group led by Laura Cantini and in close collaboration with Gabriel Peyré's group at the ENS Centre de Science des Données. The work presented in this thesis began in 2022, building on earlier research at the Broad Institute of MIT and Harvard and on applied work in the biotechnology industry. These varied experiences shaped the perspective and the questions that ultimately drove this thesis. -\section{Background to the thesis} +\bigskip -\subsection{The PiPle Project} -Many of the opportunities I had after school have been very exciting. Initially, I decided to create a company called PiPle with a friend, Paul Best, who is now a Post-doctoral Researcher at the University of Vienna in Machine Learning for bio-acoustics. +Prior to this Ph.D., I worked at the Broad Institute, where I contributed to several high-impact computational biology projects and began developing the ideas around large-scale gene regulatory network inference that underpin this thesis. A subsequent role at Whitelab Genomics in Paris provided direct exposure to the practical challenges of deploying foundation models for therapeutic applications, reinforcing the importance of rigorous benchmarking and reproducible open-source tooling. -Funnily enough, it was completely unrelated to biology. We worked on creating novel means of communication. We had---and still have---big ideas for improving utterly inadequate messaging apps, emails, and similar tools through machine learning and innovative design. Doing this, we learned a lot about managing complex projects, selling ideas, building large codebases, teamwork, and designing interfaces. +\bigskip -However, we did not gain enough traction from this, and after a year of hard work, we felt the road ahead was paved with too many sacrifices. - -\subsection{The Broad Institute} -I passed on Ph.D. opportunities a second time to work at the Broad Institute instead. Having visited the labs, Boston, and Kendall Square, I knew this was the kind of experience I wanted, and Ph.D.s seemed long and cumbersome. At Broad, I worked on many very-high-impact research projects, and I felt I was part of something bigger than myself. I published as the first author and even started my own research projects, which would inform the thesis I am presenting here. - -While I still understood that a Ph.D. was the best place to undergo such projects, I was uncertain about the specifics. I also understood the length, harshness, and sometimes arbitrary nature of U.S. Ph.D. programs. I also wanted to continue working on team-based projects and wanted to experience the start-up environment. - -\subsection{Whitelab Genomics} -\begin{figure}[ht] - \centering - \includegraphics[width=0.6\textwidth]{./figures/whitelabgx.jpeg} - \caption[The Whitelab Genomics team]{The Whitelab Genomics team in September 2023, in its future4care offices} - \label{fig:whitelabgx} -\end{figure} - -Along with other personal decisions, it led me to return to France and take on the role of team lead for the computational biology group at Whitelab Genomics in Paris. - -At Whitelab, I learned how to build a team and manage people. I learned a lot about what it means to grow companies from 10 to 50 people. I also learned about the biotech industry and how to build and sell such products. - -Whitelab had a good mix of expertise in computational biology, machine learning, structural biology, and business development (see Figure ~\ref{fig:whitelabgx}). While starting the first project there, I significantly enhanced the potential of foundation models for the biotech industry. - -From \gls{DNA} language models to cell foundation models and knowledge-graph-based models, it became clear that they would be the path forward for aggregating sparse, disparate information across many fields of biology and medicine. - -\subsection{Starting the Ph.D.} -I was not looking for any other positions and intended to stay at least a few years to assess how we had grown during that time. - -However, I was already in contact with Laura Cantini, with whom I had previously discussed Ph.D. projects. At some point, Laura came back to me with this Ph.D. proposal. I spent the better part of a month in a challenging position, thinking about which decision would not become a regret in the future. - -There was no perfect time to do this, but it felt like it was now or never. I was also very impressed by the level of various Ph.D. students in the labs of Laura and Gabriel. Seeing people 4 years younger than me already with such a high level of expertise and knowledge was very humbling. Finally, the Ph.D. topic and group were really on point with what I wanted to do. But mostly, my work/life environment was welcoming, surrounded by family, friends, and activities. I knew what I wanted to work on and what I wanted to learn. - -Therefore, I decided to start this Ph.D. journey. - -\section{Personal objectives} -\label{personal-objectives-during-the-thesis} - -\textit{This is copied from my initial objectives written in my research proposal at the start of the Ph.D.} - -I had the chance to see many friends completing their Ph.D.s before starting mine. A main mistake I saw during one's Ph.D.~is not seeing the time passing by. My goal for this Ph.D.~was to be as product-first as I was at Whitelab Genomics. Delivering results quickly \& improving until it is publishable. This mistake, thinking ``Well, I have 3 years\ldots'', is at least partly responsible for the stress, the crash, and the unpreparedness for what some students might experience after the Ph.D. Thus, I plan to give myself a short timeline, knowing I will likely go over. And I will prepare everything around this idea. I will also start to prepare for what is next from the get-go. - -To do that best, one needs to take the opportunity of the Ph.D.~to make connections with other labs (industry or academic). Moreover, a good piece of advice I have been given is to \emph{know what you want to do and what you don't want to do}. Know what you are here for. Learn to say no. And I learned to say no in the last 4 years. My goal is to work on large models \& large datasets, mostly in transcriptomics, and always to go back to first principles and biology. I also know I want to make something useful, create something that can be a stepping stone for others. Something that affects the community. I know that to do that, you have to go the extra mile in terms of development and be honest with yourself about any shortcomings. - -Finally, I have been fortunate to become addicted to my work. I like working hard and taking on challenges. But for this to happen, I need to keep enjoying what I am doing. I also wish to have no regrets about this decision. Thus, my final goal is to enjoy it as much as I can. - -\begin{figure}[ht] - \centering - \includegraphics[width=0.7\textwidth]{./figures/pasteur.jpg} - \caption[A view of the Pasteur Institute in Paris]{A view of the Pasteur Institute in Paris, where I did my Ph.D. Adopted from \citet{pasteur}.} - \label{fig:pasteur} -\end{figure} +The choice to pursue this Ph.D.\ was motivated by the conviction that the single-cell genomics field was at an inflection point: the data scale, the model architectures, and the benchmarking infrastructure were all maturing simultaneously, creating an opportunity to build something both scientifically rigorous and practically useful. That conviction shaped every design decision in this thesis. diff --git a/bibliography.bib b/bibliography.bib index 5e887d0..fcb44ed 100644 --- a/bibliography.bib +++ b/bibliography.bib @@ -7422,6 +7422,24 @@ @article{panzeriCrackingNeuralCode2017 file = {/Users/jeremie/Documents/science/computational neuro/Panzeri et al. - 2017 - Cracking the Neural Code for Sensory Perception by.pdf} } +@article{matsumotoSCODEEfficientRegulatory2017, + title = {{{SCODE}}: An Efficient Regulatory Network Inference Algorithm from Single-Cell {{RNA-Seq}} during Differentiation}, + shorttitle = {{{SCODE}}}, + author = {Matsumoto, Hirotaka and Kiryu, Hisanori and Furusawa, Chikara and Ko, Minoru S H and Ko, Shigeru B H and Gouda, Norio and Hayashi, Tetsutaro and Nikaido, Itoshi}, + year = 2017, + month = aug, + journal = {Bioinformatics}, + volume = {33}, + number = {15}, + pages = {2314--2321}, + issn = {1367-4803}, + doi = {10.1093/bioinformatics/btx194}, + urldate = {2026-03-18}, + abstract = {The analysis of RNA-Seq data from individual differentiating cells enables us to reconstruct the differentiation process and the degree of differentiation (in pseudo-time) of each cell. Such analyses can reveal detailed expression dynamics and functional relationships for differentiation. To further elucidate differentiation processes, more insight into gene regulatory networks is required. The pseudo-time can be regarded as time information and, therefore, single-cell RNA-Seq data are time-course data with high time resolution. Although time-course data are useful for inferring networks, conventional inference algorithms for such data suffer from high time complexity when the number of samples and genes is large. Therefore, a novel algorithm is necessary to infer networks from single-cell RNA-Seq during differentiation.In this study, we developed the novel and efficient algorithm SCODE to infer regulatory networks, based on ordinary differential equations. We applied SCODE to three single-cell RNA-Seq datasets and confirmed that SCODE can reconstruct observed expression dynamics. We evaluated SCODE by comparing its inferred networks with use of a DNaseI-footprint based network. The performance of SCODE was best for two of the datasets and nearly best for the remaining dataset. We also compared the runtimes and showed that the runtimes for SCODE are significantly shorter than for alternatives. Thus, our algorithm provides a promising approach for further single-cell differentiation analyses.The R source code of SCODE is available at https://github.com/hmatsu1226/SCODESupplementary data are available at Bioinformatics online.}, + file = {/Users/jkobject/Zotero/storage/9AJA99WH/Matsumoto et al. - 2017 - SCODE an efficient regulatory network inference algorithm from single-cell RNA-Seq during different.pdf} +} + + @misc{PapersCodeDeep, title = {Deep {{Networks}} with {{Stochastic Depth}}}, author = {Huang, Gao and Sun, Yu and Liu, Zhuang and Sedra, Daniel and Weinberger, Kilian}, @@ -7511,6 +7529,46 @@ @unpublished{pathakCuriositydrivenExplorationSelfsupervised2017 annotation = {00000} } +@misc{hauryTIGRESSTrustfulInference2012, + title = {{{TIGRESS}}: {{Trustful Inference}} of {{Gene REgulation}} Using {{Stability Selection}}}, + shorttitle = {{{TIGRESS}}}, + author = {Haury, Anne-Claire and Mordelet, Fantine and {Vera-Licona}, Paola and Vert, Jean-Philippe}, + year = 2012, + month = may, + number = {arXiv:1205.1181}, + eprint = {1205.1181}, + primaryclass = {stat}, + publisher = {arXiv}, + doi = {10.48550/arXiv.1205.1181}, + urldate = {2026-03-18}, + abstract = {Inferring the structure of gene regulatory networks (GRN) from gene expression data has many applications, from the elucidation of complex biological processes to the identification of potential drug targets. It is however a notoriously difficult problem, for which the many existing methods reach limited accuracy. In this paper, we formulate GRN inference as a sparse regression problem and investigate the performance of a popular feature selection method, least angle regression (LARS) combined with stability selection. We introduce a novel, robust and accurate scoring technique for stability selection, which improves the performance of feature selection with LARS. The resulting method, which we call TIGRESS (Trustful Inference of Gene REgulation using Stability Selection), was ranked among the top methods in the DREAM5 gene network reconstruction challenge. We investigate in depth the influence of the various parameters of the method and show that a fine parameter tuning can lead to significant improvements and state-of-the-art performance for GRN inference. TIGRESS reaches state-of-the-art performance on benchmark data. This study confirms the potential of feature selection techniques for GRN inference. Code and data are available on http://cbio.ensmp.fr/\textasciitilde ahaury. Running TIGRESS online is possible on GenePattern: http://www.broadinstitute.org/cancer/software/genepattern/.}, + archiveprefix = {arXiv}, + keywords = {Quantitative Biology - Quantitative Methods,Statistics - Machine Learning}, + file = {/Users/jkobject/Zotero/storage/IRDN7Z3X/Haury et al. - 2012 - TIGRESS Trustful Inference of Gene REgulation using Stability Selection.pdf;/Users/jkobject/Zotero/storage/39HZYHPU/1205.html} +} + +@article{thuDynGENIE3DynamicalGENIE32018, + title = {{{dynGENIE3}}: Dynamical {{GENIE3}} for the Inference of Gene Networks from Time Series Expression Data}, + shorttitle = {{{dynGENIE3}}}, + author = {{Huynh-Thu}, V{\^a}n Anh and Geurts, Pierre}, + year = 2018, + month = feb, + journal = {Scientific Reports}, + volume = {8}, + number = {1}, + pages = {3384}, + publisher = {Nature Publishing Group}, + issn = {2045-2322}, + doi = {10.1038/s41598-018-21715-0}, + urldate = {2026-03-18}, + abstract = {The elucidation of gene regulatory networks is one of the major challenges of systems biology. Measurements about genes that are exploited by network inference methods are typically available either in the form of steady-state expression vectors or time series expression data. In our previous work, we proposed the GENIE3 method that exploits variable importance scores derived from Random forests to identify the regulators of each target gene. This method provided state-of-the-art performance on several benchmark datasets, but it could however not specifically be applied to time series expression data. We propose here an adaptation of the GENIE3 method, called dynamical GENIE3 (dynGENIE3), for handling both time series and steady-state expression data. The proposed method is evaluated extensively on the artificial DREAM4 benchmarks and on three real time series expression datasets. Although dynGENIE3 does not systematically yield the best performance on each and every network, it is competitive with diverse methods from the literature, while preserving the main advantages of GENIE3 in terms of scalability.}, + copyright = {2018 The Author(s)}, + langid = {english}, + keywords = {Gene regulatory networks,Machine learning,Network topology,Regulatory networks,Time series}, + file = {/Users/jkobject/Zotero/storage/RGARW3TY/Huynh-Thu and Geurts - 2018 - dynGENIE3 dynamical GENIE3 for the inference of gene networks from time series expression data.pdf} +} + + @inproceedings{pathakCuriosityDrivenExplorationSelfSupervised2017a, title = {Curiosity-{{Driven Exploration}} by {{Self-Supervised Prediction}}}, author = {Pathak, Deepak and Agrawal, Pulkit and Efros, Alexei A. and Darrell, Trevor}, @@ -11133,3 +11191,44 @@ @misc{zuhriSoftpickNoAttention2025 file = {/Users/jkobject/Zotero/storage/WM3W46PZ/Zuhri et al. - 2025 - Softpick No Attention Sink, No Massive Activations with Rectified Softmax.pdf;/Users/jkobject/Zotero/storage/DZP4JTDF/2504.html} } + +@inproceedings{yuanCellOntologyGuided2024, + title = {Cell-ontology guided transcriptome foundation model}, + author = {Yuan, Xinyu and Zhan, Zhihao and Zhang, Zuobai and Zhou, Manqi and Zhao, Jianan and Han, Boyu and Li, Yue and Tang, Jian}, + booktitle = {The Thirty-eighth Annual Conference on Neural Information Processing Systems}, + year = {2024}, + note = {NeurIPS 2024 Spotlight}, + url = {https://arxiv.org/abs/2408.12373} +} + +@inproceedings{dosovitskiyImageWorth16x162021, + title = {An Image is Worth 16x16 Words: {{Transformers}} for Image Recognition at Scale}, + shorttitle = {An Image is Worth 16x16 Words}, + booktitle = {International {{Conference}} on {{Learning Representations}} ({{ICLR}})}, + author = {Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil}, + year = {2021}, + month = may, + eprint = {2010.11929}, + archiveprefix = {arxiv}, + primaryclass = {cs.CV}, + url = {https://arxiv.org/abs/2010.11929}, + urldate = {2026-03-16}, + abstract = {While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.} +} + +@article{schaffterGeneNetWeaverNetworkTopology2011, + title = {{{GeneNetWeaver}}: In Silico Benchmark Generation and Performance Profiling of Network Inference Methods}, + shorttitle = {{{GeneNetWeaver}}}, + author = {Schaffter, Thomas and Marbach, Daniel and Floreano, Dario}, + year = {2011}, + month = aug, + journal = {Bioinformatics}, + volume = {27}, + number = {16}, + pages = {2263--2270}, + issn = {1367-4803, 1367-4811}, + doi = {10.1093/bioinformatics/btr373}, + url = {https://academic.oup.com/bioinformatics/article/27/16/2263/254752}, + urldate = {2026-03-16}, + abstract = {Over the last decade, numerous methods have been developed for inference of regulatory networks from gene expression data. However, accurate and systematic evaluation of these methods is hampered by the difficulty of constructing adequate benchmarks and the lack of tools for a differentiated analysis of network predictions on such benchmarks. Here we describe a novel and comprehensive method for in silico benchmark generation and performance profiling of network inference methods available to the community as an open-source software called GeneNetWeaver (GNW). In addition to the generation of detailed dynamical models of gene regulatory networks to be used as benchmarks, GNW provides a network motif analysis that reveals systematic prediction errors, thereby indicating potential ways of improving inference methods.} +} diff --git a/chapters/background.tex b/chapters/background.tex index 01b5fc3..d64420a 100644 --- a/chapters/background.tex +++ b/chapters/background.tex @@ -4,60 +4,22 @@ \chapter*{Background} \label{chap:background} \addcontentsline{toc}{chapter}{Background} -This chapter provides the background needed to understand the contributions of this thesis. We begin with the motivations for modeling the cell, then introduce the molecular biology of gene regulation, the sequencing technologies that made large-scale cellular measurements possible, the computational tasks that have emerged from these data, and finally the artificial intelligence methods we build upon. +This chapter provides the biological and computational background needed to understand the contributions of this thesis. We introduce the molecular biology of gene regulation, the sequencing technologies that made large-scale cellular measurements possible, the computational tasks that have emerged from these data, and the artificial intelligence methods we build upon. -In the mid-17th century, Robert Hooke made a groundbreaking discovery while observing a piece of cork through his microscope. He observed structures that he named ``cells" and, as a result, marked the beginning of cellular biology \cite{petersCellsRobertHooke2024}. Cells have since been identified as life's fundamental structural and functional units, and biologists have endeavored to map the diverse cell types that comprise multicellular organisms. Additionally, they have sought to understand the transient cell states that occur during development, disease progression, and tissue regeneration \cite{Alberts2015CellsGenomes}. +\section{Molecular Biology of the Cell} -\section{The promises of cellular biology} - -Understanding cells at a fundamental level opens the door to transformative applications. - -\subsection{Drug Design} -Before developing a drug for a disease, one must understand the disease and identify a potential target gene or set of target genes. It refers to the genes in specific cell types that need to be reactivated, deactivated, or modified to address the disease's underlying mechanism. - -But drugs don't have to be small molecules. CAR-T cell therapies have revolutionized blood cancer treatment by modifying a patient's own immune cells to fight the cancer (see Figure ~\ref{fig:celltherapy}). Similar approaches could be developed for many other conditions \cite{zugastiCARTCellTherapy2025}. Here, the drug becomes a cell. - -Helping create these cellular drugs, as well as more classic ones, is one of the possible applications of the work we will present in this thesis. - -\begin{figure}[ht] - \centering - \includegraphics[width=0.9\textwidth]{./figures/cell therapy.jpg} - \caption[CAR-T cell therapy]{CAR-T cell therapy. Illustration of the CAR-T cell therapy process. It is one example of cell therapy. Adopted from \citet{cartllc}} - \label{fig:celltherapy} -\end{figure} - -\subsection{Other Applications} -But life is everywhere, and cellular engineering has already helped us make better crops, create synthetic meat, and design fungi that remove pollution. When people think about nanorobots, they should think about engineered cells instead \cite{daviesSyntheticBiologyVery2018}. Finally, Richard Feynman famously said, "What I cannot create, I do not understand." Therefore, the modeling of the cell stands as a key milestone in cellular biology, and, indeed, one cannot fulfill the aforementioned promises without a correct cellular blueprint. - -This realization has driven hundreds of companies, from tech to bio, and dozens of institutes to pursue efforts to create virtual cellular models \cite{bunneHowBuildVirtual2024}. These virtual cells aim to simulate and predict cellular behavior computationally, enabling \textit{in silico} experimentation before costly wet-lab validation. - -\section{GRN and the Cell} - -To build such a cell model, we must first understand the biological machinery they aim to represent. In this section, we review the key molecular components of the cell, focusing on the regulatory mechanisms that govern gene expression. +The cell is life's fundamental structural and functional unit. Its behavior is governed by gene expression: the process by which the information encoded in DNA is used to synthesize RNA and proteins, which in turn execute essentially all cellular functions. Understanding and modeling gene expression at scale is the central challenge this thesis addresses. \begin{figure}[ht] \centering \includegraphics[width=0.9\textwidth]{./figures/aivc.jpg} - \caption[Image of a cell]{Image of a cell. Artist representation of a small part of a eukaryotic cell from cryo-ET images. Adopted from \citet{cellimage}} + \caption[Image of a cell]{Artist representation of a small part of a eukaryotic cell from cryo-ET images. Adopted from \citet{cellimage}} \label{fig:thecell} \end{figure} -The cell is the fundamental unit of life and is composed of various components, including proteins, nucleic acids, lipids, and carbohydrates (see Figure ~\ref{fig:thecell}). Each of these components plays a crucial role in the cell's structure and function. Proteins are responsible for most cellular processes, while nucleic acids (\gls{DNA} and \gls{RNA}) carry genetic information. Lipids form cell membranes, and carbohydrates serve as energy sources and structural components. - -It is at the Institut Pasteur in the 1950s that André Lwoff, Jacques Monod, and Agnes Ullmann made significant discoveries about the role of messenger \gls{RNA}, gene regulation, and genetic programs in cellular function. Together with François Jacob (see Figure ~\ref{fig:lwoffjacobmonod}), yet another Pasteur Institute scientist, they proposed the operon model of gene regulation in prokaryotes, which explained how genes are turned on and off in response to environmental signals. For their discoveries, François, André, and Jacques were awarded the Nobel Prize in Physiology or Medicine in 1965 \cite{FrontMatter2003}. It is again at the Institut Pasteur, near Monod's and Jacob's buildings, that this Ph.D. was undertaken, aiming to understand further mRNA's role and the cell's regulation using AI models. - -\begin{figure}[ht] - \centering - \includegraphics[width=0.9\textwidth]{./figures/monod.jpg} - \caption[Lwoff, Jacob, Monod]{Lwoff, Jacob, Monod in their Pasteur Institute Office. Adopted from \citet{monod}} - \label{fig:lwoffjacobmonod} -\end{figure} - \subsection{RNA} Among the cell's molecular components, \gls{RNA} plays a particularly central role. \gls{RNA} biology is a critical aspect of cellular function, encompassing processes such as transcription, translation, and regulation. Transcription is the process by which \gls{DNA} is copied into \gls{RNA}, which then serves as a template for protein synthesis during translation. Regulation of these processes is essential for maintaining cellular homeostasis and responding to environmental changes. This regulation can occur at multiple levels, including transcriptional control, RNA processing, and post-translational modifications \cite{Alberts2015CellsGenomes}. -The \gls{RNA} hypothesis posits that \gls{RNA} molecules were the first self-replicating entities, leading to the evolution of life as we know it. This hypothesis suggests that early life forms relied on \gls{RNA} for both genetic information storage and catalytic functions, paving the way for the development of \gls{DNA} and proteins, showing how \gls{RNA} might be one of the most central components of the cell \cite{RNAWorld2024}. - Many different types of \gls{RNA} exist, each with distinct functions. Messenger \gls{RNA} (\gls{mRNA}) carries genetic information from \gls{DNA} to ribosomes for protein synthesis, while transfer \gls{RNA} (\gls{tRNA}) and ribosomal \gls{RNA} (\gls{rRNA}) allow translation of \gls{mRNA}s into proteins. Other types of \gls{RNA}, such as small interfering \gls{RNA} (\gls{siRNA}) and microRNA (\gls{miRNA}), are involved in gene regulation and silencing. Long-non-coding \gls{RNA}s (\gls{lncRNA}s) also play crucial roles in regulating gene expression and chromatin structure \cite{chenSmallLongNoncoding2024}. \subsection{Gene Expression} @@ -72,7 +34,7 @@ \subsection{Gene Expression} Transcription factors also interact with other proteins, such as cohesin, which helps maintain chromatin and the specific 3D structure of the \gls{DNA}. Chromatin is the complex of \gls{DNA} and proteins that forms chromosomes within the nucleus of eukaryotic cells. The organization of chromatin is essential for regulating gene expression, as it determines the accessibility of \gls{DNA} to transcription machinery (see Figure ~\ref{fig:grn_1}). -Understanding the transcriptional rules and grammar of \gls{TF} binding is helping us engineer bacteria and eukaryotic cells to express specific genes. +Understanding the transcriptional rules governing \gls{TF} binding is central to predicting gene expression programs from genomic sequence. \begin{figure}[ht] \centering @@ -82,7 +44,7 @@ \subsection{Gene Expression} \end{figure} \subsection{Gene Regulatory Networks} -Given this complexity, biologists have relied on the concept of gene regulatory networks (\gls{GRN}s) to simplify the complex interactions within the cell \cite{badia-i-mompelGeneRegulatoryNetwork2023}. GRNs are networks of molecular interactions that govern gene expression levels in a cell. They consist of genes, transcription factors, and other regulatory elements that interact to control the timing and level of gene expression (see Figure ~\ref{fig:grn_2}). Although very coarse and likely incomplete, These modeled interactions provide insights into how cells might respond to various stimuli, differentiate into specific cell types, and maintain homeostasis. They are used by researchers every day through pathway, regulon, and other ontological relationship databases. They help us understand diseases, their mechanisms, and improve crop quality and yield. +Given this complexity, biologists have relied on the concept of gene regulatory networks (\gls{GRN}s) to simplify the complex interactions within the cell \cite{badia-i-mompelGeneRegulatoryNetwork2023}. GRNs are networks of molecular interactions that govern gene expression levels in a cell. They consist of genes, transcription factors, and other regulatory elements that interact to control the timing and level of gene expression (see Figure ~\ref{fig:grn_2}). Although coarse and incomplete, these modeled interactions provide insights into how cells respond to stimuli, differentiate, and maintain homeostasis. GRNs are used operationally through pathway, regulon, and ontological relationship databases, and serve as a reference for interpreting transcriptomic data. \begin{figure}[ht] \centering @@ -201,7 +163,7 @@ \subsection{Definitions} \label{fig:ffn} \end{figure} -Recently, Machine Learning (\gls{ML}) has made great strides in many areas, primarily due to a significant increase in data generation, along with improvements in optimization methods and neural networks. In this Ph.D., we are piggybacking on these improvements and performing data science and machine learning on single-cell data. +Recent advances in Machine Learning (\gls{ML}) across many domains have been driven by a significant increase in data availability, alongside improvements in optimization methods and neural network architectures. This thesis applies these advances to single-cell transcriptomic data. We now present an overview of these methods and provide intuition for why they work. @@ -228,7 +190,7 @@ \subsection{Why Does It Work? Architectural Innovations} \textbf{Normalization layers} (batch normalization, layer normalization) stabilize training by normalizing intermediate activations, allowing higher learning rates and faster convergence. Layer normalization is particularly important for transformers and is used in all models. -\textbf{Tokenization and attention} make the models both more parallelizable and more complex. Indeed, it allows models to work on matrix inputs, where each input value becomes a vector of numbers (also called an embedding or token). For text, this involves subword units; for single-cell data, we can let our imagination run free; it could be genes, cells, molecules, indeed, we will be using both in our models. +\textbf{Tokenization and attention} make the models both more parallelizable and more complex. Indeed, it allows models to work on matrix inputs, where each input value becomes a vector of numbers (also called an embedding or token). For text, this involves subword units; for single-cell data, tokens can represent genes, cells, or molecules---this thesis uses both. We then use classical neural networks for per-token processing and the attention mechanism to enable tokens to interact with one another. Taken together, these make the model even more parallelizable in both depth and width. @@ -243,7 +205,7 @@ \subsection{Optimization and Loss Landscapes} But this is through \emph{stochastic} gradient descent (\gls{SGD}) methods like Adam that we successfully train \gls{NN}s \cite{kingmaAdamMethodStochastic2017,loshchilovDecoupledWeightDecay2019}. Indeed, using only a small subset of the data at each training step is not only much faster to minimize the loss function, but it also helps escape local minima and saddle points \cite{liVisualizingLossLandscape2018}. -To understand this, we need to understand the loss landscape. Imagine a 3D landscape where the height represents the loss value, and the two other "surface" dimensions are the model's parameters (see Figure ~\ref{fig:loss_surface}). The goal of the model is to find the lowest point in this landscape, which corresponds to the best set of parameters to fit the data. However, this landscape is very complex, with many local minima and saddles, which would prevent the model from reaching a nice minimum; it wanders blindly and can only sense its immediate surroundings. +To understand this, we need to understand the loss landscape. Imagine a 3D landscape where the height represents the loss value, and the two other "surface" dimensions are the model's parameters (see Figure ~\ref{fig:loss_surface}). The goal of the model is to find the lowest point in this landscape, which corresponds to the best set of parameters to fit the data. However, this landscape is highly complex, with many local minima and saddle points that can trap naive optimization procedures. \begin{figure}[ht] \centering diff --git a/chapters/intro.tex b/chapters/intro.tex index efdf29e..92d7059 100644 --- a/chapters/intro.tex +++ b/chapters/intro.tex @@ -1,240 +1,953 @@ \raggedbottom % Allow flexible page heights to reduce underfull vbox warnings \chapter*{Introduction} % Main chapter title -\addcontentsline{toc}{chapter}{Introduction} +\addcontentsline{toc}{chapter}{Introduction} \label{chap:introduction} \section{Motivation and problem setting} -The cell is the fundamental unit of life, composed of various components including proteins, nucleic acids, lipids, and carbohydrates. Despite decades of research, we still lack the ability to accurately predict how a cell will respond to a given stimulus, design targeted therapies with high confidence, or engineer cellular behavior from first principles. The central problem this thesis addresses is: \emph{can we build computational models that learn meaningful representations of cellular state from large-scale transcriptomic data, and can these representations be used to infer gene regulatory relationships and generalize to unseen biological contexts?} - -The objectives of cellular biologists are to understand and control cells, with the dream of engineering life from plants to animals and even generating entirely new synthetic life \cite{daviesSyntheticBiologyVery2018}. - -Achieving this vision requires a deep understanding of cellular mechanisms. Before developing a drug for a disease, for instance, one must identify the target genes in specific cell types that need to be reactivated, deactivated, or modified to address the disease's underlying mechanism. Understanding mRNA and siRNA has already enabled potent therapies, including some of the well-known COVID-19 vaccines. Unfortunately, many \gls{RNA} types remain poorly understood, and their functions are an active area of research. In eukaryotic cells, like our own, RNAs are produced through gene expression and are actively regulated by the cell. - -But drugs are not the only application of cellular understanding. Life is everywhere, and engineering has already helped us make better crops, create synthetic meat, and design fungi that remove pollution. Yet cells are extraordinarily complex, and we remain limited by our understanding of their inner workings. - -Recent advances in single-cell sequencing have begun to change this. Technologies have advanced rapidly, with studies conducted on tens of thousands of cells in the 2010s now scaling to millions \cite{regevHumanCellAtlas}, generating what have been called cell atlases. This explosion of data has driven hundreds of companies and dozens of institutes to pursue virtual cellular models \cite{bunneHowBuildVirtual2024}---computational systems that aim to simulate and predict cellular behavior, enabling \textit{in silico} experimentation before costly wet-lab validation. - -\subsection{Current Challenges} - -Single-cell sequencing itself comes with a set of challenges that directly motivate the methods developed in this thesis. The main issues are: +The central computational challenge addressed in this thesis is: given large-scale single-cell +transcriptomic data, can we learn cellular representations that capture gene regulatory +relationships and generalize across cell types, tissues, and species? Traditional approaches to +modeling gene regulation, whether correlation-based statistical methods or mechanistic kinetics +simulations, have not scaled to the combinatorial complexity of the problem. But technologies have advanced rapidly, with studies conducted on tens of thousands of cells in the 2010s now scaling to millions +\cite{regevHumanCellAtlas}, generating what have been called cell atlases. This explosion of data +has driven hundreds of companies and dozens of institutes to pursue virtual cellular models +\cite{bunneHowBuildVirtual2024}: computational systems that aim to simulate and predict cellular +behavior, enabling \textit{in silico} experimentation before costly wet-lab validation. The emergence of +self-supervised transformer models capable of training on tens of millions of cells opens a new +avenue, but whether this paradigm genuinely captures biology or merely memorizes distributional +statistics remains an open question. + +Concretely, this thesis presents two foundation models that address this challenge: \gls{scPRINT}, +trained on more than 50 million cells for cell-specific \gls{GRN} inference, and \gls{scPRINT}-2, +trained on over 350 million cells from 16 eukaryotic organisms, which achieves state-of-the-art +performance on zero-shot cell-type classification (75\%), expression denoising, and batch correction. +A systematic additive benchmarking framework with 42 model variants establishes the relative +contribution of each architectural and training choice. Alongside these models, this thesis +introduces BenGRN, a rigorous benchmarking suite for \gls{GRN} inference, and the Xpressor +cross-scale architecture enabling cross-attention compression between biological scales. + +\subsection{Current data-level challenges} + +Single-cell \gls{RNA-seq} itself comes with a set of challenges that directly motivate the methods +developed in this thesis: \begin{enumerate} - \item \textbf{Sparsity and noise.} Most current single-cell sequencing methods capture only 10-20\% of transcripts, leading to many zeros ("dropouts") in the data. Our models address this through denoising pretraining tasks and learned expression tokenization. - \item \textbf{Batch effects.} Strong biases in data generation make cross-dataset analysis challenging. Our foundation models learn batch-invariant representations through large-scale pretraining across hundreds of datasets. - \item \textbf{Limited coverage.} Many tissues, rare cell types, and non-model organisms remain undersequenced. We demonstrate cross-species generalization by training on 16 organisms and testing on unseen species. - \item \textbf{Missing modalities.} Spatial context and protein levels are often unavailable. We show zero-shot generalization to spatial transcriptomics data without spatial-specific training. + \item \textbf{Sparsity and noise.} Most current single-cell sequencing methods capture only + 10--20\% of transcripts, leading to many zeros (``dropouts'') in the data. Our models address + this through denoising pretraining tasks and learned expression tokenization \cite{eraslanSinglecellRNAseqDenoising2019}. + \item \textbf{Batch effects.} Strong biases introduced during data generation make cross-dataset + analysis challenging. Our foundation models learn batch-invariant representations through + large-scale pretraining across hundreds of datasets \cite{haghverdiBatchEffectsSinglecell2018}. + \item \textbf{Limited coverage.} Many tissues, rare cell types, and non-model organisms remain + undersequenced. We demonstrate cross-species generalization by training on 16 organisms and + testing on unseen species \cite{alsabbaghFoundationModelsMeet2023}. + \item \textbf{Missing modalities.} Spatial context and protein levels are often unavailable. We + show zero-shot generalization to spatial transcriptomics data without spatial-specific training. \end{enumerate} -These challenges define the benchmarking framework we use to evaluate our models and motivate our architectural choices. - -Beyond these data-level challenges, biological complexity itself poses fundamental obstacles. A single human cell contains approximately 20,000 protein-coding genes, but gene regulation extends far beyond simple on/off switches: it involves combinatorial control by transcription factors and cofactors, epigenetic modifications, post-transcriptional regulation by non-coding RNAs, alternative splicing, protein-protein interactions, and metabolic feedback loops (see ~\ref{chap:background}). These processes interact across multiple spatial and temporal scales---from millisecond signaling cascades to days-long differentiation programs---creating a system whose emergent behavior cannot be easily predicted from its individual components. It is this multi-layered complexity that makes purely mechanistic modeling insufficient and motivates the data-driven approaches developed in this thesis. - -\subsection{Gene Regulatory Networks} - -Gene regulatory networks (\gls{GRN}s) provide a simplified yet powerful framework for understanding how genes interact within cells. The foundations of this field were laid at the Institut Pasteur in the 1950s, where André Lwoff, Jacques Monod, Agnes Ullmann, and François Jacob (see Figure~\ref{fig:lwoffjacobmonod2}) made seminal discoveries about messenger \gls{RNA}, gene regulation, and the operon model \cite{FrontMatter2003}. It is again at the Institut Pasteur, near Monod's and Jacob's buildings, that this Ph.D. was undertaken, aiming to understand further mRNA's role and the cell's regulation using AI models. - +Beyond these data-level challenges, biological complexity itself poses fundamental obstacles. A +single human cell contains approximately 20,000 protein-coding genes, but gene regulation extends +far beyond simple on/off switches: it involves combinatorial control by transcription factors and +cofactors, epigenetic modifications, post-transcriptional regulation by non-coding RNAs, alternative +splicing, protein-protein interactions, and metabolic feedback loops \cite{badia-i-mompelGeneRegulatoryNetwork2023, desaiImprovingGeneRegulatory2017, oksuzTranscriptionFactorsInteract2023}. These processes interact across +multiple spatial and temporal scales, from millisecond signaling cascades to days-long +differentiation programs, creating a system whose emergent behavior cannot be easily predicted from +its individual components. This multi-layered complexity makes purely mechanistic modeling +insufficient and motivates the data-driven approaches developed in this thesis. + +\bigskip +The remainder of this introduction is organized as follows. Section~\ref{sec:grn} formalizes the +\gls{GRN} inference problem, reviews classical methods in depth, and surveys existing benchmarks. +Section~\ref{sec:foundations} introduces the transformer and self-supervised learning machinery +relevant to our work, surveys biological foundation models across scales, and provides a detailed +critical review of existing single-cell \gls{RNA-seq} foundation models. Section~\ref{sec:scope} +describes the scope, limitations, and contributions of this thesis chapter by chapter. + +% =================================== +\section{Gene Regulatory Network Inference} +\label{sec:grn} +% =================================== + +Understanding how genes regulate one another is a central problem in systems biology. Gene regulatory network inference aims to reconstruct the causal wiring diagram that governs cellular behavior from high-throughput molecular measurements, enabling predictions about how perturbations propagate through the cell \cite{badia-i-mompelGeneRegulatoryNetwork2023, huynh-thuInferringRegulatoryNetworks2010}. + +\subsection{Formal problem statement} + +Let \(\mathbf{X} \in \mathbb{R}^{n \times g}\) denote a single-cell expression matrix, where \(n\) +is the number of cells and \(g\) is the number of genes. Each row \(\mathbf{x}_i \in +\mathbb{R}^{g}\) is the expression profile of cell \(i\), with entry \(x_{ij}\) proportional to +the transcript count of gene \(j\) in cell \(i\) (after normalization). The goal of \gls{GRN} +inference is to learn a mapping \(\mathcal{F}: \mathbf{X} \mapsto G\) that extracts, from the +statistical dependencies observed in \(\mathbf{X}\), a directed weighted graph +\begin{equation} + G = (V,\, E,\, W), \quad V = \{1,\ldots,g\},\quad + E \subseteq V \times V,\quad W: E \to \mathbb{R}, +\end{equation} +where \(V\) is the set of genes, \(E\) is the set of directed regulatory edges, and \(W\) assigns +a signed weight to each edge. An edge \((j \to i) \in E\) with weight \(w_{ji} > 0\) encodes a +putative activating regulatory influence of gene \(j\) on gene \(i\); a negative weight encodes +repression. A useful \gls{GRN} should capture causal regulatory relationships: meaning that +perturbing gene \(j\) should predictably alter the expression of gene \(i\), rather than mere +statistical correlations \cite{pearlTheoreticalImpedimentsMachine2018}. In the transcription-factor-centric formulation that dominates the literature, we +distinguish a set of \glspl{TF} \(T \subset V\) from their target genes \(V \setminus T\), and +restrict \(E \subseteq T \times (V \setminus T)\), yielding a bipartite directed structure. + +\textbf{Cell-type-specific GRNs.} In heterogeneous tissues, a single aggregate network is +insufficient. A cell-type-specific \gls{GRN} +\begin{equation} + G_c = (V,\, E_c,\, W_c) +\end{equation} +is defined for each cell type \(c\), inferred from the subset of cells \(\{i : \text{type}(i) = c\} +\subseteq \{1,\ldots,n\}\). Obtaining reliable cell-type-specific networks is one of the primary +aims of this thesis. + +\textbf{Why this is hard.} With \(|V| \approx 20{,}000\) protein-coding genes, the space of +possible directed edges is \(|V|^2 \approx 4 \times 10^8\). Even restricting to +\gls{TF}-to-target interactions (with \(|T| \approx 1{,}600\) human \glspl{TF}), there are +\(\sim\! 3 \times 10^7\) candidate edges, vastly exceeding the information content of any +feasible experiment. Compounding this, ground-truth networks are sparse (on the order of +\(10^3\)--\(10^4\) validated edges), inherently incomplete, cell-type-specific, and subject to +experimental noise \cite{mercatelliGeneRegulatoryNetwork2020}. + +\textbf{Distinction from gene networks.} \glspl{GRN} as defined above encode directed +\emph{regulatory} interactions, primarily \gls{TF}-to-target transcriptional control. +Gene networks (\glspl{GN}), by contrast, are broader undirected graphs encompassing +co-expression relationships, protein-protein interactions, metabolic pathway membership, and +physical proximity. All \glspl{GRN} are a subgraph of \glspl{GN}, but not vice versa. This +distinction matters for evaluation: benchmarks using co-expression networks as ground truths +cannot fairly assess the directionality claims of \gls{GRN} methods +\cite{badia-i-mompelGeneRegulatoryNetwork2023}. + +\subsection{Classical and state-of-the-art methods} + +A substantial body of computational methods has been developed for \gls{GRN} inference from bulk +and single-cell expression data. We describe the most widely used approaches in detail, focusing +on their mathematical formulation and limitations. + +\subsubsection{Regression-based: GENIE3 and GRNBoost2} +GENIE3 \cite{genie3} decomposes the network inference problem into \(g\) independent regression +problems. For each target gene \(i \in V\), a random forest regressor \(f_i\) is trained to predict +the expression of gene \(i\) from the expressions of all other genes: +\begin{equation} + f_i : \mathbb{R}^{g-1} \to \mathbb{R}, \qquad + f_i\!\left( \mathbf{x}_{-i} \right) \approx x_i, +\end{equation} +where \(\mathbf{x}_{-i}\) denotes the expression vector with gene \(i\) excluded. The edge weight +\(w_{ji}\) is set to the \emph{feature importance} of gene \(j\) in regressor \(f_i\), measured by +the mean decrease in node impurity (Gini importance) across the ensemble of trees. This is +computed for all target genes in parallel, yielding a full \(g \times g\) weight matrix. In the +\gls{TF}-centric variant, the predictor set is restricted to \(T\), reducing the regression to +\(f_i : \mathbb{R}^{|T|} \to \mathbb{R}\). + +GRNBoost2 replaces the random forest with a gradient-boosted ensemble (XGBoost), achieving +approximately 50-fold speedup through stochastic gradient boosting with early stopping based on +out-of-bag loss. Both methods produce directed graphs by construction (since \(f_i\) only captures +\(j \to i\)), but with several limitations: + +\textbf{Symmetry bias.} If gene \(j\) drives gene \(i\), feature importance of \(j\) in \(f_i\) is high; but if the true direction is \(i \to j\), the model has no mechanism to distinguish this from \(j \to i\). Causal directionality is not encoded. + +\textbf{Indirect effects.} A chain \(j \to k \to i\) inflates the feature importance of \(j\) in \(f_i\), conflating indirect with direct regulation. Post-hoc conditioning cannot fully remove this. + +\textbf{No cell-type specificity.} The method is applied to an aggregated expression matrix; cell-type-specific networks require external stratification. + +\subsubsection{Motif-based regulon approach: pySCENIC} +A key limitation of purely expression-based methods like GENIE3 is their inability to distinguish +direct transcriptional regulation from indirect co-expression. pySCENIC addresses this by +integrating orthogonal regulatory evidence, using \gls{TF} binding motifs to filter spurious edges. + +pySCENIC \cite{aibarSCENICSinglecellRegulatory2017} and its multi-omic successor SCENIC+ +\cite{bravogonzalez-blasSCENICSinglecellMultiomic2023} combine co-expression with regulatory motif +evidence in a two-step pipeline. In the first step, GENIE3 (or GRNBoost2) identifies candidate +\gls{TF}-to-target co-expression modules: for each \gls{TF} \(t \in T\), the set of target genes +with high \(w_{ti}\) forms an initial regulon candidate. In the second step, for each candidate +regulon of \gls{TF} \(t\), \emph{motif enrichment analysis} tests whether the promoter regions +of the candidate targets are enriched for the known binding motif of \(t\), using curated +\gls{TF} motif databases (JASPAR, cisTarget). Only targets passing a false discovery rate +threshold are retained in the final regulon. + +Scoring cell-level \gls{TF} activity is then performed by AUCell: for each cell \(i\) and regulon +\(r\), the genes are ranked by expression, and the area under the recovery curve of regulon members +in the ranked list defines an enrichment score \(\text{AUC}_{ir} \in [0,1]\). This yields a +cell-by-regulon activity matrix capturing cell-type-specific \gls{TF} programs without explicit +single-cell network inference. + +Key limitations: (1) the motif database is primarily curated for human and mouse, limiting +cross-species applicability; (2) motif presence is evidence of \emph{potential} binding, not +observed binding, yielding false positives; (3) directionality within the regulon (activation +versus repression) is not inferred; (4) the co-expression-first step inherits GENIE3's indirect +effects problem; (5) SCENIC+ is computationally expensive, requiring joint scRNA-seq and +\gls{scATAC-seq} data. + +\subsubsection{Information-theoretic: PIDC} +PIDC \cite{pratapaBenchmarkingAlgorithmsGene2020} applies \emph{partial information decomposition} +(PID) to identify direct gene--gene relationships. Let \(x_i, x_j, x_k\) denote the expression +values of genes \(i, j, k\) treated as random variables across cells. For a triplet of genes +\((x_i, x_j, x_k)\), the mutual information \(I(x_i; x_j)\) is decomposed into unique, redundant, +and synergistic contributions, using the Williams and Beer PID framework. The edge weight between genes \(i\) and +\(j\) is set to the unique information that \(x_j\) provides about \(x_i\) not mediated by any +third gene \(x_k\): +\begin{equation} + w_{ji} = U(x_i; x_j \setminus x_k), \qquad \forall k \neq i,j. +\end{equation} +The advantage over pairwise mutual information, which inflates edge weights for highly connected +hub genes, is that PIDC penalizes information shared with a common regulator. Limitations include +(1) cubic computational cost \(\mathcal{O}(g^3)\) prohibiting genome-wide analysis beyond +\(\sim\!5{,}000\) genes; (2) assumption of a specific noise model (typically Dirichlet) for +expression count data; and (3) inability to produce directed edges without additional assumptions. + +\subsubsection{ODE-based: SCODE} +\glspl{ODE}-based approaches model gene expression dynamics directly \cite{matsumotoSCODEEfficientRegulatory2017}. +Related methods include TIGRESS \cite{hauryTIGRESSTrustfulInference2012} and dynGENIE3 \cite{thuDynGENIE3DynamicalGENIE32018}, +which extend regression approaches to time-series data. +\begin{equation} + \frac{d\mathbf{x}}{dt} = \mathbf{A}\,\mathbf{x}(t) + \mathbf{b}, +\end{equation} +where \(\mathbf{A} \in \mathbb{R}^{g \times g}\) is the regulatory matrix to be inferred and +\(\mathbf{b}\) is a basal expression offset. Pseudotime ordering of cells computed from a +trajectory inference method provides a proxy for the temporal variable \(t\). The system is solved +for \(\mathbf{A}\) by minimizing the residual \(\|\dot{\mathbf{x}} - \mathbf{A}\mathbf{x} - +\mathbf{b}\|_F^2\) over the pseudotime-ordered cell sequence. The resulting matrix \(\mathbf{A}\) +directly encodes the signed regulatory graph. + +This approach has a number of critical limitations: (1) the linear assumption is a strong +approximation of the fundamentally nonlinear gene regulatory dynamics; (2) performance depends +heavily on pseudotime quality; (3) the approach requires developmental or time-series data rather +than steady-state profiles; and (4) it does not scale to genome-wide analysis. + +\subsubsection{Comparative summary} +Table~\ref{tab:grn_methods} summarizes the key properties of these methods. None of the existing +approaches simultaneously achieves directed inference, cell-type specificity, and genome-wide +scalability without requiring perturbation data or pseudotime trajectories. scPRINT, introduced +in this thesis, addresses this gap by leveraging self-attention to jointly model all genes in a +single forward pass, extracting directed cell-specific networks from the attention weights without +requiring external priors or interventional data. + +\begin{table}[ht] +\centering +\scriptsize +\setlength{\tabcolsep}{5pt} +\begin{tabular}{lcccc} +\toprule +\textbf{Method} & \textbf{Directed?} & \textbf{Cell-type spec.?} & \textbf{Genome-wide?} & \textbf{Needs perturbations?} \\ +\midrule +GENIE3/GRNBoost2 & Partial & No & Yes & No \\ +pySCENIC/SCENIC+ & No & Via AUCell & No & No \\ +PIDC & No & No & No ($<$5k genes) & No \\ +SCODE & Yes & Limited & No & Pseudotime \\ +scPRINT (ours) & Yes & Yes & Yes & No \\ +\bottomrule +\end{tabular} +\caption[Comparison of GRN inference methods]{Key properties of GRN inference methods discussed in this section.} +\label{tab:grn_methods} +\end{table} + +\subsection{Benchmarking of GRN inference} + +The absence of standardized benchmarks has long hampered progress in \gls{GRN} inference. Methods +are evaluated on heterogeneous datasets, using different ground truths and metrics, making +comparison unreliable. We review the main benchmarking efforts and discuss their methodological +implications. + +\subsubsection{Existing benchmarks} +\textbf{BEELINE (Pratapa et al., 2020) \cite{pratapaBenchmarkingAlgorithmsGene2020}} is the +reference systematic benchmarking effort for single-cell \gls{GRN} inference. It evaluates 12 methods +on simulated data and on curated \textit{in vivo} datasets. Simulated expression data are generated +by tools such as SERGIO \cite{dibaeiniaSERGIOSingleCellExpression2020} and BoolODE +\cite{pratapaBenchmarkingAlgorithmsGene2020}, which produce realistic \gls{scRNA-seq} count matrices +from user-defined regulatory graphs via stochastic gene expression models. Gene network tools such +as GeneNetWeaver \cite{schaffterGeneNetWeaverNetworkTopology2011} similarly provide +in silico expression data from known topologies. Key finding: no method consistently +outperforms GENIE3 across all benchmarked datasets and metrics. + +\textbf{BenGRN \cite{kalfonJkobjectBenGRNAwesome2025}} is our contribution to this landscape, +introduced in Chapter~\ref{article1}. BenGRN provides a benchmarking suite specifically designed +for single-cell-resolution \gls{GRN} inference, with a focus on the three ground-truth types +described below and metrics calibrated to the sparsity of real biological networks. + +\subsubsection{Metrics} +Choosing appropriate metrics is non-trivial given the extreme sparsity of \gls{GRN} ground truths +(less than 0.01\% of edges are positive). + +\textbf{AUROC} (Area Under the Receiver Operating Characteristic curve) measures the probability +that a randomly chosen positive edge is ranked above a randomly chosen negative edge. While widely +reported, AUROC is misleading for sparse \glspl{GRN}: a method that ranks all edges uniformly +achieves AUROC $\approx 0.5$ regardless of the number of edges, but a method that ranks all +positives first achieves AUROC = 1. With 99.99\% negatives, the ROC curve is dominated by the +massive negative class and small improvements appear negligible. + +\textbf{\gls{AUPRC}} (Area Under the Precision-Recall Curve) is more informative for imbalanced +problems. Precision \(= \text{TP}/(\text{TP}+\text{FP})\) and recall \(= +\text{TP}/(\text{TP}+\text{FN})\) both depend on the positive class, making AUPRC sensitive to the +true positives recovered. However, AUPRC is still affected by the choice of positive set: dense +ground truths give higher baseline AUPRC than sparse ones. + +\textbf{\gls{EPR}} (Early Precision Ratio) measures the precision among the top-\(k\) ranked +predictions, where \(k\) equals the number of known positive edges. Formally, +\begin{equation} + \text{EPR} = \frac{\text{Precision@}k}{\text{random baseline}} = \frac{|\hat{E}_k \cap E^*|/k}{|E^*|/(|V|^2)}, +\end{equation} +where \(\hat{E}_k\) is the set of the \(k\) highest-scored predicted edges and \(E^*\) is the +ground-truth positive set. EPR directly captures the enrichment of true positives among the top +predictions, which is the operationally most relevant regime: practitioners use the top-scored +edges for downstream validation, not the full ranked list. EPR values greater than 1 indicate +above-random performance. + +\subsubsection{Ground truths} +\textbf{Literature-curated networks: OmniPath.} OmniPath \cite{tureiOmniPathGuidelinesGateway2016} +aggregates manually curated \gls{TF}-to-target interactions from 14 species across dozens of +databases, yielding a comprehensive directed interaction network. Its main limitations are +incompleteness (only well-studied \glspl{TF} are covered), ascertainment bias (literature is biased +toward cancer and development), and lack of cell-type specificity (interactions are aggregated +across all contexts). + +\textbf{Experimental binding data: ENCODE \gls{ChIP-seq}.} The ENCODE project +\cite{theencodeprojectconsortiumIntegratedEncyclopediaDNA2012} provides genome-wide \gls{TF} +binding sites measured by chromatin immunoprecipitation followed by sequencing. ChIP-seq identifies +genomic loci occupied by a given \gls{TF} in a specific cell line, providing physical evidence of +binding. However, binding does not imply regulation: a \gls{TF} may bind without activating or +repressing the nearby gene. ENCODE data is also restricted to a small number of cell types. + +\textbf{Perturbation-based: genome-wide Perturb-seq.} Genome-wide \gls{perturb-seq} experiments +\cite{replogleMappingInformationrichGenotypephenotype2022} knock out each gene individually using +\gls{CRISPR} and measure the resulting transcriptomic response. The causal effect of gene \(j\) +on gene \(i\) can then be estimated from the differential expression of gene \(i\) upon knockout +of gene \(j\), providing direct evidence of a regulatory relationship. This is the closest +available ground truth to a true causal edge. Its limitation is scale: current experiments are +conducted in a small number of cell lines (primarily K562 and RPE1), and the full genome knockout +library is expensive to generate and validate. + +\textbf{Simulated expression data and gene network tools.} An important complementary source of +ground truth comes from \textit{in silico} simulated datasets generated by tools such as +SERGIO \cite{dibaeiniaSERGIOSingleCellExpression2020}, BoolODE +\cite{pratapaBenchmarkingAlgorithmsGene2020}, and GeneNetWeaver +\cite{schaffterGeneNetWeaverNetworkTopology2011}. These tools generate synthetic \gls{scRNA-seq} +count matrices from a user-specified regulatory graph, providing a fully known and controllable +ground truth. Gene network databases such as STRING and OmniPath further supply curated +interaction networks that can serve as silver standards when experimental data are unavailable. +While simulated data enable controlled evaluation, they do not fully capture the complexity and +noise of real biological systems. + +\bigskip +The limitations of all ground truths: incompleteness, experimental noise, cell-type specificity, +and scale constraints are central to the difficulty of \gls{GRN} benchmarking and motivate our +design choices in BenGRN. In particular, the absence of complete genome-wide causal networks means +that the evaluation problem is inherently partial, and metrics must be interpreted relative to the +coverage of the chosen ground truth. + +% =================================== +\section{Foundation Models for Single-Cell Biology} +\label{sec:foundations} +% =================================== + +The challenges outlined above: noisy, sparse, and heterogeneous single-cell data combined with +the combinatorial complexity of gene regulation, demand models that can learn from massive datasets +without requiring exhaustive manual annotation. The emergence of modern self-supervised +transformers pretrained on large datasets has opened a new avenue: foundation models for +single-cell biology. The key question is whether the paradigm that transformed natural language +processing can be adapted to biological data, where tokens lack a natural ordering, expression +values are continuous, and the underlying generative process is fundamentally different from +language. + +\subsection{Key methodological elements for foundation models} + +\subsubsection{Transformers and self-supervised learning} +Foundation models in machine learning are trained on large unlabeled datasets using self-supervised +objectives, then adapted to downstream tasks via fine-tuning or zero-shot inference. In computer +vision, models such as Vision Transformers (ViT) \cite{dosovitskiyImageWorth16x162021} learn from +millions of images without manual labels, achieving state-of-the-art performance on classification, +segmentation, and generation tasks. In natural language processing, models such as BERT +\cite{devlinBERTPretrainingDeep2019} and GPT \cite{brownLanguageModelsAre2020} are pretrained on +vast text corpora using masked token prediction or autoregressive generation, then fine-tuned for +translation, question answering, or text generation. The success of these models rests on three +pillars: (1) the transformer architecture with its self-attention mechanism, (2) large-scale +pretraining datasets capturing diverse contexts, and (3) self-supervised objectives that enable +learning without manual annotation. + +\subsubsection{Self-attention mechanism} +The core operation of all transformer models is multi-head self-attention +\cite{devlinBERTPretrainingDeep2019}. Given an input matrix +\(\mathbf{X} \in \mathbb{R}^{n \times d}\) (here \(n\) is the number of input tokens and \(d\) +is the embedding dimension), the scaled dot-product attention \cite{vaswaniAttentionAllYou2023} is: +\begin{equation} + \mathbf{Q} = \mathbf{X}\mathbf{W}_Q, \quad + \mathbf{K} = \mathbf{X}\mathbf{W}_K, \quad + \mathbf{V} = \mathbf{X}\mathbf{W}_V, +\end{equation} +\begin{equation} + \text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) + = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}}\right)\mathbf{V}, +\end{equation} +where \(\mathbf{W}_Q, \mathbf{W}_K \in \mathbb{R}^{d \times d_k}\) and +\(\mathbf{W}_V \in \mathbb{R}^{d \times d_v}\) are learned projection matrices. In a single-cell +context, the tokens are \emph{genes}: \(n = g\) is the number of expressed genes and each token +\(\mathbf{x}_j \in \mathbb{R}^d\) encodes the identity and expression of gene \(j\). The attention +matrix \(\mathbf{A} = \text{softmax}(\mathbf{Q}\mathbf{K}^\top/\sqrt{d_k}) \in [0,1]^{g \times g}\) +defines a learned weighted interaction graph between all pairs of genes, making self-attention a +natural candidate for \gls{GRN} extraction. + +Multi-head attention runs \(H\) parallel attention heads with independent weight matrices, each +capturing a different type of gene-gene relationship. The outputs of all heads are concatenated and +projected: \(\text{MHA}(\mathbf{X}) = [\text{head}_1;\ldots;\text{head}_H]\mathbf{W}_O\). + +\subsubsection{Masked gene modeling} +Self-supervised pretraining in \gls{BERT}-style models uses a masked token prediction objective. +In the single-cell context, this becomes \emph{masked gene modeling}\cite{yangScBERTLargescalePretrained2022a}: a random 15\% of gene tokens +are replaced by a special \texttt{[MASK]} token, and the model is trained to predict the original +expression values from the unmasked context: +\begin{equation} + \mathcal{L}_\text{mask} = \frac{1}{|\mathcal{M}|} \sum_{j \in \mathcal{M}} + \ell\!\left( f_\theta(\mathbf{X}_{\setminus \mathcal{M}})_j,\; x_j \right), +\end{equation} +where \(\mathcal{M}\) is the set of masked gene indices, \(f_\theta\) is the transformer encoder, +and \(\ell\) is a loss function (cross-entropy on expression bins, or MSE on continuous values). +This objective forces the model to capture gene co-expression dependencies, i.e. which genes tend to +co-vary, making it a proxy for learning regulatory relationships. + +In scPRINT and scPRINT-2, we combine masked gene prediction with denoising and bottleneck losses to jointly optimize for +expression reconstruction and representation quality. Detailed examples of these multi-objective +training strategies are provided below in the reviews of state-of-the-art models. + +\subsubsection{Encoding single-cell expression data} +Adapting transformers to \gls{scRNA-seq} data requires solving three non-trivial encoding +problems: + +\textbf{Sequence length.} With \(g \approx 20{,}000\) genes, standard softmax attention has +quadratic complexity \(\mathcal{O}(g^2)\), exceeding GPU memory for full-genome analysis. To solve this problem, +three families of efficient attention mechanisms are relevant to this thesis. \textbf{Flash +Attention} \cite{daoFlashAttention2FasterAttention2023} computes exact softmax attention with +IO-aware tiling, reducing memory bandwidth cost from \(\mathcal{O}(n^2)\) to +\(\mathcal{O}(n)\) SRAM reads/writes while maintaining mathematical equivalence to standard +attention. \textbf{Performer} \cite{choromanskiRethinkingAttentionPerformers2022} approximates +the attention kernel using random Fourier features: +\(\text{softmax}(\mathbf{Q}\mathbf{K}^\top/\sqrt{d}) \approx \phi(\mathbf{Q})\phi(\mathbf{K})^\top\), +reducing complexity to \(\mathcal{O}(n)\) but with approximation error. +\textbf{Criss-cross attention}, developed in this thesis, introduces a sub-quadratic +attention pattern biased toward structurally related tokens, achieving \(\mathcal{O}(n \sqrt{n})\) +complexity while preserving long-range expressivity. + +\textbf{Expression value tokenization.} Unlike text tokens (discrete integers), expression values +are continuous non-negative counts. Existing approaches include: (i) \emph{binning}, discretizing +each value into one of \(k\) expression bins (e.g., 51 bins in scGPT \cite{cuiScGPTBuildingFoundation2024}), +losing quantitative precision; (ii) \emph{rank-ordering}, sorting genes by expression, encoding +position rather than value (Geneformer \cite{theodorisTransferLearningEnables2023}), losing +expression magnitude; (iii) \emph{learned \gls{MLP} encoder}: using a small neural network maps the +raw count to an embedding, preserving full quantitative information (our approach in scPRINT). + +\textbf{Batch effects.} Systematic measurement biases across laboratories, protocols, and +sequencing platforms corrupt the expression values. Foundation models must either model these +explicitly (e.g., through batch-conditioned layers) or learn representations invariant to them +through large-scale exposure to diverse datasets, as we will be presenting in Chapter~\ref{article1} and Chapter~\ref{article3}. + +\subsection{Foundation models in biology} + +The transformer-based foundation model paradigm has been extended across biological scales, +yielding a hierarchy of models operating from atomic resolution to tissue level. + +\textbf{Protein sequences: ESM2 \cite{esm2}.} ESM2 is a family of \gls{BERT}-style protein +language models trained on 250 million protein sequences from UniRef90. At 650M parameters, ESM2 +treats amino acids as tokens and learns representations that capture evolutionary constraints and +3D structural properties without explicit structural supervision. Of direct relevance to our work: +ESM2 embeddings encode protein function and interaction propensity, making them a natural prior +for gene representations in single-cell models. scPRINT uses frozen ESM2 embeddings as initial +gene tokens, importing evolutionary information into the cellular representation. + +\textbf{Protein structure: AlphaFold2 \cite{jumperHighlyAccurateProtein2021}.} AlphaFold2 +revolutionized structural biology by predicting 3D protein structures from amino acid sequence +with near-experimental accuracy, using a transformer-based architecture with multiple sequence +alignment input and an equivariant structure module. Its successor AlphaFold3 extends this to +protein complexes and nucleic acids. + +\textbf{DNA and chromatin: Nucleotide Transformer \cite{dalla-torreNucleotideTransformerBuilding2024}.} +Nucleotide Transformer applies a \gls{GPT}-style architecture to genomic sequences across 850 +species, using 6-nucleotide k-mer tokens. It demonstrates transfer learning from DNA sequence to +functional genomics tasks including regulatory element prediction and chromatin accessibility. +These models operate at the DNA level and capture sequence-level gene regulation, complementary +to, but distinct from, the expression-level modeling of this thesis. + +\textbf{Cross-scale integration.} Protein sequence models and cell-level models operate in +disjoint representation spaces: one encodes amino acid co-evolution, the other encodes +cell-state-specific co-expression. Bridging them requires explicit cross-scale architectures. +Chapter~\ref{article2} addresses this gap through the Xpressor architecture, which enables +multi-scale fine-tuning of a protein language model using single-cell expression objectives. + +\subsection{Single-cell RNA-seq foundation models} +\label{sec:scfm} + +Since 2021, a growing number of \gls{scFM}s have been proposed, each introducing different +architectural choices, pretraining objectives, and data regimes. We provide detailed reviews of +the most significant models, followed by brief summaries of recent additions. + +\subsubsection{scBERT (Yang et al., 2022)} +\textbf{Architecture.} scBERT \cite{yangScBERTLargescalePretrained2022a} was the first +practically deployed \gls{scFM}. It adapts the \gls{BERT} encoder with \textbf{Performer} +attention \cite{choromanskiRethinkingAttentionPerformers2022} to handle approximately 16,000 human +genes within GPU memory constraints. Each gene is represented by two components: a gene identity +embedding (a learned vector indexed by gene ID) and an expression value embedding (the expression +count discretized into one of 64 bins, then embedded). The sum of these two embeddings forms the +input token for each gene. + +\textbf{Pretraining.} scBERT was pretrained on approximately 1 million human cells from publicly +available \gls{scRNA-seq} datasets using a masked gene modeling objective: 15\% of gene tokens +are masked, and the model is trained to predict the binned expression level of each masked gene +from the remaining context, using cross-entropy loss over the expression bin vocabulary. + +\textbf{Results and limitations.} scBERT demonstrated improved cell-type classification accuracy +over \gls{PCA}-based baselines across several benchmarks. However, its evaluation was restricted +to a single downstream task (cell-type classification), training was limited to human data only, +and the coarse expression binning discards quantitative information critical for tasks such as +denoising or \gls{GRN} inference. The model has not been demonstrated to generalize across species +or to gene network tasks. + +\subsubsection{Geneformer (Theodoris et al., 2023)} \begin{figure}[ht] - \centering - \includegraphics[width=0.9\textwidth]{./figures/monod.jpg} - \caption[Lwoff, Jacob, Monod]{Lwoff, Jacob, Monod in their Pasteur Institute Office. Adopted from \citet{monod}} - \label{fig:lwoffjacobmonod2} + \centering + \includegraphics[width=0.9\textwidth]{./figures/geneformer.png} + \caption[The Geneformer model]{The Geneformer model, where genes are represented as words and cells as sentences, with genes ordered by their expression level. Adopted from \citet{theodorisTransferLearningEnables2023}.} + \label{fig:geneformer} \end{figure} -Traditional computational cell models based on chemical reaction parameters attempted to simulate these regulatory dynamics but failed to generate realistic predictions of cellular behavior, largely because they could not capture the full combinatorial complexity of gene interactions \cite{badia-i-mompelGeneRegulatoryNetwork2023}. Moreover, as discussed in the Background chapter, gene regulation extends well beyond transcription factors: it involves cofactor proteins, non-coding RNAs, RNA maturation, and protein translation, with interactions spanning all molecular layers. Gene regulatory networks (\gls{GRN}s) provide a simplified framework for these interactions, while gene networks (\gls{GN}s) encompass a broader set of relationships, including protein-protein interactions and metabolic pathways. - -Current GRN inference methods suffer from additional practical limitations: they typically operate on only a small subset of genes (often restricted to known transcription factors), process only a limited number of cells (failing to scale to modern atlas-sized datasets), and do not fully exploit the richness of single-cell expression profiles \cite{genie3}. Despite this long history, inferring accurate cell-type-specific GRNs remains a major bottleneck in computational biology. Current methods face several challenges: (1) the combinatorial explosion of possible gene-gene interactions makes exhaustive testing infeasible, (2) correlation-based methods cannot distinguish direct from indirect effects \cite{nourisaGeneRNIBLivingBenchmark2025}, (3) perturbation data is expensive and limited in scale, and (4) ground truth networks for validation are sparse and context-specific \cite{mercatelliGeneRegulatoryNetwork2020}. These limitations motivate the development of new approaches, including the foundation model-based methods presented in this thesis. - -\section{Foundation Models} - -The challenges outlined above---noisy, sparse, and heterogeneous single-cell data combined with the combinatorial complexity of gene regulation---demand models that can learn from massive datasets without requiring exhaustive manual annotation. Traditional approaches, whether statistical (correlation-based GRN inference) or mechanistic (chemical kinetics simulations), have not scaled to the complexity of the problem. The emergence of modern AI---particularly self-supervised transformers pretrained on large datasets---has opened a new avenue: foundation models for single-cell biology. The key question is whether the paradigm that transformed natural language processing can be adapted to biological data, where tokens lack a natural ordering, expression values are continuous, and the underlying generative process is fundamentally different from language. - -\subsection{LLMs} - -In Language modelling, large language models (\gls{LLM}s) have demonstrated that self-supervised pretraining on massive text corpora produces representations that generalize across a wide range of tasks. Models such as BERT \cite{devlinBERTPretrainingDeep2019} and GPT \cite{brownLanguageModelsAre2020} learn contextual relationships between tokens---words or subword units---through tasks like masked language modeling or next-token prediction. The success of these models relies on three pillars: (1) the transformer architecture, which captures long-range dependencies through attention mechanisms; (2) scale, both in data and parameters, which enables emergent capabilities; and (3) the self-supervised paradigm, which removes the need for labeled data during pretraining. These principles have inspired a wave of foundation models in scientific domains, from protein sequences (ESM2 \cite{esm2}) to molecular structures (AlphaFold \cite{jumperHighlyAccurateProtein2021}), and, as we discuss below, to single-cell transcriptomics. - -\subsection{Bio-Foundation Models} - -The first practical example of a single-cell (\gls{RNA-seq}) foundation model was scBERT, released in 2021. However, it was only used and benchmarked for cell type classification and pretrained on 1 million single cells \cite{yangScBERTLargescalePretrained2022a}. The first foundational model with broader claims, Geneformer, was released a year later \cite{theodorisTransferLearningEnables2023} (see Figure ~\ref{fig:geneformer}). The authors demonstrated the model's ability to perform various single-cell tasks, including cell-type classification, gene regulatory network inference, and perturbation prediction. Geneformer was trained on a much larger dataset of 33 million single cells. - +\textbf{Architecture.} Geneformer \cite{theodorisTransferLearningEnables2023} introduced a +distinctive tokenization strategy: for each cell, all expressed genes are \emph{sorted by their +normalized expression level} from highest to lowest, producing an ordered list of gene name tokens +(gene identity indices). Expression magnitude is thus encoded implicitly by position in the +sequence rather than by an explicit value. This rank-ordered sequence is fed to a standard +6-layer \gls{BERT} encoder with bidirectional attention, which processes it identically to a +natural language sentence. A fixed vocabulary of 25,000 human genes constrains the model to a +single species. + +\textbf{Pretraining.} The model was pretrained on 29.9 million human single-cell transcriptomes +from diverse tissues using a masked token prediction objective: 15\% of gene tokens are masked and +the model is trained to predict their \emph{gene identities} (not expression values) from context. +This is analogous to standard \gls{BERT} masked language modeling, with gene names as the +vocabulary. + +\textbf{Results and limitations.} Geneformer demonstrated competitive performance on cell-type +classification, chromatin dynamics prediction (\gls{TF} dosage sensitivity), and \textit{in +silico} perturbation response via gene silencing or activation in the latent space. A notable +application identified the NODAL pathway in cardiomyopathy as a therapeutic target. These claims +positioned Geneformer as the first \gls{scFM} with broad multi-task reach. However, rank-ordering irreversibly discards expression magnitude: two cells with +the same gene ordering but different absolute expression levels are represented identically. This +collapses the quantitative co-expression signal essential for denoising and continuous regression +tasks. The fixed 25,000-gene human vocabulary prevents cross-species use. Furthermore, independent +evaluation \cite{boiarskyDeepDiveSingleCell2023} demonstrated that Geneformer is unreliable in +zero-shot settings, often underperforming simpler baselines such as logistic regression for +cell-type annotation and highly variable gene selection for clustering, with the masked pretraining +objective failing to generalize effectively to the evaluation datasets tested. + +\subsubsection{scGPT (Cui et al., 2024)} +\textbf{Architecture.} scGPT \cite{cuiScGPTBuildingFoundation2024} was the first \gls{scFM} to +apply a \gls{GPT}-style decoder with causal (unidirectional) self-attention. Genes are treated as +tokens in a sequence of up to 3,000 expressed genes per cell. Each token is the sum of a gene +identity embedding and an expression value embedding (expression discretized into 51 bins). Causal +attention imposes a sequential ordering on genes: each gene can only attend to previously processed +genes in the sequence. This ordering is arbitrary with no biological motivation, since gene +expression is simultaneous rather than sequential. + +\textbf{Pretraining.} scGPT was pretrained on approximately 33 million human cells from the +CellxGene corpus using three objectives: (1) autoregressive gene expression generation, predict +the expression of the next gene in the sequence from all preceding genes; (2) masked value +prediction, mask 15\% of expression values and predict them from context; (3) cell-level +generation from a \texttt{[CLS]} token that aggregates cell-level information. + +\textbf{Fine-tuning protocol.} scGPT introduced the first systematic fine-tuning \gls{API} and +benchmarked across four tasks: cell-type annotation, \gls{GRN} inference, perturbation prediction, +and batch correction. This multi-task framing was influential in establishing a standard evaluation +suite for \glspl{scFM}. + +\textbf{Performance and limitations.} Independent evaluations by \citet{boiarskyDeepDiveSingleCell2023} +and \citet{alsabbaghFoundationModelsMeet2023} demonstrated that scGPT does not consistently +outperform dedicated \gls{SOTA} methods on any of the four benchmarked tasks, and in several cases +underperforms simpler baselines. The causal attention design is biologically unmotivated: gene +expression is not a sequential process, and the arbitrary gene ordering introduces spurious +dependencies. The model is restricted to human data only, and its large parameter count renders +training computationally expensive. + +\subsubsection{UCE (Rosen et al., 2023)} \begin{figure}[ht] - \centering - \includegraphics[width=0.9\textwidth]{./figures/geneformer.png} - \caption[The Geneformer model]{The Geneformer model, where genes are represented as words and cells as sentences, where genes are ordered by their expression level. Adopted from \citet{theodorisTransferLearningEnables2023}.} - \label{fig:geneformer} + \centering + \includegraphics[width=0.9\textwidth]{./figures/uce.png} + \caption[Low-dimensional visualization of universal cell embeddings across species]{Low-dimensional visualization of universal cell embeddings across species, with cells colored by type. Adopted from \citet{rosenUniversalCellEmbeddings2023}.} + \label{fig:UCE} \end{figure} -However, Geneformer, like scBERT, was essentially an LLM (\gls{BERT}) applied directly to single-cell data. In this context, words are gene names, listed in order of expression level in the cell to form a sentence. This design choice raises questions about whether such direct adaptations from NLP are optimal for biological data. - -\subsection{Current Single-Cell Foundation Models and Their Limitations} -In 2023, a year after Geneformer, several additional foundation models were released. scGPT \cite{cuiScGPTBuildingFoundation2024} showcased a GPT-style architecture and presented various losses for fine-tuning. It was the first example of systematic fine-tuning in single-cell and a more in-depth benchmark across four abilities: cell type prediction, gene network inference, perturbation prediction, and batch correction. However, it did not outperform state-of-the-art methods \cite{boiarskyDeepDiveSingleCell2023, alsabbaghFoundationModelsMeet2023}. At the same time, Universal Cell Embedding (UCE) \cite{rosenUniversalCellEmbeddings2023} demonstrated cross-species training to achieve state-of-the-art cross-species cell embeddings, introducing a contrastive loss function for cell representation learning (see Figure ~\ref{fig:UCE}). - -\begin{figure}[ht] - \centering - \includegraphics[width=0.9\textwidth]{./figures/uce.png} - \caption[Low-dimensional visualization of universal cell embeddings across species]{Low-dimensional visualization of universal cell embeddings across species. Each point is a cell positioned near similar cells according to this foundation model. Adopted from \citet{rosenUniversalCellEmbeddings2023}.} - \label{fig:UCE} -\end{figure} - -Finally, scFoundation \cite{haoLargescaleFoundationModel2024}, despite being closed-source, showcased a truly novel architecture specifically built for single-cell data and a novel training method based on the noise-to-sequencing-depth relationship. - -\textbf{Key bottlenecks in single-cell foundation models.} At the start of this Ph.D., we identified several limitations in existing approaches that motivated our contributions: - -\begin{enumerate} - \item \textbf{Expression tokenization.} Existing models used hand-crafted binning or rank-ordering of expression values. We hypothesized that learned tokenization could better capture the biological signal. - \item \textbf{Gene representation.} Most models learned gene embeddings from scratch, ignoring rich prior knowledge from protein sequences. We introduced protein-based gene encoding using ESM2 embeddings. - \item \textbf{Genomic context.} Gene position on chromosomes affects co-regulation, but this was ignored. We added genomic positional encoding. - \item \textbf{GRN inference.} Claims about GRN inference were not rigorously benchmarked. We developed BenGRN, a comprehensive benchmarking suite. - \item \textbf{Scalability.} Transformer quadratic complexity limited genome-wide analysis. We developed efficient attention mechanisms for large-scale inference. - \item \textbf{Reproducibility.} Many models were not open-source or reproducible. We committed to releasing all code, models, and benchmarks. - \item \textbf{Cross-species generalization.} Training was limited to human/mouse. We scaled to 16 organisms. -\end{enumerate} - -\section{Scientific aim} - -The limitations identified above---in both GRN inference methods and existing single-cell foundation models---point to a clear need: models that not only perform well on standard benchmarks but whose representations can be mechanistically interpreted to yield biological insight. Beyond building better architectures, we must understand \emph{what} these models learn, \emph{whether} their improvements are genuine or artifacts of evaluation choices, and \emph{how} to make them practically useful for the biological community. This requires rigorous benchmarking, reproducible training, and systematic ablation of design decisions. In what follows, we describe the initial aims that guided this Ph.D. and how they evolved into the contributions presented in the subsequent chapters. - -\subsection{Initial scientific aim} -At the start of the project, we wanted to understand how single-cell foundation models worked---or whether they worked at all---and to improve gene regulatory network (\gls{GRN}) inference using single-cell RNA sequencing (\gls{scRNA-seq}) data, sensing a possible interplay between the two. Before presenting our contributions, it is important to reflect on what we initially set out to achieve. This being a Thesis by Articles, each of the three main chapters corresponds to a specific scientific publication. - -This Ph.D. project initially aimed to develop new deep learning approaches, possibly using graph neural network architectures on large \gls{scRNA-seq} datasets, to assess their predictive performance on high-quality benchmarks and package them as an open-source Python library. Our principal idea was to use Graph Neural Networks (\gls{GNN})s. \gls{GNN}s are a class of deep learning layers designed to operate on graph-structured data. They are specifically tailored to handle modalities where edges connect the different input elements (nodes or vertices) \cite{batznerE3equivariantGraphNeural2022,decaoMolGANImplicitGenerative2018,ganInferringGeneRegulatory2024} (see Figure ~\ref{fig:gnn}). - -Traditional neural networks are primarily designed to process grid-like data, such as images, or sequential data, such as text. However, \gls{GNN}s extend this capability to graph-structured data by incorporating a pooling operation across connected nodes. - -\textbf{Objectives}. We wished to improve \gls{GRN} predictions from \gls{scRNA-seq} data. Our approach was: - -\begin{enumerate} -\tightlist -\item -To use larger neural network models that scale linearly with the dataset size, taking advantage of the tens of millions of data points now becoming available. -\item -To use novel \gls{GNN} layers that can reduce the model's ``search space'' by constraining the set of possible topologies it learns. -\item -To improve the pretraining and fine-tuning of these models to the predictive task they have to perform, and the constraints of the system they are predicting. -\item -To formulate better layers that correspond to the sparse interactions between genes and our current knowledge about their functions. -\item -To create formal and rational benchmarks that best capture the ability of a \gls{GRN} methodology. -\item -To assess predictions and any usefulness or lack of it by having biologists test hypotheses using the model. -\end{enumerate} - -\begin{figure}[ht] - \centering - \includegraphics[width=0.9\textwidth]{./figures/gnn.png} - \caption[Illustration of the graph neural network mechanism]{Illustration of the graph neural network mechanism, update and pooling (e.g., summing) across multiple connected nodes represented as vectors. Adopted from \citet{corsoGraphNeuralNetworks2024}.} - \label{fig:gnn} -\end{figure} - -\subsection{Potential Impacts} -From these initial objectives, we envisioned several impacts. This Ph.D.~project will contribute to methodological breakthroughs by providing new tools and methods for applying neural networks to unstructured data such as \gls{scRNA-seq}, and to improve the state of the art in \gls{GRN} prediction. - -The proposed methodologies will impact computational (bioinformatics, machine learning) and biomedical fields. The new architectures might address challenges faced by related fields such as environmental research, industrial biotechnology, and biofuel studies. The improved \gls{GRN} predictions will enhance our understanding of cellular processes, potentially leading to new therapeutic targets and strategies for treating diseases. The open-source Python library will democratize access to these tools, enabling researchers worldwide to apply them to their data and questions. - +\textbf{Architecture.} \gls{UCE} \cite{rosenUniversalCellEmbeddings2023} addressed the +cross-species generalization limitation of prior models by training across 36 species +simultaneously. Its defining architectural choice is to represent each gene not by a learned +embedding from scratch but by the \emph{pre-computed protein sequence embedding} from a frozen +\gls{ESM}2 protein language model \cite{esm2}. This substitutes the species-specific gene ID +lookup table with a universal protein-level representation, importing evolutionary information into +the single-cell model and enabling cross-species transfer without retraining on new species. + +\textbf{Pretraining.} UCE was trained on approximately 36 million cells from 36 species using a +contrastive learning objective: cells of the same type, sampled from different datasets or species, +are pushed together in embedding space, while cells of different types are separated. No expression +value prediction objective is used; the model is trained solely to align cell-type identities +across species. + +\textbf{Results and limitations.} UCE achieved \gls{SOTA} cross-species cell embeddings, clustering +evolutionarily related cell types together without species-specific adaptation. Its learned +embeddings provide a universal coordinate system for multi-species single-cell atlases. However, UCE is restricted to a zero-shot embedding paradigm with no demonstrated +fine-tuning protocol, limiting its utility for supervised downstream tasks. Its evaluation is largely +confined to embedding quality (clustering metrics) rather than biologically actionable tasks such +as \gls{GRN} inference or perturbation prediction. + +\subsubsection{scFoundation (Hao et al., 2024)} +\textbf{Architecture.} scFoundation \cite{haoLargescaleFoundationModel2024} departed from direct +NLP analogies with a novel \emph{Read-Depth Aware} (RDA) asymmetric encoder-decoder architecture. +The encoder processes a low-depth (artificially downsampled, sparse) expression profile; the +decoder reconstructs the corresponding high-depth profile from the encoder's latent representation. +This design explicitly models the relationship between sequencing depth and transcript capture +probability, making the model sensitive to the quantitative properties of single-cell count data. + +\textbf{Pretraining.} scFoundation was pretrained on approximately 19 million human single-cell +profiles from the NCBI \gls{GEO} database using the downsampling-reconstruction objective as its +sole pretraining task. + +\textbf{Results and limitations.} scFoundation reported competitive performance on expression enhancement, +cell-type annotation, drug sensitivity prediction, and perturbation response across several +benchmarks. However, its most significant limitation is that its model weights, +training code, and data processing pipeline were never publicly released, preventing independent +validation or replication. Its corpus is restricted to human cells, precluding cross-species +generalization. + +% =================================== \section{Scope \& contributions of the Thesis} +\label{sec:scope} +% =================================== -Given the limitations of existing approaches---graph neural networks that do not scale, foundation models with unverified claims, and a lack of rigorous benchmarks---this thesis addresses the problem of building single-cell foundation models that are simultaneously scalable, biologically meaningful, and rigorously evaluated. We investigate whether modern transformer architectures can learn cellular representations that capture gene regulatory relationships from single-cell RNA sequencing data. We explore their utility for tasks such as denoising, imputation, cell type annotation, and batch correction, and through this lens we design improvements via better architectures and training strategies. - -\subsection{Limitations} - -While the initial objectives centered on graph neural networks, early investigations revealed fundamental limitations of this approach for GRN inference. First, reliable ground-truth GRNs are largely unavailable, making it impossible to initialize a model from a known graph topology. Second, \gls{GNN}s do not scale well to the tens of thousands of genes present in a cell, and systematic benchmarks consistently show them underperforming transformer-based models on comparable tasks \cite{hussainGlobalSelfAttentionReplacement2022,chenGraphAttentionNetwork2022}. - -These findings led us to adopt transformers instead, which can be viewed as \gls{GNN}s operating on fully-connected graphs \cite{dwivediGeneralizationTransformerNetworks2021,joshiTransformersAreGraph2025,nichaniHowTransformersLearn2024a,shawSelfAttentionRelativePosition2018}. The principal challenge with transformers is their quadratic complexity with respect to the number of input tokens. Addressing this scalability bottleneck---making transformers scale sub-quadratically with the number of input genes and cells---became one of the central contributions of this thesis. - -Notably, transformer-based models can also be used to infer putative \gls{GRN}s directly from non-graph input data \cite{cuiScGPTBuildingFoundation2024,theodorisTransferLearningEnables2023}, a capability that standard \gls{GNN}s cannot achieve. This observation further reinforced our decision to build upon the transformer paradigm. - -It is also important to note that this thesis does not specifically address perturbation response prediction, temporal dynamics, or spatial transcriptomics as primary modeling targets. While we demonstrate zero-shot generalization to spatial data and discuss perturbation prediction as a fine-tuning task, dedicated modeling of these modalities remains an important direction for future work. - -Finally, while we managed to initiate some collaborations, fully achieving cross-disciplinary validation proved difficult---an unsurprising reality in the current landscape. This experience reinforced the need to make foundation models more accessible, which became one of the contributions of this thesis, represented not only in the effort to release easy-to-use open-source models but also in various side contributions and outreach efforts. - +This thesis addresses the +problem of building single-cell foundation models that are simultaneously scalable, biologically +meaningful, and rigorously evaluated. We investigate whether modern transformer architectures can +learn cellular representations that capture gene regulatory relationships from single-cell +\gls{RNA-seq} data. We explore their utility for tasks such as denoising, imputation, cell type +annotation, and batch correction, and through this lens we design improvements via better +architectures and training strategies. \subsection{Scope of the Thesis} -The limitations and opportunities described above progressively shaped the scope of this Ph.D. As we benchmarked existing single-cell foundation models and their claimed abilities, we encountered numerous shortcomings---ranging from poor usability and lack of reproducibility in pretraining to questionable architectural decisions and inconsistent evaluation practices. These observations led us to create our own model and, more broadly, to pursue a research program organized around the following axes: +While specialized architectures called \glspl{GNN}s have been proposed as a natural for network learning and modelling, +systematic benchmarks consistently show them underperforming transformer-based models on +comparable tasks \cite{hussainGlobalSelfAttentionReplacement2022,chenGraphAttentionNetwork2022}. We thus adopt transformers instead, which can be viewed as \glspl{GNN} operating on fully-connected graphs \cite{dwivediGeneralizationTransformerNetworks2021,joshiTransformersAreGraph2025,nichaniHowTransformersLearn2024a,shawSelfAttentionRelativePosition2018}. + +The principal challenge with transformers is their quadratic complexity with respect to the number +of input tokens. Addressing this scalability bottleneck, making transformers scale sub-quadratically +with the number of input genes and cells, became one of the central contributions of this thesis. + +Notably, transformer-based models can also be used to infer putative \glspl{GRN} directly from +non-graph input data \cite{cuiScGPTBuildingFoundation2024,theodorisTransferLearningEnables2023}, +a capability that standard \glspl{GNN} cannot achieve. This observation further reinforced our +decision to build upon the transformer paradigm. + +It is also important to note that this thesis does not specifically address perturbation response +prediction, temporal dynamics, or spatial transcriptomics as primary modeling targets. While we +demonstrate zero-shot generalization to spatial data and discuss perturbation prediction as a +fine-tuning task, dedicated modeling of these modalities remains an important direction for future +work. + +Finally, while several cross-disciplinary collaborations were initiated, systematic biomedical +validation at scale remains challenging in the current landscape. This gap reinforced the +importance of accessibility, which became one of the contributions of this thesis: releasing +easy-to-use open-source models alongside various outreach efforts to lower the barrier for domain +scientists. + +The limitations and opportunities described above progressively shaped the scope of this Ph.D. As +we benchmarked existing single-cell foundation models and their claimed abilities, we encountered +numerous shortcomings, ranging from poor usability and lack of reproducibility in pretraining to +questionable architectural decisions and inconsistent evaluation practices. These observations led +us to create our own model and, more broadly, to pursue a research program organized around the +following axes: \begin{enumerate} - \item \textbf{Benchmarking and evaluation.} We developed standardized, biologically grounded benchmarking suites (BenGRN, GRnnData) for GRN inference and contributed benchmarks to the Open Problems platform, addressing the lack of rigorous and reproducible evaluation in the field. - \item \textbf{Reproducibility and accessibility.} We committed to fully open-source releases of all models, training code, datasets, and documentation, and deployed models on community platforms (CZ Virtual Cell Models, Superb.io) with tutorials and containerized benchmarks. - \item \textbf{Novel architectures.} We designed the Xpressor cross-attention compression mechanism for learning across biological scales (Chapter~\ref{article2}), criss-cross attention for sub-quadratic scaling, and GNN-based expression encoders for leveraging neighborhood information (Chapter~\ref{article3}). - \item \textbf{Improved training strategies.} We systematically evaluated pretraining tasks (denoising vs.\ masking), loss functions (MSE, ZINB, and hybrids), input representations (normalized vs.\ raw counts), and gene tokenization approaches through an additive benchmarking framework (Chapter~\ref{article3}). - \item \textbf{Zero-shot and generative capabilities.} We demonstrated zero-shot performance on denoising, cell-type classification, batch correction, and spatial transcriptomics, as well as counterfactual generation through the Xpressor architecture (Chapters~\ref{article1} and~\ref{article3}). - \item \textbf{Applications to biological discovery.} We applied our models to real biological systems, including prostate tissue atlases and cross-species macrophage analysis, recovering known biology and generating testable hypotheses (Chapters~\ref{article1} and~\ref{article3}). - \item \textbf{Understanding model limitations.} Through systematic ablations and cross-model comparisons, we characterized the conditions under which foundation models succeed or fail, informing future model development. + \item \textbf{Benchmarking and evaluation.} We developed standardized, biologically grounded + benchmarking suites (BenGRN, GRnnData) for \gls{GRN} inference and contributed benchmarks to + the Open Problems platform, addressing the lack of rigorous and reproducible evaluation in the + field. + \item \textbf{Reproducibility and accessibility.} We committed to fully open-source releases of + all models, training code, datasets, and documentation, and deployed models on community + platforms (CZ Virtual Cell Models, Superb.io) with tutorials and containerized benchmarks. + \item \textbf{Novel architectures.} We designed the Xpressor cross-attention compression + mechanism for learning across biological scales (Chapter~\ref{article2}), criss-cross attention + for sub-quadratic scaling, and \gls{GNN}-based expression encoders for leveraging neighborhood + information (Chapter~\ref{article3}). + \item \textbf{Improved training strategies.} We systematically evaluated pretraining tasks + (denoising vs.\ masking), loss functions (\gls{MSE}, \gls{ZINB}, and hybrids), input + representations (normalized vs.\ raw counts), and gene tokenization approaches through an + additive benchmarking framework (Chapter~\ref{article3}). + \item \textbf{Zero-shot and generative capabilities.} We demonstrated zero-shot performance on + denoising, cell-type classification, batch correction, and spatial transcriptomics, as well as + counterfactual generation through the Xpressor architecture + (Chapters~\ref{article1} and~\ref{article3}). + \item \textbf{Applications to biological discovery.} We applied our models to real biological + systems, including prostate tissue atlases and cross-species macrophage analysis, recovering + known biology and generating testable hypotheses (Chapters~\ref{article1} and~\ref{article3}). + \item \textbf{Understanding model limitations.} Through systematic ablations and cross-model + comparisons, we characterized the conditions under which foundation models succeed or fail, + informing future model development. \end{enumerate} -Improving \gls{scFM}s to generate better representations of cells, genes, and their networks thus became the central objective of this Ph.D. We also sought to create benchmarks better suited to the single-cell genomics field, driven by real-life applicability rather than artificial metrics. Indeed, current methods often relied on synthetic data and ground truths unrepresentative of real biological systems, and the known single-cell standardized benchmarks were rarely used by early \gls{scFM} papers. +Improving \glspl{scFM} to generate better representations of cells, genes, and their networks thus +became the central objective of this Ph.D. We also sought to create benchmarks better suited to +the single-cell genomics field, driven by real-life applicability rather than artificial metrics. +Indeed, current methods often relied on synthetic data and ground truths unrepresentative of real +biological systems, and the known single-cell standardized benchmarks were rarely used by early +\gls{scFM} papers. \section{Chapters Overview \& Main Contributions} -This thesis is structured around three main publications, each presented as a chapter. Below, we provide detailed summaries of our contributions and results. +This thesis is structured around three main publications, each presented as a chapter. Below, we +provide detailed summaries of our contributions and results. \subsection{Chapter \ref{article1}: scPRINT: pretraining on 50 million cells allows robust gene network predictions} -In this chapter, we present \gls{scPRINT} (single-cell PRetrained Inference of Networks with Transformers), a large cell model designed for cell-specific gene network inference at the genome scale. This work addresses a fundamental challenge in cellular biology: inferring the network of molecular interactions that governs cell behavior. - -\textbf{Model architecture and training innovations.} We trained \gls{scPRINT} on more than 50 million cells from the \gls{CxG} database, representing approximately 80 billion tokens across multiple species, diseases, and ethnicities. Our model introduces several architectural innovations: (1) a protein-based gene encoding using \gls{ESM}2 embeddings, which reduces parameters while enabling cross-species generalization; (2) a learned expression tokenization via \gls{MLP} rather than hand-crafted binning; and (3) positional encoding of genomic location to capture co-regulation patterns. We designed three complementary pretraining tasks: a denoising task (transcript upsampling), a bottleneck learning task (embedding compression and reconstruction), and a label-prediction task with hierarchical classification for disentangled cell embeddings that represent different phenotypic facets. - -\textbf{Gene network inference methodology.} A critical contribution is our method for extracting cell-specific gene networks from the transformer's attention matrices, inspired by similar approaches in \gls{ESM}2 for protein contact prediction. We made this approach scalable to compute genome-wide networks for thousands of cells on commodity hardware. We also introduced an attention head selection mechanism that selects a subset of heads based on their correlation with known ground-truth networks, significantly improving network quality in larger models. - -\textbf{Comprehensive benchmarking framework.} We created BenGRN and GRnnData, novel benchmarking suites for \gls{GRN} inference that address the lack of standardized evaluation in the field. We benchmarked \gls{scPRINT} against \gls{scGPT}, Geneformer v2, Deep\gls{SEM}, and GENIE3 using multiple ground truth types: literature-based networks (Omnipath), cell-type-specific \gls{ChIP-seq}/perturb-seq intersections (MCalla et al.), and genome-wide perturb-seq data. Our results demonstrate that \gls{scPRINT} outperforms all other methods on most benchmarks. On the Omnipath benchmark across 26 cell types, \gls{scPRINT} recovered 67\% more connections than GENIE3 and showed superior enrichment for \gls{TF}s and their \gls{ENCODE}-validated targets (20\% of \gls{TF}s with significant enrichment, compared to 0\% for \gls{scGPT}). On the MCalla et al. cell-type-specific ground truth, \gls{scPRINT} consistently outperformed all methods on both \gls{AUPRC} and \gls{EPR} metrics. - -\textbf{Zero-shot capabilities on orthogonal tasks.} Beyond gene network inference, we demonstrated that \gls{scPRINT}'s learned cell model enables competitive zero-shot performance on denoising, cell type prediction, and batch effect correction---without fine-tuning. For denoising, \gls{scPRINT} matches \gls{SOTA} methods (MAGIC, KNNsmoothing2) on bulk populations and outperforms them on rare cell types where neighborhood-based methods fail. For cell type classification, \gls{scPRINT} achieves 62\% accuracy as a zero-shot predictor across 200+ cell types, outperforming marker-based methods like CellTypist. For batch effect correction, \gls{scPRINT} achieves competitive \gls{scIB} scores without using batch labels, outperforming all methods that similarly do not require batch annotation. - -\textbf{Biological application and discovery.} We applied \gls{scPRINT} to an atlas of 83,000 cells from normal and \gls{BPH} prostate tissues. In rare switched memory B cells, we identified early \gls{TME} markers, including BAG5, a known B-cell-associated prostate cancer marker. In fibroblasts, our gene networks revealed differential hub genes between normal and \gls{BPH}-associated cells, recovering known biology around PAGE4 and uncovering interconnected pathways linking ion exchange, \gls{ECM} remodeling, oxidative stress, and chronic inflammation---hallmarks of premalignant states. +In this chapter, we present \gls{scPRINT} (single-cell PRetrained Inference of Networks with +Transformers), a large cell model designed for cell-specific gene network inference at the genome +scale. This work addresses a fundamental challenge in cellular biology: inferring the network of +molecular interactions that governs cell behavior. + +\textbf{Model architecture and training innovations.} We trained \gls{scPRINT} on more than 50 +million cells from the \gls{CxG} database, representing approximately 80 billion tokens across +multiple species, diseases, and ethnicities. Our model introduces several architectural +innovations: (1) a protein-based gene encoding using \gls{ESM}2 embeddings, which reduces +parameters while enabling cross-species generalization; (2) a learned expression tokenization via +\gls{MLP} rather than hand-crafted binning; and (3) positional encoding of genomic location to +capture co-regulation patterns. We designed three complementary pretraining tasks: a denoising +task (transcript upsampling), a bottleneck learning task (embedding compression and +reconstruction), and a label-prediction task with hierarchical classification for disentangled +cell embeddings that represent different phenotypic facets. + +\textbf{Gene network inference methodology.} A critical contribution is our method for extracting +cell-specific gene networks from the transformer's attention matrices, inspired by similar +approaches in \gls{ESM}2 for protein contact prediction. We made this approach scalable to compute +genome-wide networks for thousands of cells on commodity hardware. We also introduced an attention +head selection mechanism that selects a subset of heads based on their correlation with known +ground-truth networks, significantly improving network quality in larger models. + +\textbf{Comprehensive benchmarking framework.} We created BenGRN and GRnnData, novel benchmarking +suites for \gls{GRN} inference that address the lack of standardized evaluation in the field. We +benchmarked \gls{scPRINT} against \gls{scGPT}, Geneformer v2, DeepSEM, and GENIE3 using multiple +ground truth types: literature-based networks (Omnipath), cell-type-specific \gls{ChIP-seq}/ +perturb-seq data (MCalla et al.), and genome-wide perturb-seq data. Our results +demonstrate that \gls{scPRINT} outperforms all other methods on most benchmarks. On the Omnipath +benchmark across 26 cell types, \gls{scPRINT} recovered 67\% more connections than GENIE3 and +showed superior enrichment for \glspl{TF} and their \gls{ENCODE}-validated targets (20\% of +\glspl{TF} with significant enrichment, compared to 0\% for \gls{scGPT}). On the MCalla et al.\ +cell-type-specific ground truth, \gls{scPRINT} consistently outperformed all methods on both +\gls{AUPRC} and \gls{EPR} metrics. + +\textbf{Zero-shot capabilities on orthogonal tasks.} Beyond gene network inference, we +demonstrated that \gls{scPRINT}'s learned cell model enables competitive zero-shot performance on +denoising, cell type prediction, and batch effect correction, without fine-tuning. For denoising, +\gls{scPRINT} matches \gls{SOTA} methods (MAGIC, KNNsmoothing2) on bulk populations and +outperforms them on rare cell types where neighborhood-based methods fail. For cell type +classification, \gls{scPRINT} achieves 62\% accuracy as a zero-shot predictor across 200+ cell +types, outperforming marker-based methods like CellTypist. For batch effect correction, +\gls{scPRINT} achieves competitive \gls{scIB} scores without using batch labels, outperforming all +methods that similarly do not require batch annotation. + +\textbf{Biological application and discovery.} We applied \gls{scPRINT} to an atlas of 83,000 +cells from normal and \gls{BPH} prostate tissues. In rare switched memory B cells, we identified +early \gls{TME} markers, including BAG5, a known B-cell-associated prostate cancer marker. In +fibroblasts, our gene networks revealed differential hub genes between normal and +\gls{BPH}-associated cells, recovering known biology around PAGE4 and uncovering interconnected +pathways linking ion exchange, \gls{ECM} remodeling, oxidative stress, and chronic +inflammation: hallmarks of premalignant states. \subsection{Chapter \ref{article2}: Xpressor: Towards foundation models that learn across biological scales} -In this chapter, we present Xpressor, a framework and architecture enabling cross-scale learning between biological foundation models. This work addresses a fundamental challenge: while foundation models exist at multiple biological scales (molecules, sequences, cells, tissues), they operate in isolation, unable to leverage the rich interconnections between scales. - -\textbf{Motivation and conceptual framework.} We begin with a comprehensive review of foundation models across four biological scales: \glspl{mFM} for atomistic molecular representations, \glspl{nFM} for nucleotide and amino acid sequences (\gls{DNA}, \gls{RNA}, proteins), \glspl{cFM} for cellular abundance profiles, and \glspl{tFM} for tissue-level spatial organization. We argue that information flows between scales: lower-scale models (e.g., protein sequences) can improve input representations for higher-scale models (e.g., cells), while relationships learned at higher scales can inform lower-scale representations. Each scale's vocabulary can be seen as built from the compressed representations of the scale below---amino acids from atoms, genes from proteins, cells from genes, tissues from cells. - -\textbf{The Xpressor architecture.} Our first contribution is a cross-attention-based compression mechanism called Xpressor that transforms high-dimensional gene-level representations into lower-dimensional cell-state vectors. The architecture introduces additional transformer blocks that perform cross-attention between the output embeddings of a foundation model and a set of learned latent tokens. It creates a bottleneck that compresses $m$ gene tokens of dimension $d_c$ into $n$ cell tokens of dimension $d_t$, where $n \ll m$ and $d_t < d_c$. Critically, the same transformer can then decompress these cell representations back to gene-level predictions using cross-attention with gene ID tokens. This compression/decompression framework is grounded in the information bottleneck theory of Tishby et al., where the goal is to retain maximal information about relevant variables while achieving compression. We further regularize the latent space using contrastive losses between embedding dimensions and dimension-specific classifiers, ensuring each cell embedding dimension captures distinct biological information. - -\textbf{Multi-scale fine-tuning approach.} Our second contribution is a method for fine-tuning lower-scale models using upper-scale tasks via adapter layers. We demonstrate this using \gls{ESM}2 (a protein language model) as the lower-scale model and \gls{scPRINT} as the upper-scale model. Rather than simply using frozen \gls{ESM}2 embeddings as gene tokens, we add a trainable \gls{MLP} adapter that transforms each protein embedding during \gls{scPRINT}'s pretraining. We provide a formal proof that such an \gls{MLP} has sufficient capacity to learn any arbitrary mapping---including acting as a lookup table that assigns each of $D$ proteins to a unique learned output. This allows the adapter to enrich \gls{ESM}2's representations (which encode protein sequence, evolutionary constraints, and structure) with co-expression information learned from millions of single-cell profiles. - -\textbf{Empirical results on the scPRINT benchmark gymnasium.} We evaluate both contributions on three tasks from the \gls{scPRINT} benchmark: cell-type prediction, embedding quality (\gls{scIB} score for batch correction and biological consistency), and gene network inference (\gls{EPR} on genome-wide perturb-seq and Omnipath ground truths). For the Xpressor architecture versus standard class-pooling (as used in \gls{scGPT}), we observe substantial improvements: cell-type prediction accuracy increases from 0.60 to 0.72 (+20\%), and embedding quality improves from 0.48 to 0.52 (+8\%), while gene network inference remains comparable. For multi-scale fine-tuning, comparing frozen \gls{ESM}2 embeddings versus fine-tuned ones, we see cell-type prediction improve from 0.60 to 0.70 (+17\%), embedding quality from 0.48 to 0.49, and gene network inference improves on the Omnipath benchmark from 2.0 to 2.4 \gls{EPR} (+20\%). Notably, fine-tuned \gls{ESM}2 embeddings outperform both frozen \gls{ESM}2 and randomly initialized embeddings across nearly all metrics. +In this chapter, we present Xpressor, a framework and architecture enabling cross-scale learning +between biological foundation models. This work addresses a fundamental challenge: while foundation +models exist at multiple biological scales (molecules, sequences, cells, tissues), they operate in +isolation, unable to leverage the rich interconnections between scales. + +\textbf{Motivation and conceptual framework.} We begin with a comprehensive review of foundation +models across four biological scales: \glspl{mFM} for atomistic molecular representations, +\glspl{nFM} for nucleotide and amino acid sequences (\gls{DNA}, \gls{RNA}, proteins), \glspl{cFM} +for cellular abundance profiles, and \glspl{tFM} for tissue-level spatial organization. We argue +that information flows between scales: lower-scale models (e.g., protein sequences) can improve +input representations for higher-scale models (e.g., cells), while relationships learned at higher +scales can inform lower-scale representations. Each scale's vocabulary can be seen as built from +the compressed representations of the scale below: amino acids from atoms, genes from proteins, +cells from genes, tissues from cells. + +\textbf{The Xpressor architecture.} Our first contribution is a cross-attention-based compression +mechanism called Xpressor that transforms high-dimensional gene-level representations into +lower-dimensional cell-state vectors. The architecture introduces additional transformer blocks +that perform cross-attention between the output embeddings of a foundation model and a set of +learned latent tokens. It creates a bottleneck that compresses \(m\) gene tokens of dimension +\(d_c\) into \(n\) cell tokens of dimension \(d_t\), where \(n \ll m\) and \(d_t < d_c\). +Critically, the same transformer can then decompress these cell representations back to gene-level +predictions using cross-attention with gene ID tokens. This compression/decompression framework is +grounded in the information bottleneck theory of Tishby et al., where the goal is to retain +maximal information about relevant variables while achieving compression. We further regularize the +latent space using contrastive losses between embedding dimensions and dimension-specific +classifiers, ensuring each cell embedding dimension captures distinct biological information. + +\textbf{Multi-scale fine-tuning approach.} Our second contribution is a method for fine-tuning +lower-scale models using upper-scale tasks via adapter layers. We demonstrate this using \gls{ESM}2 +(a protein language model) as the lower-scale model and \gls{scPRINT} as the upper-scale model. +Rather than simply using frozen \gls{ESM}2 embeddings as gene tokens, we add a trainable \gls{MLP} +adapter that transforms each protein embedding during \gls{scPRINT}'s pretraining. We provide a +formal proof that such an \gls{MLP} has sufficient capacity to learn any arbitrary +mapping, including acting as a lookup table that assigns each of \(D\) proteins to a unique +learned output. This allows the adapter to enrich \gls{ESM}2's representations (which encode +protein sequence, evolutionary constraints, and structure) with co-expression information learned +from millions of single-cell profiles. + +\textbf{Empirical results on the scPRINT benchmark gymnasium.} We evaluate both contributions on +three tasks from the \gls{scPRINT} benchmark: cell-type prediction, embedding quality (\gls{scIB} +score for batch correction and biological consistency), and gene network inference (\gls{EPR} on +genome-wide perturb-seq and Omnipath ground truths). For the Xpressor architecture versus standard +class-pooling (as used in \gls{scGPT}), we observe substantial improvements: cell-type prediction +accuracy increases from 0.60 to 0.72 (+20\%), and embedding quality improves from 0.48 to 0.52 +(+8\%), while gene network inference remains comparable. For multi-scale fine-tuning, comparing +frozen \gls{ESM}2 embeddings versus fine-tuned ones, we see cell-type prediction improve from 0.60 +to 0.70 (+17\%), embedding quality from 0.48 to 0.49, and gene network inference improves on the +Omnipath benchmark from 2.0 to 2.4 \gls{EPR} (+20\%). Notably, fine-tuned \gls{ESM}2 embeddings +outperform both frozen \gls{ESM}2 and randomly initialized embeddings across nearly all metrics. \subsection{Chapter \ref{article3}: scPRINT-2: Towards the next-generation of cell foundation models and benchmarks} -In this chapter, we present \gls{scPRINT}-2, a next-generation single-cell foundation model whose design decisions were systematically validated through an unprecedented additive benchmarking framework. This work addresses the critical gap in the field: while many \gls{scFM}s have been proposed, the relative importance of their architectural choices, training strategies, and data modalities has never been rigorously assessed in isolation. - -\textbf{The additive benchmark: a systematic evaluation framework.} We designed a comprehensive benchmark to evaluate 42 different configurations of \gls{scFM} components, including pretraining databases, architectures, and training tasks. Each model variant was trained 6 times across multiple seeds to generate statistical error bounds, and evaluated on a gymnasium of tasks: cell-type classification, batch correction (\gls{scIB} scores), expression denoising, and gene network inference. Our benchmark revealed several key findings: (1) denoising is superior to masking as a pretraining task for classification and embedding quality; (2) un-normalized expression outperforms normalized input; (3) \gls{ESM}-based gene tokens significantly outperform learned embeddings from scratch; (4) genomic location encoding improves model convergence; (5) \gls{MSE} loss outperforms \gls{ZINB} on average, but a hybrid \gls{ZINB}+MSE loss provides the best balance between accuracy and expressivity; and (6) model size correlates with improved gene network inference and cell-type prediction. - -\textbf{The \gls{scPRINT}-2 corpus: the largest single-cell database to date.} We assembled a pretraining database of over 350 million cells from 16 eukaryotic organisms spanning more than one billion years of evolution. This corpus integrates data from \gls{CxG}, the Tahoe-100M dataset, and the scBasecount database (20,000 reprocessed \gls{GEO} datasets), totaling 25 TB of unique data with approximately 400,000 distinct genes and 4,764 different cell labels across 140,000 cell groups. We demonstrated that cell-state diversity and data quality are more important than sheer cell count — reducing to 200 human datasets caused only a minimal performance decrease, whereas using low-diversity datasets alone caused performance to plummet. We introduced cluster-weighted sampling and \gls{NNZ}-weighted sampling to address dataset imbalances, enabling effective training on this heterogeneous corpus. - -\textbf{Architectural innovations.} \gls{scPRINT}-2 incorporates 12 distinct contributions validated through our benchmark. Key innovations include: (1) the XPressor architecture, a cross-attention-based compression mechanism that transforms gene-level representations into cell-level tokens and back, enabling the model to be generative; (2) a \gls{GNN}-based expression encoder that leverages neighborhood information from similar cells or spatial neighbors; (3) criss-cross attention, a sub-quadratic attention mechanism inspired by Recurrent Interface Networks that dramatically improves training speed while retaining model capabilities; (4) \gls{VAE}-based compression with dissimilarity losses between cell tokens, improving batch correction; and (5) an updated hierarchical classification loss that penalizes predictions based on ontological distance rather than binary correctness. - -\textbf{State-of-the-art performance across benchmarks.} On the Open Problems benchmark (November 2025), \gls{scPRINT}-2 achieved 75\% zero-shot cell-type classification accuracy, outperforming \gls{scPRINT}-1 (47\%) and all other zero-shot \gls{scFM}s (40-60\%). With our \gls{XPEFT}, \gls{scPRINT}-2 surpassed every existing supervised and unsupervised method on the platform. For expression denoising, \gls{scPRINT}-2 became state-of-the-art, outperforming MAGIC across all tested contexts, with particularly strong improvements on low- and mid-quality datasets where the \gls{GNN} encoder can leverage neighbor information. For batch integration, \gls{scPRINT}-2's zero-shot performance exceeded all other methods, and fine-tuned performance achieved the best overall \gls{scIB} scores. - -\textbf{Generalization to unseen modalities and organisms.} We demonstrated \gls{scPRINT}-2's ability to generalize beyond its training distribution. On Xenium spatial transcriptomics data (a modality absent from training), \gls{scPRINT}-2 successfully denoised expression, imputed 5,000 unseen genes with correlation scores matching denoised genes, and produced biologically meaningful cell-type and disease predictions. On cat and tiger lung tissues (organisms not seen during training), \gls{scPRINT}-2 achieved 42\% cell-type classification accuracy across 500 possible labels, with differential expression analysis confirming that \gls{scPRINT}-2 sometimes corrected expert annotations. With cluster-based logits averaging and \gls{XPEFT} fine-tuning, accuracy improved to 95\%. - -\textbf{Counterfactual reasoning and generative capabilities.} The XPressor architecture enables \gls{scPRINT}-2 to perform counterfactual generation. We demonstrated this by replacing organism-specific cell embeddings from mouse cells with human embeddings to generate ``humanized'' mouse expression profiles. The Wasserstein-2 distance between these counterfactual profiles and real human cells decreased significantly, and over-representation analysis showed 58\% enrichment in correctly predicted differentially expressed genes. Pathway analysis revealed biologically meaningful differences in immune function, membrane-\gls{ECM} interactions, and tissue elasticity. - -\textbf{Gene embeddings and network inference.} We showed that the XPressor architecture produces output gene embeddings with meaningful biological clustering (enriched for known pathways), whereas standard transformers without XPressor produce embeddings that merely encode expression values. For gene network inference, we introduced a computationally intensive extraction method biased toward co-expressed genes. Benchmarking against six ground-truth networks (including the cellmap \gls{AP-MS} data, human interactome, and \gls{gwps}), \gls{scPRINT}-2 showed improved performance on odds-ratio metrics. We demonstrated cross-species gene network analysis in macrophages, identifying conserved hub genes involved in feroptosis, pathogen phagocytosis, and \gls{MHC} pathways. We also showed how \gls{scPRINT}-2's predictions can cross-validate \gls{PPI} ground truths, identifying connections (HLA-DRA/CD74, B2M/B2M) that RoseTTAFold2-\gls{PPI} missed but AlphaFold-Multimer confirmed. +In this chapter, we present \gls{scPRINT}-2, a next-generation single-cell foundation model whose +design decisions were systematically validated through an unprecedented additive benchmarking +framework. This work addresses the critical gap in the field: while many \glspl{scFM} have been +proposed, the relative importance of their architectural choices, training strategies, and data +modalities has never been rigorously assessed in isolation. + +\textbf{The additive benchmark: a systematic evaluation framework.} We designed a comprehensive +benchmark to evaluate 42 different configurations of \gls{scFM} components, including pretraining +databases, architectures, and training tasks. Each model variant was trained 6 times across +multiple seeds to generate statistical error bounds, and evaluated on a gymnasium of tasks: +cell-type classification, batch correction (\gls{scIB} scores), expression denoising, and gene +network inference. Our benchmark revealed several key findings: (1) denoising is superior to +masking as a pretraining task for classification and embedding quality; (2) un-normalized +expression outperforms normalized input; (3) \gls{ESM}-based gene tokens significantly outperform +learned embeddings from scratch; (4) genomic location encoding improves model convergence; (5) +\gls{MSE} loss outperforms \gls{ZINB} on average, but a hybrid \gls{ZINB}+MSE loss provides the +best balance between accuracy and expressivity; and (6) model size correlates with improved gene +network inference and cell-type prediction. + +\textbf{The \gls{scPRINT}-2 corpus: the largest single-cell database to date.} We assembled a +pretraining database of over 350 million cells from 16 eukaryotic organisms spanning more than one +billion years of evolution. This corpus integrates data from \gls{CxG}, the Tahoe-100M dataset, +and the scBasecount database (20,000 reprocessed \gls{GEO} datasets), totaling 25 TB of unique +data with approximately 400,000 distinct genes and 4,764 different cell labels across 140,000 cell +groups. We demonstrated that cell-state diversity and data quality are more important than sheer +cell count. Reducing to 200 human datasets caused only a minimal performance decrease, whereas +using low-diversity datasets alone caused performance to plummet. We introduced cluster-weighted +sampling and \gls{NNZ}-weighted sampling to address dataset imbalances, enabling effective +training on this heterogeneous corpus. + +\textbf{Architectural innovations.} \gls{scPRINT}-2 incorporates 12 distinct contributions +validated through our benchmark. Key innovations include: (1) the XPressor architecture, a +cross-attention-based compression mechanism that transforms gene-level representations into +cell-level tokens and back, enabling the model to be generative; (2) a \gls{GNN}-based expression +encoder that leverages neighborhood information from similar cells or spatial neighbors; (3) +criss-cross attention, a sub-quadratic attention mechanism inspired by Recurrent Interface +Networks that dramatically improves training speed while retaining model capabilities; (4) +\gls{VAE}-based compression with dissimilarity losses between cell tokens, improving batch +correction; and (5) an updated hierarchical classification loss that penalizes predictions based +on ontological distance rather than binary correctness. + +\textbf{State-of-the-art performance across benchmarks.} On the Open Problems benchmark (November +2025), \gls{scPRINT}-2 achieved 75\% zero-shot cell-type classification accuracy, outperforming +\gls{scPRINT}-1 (47\%) and all other zero-shot \glspl{scFM} (40--60\%). With our \gls{XPEFT}, +\gls{scPRINT}-2 surpassed every existing supervised and unsupervised method on the platform. For +expression denoising, \gls{scPRINT}-2 became state-of-the-art, outperforming MAGIC across all +tested contexts, with particularly strong improvements on low- and mid-quality datasets where the +\gls{GNN} encoder can leverage neighbor information. For batch integration, \gls{scPRINT}-2's +zero-shot performance exceeded all other methods, and fine-tuned performance achieved the best +overall \gls{scIB} scores. + +\textbf{Generalization to unseen modalities and organisms.} We demonstrated \gls{scPRINT}-2's +ability to generalize beyond its training distribution. On Xenium spatial transcriptomics data (a +modality absent from training), \gls{scPRINT}-2 successfully denoised expression, imputed 5,000 +unseen genes with correlation scores matching denoised genes, and produced biologically meaningful +cell-type and disease predictions. On cat and tiger lung tissues (organisms not seen during +training), \gls{scPRINT}-2 achieved 42\% cell-type classification accuracy across 500 possible +labels, with differential expression analysis confirming that \gls{scPRINT}-2 sometimes corrected +expert annotations. With cluster-based logits averaging and \gls{XPEFT} fine-tuning, accuracy +improved to 95\%. + +\textbf{Counterfactual reasoning and generative capabilities.} The XPressor architecture enables +\gls{scPRINT}-2 to perform counterfactual generation. We demonstrated this by replacing +organism-specific cell embeddings from mouse cells with human embeddings to generate +``humanized'' mouse expression profiles. The Wasserstein-2 distance between these counterfactual +profiles and real human cells decreased significantly, and over-representation analysis showed +58\% enrichment in correctly predicted differentially expressed genes. Pathway analysis revealed +biologically meaningful differences in immune function, membrane-\gls{ECM} interactions, and +tissue elasticity. + +\textbf{Gene embeddings and network inference.} We showed that the XPressor architecture produces +output gene embeddings with meaningful biological clustering (enriched for known pathways), whereas +standard transformers without XPressor produce embeddings that merely encode expression values. +For gene network inference, we introduced a computationally intensive extraction method biased +toward co-expressed genes. Benchmarking against six ground-truth networks (including the cellmap +\gls{AP-MS} data, human interactome, and \gls{gwps}), \gls{scPRINT}-2 showed improved performance +on odds-ratio metrics. We demonstrated cross-species gene network analysis in macrophages, +identifying conserved hub genes involved in feroptosis, pathogen phagocytosis, and \gls{MHC} +pathways. We also showed how \gls{scPRINT}-2's predictions can cross-validate \gls{PPI} ground +truths, identifying connections (HLA-DRA/CD74, B2M/B2M) that RoseTTAFold2-\gls{PPI} missed but +AlphaFold-Multimer confirmed. \subsection{Impacts Beyond Publications} Beyond the three main publications, this thesis produced several additional contributions: -\paragraph{Open-source Python packages.} Eight open-source Python packages were released alongside this thesis, making all models, tools, and methods freely available to the community: +\paragraph{Open-source Python packages.} Eight open-source Python packages were released alongside +this thesis, making all models, tools, and methods freely available to the community: \begin{itemize} - \item \emph{\gls{scPRINT}} (\url{https://github.com/cantinilab/scPRINT}): model weights, training scripts, notebooks, and data utilities for the first model. - \item \emph{scPRINT-2} (\url{https://github.com/cantinilab/scPRINT-2}): the same for the second model. - \item \emph{scDataLoader} (\url{https://github.com/jkobject/scDataLoader}): efficient loading of thousands of large single-cell datasets with preprocessing, filtering, and a first-of-its-kind weighted random sampling over billions of elements. - \item \emph{BenGRN} (\url{https://github.com/jkobject/Bengrn}): benchmarking of \gls{GRN} inference methods on single-cell data with multiple metrics and ground-truth networks. - \item \emph{GRNNdata} (\url{https://github.com/cantinilab/GRNNdata}): joint handling of gene regulatory networks and single-cell data in the AnnData format. - \item \emph{Xpressor} (\url{https://github.com/cantinilab/XPressor}): reproducible experiments and from-scratch construction of the XPressor architecture. - \item \emph{Simpler Flash} (\url{https://github.com/jkobject/simpler_flash}): a library of efficient attention mechanisms including softpick-flash and our flash criss-cross attention. - \item \emph{Hierarchical Classifier} (\url{https://gist.github.com/jkobject/5b36bc4807edb440b86644952a49781e}): hierarchical classification for single-cell data. + \item \emph{\gls{scPRINT}} (\url{https://github.com/cantinilab/scPRINT}): model weights, + training scripts, notebooks, and data utilities for the first model. + \item \emph{scPRINT-2} (\url{https://github.com/cantinilab/scPRINT-2}): the same for the + second model. + \item \emph{scDataLoader} (\url{https://github.com/jkobject/scDataLoader}): efficient loading + of thousands of large single-cell datasets with preprocessing, filtering, and a + first-of-its-kind weighted random sampling over billions of elements. + \item \emph{BenGRN} (\url{https://github.com/jkobject/Bengrn}): benchmarking of \gls{GRN} + inference methods on single-cell data with multiple metrics and ground-truth networks. + \item \emph{GRNNdata} (\url{https://github.com/cantinilab/GRNNdata}): joint handling of gene + regulatory networks and single-cell data in the AnnData format. + \item \emph{Xpressor} (\url{https://github.com/cantinilab/XPressor}): reproducible experiments + and from-scratch construction of the XPressor architecture. + \item \emph{Simpler Flash} (\url{https://github.com/jkobject/simpler_flash}): a library of + efficient attention mechanisms including softpick-flash and our flash criss-cross attention. + \item \emph{Hierarchical Classifier} + (\url{https://gist.github.com/jkobject/5b36bc4807edb440b86644952a49781e}): hierarchical + classification for single-cell data. \end{itemize} -\paragraph{Accessibility and open science.} Beyond releasing model weights and inference code, I prioritised broad accessibility by providing easy-to-use inference tools, pretraining methods, curated datasets, training traces, and documentation. Tutorials were implemented in Google Colab for zero-setup usage, and model versions were published on the Chan–Zuckerberg model hub (\url{https://virtualcellmodels.cziscience.com/}) and the Superb.io platform (\url{https://superbio.ai}). - -\paragraph{Benchmarking infrastructure.} I implemented Docker containers for \gls{scPRINT}, \gls{scGPT}, and Geneformer to enable reproducible evaluation on the Open Problems benchmarking platform, and actively participated in creating and improving two benchmarks on that platform. - -\paragraph{Science communication.} I disseminated findings through multiple channels: a \href{https://blog.lamin.ai/arrayloader-benchmarks}{blog post} with \href{https://lamin.ai}{Lamin.ai} on training at scale; posts on X, LinkedIn, Bluesky, and Threads; and articles on my \href{https://www.jkobject.com}{personal website}. I also wrote vulgarisation articles for the \href{https://www.pasteur.fr/fr/journal-recherche/videos/langage-cellules}{Pasteur Institute} and the \href{https://www.insb.cnrs.fr/fr/cnrsinfo/scprint-le-premier-modele-dia-francais-pour-dechiffrer-les-reseaux-genetiques}{\gls{CNRS}}, and produced one of the most viewed videos on the Pasteur Institute's \href{https://www.youtube.com/watch?v=bgtcDs5EXY8}{YouTube} and Instagram accounts. I was highlighted in Whitelab's blog posts and published four additional \href{https://www.youtube.com/@jkobject}{YouTube} presentations of my work. - -\paragraph{Conference presentations and invited talks.} Outreach extended to the broader scientific community through participation in three international \gls{ML} conferences, over 25 invited talks, and five poster presentations. - -\paragraph{Industry translation and entrepreneurship.} I contributed to the European start-up ecosystem by joining Nucleate, a worldwide organisation supporting master's students, PhDs, and post-docs in translating research into companies. I also worked as a consultant for four start-ups — \href{https://whitelabgx.com/}{Whitelab Genomics}, \href{https://graphica.bio/}{Biographica}, Blossom, and \href{https://www.dotomics.bio/}{dot-omics} — advising on foundation model strategies for single-cell \gls{RNA-seq} and \gls{DNA} sequencing. +\paragraph{Accessibility and open science.} Beyond releasing model weights and inference code, +broad accessibility was prioritised by providing easy-to-use inference tools, pretraining methods, +curated datasets, training traces, and documentation. Tutorials were implemented in Google Colab +for zero-setup usage, and model versions were published on the Chan--Zuckerberg model hub +(\url{https://virtualcellmodels.cziscience.com/}) and the Superb.io platform +(\url{https://superbio.ai}). + +\paragraph{Benchmarking infrastructure.} Docker containers were implemented for \gls{scPRINT}, +\gls{scGPT}, and Geneformer to enable reproducible evaluation on the Open Problems benchmarking +platform, with active participation in creating and improving two benchmarks on that platform. + +\paragraph{Science communication.} Findings were disseminated through multiple channels: a +\href{https://blog.lamin.ai/arrayloader-benchmarks}{blog post} with +\href{https://lamin.ai}{Lamin.ai} on training at scale; posts on X, LinkedIn, Bluesky, and +Threads; and articles on the \href{https://www.jkobject.com}{personal website}. Vulgarisation +articles were written for the +\href{https://www.pasteur.fr/fr/journal-recherche/videos/langage-cellules}{Pasteur Institute} and +the \href{https://www.insb.cnrs.fr/fr/cnrsinfo/scprint-le-premier-modele-dia-francais-pour-dechiffrer-les-reseaux-genetiques}{\gls{CNRS}}, +and one of the most viewed videos on the Pasteur Institute's +\href{https://www.youtube.com/watch?v=bgtcDs5EXY8}{YouTube} and Instagram accounts was produced. +Whitelab's blog posts included highlights of this work, with four additional +\href{https://www.youtube.com/@jkobject}{YouTube} presentations of the research. + +\paragraph{Conference presentations and invited talks.} Outreach extended to the broader scientific +community through participation in three international \gls{ML} conferences, over 25 invited talks, +and five poster presentations. + +\paragraph{Industry translation and entrepreneurship.} Contributions to the European start-up +ecosystem included joining Nucleate, a worldwide organisation supporting master's students, PhDs, +and post-docs in translating research into companies. Consulting work was performed for four +start-ups: \href{https://whitelabgx.com/}{Whitelab Genomics}, +\href{https://graphica.bio/}{Biographica}, Blossom, and +\href{https://www.dotomics.bio/}{dot-omics}, advising on foundation model strategies for +single-cell \gls{RNA-seq} and \gls{DNA} sequencing. diff --git a/context_papers/Geneformer.pdf b/context_papers/Geneformer.pdf new file mode 100644 index 0000000..fc1454e Binary files /dev/null and b/context_papers/Geneformer.pdf differ diff --git a/context_papers/scGPT.pdf b/context_papers/scGPT.pdf new file mode 100644 index 0000000..d3455fa Binary files /dev/null and b/context_papers/scGPT.pdf differ diff --git a/main.pdf b/main.pdf index 00f5cc0..132b6d9 100644 Binary files a/main.pdf and b/main.pdf differ diff --git a/main.tex b/main.tex index c4ec07a..5293c59 100644 --- a/main.tex +++ b/main.tex @@ -302,10 +302,6 @@ %\chapter*{Introduction} \fancyhead{} % clear all header fields \input{chapters/long_abstract_fr} %partie introduction ou avant-propos -\fancyhead[RO]{\textsc{Background}} -\setcounter{section}{0} - -\input{chapters/background} \fancyhead[OL]{\textsc{Introduction}} \input{chapters/intro} %partie introduction ou avant-propos diff --git a/reply_to_rapporteurs.md b/reply_to_rapporteurs.md new file mode 100644 index 0000000..7ffb2b5 --- /dev/null +++ b/reply_to_rapporteurs.md @@ -0,0 +1,60 @@ +Reply to rapporteurs +Valentina +A central conceptual question that would benefit from more explicit treatment concerns the relationship between the denoising training objective and the nature of the regulatory interactions captured by attention. Since scPRINT is trained to reconstruct downsampled expression profiles, the model is optimized to exploit global co-expression structure. This creates no obvious inductive bias toward direct regulatory interactions over indirect transitive paths of the form A→B→C. The manuscript does not fully address why attention weights in this setting should preferentially reflect direct regulation rather than co-expression. Given that biological interpretability of the inferred networks is a central claim, a more explicit theoretical treatment of this issue would substantially strengthen the work. +As long as steady-state expression data are used, nothing more than co-expression can be achieved, and we do not make a claim to the contrary. The goal is not to infer GRN per se, but to understand the ability of foundation models to leverage an understanding of gene relationships (albeit through co-expression patterns) to achieve their tasks and how this general understanding enables them to perform many other downstream tasks. However, we do believe that foundation models can go beyond co-expression. Indeed, using ESM3 embeddings confers knowledge of protein structure and evolutionary relationships, and using gene location provides additional information on the probability of co-regulation. working across species further provides patterns of expression not just within cells but across kingdoms. Obviously, nothing is causal yet without interventional or temporal data, and that is a point left to be worked out +→ I have added this sentence: “Given the model’s pre-training and losses, it cannot be expected of them to contain more than co-expression patterns. Compared to a complex co-expression model like GENIE3, they often indeed even underperform it. However, we show that we can get closer to the best co-expression-based gene network inference tools, such as GENIE3, by using improved training. Finally, the addition of inductive biases like the gene’s sequence and evolutionary similarities can help provide information unavailable in the expression data alone.” + + +Figure 12 appears somewhat generic and does not substantially contribute to the understanding of the specific models proposed, while Figure 13 introduces skip connections and their effect on the loss landscape without fully clarifying their precise role in the proposed architectures. Minor typographical errors are present but do not affect readability. +→ I have moved figures 12 and 13, as well as their related content to another section called background for people unfamiliar with neural networks and why they indeed work +→ I have rechecked the paper for grammatical errors and updated it + + +Another potential critique: Across all three chapters, key architectural parameters, including the number of transformer layers, attention heads, and embedding dimensions for each model variant, are not consolidated in a form that allows the reader to directly connect design choices to performance outcomes. This is particularly consequential in Chapter 3, where 42 configurations are evaluated: without a summary table of specifications alongside results, it is difficult to draw clean conclusions from the ablation. Adding such a table would substantially improve reproducibility and the interpretability of the benchmarking. +For Chapter 1, the values are available in Table 5.1.2. The values are always the same for all elements and are defined in the first part of the methods. Providing it in the table would have taken too much space. +For Chapter 2, the values are in the method section 2.4.6 +For Chapter 3, the values are in the method section 3.5.1 +Charlotte +At the moment, the introduction requires a complete rewrite from scratch. In particular: +Remove the preamble and any personal narrative. A thesis introduction should be academic in tone and content. Personal motivation, objectives, or a “PhD journey” framing can go into acknowledgements, but not in the introduction. +Do not introduce basic textbook material. There is no need to explain the central dogma or basic principles of gene regulation, especially not with explicit figures. Likewise, avoid a generic sequencing overview. Background should be minimal and only included if it directly supports what you do in the thesis. Don’t confuse Introduction and Background, this are two different chapters. For now, in fact, you have no Background/Related Work section, it seems? +I have added the “personal” section to the Ph.D. journey voluntarily, knowing that it is not often included in a Ph.D. thesis, but it is something I have always felt was missing when presenting one’s work. I would like to keep the section in the acknowledgements if you agree. +Indeed, in France, for a publication-based thesis, the introduction/discussion is seen as a place where the student can present material they couldn’t include in their papers. Having never appreciated the classical academic writing style, I set out to write in a style I enjoy reading and to be more personal. I have reworded many paragraphs, moved almost all sections, and made changes based on your thoughtful feedback. +→ I fully rewrote the structure of the introduction / objective chapters into: +1. Acknowledgement + 1. remerciements + 2. Personal Motivation → moved the personal stuff here + 1. background to the thesis + 2. personal objectives + + 2. … lists … (kept unchanged) + + 3. Background: → moved all the more basic review elements and definitions here + 1. Biology (rna, networks, sequencing) + 2. Machine learning (definitions, neural nets, loss landscapes…) + + 4. Introduction: (added and modified elements initially in the introduction and objectives chapters, following your proposed structure: + 1. Motivation and problem setting: Described the complexity of cellular biology and why computational modeling is needed. + 2. Why modern large models: Explained, at a high level, why the building blocks and their interactions motivate large-scale learning approaches + 3. Scientific aim: Motivated the need for mechanistic insight + 4. Thesis scope: + 1. Clearly state what has been done, + 2. What we did not do. I mentioned that we did not do temporal or interventional data and why. I also added a subsection on what was initially planned but not done (initially part of the “objective” chapter) + 5. Thesis outline and contributions: Briefly described what each chapter contains, aligned with the actual results. (mostly what was in the objective chapter). summarized the core contributions and impact. + + 5. Chapter 1 … (kept unchanged) + +Since the thesis is at the crossroads of machine learning and Biology. I made the choice to add some background and go through some definitions on purpose to present things to both biologists and computer scientists. In the Doctoral school template, there is no background section, and they say to add it in the intro. I will add a specific subsection to clearly separate the background, the introduction, and related works. + + +Fix the structure and narrative flow. The current version is fuzzy and reads like a list of loosely connected topics. It jumps between broad promises of biology, the central dogma, gene regulatory networks, sequencing, and unrelated figures (for example, the Feynman image) without a coherent progression. +Align the introduction with what the thesis actually covers. The introduction currently lists many topics (including on page 20) that are not addressed in your work. The introduction should set up the specific problem setting, methods, and contributions that your thesis delivers. +I am not sure what the Feynman diagrams are in my thesis, but I have reordered sections and topics a lot. +→ I have reordered sections to get preamble [my background for the thesis, personal objectives for the thesis], background [biology, machine learning], introduction [with your proposed structure: initial objectives, potential impacts, …], + + +Improve language, formatting, and academic style. There are many spelling errors, missing full stops, sloppy language, and generally non-academic phrasing. Also, please correct punctuation spacing: English does not use a space before “?” or “:” (this is a common French typography habit and it stands out in English writing). +I reviewed the thesis's spelling with Grammarly, but made multiple changes based on feedback from Gabriel and Laura. +→I have done another pass now and updated many spelling issues, especially in some early parts of the introduction +→ For the academic phrasing, I have made some changes in the introduction section, mostly. But you might also refer to the more “personal” section. The tone has stayed similar, but it was moved to the acknowledgments +→ For the space. Although there is no space in my LaTeX document, the university's thesis styling guide automatically added them. I changed the styling guide from a French Thesis to an English thesis; this removed the spaces. \ No newline at end of file