Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
6bef1c9
charlotte-response: expand scFM descriptions, restructure intro, add …
jkobject Mar 15, 2026
868e19d
charlotte-response: fix opening tone, remove Monod figure, fix tense,…
jkobject Mar 15, 2026
0ee03a9
Potential fix for pull request finding
jkobject Mar 15, 2026
3330382
Potential fix for pull request finding
jkobject Mar 15, 2026
0a85273
Potential fix for pull request finding
jkobject Mar 15, 2026
cabbf9a
fix: correct scCello and LangCell cite keys, add scCello to bibliography
Mar 15, 2026
ecea1b6
fix: remove GRN doublon, depersonalize limitations section
jkobject Mar 16, 2026
2cd4f94
fix: remove 'promises of cellular biology' section, Monod personal an…
jkobject Mar 16, 2026
6dc7266
fix: clean informal language in background.tex AI section, update res…
jkobject Mar 16, 2026
e47eea4
feat: restructure intro — formalize GRN, detailed scFM section, merge…
jkobject Mar 16, 2026
9408270
fix: replace personal narrative in auxiliaries/background with concis…
jkobject Mar 16, 2026
76e9ef1
built pdf
jkobject Mar 16, 2026
4362754
fix: address Charlotte's 7 intro feedback points
jkobject Mar 16, 2026
78c188f
fix: add missing citations and glossary entries
jkobject Mar 16, 2026
0d1e081
chore: build PDF after intro edits and citation fixes
jkobject Mar 16, 2026
ff30c8a
final
jkobject Mar 18, 2026
2c2036f
final
jkobject Mar 19, 2026
6f67ff3
final
jkobject Mar 19, 2026
c0d07ef
Delete CHARLOTTE_TASKS.md
jkobject Mar 19, 2026
5287569
Delete RESPONSE_TO_CHARLOTTE.md
jkobject Mar 19, 2026
f779362
Delete charlotte_feedback.txt
jkobject Mar 19, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion auxiliaries/abreviations.tex
Original file line number Diff line number Diff line change
Expand Up @@ -507,4 +507,10 @@
\newacronym{ipTM}{ipTM}{interface predicted Template Modeling score}

\glsaddall
\printglossary[type=\acronymtype, title=Liste des abréviations, toctitle=Liste des abréviations]
\printglossary[type=\acronymtype, title=Liste des abréviations, toctitle=Liste des abréviations]
\newglossaryentry{API}
{
name={API},
description={Application Programming Interface},
type=\acronymtype
}
63 changes: 7 additions & 56 deletions auxiliaries/background.tex
Original file line number Diff line number Diff line change
@@ -1,63 +1,14 @@
\raggedbottom % Allow flexible page heights to reduce underfull vbox warnings

\chapter*{Personal Motivation}
\addcontentsline{toc}{chapter}{Personal Motivation}
\chapter*{Acknowledgements}
\addcontentsline{toc}{chapter}{Acknowledgements}

This Ph.D. started relatively late in my career. This chapter explains the personal and professional path that led to this thesis, and the objectives I set for myself when starting it.
This Ph.D.\ was undertaken at the Institut Pasteur, in the ML4IG group led by Laura Cantini and in close collaboration with Gabriel Peyré's group at the ENS Centre de Science des Données. The work presented in this thesis began in 2022, building on earlier research at the Broad Institute of MIT and Harvard and on applied work in the biotechnology industry. These varied experiences shaped the perspective and the questions that ultimately drove this thesis.

\section{Background to the thesis}
\bigskip

\subsection{The PiPle Project}
Many of the opportunities I had after school have been very exciting. Initially, I decided to create a company called PiPle with a friend, Paul Best, who is now a Post-doctoral Researcher at the University of Vienna in Machine Learning for bio-acoustics.
Prior to this Ph.D., I worked at the Broad Institute, where I contributed to several high-impact computational biology projects and began developing the ideas around large-scale gene regulatory network inference that underpin this thesis. A subsequent role at Whitelab Genomics in Paris provided direct exposure to the practical challenges of deploying foundation models for therapeutic applications, reinforcing the importance of rigorous benchmarking and reproducible open-source tooling.

Funnily enough, it was completely unrelated to biology. We worked on creating novel means of communication. We had---and still have---big ideas for improving utterly inadequate messaging apps, emails, and similar tools through machine learning and innovative design. Doing this, we learned a lot about managing complex projects, selling ideas, building large codebases, teamwork, and designing interfaces.
\bigskip

However, we did not gain enough traction from this, and after a year of hard work, we felt the road ahead was paved with too many sacrifices.

\subsection{The Broad Institute}
I passed on Ph.D. opportunities a second time to work at the Broad Institute instead. Having visited the labs, Boston, and Kendall Square, I knew this was the kind of experience I wanted, and Ph.D.s seemed long and cumbersome. At Broad, I worked on many very-high-impact research projects, and I felt I was part of something bigger than myself. I published as the first author and even started my own research projects, which would inform the thesis I am presenting here.

While I still understood that a Ph.D. was the best place to undergo such projects, I was uncertain about the specifics. I also understood the length, harshness, and sometimes arbitrary nature of U.S. Ph.D. programs. I also wanted to continue working on team-based projects and wanted to experience the start-up environment.

\subsection{Whitelab Genomics}
\begin{figure}[ht]
\centering
\includegraphics[width=0.6\textwidth]{./figures/whitelabgx.jpeg}
\caption[The Whitelab Genomics team]{The Whitelab Genomics team in September 2023, in its future4care offices}
\label{fig:whitelabgx}
\end{figure}

Along with other personal decisions, it led me to return to France and take on the role of team lead for the computational biology group at Whitelab Genomics in Paris.

At Whitelab, I learned how to build a team and manage people. I learned a lot about what it means to grow companies from 10 to 50 people. I also learned about the biotech industry and how to build and sell such products.

Whitelab had a good mix of expertise in computational biology, machine learning, structural biology, and business development (see Figure ~\ref{fig:whitelabgx}). While starting the first project there, I significantly enhanced the potential of foundation models for the biotech industry.

From \gls{DNA} language models to cell foundation models and knowledge-graph-based models, it became clear that they would be the path forward for aggregating sparse, disparate information across many fields of biology and medicine.

\subsection{Starting the Ph.D.}
I was not looking for any other positions and intended to stay at least a few years to assess how we had grown during that time.

However, I was already in contact with Laura Cantini, with whom I had previously discussed Ph.D. projects. At some point, Laura came back to me with this Ph.D. proposal. I spent the better part of a month in a challenging position, thinking about which decision would not become a regret in the future.

There was no perfect time to do this, but it felt like it was now or never. I was also very impressed by the level of various Ph.D. students in the labs of Laura and Gabriel. Seeing people 4 years younger than me already with such a high level of expertise and knowledge was very humbling. Finally, the Ph.D. topic and group were really on point with what I wanted to do. But mostly, my work/life environment was welcoming, surrounded by family, friends, and activities. I knew what I wanted to work on and what I wanted to learn.

Therefore, I decided to start this Ph.D. journey.

\section{Personal objectives}
\label{personal-objectives-during-the-thesis}

\textit{This is copied from my initial objectives written in my research proposal at the start of the Ph.D.}

I had the chance to see many friends completing their Ph.D.s before starting mine. A main mistake I saw during one's Ph.D.~is not seeing the time passing by. My goal for this Ph.D.~was to be as product-first as I was at Whitelab Genomics. Delivering results quickly \& improving until it is publishable. This mistake, thinking ``Well, I have 3 years\ldots'', is at least partly responsible for the stress, the crash, and the unpreparedness for what some students might experience after the Ph.D. Thus, I plan to give myself a short timeline, knowing I will likely go over. And I will prepare everything around this idea. I will also start to prepare for what is next from the get-go.

To do that best, one needs to take the opportunity of the Ph.D.~to make connections with other labs (industry or academic). Moreover, a good piece of advice I have been given is to \emph{know what you want to do and what you don't want to do}. Know what you are here for. Learn to say no. And I learned to say no in the last 4 years. My goal is to work on large models \& large datasets, mostly in transcriptomics, and always to go back to first principles and biology. I also know I want to make something useful, create something that can be a stepping stone for others. Something that affects the community. I know that to do that, you have to go the extra mile in terms of development and be honest with yourself about any shortcomings.

Finally, I have been fortunate to become addicted to my work. I like working hard and taking on challenges. But for this to happen, I need to keep enjoying what I am doing. I also wish to have no regrets about this decision. Thus, my final goal is to enjoy it as much as I can.

\begin{figure}[ht]
\centering
\includegraphics[width=0.7\textwidth]{./figures/pasteur.jpg}
\caption[A view of the Pasteur Institute in Paris]{A view of the Pasteur Institute in Paris, where I did my Ph.D. Adopted from \citet{pasteur}.}
\label{fig:pasteur}
\end{figure}
The choice to pursue this Ph.D.\ was motivated by the conviction that the single-cell genomics field was at an inflection point: the data scale, the model architectures, and the benchmarking infrastructure were all maturing simultaneously, creating an opportunity to build something both scientifically rigorous and practically useful. That conviction shaped every design decision in this thesis.
99 changes: 99 additions & 0 deletions bibliography.bib
Original file line number Diff line number Diff line change
Expand Up @@ -7422,6 +7422,24 @@ @article{panzeriCrackingNeuralCode2017
file = {/Users/jeremie/Documents/science/computational neuro/Panzeri et al. - 2017 - Cracking the Neural Code for Sensory Perception by.pdf}
}

@article{matsumotoSCODEEfficientRegulatory2017,
title = {{{SCODE}}: An Efficient Regulatory Network Inference Algorithm from Single-Cell {{RNA-Seq}} during Differentiation},
shorttitle = {{{SCODE}}},
author = {Matsumoto, Hirotaka and Kiryu, Hisanori and Furusawa, Chikara and Ko, Minoru S H and Ko, Shigeru B H and Gouda, Norio and Hayashi, Tetsutaro and Nikaido, Itoshi},
year = 2017,
month = aug,
journal = {Bioinformatics},
volume = {33},
number = {15},
pages = {2314--2321},
issn = {1367-4803},
doi = {10.1093/bioinformatics/btx194},
urldate = {2026-03-18},
abstract = {The analysis of RNA-Seq data from individual differentiating cells enables us to reconstruct the differentiation process and the degree of differentiation (in pseudo-time) of each cell. Such analyses can reveal detailed expression dynamics and functional relationships for differentiation. To further elucidate differentiation processes, more insight into gene regulatory networks is required. The pseudo-time can be regarded as time information and, therefore, single-cell RNA-Seq data are time-course data with high time resolution. Although time-course data are useful for inferring networks, conventional inference algorithms for such data suffer from high time complexity when the number of samples and genes is large. Therefore, a novel algorithm is necessary to infer networks from single-cell RNA-Seq during differentiation.In this study, we developed the novel and efficient algorithm SCODE to infer regulatory networks, based on ordinary differential equations. We applied SCODE to three single-cell RNA-Seq datasets and confirmed that SCODE can reconstruct observed expression dynamics. We evaluated SCODE by comparing its inferred networks with use of a DNaseI-footprint based network. The performance of SCODE was best for two of the datasets and nearly best for the remaining dataset. We also compared the runtimes and showed that the runtimes for SCODE are significantly shorter than for alternatives. Thus, our algorithm provides a promising approach for further single-cell differentiation analyses.The R source code of SCODE is available at https://github.com/hmatsu1226/SCODESupplementary data are available at Bioinformatics online.},
file = {/Users/jkobject/Zotero/storage/9AJA99WH/Matsumoto et al. - 2017 - SCODE an efficient regulatory network inference algorithm from single-cell RNA-Seq during different.pdf}
}


@misc{PapersCodeDeep,
title = {Deep {{Networks}} with {{Stochastic Depth}}},
author = {Huang, Gao and Sun, Yu and Liu, Zhuang and Sedra, Daniel and Weinberger, Kilian},
Expand Down Expand Up @@ -7511,6 +7529,46 @@ @unpublished{pathakCuriositydrivenExplorationSelfsupervised2017
annotation = {00000}
}

@misc{hauryTIGRESSTrustfulInference2012,
title = {{{TIGRESS}}: {{Trustful Inference}} of {{Gene REgulation}} Using {{Stability Selection}}},
shorttitle = {{{TIGRESS}}},
author = {Haury, Anne-Claire and Mordelet, Fantine and {Vera-Licona}, Paola and Vert, Jean-Philippe},
year = 2012,
month = may,
number = {arXiv:1205.1181},
eprint = {1205.1181},
primaryclass = {stat},
publisher = {arXiv},
doi = {10.48550/arXiv.1205.1181},
urldate = {2026-03-18},
abstract = {Inferring the structure of gene regulatory networks (GRN) from gene expression data has many applications, from the elucidation of complex biological processes to the identification of potential drug targets. It is however a notoriously difficult problem, for which the many existing methods reach limited accuracy. In this paper, we formulate GRN inference as a sparse regression problem and investigate the performance of a popular feature selection method, least angle regression (LARS) combined with stability selection. We introduce a novel, robust and accurate scoring technique for stability selection, which improves the performance of feature selection with LARS. The resulting method, which we call TIGRESS (Trustful Inference of Gene REgulation using Stability Selection), was ranked among the top methods in the DREAM5 gene network reconstruction challenge. We investigate in depth the influence of the various parameters of the method and show that a fine parameter tuning can lead to significant improvements and state-of-the-art performance for GRN inference. TIGRESS reaches state-of-the-art performance on benchmark data. This study confirms the potential of feature selection techniques for GRN inference. Code and data are available on http://cbio.ensmp.fr/\textasciitilde ahaury. Running TIGRESS online is possible on GenePattern: http://www.broadinstitute.org/cancer/software/genepattern/.},
archiveprefix = {arXiv},
keywords = {Quantitative Biology - Quantitative Methods,Statistics - Machine Learning},
file = {/Users/jkobject/Zotero/storage/IRDN7Z3X/Haury et al. - 2012 - TIGRESS Trustful Inference of Gene REgulation using Stability Selection.pdf;/Users/jkobject/Zotero/storage/39HZYHPU/1205.html}
}

@article{thuDynGENIE3DynamicalGENIE32018,
title = {{{dynGENIE3}}: Dynamical {{GENIE3}} for the Inference of Gene Networks from Time Series Expression Data},
shorttitle = {{{dynGENIE3}}},
author = {{Huynh-Thu}, V{\^a}n Anh and Geurts, Pierre},
year = 2018,
month = feb,
journal = {Scientific Reports},
volume = {8},
number = {1},
pages = {3384},
publisher = {Nature Publishing Group},
issn = {2045-2322},
doi = {10.1038/s41598-018-21715-0},
urldate = {2026-03-18},
abstract = {The elucidation of gene regulatory networks is one of the major challenges of systems biology. Measurements about genes that are exploited by network inference methods are typically available either in the form of steady-state expression vectors or time series expression data. In our previous work, we proposed the GENIE3 method that exploits variable importance scores derived from Random forests to identify the regulators of each target gene. This method provided state-of-the-art performance on several benchmark datasets, but it could however not specifically be applied to time series expression data. We propose here an adaptation of the GENIE3 method, called dynamical GENIE3 (dynGENIE3), for handling both time series and steady-state expression data. The proposed method is evaluated extensively on the artificial DREAM4 benchmarks and on three real time series expression datasets. Although dynGENIE3 does not systematically yield the best performance on each and every network, it is competitive with diverse methods from the literature, while preserving the main advantages of GENIE3 in terms of scalability.},
copyright = {2018 The Author(s)},
langid = {english},
keywords = {Gene regulatory networks,Machine learning,Network topology,Regulatory networks,Time series},
file = {/Users/jkobject/Zotero/storage/RGARW3TY/Huynh-Thu and Geurts - 2018 - dynGENIE3 dynamical GENIE3 for the inference of gene networks from time series expression data.pdf}
}


@inproceedings{pathakCuriosityDrivenExplorationSelfSupervised2017a,
title = {Curiosity-{{Driven Exploration}} by {{Self-Supervised Prediction}}},
author = {Pathak, Deepak and Agrawal, Pulkit and Efros, Alexei A. and Darrell, Trevor},
Expand Down Expand Up @@ -11133,3 +11191,44 @@ @misc{zuhriSoftpickNoAttention2025
file = {/Users/jkobject/Zotero/storage/WM3W46PZ/Zuhri et al. - 2025 - Softpick No Attention Sink, No Massive Activations with Rectified Softmax.pdf;/Users/jkobject/Zotero/storage/DZP4JTDF/2504.html}
}


@inproceedings{yuanCellOntologyGuided2024,
title = {Cell-ontology guided transcriptome foundation model},
author = {Yuan, Xinyu and Zhan, Zhihao and Zhang, Zuobai and Zhou, Manqi and Zhao, Jianan and Han, Boyu and Li, Yue and Tang, Jian},
booktitle = {The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year = {2024},
note = {NeurIPS 2024 Spotlight},
url = {https://arxiv.org/abs/2408.12373}
}

@inproceedings{dosovitskiyImageWorth16x162021,
title = {An Image is Worth 16x16 Words: {{Transformers}} for Image Recognition at Scale},
shorttitle = {An Image is Worth 16x16 Words},
booktitle = {International {{Conference}} on {{Learning Representations}} ({{ICLR}})},
author = {Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
year = {2021},
month = may,
eprint = {2010.11929},
archiveprefix = {arxiv},
primaryclass = {cs.CV},
url = {https://arxiv.org/abs/2010.11929},
urldate = {2026-03-16},
abstract = {While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.}
}

@article{schaffterGeneNetWeaverNetworkTopology2011,
title = {{{GeneNetWeaver}}: In Silico Benchmark Generation and Performance Profiling of Network Inference Methods},
shorttitle = {{{GeneNetWeaver}}},
author = {Schaffter, Thomas and Marbach, Daniel and Floreano, Dario},
year = {2011},
month = aug,
journal = {Bioinformatics},
volume = {27},
number = {16},
pages = {2263--2270},
issn = {1367-4803, 1367-4811},
doi = {10.1093/bioinformatics/btr373},
url = {https://academic.oup.com/bioinformatics/article/27/16/2263/254752},
urldate = {2026-03-16},
abstract = {Over the last decade, numerous methods have been developed for inference of regulatory networks from gene expression data. However, accurate and systematic evaluation of these methods is hampered by the difficulty of constructing adequate benchmarks and the lack of tools for a differentiated analysis of network predictions on such benchmarks. Here we describe a novel and comprehensive method for in silico benchmark generation and performance profiling of network inference methods available to the community as an open-source software called GeneNetWeaver (GNW). In addition to the generation of detailed dynamical models of gene regulatory networks to be used as benchmarks, GNW provides a network motif analysis that reveals systematic prediction errors, thereby indicating potential ways of improving inference methods.}
}
Loading
Loading