Response to Charlotte — complete intro restructuring, background removed, personal narrative removed by jkobject · Pull Request #5 · jkobject/Thesis

jkobject · 2026-03-15T15:44:44Z

Summary

This PR addresses all 7 points from Charlotte Bunne feedback (24 Feb + 14 Mar 2026 follow-up).

Key changes (latest — 16 Mar 2026)

`chapters/intro.tex` (231 → 1000 lines)

Opens directly on the central computational challenge and actual thesis results
Section 2 — GRN Inference: formal mathematical problem statement (X∈ℝ^{n×g}, G=(V,E,W), cell-type-specific GRNs), detailed explanations with equations: GENIE3/GRNBoost2 (random forest + feature importance), pySCENIC (two-step: co-expression + motif enrichment + AUCell), PIDC (partial information decomposition), SCODE (linear ODE); comparison table
Benchmarking: BEELINE, GeneRNI Living Benchmark, metrics (AUROC, AUPRC, EPR), ground truths (Omnipath, ChIP-seq, Perturb-seq)
Section 3 — Foundation Models: self-attention with full equations, masked gene modeling, scRNA tokenization challenges (binning vs rank-ordering vs MLP)
scFMs detailed (~half page each): scBERT, Geneformer (rank-ordering explained), scGPT (causal attention), UCE (ESM2 gene tokens), scFoundation
Contributions section preserved

`chapters/background.tex` removed from `main.tex`

Content absorbed into new intro

`auxiliaries/background.tex` — Personal Motivation → Acknowledgements

56-line personal narrative (PiPle, Broad Institute, Whitelab journey) replaced with 3-paragraph concise academic acknowledgements

Charlotte 7 points addressed

Point	Status
1. Remove personal narrative	✅ auxiliaries/background.tex rewritten
2. Structure and flow	✅ motivation → GRN → FMs → contributions
3. No basic textbook content	✅ all content directly supports thesis
4. Separate Intro and Background	✅ single intro chapter, background.tex removed
5. Align with actual thesis	✅ scope section explicit on what is/is not covered
6. Language and punctuation	✅ no French spacing, academic register
7. Suggested structure	✅ followed exactly

…response letter # Response to Reviewer Charlotte Bunne Dear Charlotte, Thank you for your detailed and constructive feedback on the thesis manuscript. I address each of your points below. --- ## 1. Personal narrative / preamble **Your comment:** "Remove the preamble and any personal narrative. A thesis introduction should be academic in tone and content." **Response:** The "Personal Motivation" section has been removed from the Introduction entirely. Personal context has been moved to the Acknowledgements section, as you suggested. The Introduction now opens directly with the scientific problem setting. --- ## 2. Structure and narrative flow **Your comment:** "The current version is fuzzy and reads like a list of loosely connected topics... without a coherent progression." **Response:** The Introduction has been fully restructured following your proposed outline: 1. Motivation and problem setting 2. Foundation models and the rationale for large-scale learning approaches (technical, non-textbook) 3. Scientific aim 4. Thesis scope and contributions (with explicit statement of what is and is not covered) 5. Chapter-by-chapter overview The "Promises of cellular biology" framing and the Feynman figure reference have been removed. --- ## 3. Basic textbook material **Your comment:** "There is no need to explain the central dogma or basic principles of gene regulation... Do not confuse Introduction and Background." **Response:** All pedagogical background material (RNA biology, central dogma, gene regulatory networks, single-cell sequencing, basic ML concepts) has been moved to a dedicated **Background chapter** (Chapter 2), which is intended for readers less familiar with either biology or machine learning. The Introduction no longer contains textbook-level content. --- ## 4. Introduction vs. Background separation **Your comment:** "For now, in fact, you have no Background/Related Work section, it seems?" **Response:** A full Background chapter has been added (chapters/background.tex), covering: (1) cell regulation and RNA biology, (2) gene regulatory networks, (3) single-cell sequencing technologies, and (4) foundational ML/AI concepts. This chapter is clearly separated from the Introduction in the thesis structure. --- ## 5. Scope alignment **Your comment:** "The introduction currently lists many topics that are not addressed in your work." **Response:** The Scope section now explicitly states what the thesis does not address (perturbation response prediction as a primary target, temporal dynamics, spatial transcriptomics as a primary modality). Any topics that appear on page 20 but are not covered have been removed or qualified. --- ## 6. Language, style, and punctuation **Your comment:** "Many spelling errors, missing full stops, sloppy language. English does not use a space before '?' or ':'." **Response:** The manuscript was reviewed with Grammarly and manually corrected. French typography habits (space before '?' and ':') have been removed. The LaTeX document class has been updated from a French thesis template to an English one, which also resolved automatic spacing insertion. The academic register has been reviewed and informal phrasing corrected throughout the Introduction. --- ## 7. Technical depth of the scFM state of the art **Your comment (via PI):** "When you talk about the state-of-the-art use a more technical tone and give details on the different models instead of just listing and citing some." **Response:** The Bio-Foundation Models section has been substantially expanded. Each model now includes: - **Architecture specifics** (attention mechanism, encoder/decoder design, tokenization strategy) - **Training dataset** (size, source, organism coverage) - **Pretraining objective** (masked token prediction, autoregressive generation, contrastive learning, etc.) - **Key benchmark results** and demonstrated capabilities - **Limitations** identified by independent evaluations Models now covered in technical depth: scBERT, Geneformer, scGPT, UCE, scFoundation. --- ## Note on versioning The substantial structural revisions described above were implemented in commits `07ff110`, `35bb09e`, `cd52704`, `d72e822`, and `0a30b9c` (February 26 – March 11, 2026). We note that your email of March 14 indicated the revisions had not been addressed — it is possible you were viewing a cached copy of the PDF, as the updated version was committed and pushed on March 11. The current branch `charlotte-response` contains all revisions described in this letter. Best regards, Jérémie Kalfon

Copilot

Pull request overview

This PR updates the thesis materials to address reviewer Charlotte Bunne’s feedback, primarily by restructuring the Introduction and expanding the technical state-of-the-art discussion on single-cell foundation models, and by adding written correspondence artifacts (response letter, task list, and archived feedback).

Changes:

Restructures chapters/intro.tex to remove personal narrative and substantially expand/technicalize the scFM state-of-the-art section.
Adds a point-by-point response letter (RESPONSE_TO_CHARLOTTE.md) and an execution checklist (CHARLOTTE_TASKS.md).
Adds/archives rapporteur/reviewer correspondence (reply_to_rapporteurs.md, charlotte_feedback.txt).

Reviewed changes

Copilot reviewed 5 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
reply_to_rapporteurs.md	Adds a draft reply document; currently contains significant character-encoding corruption.
charlotte_feedback.txt	Adds archived email thread containing reviewer feedback (and personal email addresses/signatures).
chapters/intro.tex	Removes personal narrative and expands single-cell foundation model discussion with architectural/training details.
RESPONSE_TO_CHARLOTTE.md	Adds a structured response letter to the reviewer.
CHARLOTTE_TASKS.md	Adds an internal task list to track remaining changes for addressing feedback.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

+ï»¿Reply to rapporteurs
+Valentina
+A central conceptual question that would benefit from more explicit treatment concerns the relationship between the denoising training objective and the nature of the regulatory interactions captured by attention. Since scPRINT is trained to reconstruct downsampled expression profiles, the model is optimized to exploit global co-expression structure. This creates no obvious inductive bias toward direct regulatory interactions over indirect transitive paths of the form AâBâC. The manuscript does not fully address why attention weights in this setting should preferentially reflect direct regulation rather than co-expression. Given that biological interpretability of the inferred networks is a central claim, a more explicit theoretical treatment of this issue would substantially strengthen the work.
+As long as steady-state expression data are used, nothing more than co-expression can be achieved, and we do not make a claim to the contrary. The goal is not to infer GRN per se, but to understand the ability of foundation models to leverage an understanding of gene relationships (albeit through co-expression patterns) to achieve their tasks and how this general understanding enables them to perform many other downstream tasks. However, we do believe that foundation models can go beyond co-expression. Indeed, using ESM3 embeddings confers knowledge of protein structure and evolutionary relationships, and using gene location provides additional information on the probability of co-regulation. working across species further provides patterns of expression not just within cells but across kingdoms. Obviously, nothing is causal yet without interventional or temporal data, and that is a point left to be worked out


+On 23 Jan 2026, at 15:51, Jérémie Kalfon <jkobject@gmail.com> wrote:
+
+Dear Charlotte, Valentina,
+
+You will find, available on through this link: https://github.com/jkobject/Thesis/blob/main/main.pdf, my Ph.D. manuscript to be evaluated.


-
 \subsection{Current Single-Cell Foundation Models and Their Limitations}
-In 2023, a year after Geneformer, several additional foundation models were released. scGPT \cite{cuiScGPTBuildingFoundation2024} showcased a GPT-style architecture and presented various losses for fine-tuning. It was the first example of systematic fine-tuning in single-cell and a more in-depth benchmark across four abilities: cell type prediction, gene network inference, perturbation prediction, and batch correction. However, it did not outperform state-of-the-art methods \cite{boiarskyDeepDiveSingleCell2023, alsabbaghFoundationModelsMeet2023}. At the same time, Universal Cell Embedding (UCE) \cite{rosenUniversalCellEmbeddings2023} demonstrated cross-species training to achieve state-of-the-art cross-species cell embeddings, introducing a contrastive loss function for cell representation learning (see Figure ~\ref{fig:UCE}).
+In 2023, a year after Geneformer, several additional foundation models were released. scGPT \cite{cuiScGPTBuildingFoundation2024} was the first to apply a GPT-style generative architecture to single-cell transcriptomics. Unlike its BERT-style predecessors, scGPT uses causal (unidirectional) self-attention, processing gene tokens sequentially and predicting each gene's expression conditioned on preceding genes. It was pretrained on approximately 33 million human cells from the CellxGene corpus using three objectives: autoregressive gene expression generation, masked value prediction, and a cell-level generation task. scGPT introduced the first systematic fine-tuning protocol and benchmarked across four tasks: cell-type annotation, GRN inference, perturbation prediction, and batch correction. However, independent evaluations \cite{boiarskyDeepDiveSingleCell2023, alsabbaghFoundationModelsMeet2023} demonstrated that scGPT does not consistently outperform dedicated state-of-the-art methods on any of these tasks, and that the causal attention design imposes an artificial gene ordering with no biological motivation.


chatgpt-codex-connector · 2026-03-15T15:49:21Z

💡 Codex Review

Thesis/context_papers/Geneformer.pdf

Line 1 in 6bef1c9

<!DOCTYPE html>

Replace Geneformer asset with a real PDF

The file is committed as Geneformer.pdf but its content starts with HTML (<!DOCTYPE html>), meaning this is a saved Nature landing page rather than a PDF document. Any workflow or user that opens context_papers/*.pdf with PDF tooling (viewer, parser, text extractor) will fail or get unusable content, so the reference corpus is currently broken for offline/document-processing use.

Thesis/context_papers/scGPT.pdf

Line 1 in 6bef1c9

<!DOCTYPE html>

Replace scGPT asset with a real PDF

This file is also stored with a .pdf extension but contains HTML (<!DOCTYPE html>), so it is not a valid PDF artifact. As a result, consumers expecting actual PDFs in context_papers/ cannot reliably open or parse this reference, which breaks reproducibility for anyone using these files as a local paper dataset.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

… add scCello/LangCell, condense LLM section

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

…ecdote, vulgarisation language from background.tex

…ponse to Charlotte with scCello/LangCell

… background content - Section 1: thesis goals with concrete results (scPRINT-2 SOTA numbers) - Section 2: formal GRN problem statement with math (X, G, E, W notation), detailed method descriptions (GENIE3/GRNBoost2, pySCENIC, PIDC, SCODE), comparative table, benchmarking section (BEELINE, SERGIO, GeneRNI, BenGRN), metrics (AUROC, AUPRC, EPR with formulas), ground truths (OmniPath, ENCODE, perturb-seq, MCalla intersection) - Section 3: transformer self-attention math, masked gene modeling objective, encoding challenges, efficient attention (Flash Attention, Performer, criss-cross), bio FMs (ESM2, AlphaFold2, Nucleotide Transformer), detailed scFM reviews (scBERT, Geneformer, scGPT, UCE, scFoundation), brief (scCello, LangCell) - Section 4: contributions chapters kept verbatim from previous intro - main.tex: remove \input{chapters/background} and its associated header/counter Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…e academic acknowledgements

1. Table 1: reduced font size (scriptsize) and text length 2. Existing benchmarks: mention only BEELINE, moved simulated data tools here 3. Ground truths: removed intersections, added simulated expression & gene networks 4. Foundation models section: restructured to start general (transformers/vision/NLP) → biology → scFMs 5. UCE: removed ESM2 embeddings requirement sentence 6. Removed 'Additional models' and 'Key bottlenecks' sections 7. Geneformer: detailed why reported comparisons failed (Boiarsky 2023 findings)

- Added Dosovitskiy 2021 (Vision Transformer) citation - Added Schaffter 2011 (GeneNetWeaver) citation - Added API glossary entry

Copilot AI review requested due to automatic review settings March 15, 2026 15:44

jkobject force-pushed the charlotte-response branch from ce6dcb2 to 6bef1c9 Compare March 15, 2026 15:45

Copilot started reviewing on behalf of jkobject March 15, 2026 15:45 View session

Copilot AI reviewed Mar 15, 2026

View reviewed changes

jkobject and others added 10 commits March 15, 2026 16:14

charlotte-response: fix opening tone, remove Monod figure, fix tense,…

868e19d

… add scCello/LangCell, condense LLM section

Potential fix for pull request finding

0ee03a9

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

3330382

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

0a85273

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

fix: correct scCello and LangCell cite keys, add scCello to bibliography

cabbf9a

fix: remove GRN doublon, depersonalize limitations section

ecea1b6

fix: remove 'promises of cellular biology' section, Monod personal an…

2cd4f94

…ecdote, vulgarisation language from background.tex

fix: clean informal language in background.tex AI section, update res…

6dc7266

…ponse to Charlotte with scCello/LangCell

fix: replace personal narrative in auxiliaries/background with concis…

9408270

…e academic acknowledgements

jkobject changed the title ~~Charlotte Bunne reviewer feedback — restructured intro & expanded scFM section~~ Response to Charlotte — complete intro restructuring, background removed, personal narrative removed Mar 16, 2026

jkobject added 10 commits March 16, 2026 16:10

built pdf

76e9ef1

fix: add missing citations and glossary entries

78c188f

- Added Dosovitskiy 2021 (Vision Transformer) citation - Added Schaffter 2011 (GeneNetWeaver) citation - Added API glossary entry

chore: build PDF after intro edits and citation fixes

0d1e081

final

ff30c8a

final

2c2036f

final

6f67ff3

Delete CHARLOTTE_TASKS.md

c0d07ef

Delete RESPONSE_TO_CHARLOTTE.md

5287569

Delete charlotte_feedback.txt

f779362

jkobject merged commit 974d240 into main Mar 19, 2026
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Response to Charlotte — complete intro restructuring, background removed, personal narrative removed#5

Response to Charlotte — complete intro restructuring, background removed, personal narrative removed#5
jkobject merged 21 commits into
mainfrom
charlotte-response

jkobject commented Mar 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot commented Mar 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jkobject commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key changes (latest — 16 Mar 2026)

chapters/intro.tex (231 → 1000 lines)

chapters/background.tex removed from main.tex

auxiliaries/background.tex — Personal Motivation → Acknowledgements

Charlotte 7 points addressed

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot commented Mar 15, 2026

💡 Codex Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jkobject commented Mar 15, 2026 •

edited

Loading

`chapters/intro.tex` (231 → 1000 lines)

`chapters/background.tex` removed from `main.tex`

`auxiliaries/background.tex` — Personal Motivation → Acknowledgements