Skip to content

Response to Charlotte — complete intro restructuring, background removed, personal narrative removed#5

Merged
jkobject merged 21 commits into
mainfrom
charlotte-response
Mar 19, 2026
Merged

Response to Charlotte — complete intro restructuring, background removed, personal narrative removed#5
jkobject merged 21 commits into
mainfrom
charlotte-response

Conversation

@jkobject

@jkobject jkobject commented Mar 15, 2026

Copy link
Copy Markdown
Owner

Summary

This PR addresses all 7 points from Charlotte Bunne feedback (24 Feb + 14 Mar 2026 follow-up).

Key changes (latest — 16 Mar 2026)

chapters/intro.tex (231 → 1000 lines)

  • Opens directly on the central computational challenge and actual thesis results
  • Section 2 — GRN Inference: formal mathematical problem statement (X∈ℝ^{n×g}, G=(V,E,W), cell-type-specific GRNs), detailed explanations with equations: GENIE3/GRNBoost2 (random forest + feature importance), pySCENIC (two-step: co-expression + motif enrichment + AUCell), PIDC (partial information decomposition), SCODE (linear ODE); comparison table
  • Benchmarking: BEELINE, GeneRNI Living Benchmark, metrics (AUROC, AUPRC, EPR), ground truths (Omnipath, ChIP-seq, Perturb-seq)
  • Section 3 — Foundation Models: self-attention with full equations, masked gene modeling, scRNA tokenization challenges (binning vs rank-ordering vs MLP)
  • scFMs detailed (~half page each): scBERT, Geneformer (rank-ordering explained), scGPT (causal attention), UCE (ESM2 gene tokens), scFoundation
  • Contributions section preserved

chapters/background.tex removed from main.tex

  • Content absorbed into new intro

auxiliaries/background.tex — Personal Motivation → Acknowledgements

  • 56-line personal narrative (PiPle, Broad Institute, Whitelab journey) replaced with 3-paragraph concise academic acknowledgements

Charlotte 7 points addressed

Point Status
1. Remove personal narrative ✅ auxiliaries/background.tex rewritten
2. Structure and flow ✅ motivation → GRN → FMs → contributions
3. No basic textbook content ✅ all content directly supports thesis
4. Separate Intro and Background ✅ single intro chapter, background.tex removed
5. Align with actual thesis ✅ scope section explicit on what is/is not covered
6. Language and punctuation ✅ no French spacing, academic register
7. Suggested structure ✅ followed exactly

Copilot AI review requested due to automatic review settings March 15, 2026 15:44
…response letter

# Response to Reviewer Charlotte Bunne

Dear Charlotte,

Thank you for your detailed and constructive feedback on the thesis manuscript. I address each of your points below.

---

## 1. Personal narrative / preamble

**Your comment:** "Remove the preamble and any personal narrative. A thesis introduction should be academic in tone and content."

**Response:** The "Personal Motivation" section has been removed from the Introduction entirely. Personal context has been moved to the Acknowledgements section, as you suggested. The Introduction now opens directly with the scientific problem setting.

---

## 2. Structure and narrative flow

**Your comment:** "The current version is fuzzy and reads like a list of loosely connected topics... without a coherent progression."

**Response:** The Introduction has been fully restructured following your proposed outline:
1. Motivation and problem setting
2. Foundation models and the rationale for large-scale learning approaches (technical, non-textbook)
3. Scientific aim
4. Thesis scope and contributions (with explicit statement of what is and is not covered)
5. Chapter-by-chapter overview

The "Promises of cellular biology" framing and the Feynman figure reference have been removed.

---

## 3. Basic textbook material

**Your comment:** "There is no need to explain the central dogma or basic principles of gene regulation... Do not confuse Introduction and Background."

**Response:** All pedagogical background material (RNA biology, central dogma, gene regulatory networks, single-cell sequencing, basic ML concepts) has been moved to a dedicated **Background chapter** (Chapter 2), which is intended for readers less familiar with either biology or machine learning. The Introduction no longer contains textbook-level content.

---

## 4. Introduction vs. Background separation

**Your comment:** "For now, in fact, you have no Background/Related Work section, it seems?"

**Response:** A full Background chapter has been added (chapters/background.tex), covering: (1) cell regulation and RNA biology, (2) gene regulatory networks, (3) single-cell sequencing technologies, and (4) foundational ML/AI concepts. This chapter is clearly separated from the Introduction in the thesis structure.

---

## 5. Scope alignment

**Your comment:** "The introduction currently lists many topics that are not addressed in your work."

**Response:** The Scope section now explicitly states what the thesis does not address (perturbation response prediction as a primary target, temporal dynamics, spatial transcriptomics as a primary modality). Any topics that appear on page 20 but are not covered have been removed or qualified.

---

## 6. Language, style, and punctuation

**Your comment:** "Many spelling errors, missing full stops, sloppy language. English does not use a space before '?' or ':'."

**Response:** The manuscript was reviewed with Grammarly and manually corrected. French typography habits (space before '?' and ':') have been removed. The LaTeX document class has been updated from a French thesis template to an English one, which also resolved automatic spacing insertion. The academic register has been reviewed and informal phrasing corrected throughout the Introduction.

---

## 7. Technical depth of the scFM state of the art

**Your comment (via PI):** "When you talk about the state-of-the-art use a more technical tone and give details on the different models instead of just listing and citing some."

**Response:** The Bio-Foundation Models section has been substantially expanded. Each model now includes:
- **Architecture specifics** (attention mechanism, encoder/decoder design, tokenization strategy)
- **Training dataset** (size, source, organism coverage)
- **Pretraining objective** (masked token prediction, autoregressive generation, contrastive learning, etc.)
- **Key benchmark results** and demonstrated capabilities
- **Limitations** identified by independent evaluations

Models now covered in technical depth: scBERT, Geneformer, scGPT, UCE, scFoundation.

---

## Note on versioning

The substantial structural revisions described above were implemented in commits `07ff110`, `35bb09e`, `cd52704`, `d72e822`, and `0a30b9c` (February 26 – March 11, 2026). We note that your email of March 14 indicated the revisions had not been addressed — it is possible you were viewing a cached copy of the PDF, as the updated version was committed and pushed on March 11. The current branch `charlotte-response` contains all revisions described in this letter.

Best regards,
Jérémie Kalfon

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the thesis materials to address reviewer Charlotte Bunne’s feedback, primarily by restructuring the Introduction and expanding the technical state-of-the-art discussion on single-cell foundation models, and by adding written correspondence artifacts (response letter, task list, and archived feedback).

Changes:

  • Restructures chapters/intro.tex to remove personal narrative and substantially expand/technicalize the scFM state-of-the-art section.
  • Adds a point-by-point response letter (RESPONSE_TO_CHARLOTTE.md) and an execution checklist (CHARLOTTE_TASKS.md).
  • Adds/archives rapporteur/reviewer correspondence (reply_to_rapporteurs.md, charlotte_feedback.txt).

Reviewed changes

Copilot reviewed 5 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
reply_to_rapporteurs.md Adds a draft reply document; currently contains significant character-encoding corruption.
charlotte_feedback.txt Adds archived email thread containing reviewer feedback (and personal email addresses/signatures).
chapters/intro.tex Removes personal narrative and expands single-cell foundation model discussion with architectural/training details.
RESPONSE_TO_CHARLOTTE.md Adds a structured response letter to the reviewer.
CHARLOTTE_TASKS.md Adds an internal task list to track remaining changes for addressing feedback.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread RESPONSE_TO_CHARLOTTE.md Outdated
Comment thread reply_to_rapporteurs.md Outdated
Comment thread reply_to_rapporteurs.md
Reply to rapporteurs
Valentina
A central conceptual question that would benefit from more explicit treatment concerns the relationship between the denoising training objective and the nature of the regulatory interactions captured by attention. Since scPRINT is trained to reconstruct downsampled expression profiles, the model is optimized to exploit global co-expression structure. This creates no obvious inductive bias toward direct regulatory interactions over indirect transitive paths of the form A→B→C. The manuscript does not fully address why attention weights in this setting should preferentially reflect direct regulation rather than co-expression. Given that biological interpretability of the inferred networks is a central claim, a more explicit theoretical treatment of this issue would substantially strengthen the work.
As long as steady-state expression data are used, nothing more than co-expression can be achieved, and we do not make a claim to the contrary. The goal is not to infer GRN per se, but to understand the ability of foundation models to leverage an understanding of gene relationships (albeit through co-expression patterns) to achieve their tasks and how this general understanding enables them to perform many other downstream tasks. However, we do believe that foundation models can go beyond co-expression. Indeed, using ESM3 embeddings confers knowledge of protein structure and evolutionary relationships, and using gene location provides additional information on the probability of co-regulation. working across species further provides patterns of expression not just within cells but across kingdoms. Obviously, nothing is causal yet without interventional or temporal data, and that is a point left to be worked out
Comment thread charlotte_feedback.txt Outdated
Comment on lines +12 to +16
On 23 Jan 2026, at 15:51, Jérémie Kalfon <jkobject@gmail.com> wrote:

Dear Charlotte, Valentina,

You will find, available on through this link: https://github.com/jkobject/Thesis/blob/main/main.pdf, my Ph.D. manuscript to be evaluated.
Comment thread chapters/intro.tex Outdated
Comment thread chapters/intro.tex Outdated

\subsection{Current Single-Cell Foundation Models and Their Limitations}
In 2023, a year after Geneformer, several additional foundation models were released. scGPT \cite{cuiScGPTBuildingFoundation2024} showcased a GPT-style architecture and presented various losses for fine-tuning. It was the first example of systematic fine-tuning in single-cell and a more in-depth benchmark across four abilities: cell type prediction, gene network inference, perturbation prediction, and batch correction. However, it did not outperform state-of-the-art methods \cite{boiarskyDeepDiveSingleCell2023, alsabbaghFoundationModelsMeet2023}. At the same time, Universal Cell Embedding (UCE) \cite{rosenUniversalCellEmbeddings2023} demonstrated cross-species training to achieve state-of-the-art cross-species cell embeddings, introducing a contrastive loss function for cell representation learning (see Figure ~\ref{fig:UCE}).
In 2023, a year after Geneformer, several additional foundation models were released. scGPT \cite{cuiScGPTBuildingFoundation2024} was the first to apply a GPT-style generative architecture to single-cell transcriptomics. Unlike its BERT-style predecessors, scGPT uses causal (unidirectional) self-attention, processing gene tokens sequentially and predicting each gene's expression conditioned on preceding genes. It was pretrained on approximately 33 million human cells from the CellxGene corpus using three objectives: autoregressive gene expression generation, masked value prediction, and a cell-level generation task. scGPT introduced the first systematic fine-tuning protocol and benchmarked across four tasks: cell-type annotation, GRN inference, perturbation prediction, and batch correction. However, independent evaluations \cite{boiarskyDeepDiveSingleCell2023, alsabbaghFoundationModelsMeet2023} demonstrated that scGPT does not consistently outperform dedicated state-of-the-art methods on any of these tasks, and that the causal attention design imposes an artificial gene ordering with no biological motivation.
@chatgpt-codex-connector

Copy link
Copy Markdown

💡 Codex Review

<!DOCTYPE html>

P2 Badge Replace Geneformer asset with a real PDF

The file is committed as Geneformer.pdf but its content starts with HTML (<!DOCTYPE html>), meaning this is a saved Nature landing page rather than a PDF document. Any workflow or user that opens context_papers/*.pdf with PDF tooling (viewer, parser, text extractor) will fail or get unusable content, so the reference corpus is currently broken for offline/document-processing use.


<!DOCTYPE html>

P2 Badge Replace scGPT asset with a real PDF

This file is also stored with a .pdf extension but contains HTML (<!DOCTYPE html>), so it is not a valid PDF artifact. As a result, consumers expecting actual PDFs in context_papers/ cannot reliably open or parse this reference, which breaks reproducibility for anyone using these files as a local paper dataset.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

jkobject and others added 10 commits March 15, 2026 16:14
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…ecdote, vulgarisation language from background.tex
… background content

- Section 1: thesis goals with concrete results (scPRINT-2 SOTA numbers)
- Section 2: formal GRN problem statement with math (X, G, E, W notation),
  detailed method descriptions (GENIE3/GRNBoost2, pySCENIC, PIDC, SCODE),
  comparative table, benchmarking section (BEELINE, SERGIO, GeneRNI, BenGRN),
  metrics (AUROC, AUPRC, EPR with formulas), ground truths (OmniPath, ENCODE,
  perturb-seq, MCalla intersection)
- Section 3: transformer self-attention math, masked gene modeling objective,
  encoding challenges, efficient attention (Flash Attention, Performer, criss-cross),
  bio FMs (ESM2, AlphaFold2, Nucleotide Transformer), detailed scFM reviews
  (scBERT, Geneformer, scGPT, UCE, scFoundation), brief (scCello, LangCell)
- Section 4: contributions chapters kept verbatim from previous intro
- main.tex: remove \input{chapters/background} and its associated header/counter

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jkobject jkobject changed the title Charlotte Bunne reviewer feedback — restructured intro & expanded scFM section Response to Charlotte — complete intro restructuring, background removed, personal narrative removed Mar 16, 2026
jkobject added 10 commits March 16, 2026 16:10
1. Table 1: reduced font size (scriptsize) and text length
2. Existing benchmarks: mention only BEELINE, moved simulated data tools here
3. Ground truths: removed intersections, added simulated expression & gene networks
4. Foundation models section: restructured to start general (transformers/vision/NLP) → biology → scFMs
5. UCE: removed ESM2 embeddings requirement sentence
6. Removed 'Additional models' and 'Key bottlenecks' sections
7. Geneformer: detailed why reported comparisons failed (Boiarsky 2023 findings)
- Added Dosovitskiy 2021 (Vision Transformer) citation
- Added Schaffter 2011 (GeneNetWeaver) citation
- Added API glossary entry
@jkobject jkobject merged commit 974d240 into main Mar 19, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants