Skip to content

Latest commit

 

History

History
214 lines (163 loc) · 6.77 KB

File metadata and controls

214 lines (163 loc) · 6.77 KB

Technical Appendix

This appendix keeps the operational details in one place so the README can stay clean. For biological context on the liver cell compartments, see biology_primer_liver_fibrosis.md.

Repository Structure

config/        Dataset paths, marker panels, scoring weights
workflow/      Ordered R workflow stages
src/R/         Shared R helpers
scripts/       Data prep, validation, evidence enrichment, checks
nextflow/      Local and AWS workflow scaffold
dashboard/     Shiny app and dashboard-ready CSVs
reports/       Executive report, written responses, figures, tables
docs/          Method walkthrough and technical appendix
data/demo/     Tiny tracked demo dataset
data/metadata/ Curated public-data manifests

Inputs

Primary input:

  • GSE136103 GEO processed count matrices: barcodes.tsv.gz, genes.tsv.gz, matrix.mtx.gz
  • Manifest generated by workflow/02_curate_metadata.R

Validation inputs:

  • GSE244832 processed liver matrix and metadata for MASH/HSC validation
  • GSE207310 bulk count tables and GEO phenotype metadata for NAFLD/NASH directionality
  • GSE136103 blood and mouse libraries for marker specificity and preclinical conservation checks

For proprietary or future datasets, the expected minimum metadata are:

Field Purpose
sample_id Stable sample identifier
donor Biological replicate
disease_state Control, MASH, cirrhosis, fibrosis bin, or source label
tissue Liver, blood, other tissue source
species Human, mouse, other
assay_type scRNA-seq, snRNA-seq, bulk RNA-seq, spatial
batch Chemistry, site, run, or processing batch

Recommended fields include fibrosis stage, MASLD/MASH category, biopsy source, sex, age, BMI, diabetes status, medication, and clinical covariates.

Main Local Commands

make check
make fetch-data
make curate
make analyze
make refine-labels
make pseudobulk
make pathfindr
make prioritize
make validation
make hsc-validation
make gse244832-focused
make gse207310-validation
make secondary-validation
make evidence
make translational-evidence
make dashboard
make render-summary
make validate-repo

The full local workflow is:

make all

Environment

  • R 4.6.0
  • Seurat 5.5.0
  • pathfindR 2.6.0
  • Java/OpenJDK for pathfindR active-subnetwork search and Nextflow
  • renv.lock for package pinning
  • Dockerfile for containerized execution
  • Nextflow plus Java for local and AWS pipeline execution

Restore R packages:

Rscript -e "renv::restore()"

Main Outputs

Executive and interpretation:

  • reports/executive_submission_summary.html
  • reports/executive_submission_summary.Rmd
  • reports/executive_submission_summary.md
  • docs/analysis_walkthrough.md
  • docs/biology_primer_liver_fibrosis.md
  • reports/screening_responses/README.md

Core tables:

  • reports/tables/qc_by_library.csv
  • reports/tables/qc_filtered_by_library_compartment.csv
  • reports/tables/compartment_de_cell_level_exploratory.csv
  • reports/tables/pseudobulk_de_by_refined_state.csv
  • reports/tables/pseudobulk_priority_gene_de.csv
  • reports/tables/hallmark_pathway_enrichment.csv
  • reports/tables/pathfindr_pseudobulk_run_summary.csv
  • reports/tables/pathfindr_pseudobulk_reactome_enrichment.csv
  • reports/tables/ranked_biomarker_target_candidates_translational.csv
  • reports/tables/target_prioritization_scoring_components.csv
  • reports/tables/target_prioritization_scoring_method.csv

Validation tables:

  • reports/tables/gse244832_focused_object_candidate_summary.csv
  • reports/tables/validation_gse207310_candidate_lm_results.csv
  • reports/tables/gse136103_blood_candidate_marker_role_summary.csv
  • reports/tables/gse136103_mouse_candidate_ortholog_summary.csv

Core figures:

  • reports/figures/required_compartment_marker_dotplot.png
  • reports/figures/umap_refined_cell_states.png
  • reports/figures/pseudobulk_priority_gene_de.png
  • reports/figures/pathfindr_pseudobulk_reactome_barplot.png
  • reports/figures/pathfindr_pseudobulk_reactome_dotplot.png
  • reports/figures/ranked_candidate_scores.png
  • reports/figures/gse244832_focused_object_validation_heatmap.png
  • reports/figures/gse207310_candidate_validation_heatmap.png
  • reports/figures/gse136103_blood_candidate_marker_heatmap.png
  • reports/figures/gse136103_mouse_candidate_ortholog_heatmap.png

Data Policy

Tracked:

  • code
  • config
  • metadata manifests
  • demo dataset
  • compact tables and figures
  • dashboard-ready CSVs
  • reports and documentation

Not tracked:

  • raw GEO archives
  • extracted validation matrices
  • large Seurat objects
  • logs
  • private notes or assignment context

Large analysis objects should live in local ignored folders for this repo and in S3/EFS for AWS runs.

Public Evidence Sources

The target-evidence enrichment uses:

  • MyGene.info for identifiers
  • Open Targets for tractability and safety-liability annotations
  • ClinicalTrials.gov for liver-context trial text matches
  • ClinVar for gene-level clinical variant counts
  • UniProt for protein localization, tissue specificity, and function
  • PubMed for perturbation and safety literature signal
  • babelgene for human-to-mouse orthology

These are triage layers. They do not replace donor-aware disease biology, validation data, protein localization, or perturbation experiments.

Nextflow And AWS Pattern

Local demo:

make nextflow-demo

The demo outputs a compact report plus QC, embedding, candidate DE, pathway-theme, and ranked-candidate artifacts under reports/nextflow_demo/. It is a small contract test for a dataset-independent workflow, not a substitute for the full Seurat run.

Direct Nextflow run:

export PATH="/opt/homebrew/opt/openjdk/bin:$PATH"
nextflow run nextflow/fibrotarget_demo -profile local --outdir reports/nextflow_demo

AWS pattern:

export NXF_WORK=s3://<bucket>/fibrotarget-liver/work
export PIPELINE_IMAGE=<account>.dkr.ecr.us-west-2.amazonaws.com/fibrotarget-liver:latest

nextflow run nextflow/fibrotarget_demo -profile aws \
  --input s3://<bucket>/demo/demo_samplesheet.csv \
  --outdir s3://<bucket>/results/fibrotarget-demo

Expected production services:

  • S3 for raw, processed, report, and dashboard outputs
  • ECR for the analysis image
  • AWS Batch or ECS for execution
  • CloudWatch for logs
  • Parameter Store or Secrets Manager for protected configuration

Current Limits

  • Cell-level DE is exploratory; donor-level pseudobulk is the preferred inferential layer.
  • GSE244832 has focused local validation; full all-gene reanalysis belongs on larger compute.
  • GSE136103 mouse validation has one healthy and one fibrotic mouse sample, so it is a conservation screen, not a powered preclinical DE analysis.
  • Macrophage candidates need a macrophage-focused external atlas and spatial validation before therapeutic nomination.