BCR-GEX-Tutorial/BCR_GEX_Tutorial_Part3.Rmd at main · Part-coder-part-batman/BCR-GEX-Tutorial · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "BCR + GEX Integration Tutorial -- Part 3: Repertoire Visualization"
subtitle: "Clonality donuts, phylogenetic trees, and shared clone analysis"
author: "Nachi Nathan"
date: "`r Sys.Date()`"
output:
  html_document:
    toc: true
    toc_float: true
    toc_depth: 3
    theme: flatly
    code_folding: show
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  echo    = TRUE,
  message = TRUE,
  warning = FALSE
)
```

***

# Prerequisites
Before running this part, the following must be in place.

**Input files required**

| File | Where it comes from |
|------|-------------|
| `P1_LN_bcr_data.tsv` | Part 1 output, or download from Zenodo: https://zenodo.org/records/20324083 |
| `P1_PT_bcr_data.tsv` | Part 1 output, or download from Zenodo: https://zenodo.org/records/20324083 |
| `integrated_S7_annotated.rds` | Part 2 output, or download from Zenodo: https://zenodo.org/records/20324083 |
| `BCR_viz_functions.R` | This tutorial repo, at `ROOT_DIR/` |

**IQ-TREE 2 (required for phylogenetic trees only)**
The tree section uses the dowser package, which calls IQ-TREE 2 externally to
build the tree topology. IQ-TREE 2 must be installed before running
Section 8.4. The donut plot sections (8.1--8.3) and all of Step 9 do not
require IQ-TREE and can be run independently.

1. Download the IQ-TREE 2 binary for your operating system from:
   https://github.com/Cibiv/IQ-TREE/releases
   (look for the latest release and download the zip for your OS)
2. Unzip and note the path to the executable:
   - Windows: `C:/tools/iqtree2/bin/iqtree2.exe`
   - macOS/Linux: `/usr/local/bin/iqtree2`
3. Either add the bin folder to your system PATH so it can be called as
   `iqtree2`, or supply the full path to the `exec` argument in
   `build_bcr_trees()` as shown in Section 8.4.

**R packages required**

```r
install.packages(c("dplyr", "ggplot2", "data.table", "writexl", "tibble",
                   "tidyr", "patchwork", "scales", "stringr",
                   "dowser", "ape", "RColorBrewer", "ggraph"))
install.packages("Seurat")
# scRepertoire
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("scRepertoire")
# ggtree is on Bioconductor -- cannot be installed with install.packages()
BiocManager::install("ggtree")
```

***

# Overview

This part covers BCR repertoire visualization using two tool ecosystems.
Step 8 uses Immcantation and custom functions to visualize clonal structure
within each sample -- donut plots at the whole-sample, per-isotype, and
per-cluster level, followed by germline-rooted phylogenetic trees for the top
expanded clones. Step 9 uses scRepertoire to ask a different question: are any
clones shared between the LN and PT samples from the same patient?

The dataset comes from Nathan et al., *Immunity* (2026). The two samples used
throughout this tutorial -- P1_LN and P1_PT -- are the lymph node and primary
tumor from the same ovarian cancer patient.

***

# Setup

Set these variables before running anything else. All paths downstream are
derived from these.

```{r setup-session}
suppressPackageStartupMessages({
  library(Seurat)
  library(dplyr)
  library(ggplot2)
  library(data.table)
  library(writexl)
  library(tibble)
  library(tidyr)
  library(patchwork)
  library(scales)
  library(stringr)
  library(dowser)
  library(ggtree)
  library(ape)
  library(RColorBrewer)
  library(scRepertoire)
  library(ggraph)
})

# Baseline RNG seed for the session. Individual stochastic chunks (clonalNetwork)
# are also seeded immediately before they run, so they remain reproducible when
# executed in isolation.
set.seed(42)

ROOT_DIR <- "C:/Users/natha/OneDrive/Documents/Immcantation/BCR_Tutorial/Github"

# BCR_viz_functions.R lives in ROOT_DIR alongside the tutorial Rmd files
source(file.path(ROOT_DIR, "BCR_viz_functions.R"))

# Input: integrated annotated Seurat object from Part 2
rds_in <- file.path(ROOT_DIR, "RDS_Objects", "integrated_S7_annotated.rds")

# Input: per-sample BCR TSV files from Part 1
bcr_tsv <- list(
  P1_LN = file.path(ROOT_DIR, "P1_LN", "Output", "P1_LN_bcr_data.tsv"),
  P1_PT = file.path(ROOT_DIR, "P1_PT", "Output", "P1_PT_bcr_data.tsv")
)

# Output: annotated BCR files, summary tables, and plots
out_dir  <- file.path(ROOT_DIR, "BCR_Data", "Annotated")
plot_dir <- file.path(ROOT_DIR, "Plots", "Part3")
rds_dir  <- file.path(ROOT_DIR, "RDS_Objects")
dir.create(out_dir,  recursive = TRUE, showWarnings = FALSE)
dir.create(plot_dir, recursive = TRUE, showWarnings = FALSE)
dir.create(rds_dir,  recursive = TRUE, showWarnings = FALSE)

message("Ready.")
```

***

# Step 8: BCR Visualization

***

## 8.1 Load the annotated Seurat object and extract cell type labels

We load the integrated object from Part 2 and extract the
barcode-to-cell-type mapping from its metadata. We only need three columns:
the cell barcode, the `cell_type` label, and the `sample_id`. The barcodes in
the integrated object carry the sample suffix (e.g. `ACGT...TGCA-1_P1_LN`)
because this was appended during Part 2. The BCR `cell_id` column carries the
same suffix from Part 1. The join in Step 8.2 therefore works directly without
any barcode manipulation.

Not all BCR cells will match a Seurat barcode. Cells that were filtered out
during GEX QC (low quality, doublets) in Part 2 are present in the BCR data
but absent from the Seurat object. These will have `cell_type = NA` and are
retained in the BCR data for completeness. They appear in combined and
per-isotype donuts -- their BCR sequences are valid -- but are excluded from
per-cluster donuts and from tree tip labels, since they have no cluster
assignment. In this dataset, 87.8% of P1_LN heavy chain cells and 78.1% of
P1_PT heavy chain cells matched a Seurat barcode. Match rates vary depending
on GEX QC stringency and are expected to be somewhat lower for the tumor
sample, where more cells are filtered during QC.

```{r load-seurat}
if (!file.exists(rds_in)) {
  stop("Cannot find integrated RDS at:\n  ", rds_in,
       "\nRun Part 2 first.")
}

obj <- readRDS(rds_in)
message("Integrated object loaded: ", ncol(obj), " cells | ",
        length(unique(obj$cell_type)), " cell types")

# Extract barcode -> cell_type table from Seurat metadata
# rownames of @meta.data are the cell barcodes
cell_type_map <- obj@meta.data %>%
  tibble::rownames_to_column("cell_id") %>%
  dplyr::select(cell_id, cell_type, sample_id)

message("Cell type distribution:")
print(table(cell_type_map$cell_type))
```

***

## 8.2 Join cell type onto BCR data

For each sample we load the BCR TSV from Part 1 and left-join the `cell_type`
column by `cell_id`. The result is a long-format data frame with one row per
contig (heavy and light chains in separate rows). This long format is required
by all visualization functions and by dowser for tree building.

We use `data.table::fread()` to load the TSV files. These files can be large
and `fread` is substantially faster than `read_tsv` for files stored on
OneDrive, which syncs in the background during reading and can cause `read_tsv`
to hang for several minutes.

```{r join-cell-type}
bcr_annotated <- list()

for (sid in names(bcr_tsv)) {

  tsv_path <- bcr_tsv[[sid]]

  if (!file.exists(tsv_path)) {
    stop("Cannot find BCR TSV for ", sid, " at:\n  ", tsv_path,
         "\nRun Part 1 first.")
  }

  message("\n--- Processing: ", sid, " ---")

  # fread is used here rather than read_tsv for speed on OneDrive-backed paths
  bcr <- data.table::fread(tsv_path, sep = "\t", data.table = FALSE)
  message("  BCR contigs loaded: ", nrow(bcr))

  # Filter cell_type_map to this sample and join by cell_id
  ct_this_sample <- dplyr::filter(cell_type_map, sample_id == sid) %>%
    dplyr::select(cell_id, cell_type)

  bcr <- dplyr::left_join(bcr, ct_this_sample, by = "cell_id")

  # Report match rate on heavy chains only (one per cell)
  n_matched   <- sum(!is.na(bcr$cell_type[bcr$locus == "IGH"]))
  n_igh_total <- sum(bcr$locus == "IGH")
  pct_matched <- round(100 * n_matched / n_igh_total, 1)

  message("  Cell type matched: ", n_matched, " / ", n_igh_total,
          " heavy chain cells (", pct_matched, "%)")

  # Match rates below 50% suggest a barcode format mismatch between
  # Part 1 and Part 2. Check that sample suffixes are consistent.
  if (pct_matched < 50) {
    warning("  Less than 50% of heavy chain cells matched to a cell type. ",
            "Check that barcode suffixes are consistent between Part 1 and Part 2.")
  }

  # Cells not in the Seurat object are left as NA -- they were filtered out
  # during GEX QC and have no valid cluster assignment. They are retained in
  # the BCR data because their BCR sequences are valid, but they will be
  # excluded from per-cluster plots and tree tip labels automatically.

  # Compute VDJ duplicate counts within each clone.
  # For each chain, trim the sequence to V-start -> J-end (same coordinates
  # used by the tree builder) and count how many cells in the same clone share
  # an identical VDJ DNA sequence. This becomes Duplicate_DNA_H / _L in the
  # wide-format table and tells you how many cells made the exact same antibody.
  bcr <- bcr %>%
    dplyr::mutate(
      VDJ_DNA = dplyr::if_else(
        !is.na(v_sequence_start) & !is.na(j_sequence_end) &
          v_sequence_start > 0   & j_sequence_end > 0,
        substr(sequence, v_sequence_start, j_sequence_end),
        sequence
      )
    ) %>%
    dplyr::group_by(clone_id, locus, VDJ_DNA) %>%
    dplyr::mutate(
      Duplicate_DNA_H = dplyr::if_else(locus == "IGH",             dplyr::n(), NA_integer_),
      Duplicate_DNA_L = dplyr::if_else(locus %in% c("IGK", "IGL"), dplyr::n(), NA_integer_)
    ) %>%
    dplyr::ungroup() %>%
    dplyr::select(-VDJ_DNA)

  bcr_annotated[[sid]] <- bcr
  message("  Done: ", nrow(bcr), " total contigs | ",
          length(unique(bcr$cell_id[bcr$locus == "IGH"])), " cells with heavy chain")
}
```

***

### 8.2a Save the annotated long-format BCR data

We save the annotated BCR data for each sample as a TSV. This is the
authoritative input for all BCR visualization downstream, and also feeds into
Step 9. If you need to re-run any plots without re-loading the Seurat object,
you can load these files directly and skip Steps 8.1--8.2.

```{r save-long-format}
for (sid in names(bcr_annotated)) {
  out_path <- file.path(out_dir, paste0(sid, "_S8_bcr_annotated.tsv"))
  data.table::fwrite(bcr_annotated[[sid]], file = out_path, sep = "\t")
  message("Saved: ", out_path)
}
```

***

### 8.2b Build and save the wide-format summary table

The long-format BCR data has two rows per cell (one for the heavy chain, one
for the light chain). For inspection and tabular analysis it is useful to have
a single-row-per-cell summary with both chains side by side. We build this here
and save it as an Excel file.

The table includes: cell barcode, clone ID, clone size, sample, cell type,
isotype, light chain type (kappa/lambda), VDJ duplicate counts for heavy and
light chains, somatic hypermutation counts and frequencies for both chains,
V/D/J gene calls, and the full nucleotide sequences.

```{r wide-format-table}
for (sid in names(bcr_annotated)) {

  bcr <- bcr_annotated[[sid]]

  heavy <- bcr %>%
    dplyr::filter(locus == "IGH") %>%
    dplyr::transmute(
      cell_id,
      clone_id,
      clone_count,
      sample_id,
      cell_type,
      isotype        = c_call,
      Duplicate_DNA_H,
      mu_count_H     = mu_count,
      mu_freq_H      = mu_freq,
      v_call_H       = v_call,
      d_call_H       = d_call,
      j_call_H       = j_call,
      sequence_H     = sequence
    )

  light <- bcr %>%
    dplyr::filter(locus %in% c("IGK", "IGL")) %>%
    dplyr::transmute(
      cell_id,
      light_chain = dplyr::case_when(
        locus == "IGK" ~ "Kappa",
        locus == "IGL" ~ "Lambda",
        TRUE           ~ locus
      ),
      Duplicate_DNA_L,
      mu_count_L = mu_count,
      mu_freq_L  = mu_freq,
      v_call_L   = v_call,
      j_call_L   = j_call,
      sequence_L = sequence
    )

  wide <- dplyr::left_join(heavy, light, by = "cell_id") %>%
    dplyr::arrange(dplyr::desc(clone_count))

  out_path <- file.path(out_dir, paste0(sid, "_S8_final_table.xlsx"))
  writexl::write_xlsx(wide, out_path)
  message(sid, ": wide-format table saved (", nrow(wide), " cells) -- ", out_path)
}
```

***

## 8.3 Donut plots

Donut plots are one of the most intuitive ways to visualize clonal structure in
BCR data. Each slice represents one clone, with arc length proportional to the
number of cells in that clone. Singletons (clones with only one cell) are
aggregated into a single white arc -- they correctly represent the proportion of
the repertoire that is non-expanded, but individual singleton clones carry no
meaningful information at this level of analysis.

Expanded clones are colored by their dominant heavy chain isotype using the
standard palette defined in `BCR_viz_functions.R`, giving an immediate visual
read of the isotype composition of the expanded repertoire.

Three levels of granularity are shown:

- **Combined** -- one donut for the whole sample showing all clones. Each slice
  represents one expanded clone colored by its heavy chain isotype; singletons
  are aggregated into a single white arc. The relative size of the white arc
  vs the colored arcs gives an immediate read of how clonal the repertoire is.
- **Per isotype** -- one donut per isotype, showing clonal structure within
  each isotype class. Each expanded clone gets a shade within that isotype's
  color family; singletons are white.
- **Per cluster** -- one donut per annotated cell type, showing the clonal
  landscape within each population. BCR cells with no Seurat match (NA cell
  type) are excluded here since they have no cluster assignment. In this
  dataset, PC clusters show the highest clonality. Naive B cells show the
  lowest clonality -- near-zero expansion is expected for this population. GC
  clusters can also show substantial expansion. MBC subsets vary.

***

### 8.3.1 Combined donut -- P1_LN

```{r donut-combined-ln, fig.height=6, fig.width=6}
donuts_ln <- plot_combined_donut(bcr_annotated[["P1_LN"]], sample_name = "P1_LN")
print(donuts_ln$all_clones)
```

***

### 8.3.2 Combined donut -- P1_PT

```{r donut-combined-pt, fig.height=6, fig.width=6}
donuts_pt <- plot_combined_donut(bcr_annotated[["P1_PT"]], sample_name = "P1_PT")
print(donuts_pt$all_clones)
```

***

### 8.3.3 Per-isotype donuts -- P1_LN

Each plot shows the clonal structure within one isotype. Expanded clones are
shown in shades of the isotype's color family (shades of green for IGHG1,
shades of blue for IGHM etc.), with darker shades indicating larger clones.
Singletons are white. Isotypes with fewer than 5 cells are skipped.

```{r donut-isotype-ln, fig.height=6, fig.width=6}
iso_donuts_ln <- plot_isotype_donuts(bcr_annotated[["P1_LN"]], sample_name = "P1_LN")

for (iso in names(iso_donuts_ln)) {
  print(iso_donuts_ln[[iso]]$all_clones)
}
```

***

### 8.3.4 Per-isotype donuts -- P1_PT

```{r donut-isotype-pt, fig.height=6, fig.width=6}
iso_donuts_pt <- plot_isotype_donuts(bcr_annotated[["P1_PT"]], sample_name = "P1_PT")

for (iso in names(iso_donuts_pt)) {
  print(iso_donuts_pt[[iso]]$all_clones)
}
```

***

### 8.3.5 Per-cluster donuts -- P1_LN

Clusters with fewer than 10 cells are skipped automatically. BCR cells that
did not match a Seurat barcode (NA cell type) are excluded from this section.

```{r donut-cluster-ln, fig.height=6, fig.width=6}
cluster_donuts_ln <- plot_cluster_donuts(
  bcr_annotated[["P1_LN"]],
  sample_name   = "P1_LN",
  cell_type_col = "cell_type"
)

for (cl in names(cluster_donuts_ln)) {
  print(cluster_donuts_ln[[cl]]$all_clones)
}
```

***

### 8.3.6 Per-cluster donuts -- P1_PT

```{r donut-cluster-pt, fig.height=6, fig.width=6}
cluster_donuts_pt <- plot_cluster_donuts(
  bcr_annotated[["P1_PT"]],
  sample_name   = "P1_PT",
  cell_type_col = "cell_type"
)

for (cl in names(cluster_donuts_pt)) {
  print(cluster_donuts_pt[[cl]]$all_clones)
}
```

***

## 8.4 Phylogenetic trees

Phylogenetic trees reconstruct the evolutionary history of a B cell clone --
how cells diversified from a common unmutated ancestor through rounds of
somatic hypermutation in the germinal center. Unlike generic phylogenetic tools,
dowser builds trees specifically designed for B cell clonal evolution. The key
difference is that the unmutated germline sequence (reconstructed in Part 1 by
`createGermlines()`) is used as the known root of the tree. This means the root
is not inferred from the data but is fixed to the true biological ancestor,
giving the tree a meaningful directionality: branch length represents mutations
accumulated from the germline, and the root-to-tip direction represents time
and affinity maturation.

Each tip represents one unique sequence (or a group of cells sharing an
identical sequence, collapsed to reduce redundancy). Tip size reflects how many
cells share that sequence. Tips are colored by isotype, showing how isotype
switching occurred within the clone. Cell type labels are shown as colored
rectangles next to tips where the cell type is known and at least two cells
share that sequence -- singleton tips are shown as points only to avoid
cluttering small trees.

We build trees for the top 5 expanded clones per sample by default. You can
change `top_n` or supply specific clone IDs via `clone_ids` to inspect
particular clones of interest.

***

### 8.4a Cluster color palette

The tree tip labels use the same cell type color palette as the UMAP in Part 2.

```{r cluster-colors}
cluster_colors <- c(
  "MBC_Basal"    = "#F8766D",
  "MBC_Atypical" = "#D89000",
  "Naive"        = "#A3A500",
  "PC"           = "#39B600",
  "GC"           = "#00BF7D",
  "GC_Cycling"   = "#00BFC4",
  "MBC_CR2pos"   = "#00B0F6",
  "PC_2"         = "#9590FF",
  "Plasmablast"  = "#E76BF3",
  "Unsure"       = "#FF62BC"
)
```

***

### 8.4.1 Trees -- P1_LN

```{r trees-ln, fig.height=8, fig.width=10}
# If IQ-TREE 2 is not on your system PATH, supply the full path:
#   Windows : exec = "C:/tools/iqtree2/bin/iqtree2.exe"
#   macOS   : exec = "/usr/local/bin/iqtree2"

tree_plots_ln <- build_bcr_trees(
  bcr_data       = bcr_annotated[["P1_LN"]],
  cluster_colors = cluster_colors,
  top_n          = 5,
  cell_type_col  = "cell_type",
  exec           = "iqtree2"
)

for (cid in names(tree_plots_ln)) {
  if (!is.null(tree_plots_ln[[cid]])) {
    print(tree_plots_ln[[cid]])
  }
}
```

***

### 8.4.2 Trees -- P1_PT

```{r trees-pt, fig.height=8, fig.width=10}
tree_plots_pt <- build_bcr_trees(
  bcr_data       = bcr_annotated[["P1_PT"]],
  cluster_colors = cluster_colors,
  top_n          = 5,
  cell_type_col  = "cell_type",
  exec           = "iqtree2"
)

for (cid in names(tree_plots_pt)) {
  if (!is.null(tree_plots_pt[[cid]])) {
    print(tree_plots_pt[[cid]])
  }
}
```

***

### 8.4.3 Building a tree for a specific clone

To build a tree for a particular clone of interest rather than the top N by
size, supply the clone ID directly via the `clone_ids` argument. Clone IDs
follow the format `<sample_id>_<random_code>_<clone_size>_<isotypes>` and can
be found in the wide-format Excel table saved in Section 8.2b.

```{r trees-specific, eval=FALSE}
# Replace with your clone ID of interest
target_clone <- "P1_LN_4aFs_6_G2G4"

tree_specific <- build_bcr_trees(
  bcr_data       = bcr_annotated[["P1_LN"]],
  cluster_colors = cluster_colors,
  clone_ids      = target_clone,
  cell_type_col  = "cell_type",
  exec           = "iqtree2"
)

print(tree_specific[[target_clone]])
```

***

### 8.4.4 Save tree plots to PDF

```{r save-trees}
tree_dir <- file.path(plot_dir, "Trees")
dir.create(tree_dir, recursive = TRUE, showWarnings = FALSE)

for (sid in c("P1_LN", "P1_PT")) {
  plots <- if (sid == "P1_LN") tree_plots_ln else tree_plots_pt

  for (cid in names(plots)) {
    if (!is.null(plots[[cid]])) {
      out_file <- file.path(tree_dir, paste0(cid, "_tree.pdf"))
      ggplot2::ggsave(out_file, plots[[cid]], width = 10, height = 8)
      message("Saved: ", out_file)
    }
  }
}

message("All tree plots saved.")
```

***

# Step 9: Cross-Sample Clonal Analysis with scRepertoire

The analyses in Step 8 are per-sample: clones were defined independently for
each sample in Part 1, so a clone in P1_LN and a clone in P1_PT with identical
V(D)J sequences are treated as separate lineages. Step 9 asks a different
question: are any clones shared between the LN and PT samples from the same
patient? To answer this we use scRepertoire, which applies a cross-sample
clonotype definition (CTstrict) based on shared V gene usage and greater than
85% CDR3 nucleotide similarity. This is intentionally distinct from the
Immcantation clone IDs used in Step 8. The two definitions serve different
purposes and should not be mixed.

**A note on clonotype definitions:** scRepertoire's CTstrict and Immcantation's
`hierarchicalClones()` are not equivalent. Immcantation uses a mixture-model
distance threshold on full heavy-chain sequences with light-chain splitting;
scRepertoire uses a fixed 85% CDR3 similarity cutoff. CTstrict is specifically
designed for cross-sample comparison where you do not want to re-run Immcantation
on the combined data. Do not replace the Immcantation clone IDs with CTstrict
for within-sample analyses -- they serve different purposes.

***

## 9.1 Prepare AIRR data for scRepertoire

scRepertoire's `combineBCR()` expects a named list of data frames, one per
sample, with specific column names. Our Immcantation AIRR data uses slightly
different names (e.g. `cell_id` instead of `barcode`, `locus` instead of
`chain`, `v_call` instead of `v_gene`). The function below maps between them.

Note that `cdr3` (amino acid) and `cdr3_nt` (nucleotide) are both mapped from
the `junction` and `junction_aa` columns. scRepertoire uses `cdr3_nt` for the
CTstrict distance calculation.

```{r prepare-screp}
prepare_for_scRepertoire <- function(df) {

  # barcode and sample identity
  if (!"barcode"   %in% names(df) && "cell_id"   %in% names(df)) df$barcode   <- df$cell_id
  if (!"sample"    %in% names(df) && "sample_id" %in% names(df)) df$sample    <- df$sample_id

  # chain locus
  if (!"chain"     %in% names(df) && "locus"     %in% names(df)) df$chain     <- df$locus

  # gene calls
  if (!"v_gene"    %in% names(df) && "v_call"    %in% names(df)) df$v_gene    <- df$v_call
  if (!"d_gene"    %in% names(df) && "d_call"    %in% names(df)) df$d_gene    <- df$d_call
  if (!"j_gene"    %in% names(df) && "j_call"    %in% names(df)) df$j_gene    <- df$j_call
  if (!"c_gene"    %in% names(df) && "c_call"    %in% names(df)) df$c_gene    <- df$c_call

  # CDR3 sequences -- scRepertoire uses cdr3 (aa) and cdr3_nt for CTstrict
  if (!"cdr3"      %in% names(df) && "junction_aa" %in% names(df)) df$cdr3    <- df$junction_aa
  if (!"cdr3_nt"   %in% names(df) && "junction"    %in% names(df)) df$cdr3_nt <- df$junction
  if (!"cdr3_aa"   %in% names(df) && "junction_aa" %in% names(df)) df$cdr3_aa <- df$junction_aa

  # UMI count as a reads proxy (optional but avoids NAs in scRepertoire internals)
  if (!"reads"     %in% names(df) && "umi_count"   %in% names(df)) df$reads   <- df$umi_count
  if (!"reads"     %in% names(df))                                  df$reads   <- 1L

  return(df)
}
```

Now load both samples, apply the mapping, and run `combineBCR()`.
`removeMulti = TRUE` drops cells with more than one heavy chain (these were
already filtered in Part 1, so this is a safety net only).

```{r combine-bcr}
# Load the annotated BCR TSVs saved in Section 8.2a
annotated_tsv <- list(
  P1_LN = file.path(out_dir, "P1_LN_S8_bcr_annotated.tsv"),
  P1_PT = file.path(out_dir, "P1_PT_S8_bcr_annotated.tsv")
)

airr_list <- lapply(annotated_tsv, function(path) {
  if (!file.exists(path)) stop("File not found: ", path)
  data.table::fread(path, sep = "\t", data.table = FALSE)
})

airr_list <- lapply(airr_list, prepare_for_scRepertoire)

# Quick check: confirm required columns are present
required_cols <- c("barcode", "chain", "v_gene", "j_gene", "cdr3_nt")
for (sid in names(airr_list)) {
  missing <- setdiff(required_cols, colnames(airr_list[[sid]]))
  if (length(missing) > 0)
    stop(sid, ": missing required columns after mapping: ",
         paste(missing, collapse = ", "))
}

message("Running combineBCR() across ", length(airr_list), " samples...")

# Note on barcode format: combineBCR() prepends the sample name to each
# barcode, producing e.g. "P1_LN_ACTGCTCA...-1_P1_LN". This double
# suffix/prefix is expected -- our cell_ids already carry the sample suffix
# from Part 1, and combineBCR() adds its own prefix on top. The ct_lookup
# step in 9.3.1 strips the leading prefix to recover the original cell_id
# format used throughout Parts 1-8.

combined_BCR <- combineBCR(
  airr_list,
  samples     = names(airr_list),
  removeNA    = FALSE,
  removeMulti = TRUE
)

# Confirm output
message("combineBCR() complete.")
message("Cells per sample:")
print(sapply(combined_BCR, nrow))
message("Columns in combined_BCR output:")
print(colnames(combined_BCR[[1]]))
message("Example CTstrict values (P1_LN):")
print(head(combined_BCR[["P1_LN"]]$CTstrict, 10))

# Save
saveRDS(combined_BCR, file.path(rds_dir, "combined_BCR_S9.rds"))
message("Saved: combined_BCR_S9.rds")
```

***

## 9.2 Repertoire-level characterization

Before looking at shared clones, we characterize each sample's repertoire
structure independently.

***

### 9.2.1 Clonal homeostasis

`clonalHomeostasis()` shows what fraction of each sample's repertoire is
occupied by clones in different size classes. Each bar represents one sample;
the stacked segments represent the proportion of cells belonging to rare,
small, medium, large, and hyperexpanded clones. This gives an immediate read
of how oligoclonal each repertoire is.

For B cell data the default proportional bins (Rare < 0.0001, Small < 0.001,
Medium < 0.01, Large < 0.1, Hyperexpanded < 1) are appropriate.

In this patient, PT shows substantially more clonal expansion than LN, with a
notable fraction of cells in the Medium and Large categories. LN is almost
entirely composed of Small and Rare clones.

```{r homeostasis, fig.height=5, fig.width=6}
p_homeostasis <- clonalHomeostasis(
  combined_BCR,
  cloneCall = "strict"
)

p_homeostasis

ggsave(file.path(plot_dir, "clonal_homeostasis.pdf"),
       p_homeostasis, width = 6, height = 5)
message("Saved: clonal_homeostasis.pdf")
```

***

## 9.3 Quantify shared clones

Here we break down the repertoire overlap into specific numbers: how many
CTstrict clones are shared between LN and PT, and how many cells belong to
those shared clones. We also build a master table that joins CTstrict onto the
per-cell BCR data from Step 8.

**A note on method and reproducibility:** This tutorial uses pure CTstrict for
both identifying shared clones and counting cells within them. If you compare
these numbers against a paper that used a hybrid approach -- using CTstrict to
identify shared clones but Immcantation `clone_id` to count cells -- you will
get different totals. Immcantation groups cells more aggressively than CTstrict,
so the hybrid approach typically counts more cells per shared clone. Neither is
wrong; they answer slightly different questions. The pure CTstrict approach used
here is more reproducible because it depends only on scRepertoire.

**A note on scRepertoire version:** This tutorial was written and tested against
scRepertoire v1.12.0. The CTstrict string format used to identify shared clones
changed between major versions, so the same input data can produce slightly
different shared clone counts depending on which version is installed. With
v1.12.0, the shared count for P1 in Section 9.3.3 below is 151. On other
versions you may see a different number for the same set of cells and the same
underlying Immcantation clones -- the cells matched are identical, only the
string used to compare them is formatted differently. To exactly reproduce the
counts shown here, pin scRepertoire to v1.12.0. Otherwise small numeric
differences are expected and do not reflect a bug.

***

### 9.3.1 Build the CTstrict lookup

`combineBCR()` prepends the sample name to each barcode (e.g.
`P1_LN_ACGT...-1_P1_LN`). The BCR data from Step 8 uses `cell_id` in the
format `ACGT...-1_P1_LN`. We strip the leading `<sample>_` prefix from the
combined_BCR barcodes so they match the Step 8 cell_id format.

```{r ct-lookup}
ct_lookup <- do.call(rbind, lapply(names(combined_BCR), function(sid) {
  df <- combined_BCR[[sid]]
  data.frame(
    sample_id = df$sample,
    # Remove the leading "<sample>_" prefix that combineBCR adds
    cell_id   = sub(paste0("^", sid, "_"), "", df$barcode),
    CTstrict  = df$CTstrict,
    CTaa      = df$CTaa,
    CTnt      = df$CTnt,
    CTgene    = df$CTgene,
    stringsAsFactors = FALSE
  )
}))

message("CTstrict lookup: ", nrow(ct_lookup), " rows")
message("Example cell_id values after stripping:")
print(head(ct_lookup$cell_id, 6))
```

***

### 9.3.2 Join CTstrict onto the master BCR table

Load both annotated BCR TSVs from Step 8, combine them, and join the CTstrict
column. This master table is the single input for the shared clone dot plot
in Step 9.6.

```{r master-table}
bcr_all <- do.call(rbind, lapply(names(annotated_tsv), function(sid) {
  data.table::fread(annotated_tsv[[sid]], sep = "\t", data.table = FALSE)
}))

message("Master BCR table: ", nrow(bcr_all), " rows | ",
        length(unique(bcr_all$cell_id[bcr_all$locus == "IGH"])),
        " cells with heavy chain")

# Left-join CTstrict by sample_id + cell_id
bcr_all <- dplyr::left_join(
  bcr_all,
  ct_lookup,
  by = c("sample_id", "cell_id")
)

# Check match rate
n_igh     <- sum(bcr_all$locus == "IGH")
n_matched <- sum(!is.na(bcr_all$CTstrict[bcr_all$locus == "IGH"]))
message("CTstrict matched: ", n_matched, " / ", n_igh,
        " heavy chain cells (",
        round(100 * n_matched / n_igh, 1), "%)")

# If the match rate is below 50%, check that the barcode format is consistent
# between Part 1 and Part 2. The cell_id in Step 8 should look like
# "ACGT...-1_P1_LN" and the stripped combined_BCR barcode should match exactly.
if (n_matched / n_igh < 0.5) {
  warning("Less than 50% of heavy chain cells matched a CTstrict. ",
          "Check barcode format consistency between Part 1 and Part 2.")
}

# Save
data.table::fwrite(bcr_all,
                   file.path(out_dir, "All_S9_bcr_with_CTstrict.tsv"),
                   sep = "\t")
message("Saved: All_S9_bcr_with_CTstrict.tsv")
```

***

### 9.3.3 Count shared clones and cells

```{r shared-counts}
# Tissue label (LN or PT) derived from sample_id
bcr_all <- bcr_all %>%
  dplyr::mutate(
    tissue = dplyr::case_when(
      grepl("_LN$|^LN", sample_id) ~ "LN",
      grepl("_PT$|^PT", sample_id) ~ "PT",
      TRUE ~ "UNK"
    )
  )

# Restrict to heavy chain rows with a valid CTstrict
bcr_igh <- bcr_all %>%
  dplyr::filter(locus == "IGH", !is.na(CTstrict), CTstrict != "")

# Which CTstrict clones appear in both LN and PT?
clones_ln <- unique(bcr_igh$CTstrict[bcr_igh$tissue == "LN"])
clones_pt <- unique(bcr_igh$CTstrict[bcr_igh$tissue == "PT"])
shared_cts <- intersect(clones_ln, clones_pt)

message("CTstrict clones in LN only:   ", length(setdiff(clones_ln, clones_pt)))
message("CTstrict clones in PT only:   ", length(setdiff(clones_pt, clones_ln)))
message("CTstrict clones shared LN+PT: ", length(shared_cts))

# Cells per tissue belonging to shared clones
# (raw counts only -- no percentages; denominator choice is non-trivial)
cells_ln_shared <- sum(bcr_igh$CTstrict[bcr_igh$tissue == "LN"] %in% shared_cts)
cells_pt_shared <- sum(bcr_igh$CTstrict[bcr_igh$tissue == "PT"] %in% shared_cts)

message("LN cells in shared clones:    ", cells_ln_shared)
message("PT cells in shared clones:    ", cells_pt_shared)
message("Total shared cells:           ", cells_ln_shared + cells_pt_shared)

# Summary table of shared clones with cell counts per tissue
shared_clone_tbl <- bcr_igh %>%
  dplyr::filter(CTstrict %in% shared_cts) %>%
  dplyr::count(CTstrict, tissue, name = "n_cells") %>%
  tidyr::pivot_wider(
    names_from  = tissue,
    values_from = n_cells,
    values_fill = 0L
  ) %>%
  dplyr::mutate(total_cells = LN + PT) %>%
  dplyr::arrange(dplyr::desc(total_cells))

message("\nShared clone table (top 10):")
print(knitr::kable(head(shared_clone_tbl, 10), caption = "Top 10 shared clones by total cell count"))
```

***

### 9.3.4 Visualize shared vs. exclusive cell counts

With only two samples, summary statistics like the Morisita overlap index
reduce to a single number and are not informative as a standalone plot.
Instead, we show a stacked bar chart of cell counts broken down by tissue
and sharing status: how many IGH cells in each sample belong to
tissue-exclusive clones vs. clones shared with the other tissue.

```{r shared-bar, fig.height=4, fig.width=5}
shared_bar_df <- bcr_igh %>%
  dplyr::mutate(
    status = dplyr::if_else(CTstrict %in% shared_cts, "Shared", "Exclusive")
  ) %>%
  dplyr::count(tissue, status, name = "n_cells") %>%
  dplyr::mutate(
    tissue = factor(tissue, levels = c("LN", "PT")),
    status = factor(status, levels = c("Shared", "Exclusive"))
  )

p_shared_bar <- ggplot(shared_bar_df,
                       aes(x = tissue, y = n_cells, fill = status)) +
  geom_col(width = 0.55) +
  geom_text(aes(label = n_cells),
            position = position_stack(vjust = 0.5),
            color = "white", size = 4, fontface = "bold") +
  scale_fill_manual(
    values = c(Shared = "#2166AC", Exclusive = "#B2B2B2"),
    name   = "Clone status"
  ) +
  theme_minimal(base_size = 13) +
  theme(panel.grid.major.x = element_blank()) +
  labs(
    x     = "Tissue",
    y     = "Number of cells (IGH)",
    title = "Cells in shared vs. exclusive clones -- P1"
  )

p_shared_bar

ggsave(file.path(plot_dir, "shared_cells_bar_P1.pdf"),
       p_shared_bar, width = 5, height = 4)
message("Saved: shared_cells_bar_P1.pdf")
```

***

## 9.4 Attach clonal data to the Seurat object

`combineExpression()` adds the scRepertoire clonotype metadata to the Seurat
object's `@meta.data` slot. After this step, every cell in the UMAP has a
CTstrict label, a cloneSize category, and a clonal frequency, enabling
visualization of clonal structure directly on the dimensional reduction.

The barcode format in the Seurat object is `ACGT...-1_P1_LN` (barcode + sample
suffix, added during integration in Part 2). The format in combined_BCR after
`combineBCR()` is `P1_LN_ACGT...-1_P1_LN` (sample prefix + barcode + original
suffix). We harmonize these before calling `combineExpression()` by stripping
the leading `<sample>_` prefix from combined_BCR barcodes.

```{r attach-seurat}
seu <- readRDS(rds_in)
message("Seurat object loaded: ", ncol(seu), " cells")

# Harmonize barcodes: strip leading "<sample>_" prefix from combined_BCR
combined_BCR_harmonized <- lapply(combined_BCR, function(df) {
  df$barcode <- sub(paste0("^", df$sample, "_"), "", df$barcode)
  df
})

# Overlap check before proceeding
airr_barcodes <- unique(unlist(lapply(combined_BCR_harmonized,
                                       function(x) x$barcode)))
n_overlap <- sum(colnames(seu) %in% airr_barcodes)
message("Barcode overlap (Seurat intersect AIRR): ", n_overlap, " / ", ncol(seu))

if (n_overlap == 0) {
  stop(
    "Zero overlap after barcode harmonization.\n",
    "Check that cell_id format in Step 8 TSVs matches Seurat colnames.\n",