Hürthle cell carcinoma (HCC) is a subtype of thyroid cancer, accounting for 3-5% of all thyroid malignancies. HCC is characterised by an abundance of malfunctioning mitochondria and poor response to radioiodine therapy. While prior studies have documented mitochondrial complex I DNA mutations and metabolomic vulnerabilities in HCC, the transcriptomic landscape remains largely unexplored. No studies to date have specifically characterised isoform switching events in HCC.
The analysis is performed on NCBI GEO dataset and explores the expression profiles of different isoforms in HCC tissues. The original study explored metabolomic profiles of HCC and identified that mitochondrial complex I loss along with lipid peroxide stress is a vulnerability in HCC. This analysis performed on a subset of samples explores the isoform profiles in HCC, identifies some major genes undergoing functional isoform switching and highlights alternative splicing mechanisms that may drive the pathogenesis of the disease.
- Quantify the expression of transcripts between normal & HCC tissues.
- Identify transcript isoforms in HCC tissues.
- Predict functional consequences & alternative splicing event mechanisms in HCC tissues.
The dataset for this analysis have been obtained from NCBI GEO with accession ID GSE228870. The dataset was subset to include 7 HCC samples & 7 matched normal thyroid samples.
- Data Import
-
Download paired-end raw reads (fastq files) from NCBI GEO.
-
Tool: fasterq-dump
- Initial QC
-
Check the quality of raw reads, including per base sequence quality, adapter sequences, GC content etc.
-
Tool: fastqc
- QC
-
All-in-one processing to remove low quality sequences, over represented sequences & adapter trimming.
-
Tool: fastp
- Quantification
-
Mapping & quantification of reads against a reference transcriptome (GRCh38).
-
Tool: Salmon
- Post-Alignment QC
-
Check the mapping quality & mapping rate of reads against the reference transcriptome.
-
Tool: MultiQC
- Isoform Switch Analysis
-
Isoform switches in tumor samples along with their functional consequences including Non-sense mediated decay (NMD) sensitivity, intron retention & coding potential of isoform was analyzed.
-
Tool: IsoformSwitchAnalyzeR
- Visualization
-
Statistically significant switches, alternative splicing events & consequence summary for different genes was visualized.
-
Tool: IsoformSwitchAnalyzeR
Why 7 matched pairs instead of the full dataset?
The full dataset contains samples lacking matched normal controls. For isoform switch analysis, tumour-normal pairing is essential to study transcriptomic variation. Computational constraints also informed this decision. Only samples with confirmed matched pairs were retained, yielding 7 HCC and 7 normal thyroid samples.
Why Salmon over HISAT2 + featureCounts?
Isoform-level quantification requires transcript-level resolution. HISAT2 + featureCounts is optimised for gene-level count matrices and would have required additional assembly steps (e.g. StringTie) to recover novel isoforms, which was outside the scope of this analysis. Salmon's quasi-mapping approach quantifies directly against the reference transcriptome at transcript resolution, is computationally efficient, and its output integrates directly with IsoformSwitchAnalyzeR. --validateMappings was enabled to improve mapping accuracy by removing invalid multi-mapping reads.
Why GENCODE v44 as reference?
GENCODE v44 provides the most comprehensive human transcript annotation, including complete genome assembly (GRCh38.p14) and a matched GTF file required for transcript-level quantification. Using a comprehensive annotation is critical for isoform analysis.
Why isoform-level analysis over standard DEG?
Standard DEG analysis (e.g. DESeq2 on gene-level counts) would not capture scenarios where total gene expression is stable but isoform usage shifts — a pattern that can have profound functional consequences. DIXDC1 is a direct example from this dataset: gene-level analysis would have missed it entirely.
- Read Mapping & Sample Quality
-
Salmon automatically inferred the library type to be ISR (Inward, Stranded & Reverse).
-
All samples had high mapping rate (above 90%) against the reference transcriptome.
- Sample Clustering
-
Principal Component Analysis (PCA) showed that tissue condition (tumor vs normal) was the main driver of transcriptomic heterogeneity, with PC1 accounting for 28.3% of total variance.
-
PC2 captured minor within-group heterogeneity.
- Isoform Analysis & Switch Consequences
-
Differential splicing analysis identified 808 significant isoform switches in 658 distinct genes, involving 911 isoforms.
-
After filtering for functional consequences, 452 genes had 578 isoform switches that were known to cause functional consequences including intron retention, NMD, and structural modifications.
- Top Significant Switching Genes
-
GAB2, DIXDC1, CRTC1 & CRB3 were the most heavily altered genes.
-
Global alternative splicing mechanisms showed that 548 isoforms contain single intron retention (IR), while 143 isoforms contained multiple IR.
- Functional Consequence Profiling
-
Functional consequence analysis showed that tumor upregulated isoforms alter the transcript’s coding potential and ORF structure.
-
Analysis showed balanced rate of intron retention and intron gain while small subset of switches resulted in altered NMD sensitivity.

- Alternative Splicing & Genome-Wide Enrichment
-
Analysis of alternative splicing events showed that alternative transcription start site (ATSS) and alternative transcription termination sites (ATTS) are the main events driving isoform variation within tumor samples.
-
Genome-wide splicing enrichment analysis identified that transcripts undergoing ATSS and ATTS events showed a significant enrichment toward "gain" features, indicating preferential usage of alternative upstream initiation and downstream termination coordinates in tumor tissue.
Detailed Results for DIXDC1 gene
The most significant isoform switch within DIXDC1 genes shows that longer, protein-coding transcript ENST00000440460.7 in tumor samples, while the transcript ENST00000618522.4 is suppressed in tumor samples. While gene expression remains almost the same in tumor and normal samples, individual transcripts showed that upregulated tumor isoform gained intrinsically disordered region (IDR). This altered isoform expression suggests a role of this isoform in HCC profile.
The dominance of ATSS and ATTS events as the primary drivers of isoform variation in HCC tumour samples is a notable finding. These events alter which promoter regions are utilised and where transcription terminates, potentially placing transcripts under different regulatory controls or producing proteins with altered N- or C-terminal domains. Whether this pattern reflects broader dysregulation of transcriptional machinery in HCC, or is specific to the mitochondrial-rich tumour microenvironment of this cancer type, is an open question this analysis cannot resolve but warrants further investigation.
A balanced rate of NMD gain and loss across switching events was unexpected. A bias toward NMD gain would suggest tumour cells are selectively degrading certain transcripts; a bias toward loss would suggest escape from surveillance. The balanced pattern may reflect heterogeneous switching across tumour samples rather than a coordinated programme, though this remains uncharacterised.
- Absence of Decoy Sequences in Transcript Quantification
Transcript-level abundance estimation using Salmon was performed without incorporating a selective-alignment decoy index. This potentially increases the rate of false-positive multi-mapping reads, particularly from genomic regions with high sequence homology or unannotated pseudogenes.
- Lack of Protein Domain Structural Integration
Downstream functional consequence analysis utilizing IsoformSwitchAnalyzeR was conducted without integrating external topological or protein domain annotation databases. Consequently, while macro-structural changes like ORF lengths and IDRs were captured, localized disruptions to specific functional or catalytic protein domains remain uncharacterized.
Future directions: Integrating decoy-aware Salmon indexing to reduce false-positive multi-mapping, adding protein domain annotation via Pfam or SignalP within IsoformSwitchAnalyzeR, and cross-referencing the top switching genes against known HCC driver mutations would each meaningfully extend this analysis.
Isoform_Switch_Analysis/
├── salmon_quant/ # Main directory for performing Salmon Quantification
│ ├── data/ # Accession List for selected samples
│ ├── fastp_output/ # For fastp.json files
│ ├── fastqc/ # For fastqc reports
│ ├── multiqc/ # For multiqc report
│ ├── salmon_output/ # Contains salmon quant.sf files for each sample
│ └── scripts/ # Scripts for getting data, initial QC & salmon quantification
│
└── isoformswitchanalyzer/ # Directory containing isoformswitchanalyzer
├── Plots/ # Directory for Plots (from isoformswitchanalyzer)
├── results/ # Directory for saving tabular results
├── switchanalyzer_output/ # Contains fasta files obtained from isoformswitchanalyzer
├── IsoformSwitchAnalyze.R # R script for IsoformSwitchAnalyzeR
├── metadata.csv # CSV file for selected samples & their condition
├── isoformswitchanalyzer.Rproj # R project
└── Readme.md # Readme.md file
└── env.yaml # Conda env for salmon
└── r_session_info.txt # R package versions


