This repository contains implementations of algorithms, data pipelines, and analysis tools developed during the Bioinformatics 2 curriculum at the Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University. The focus is on the mathematical implementation of biological algorithms and efficient data processing
- Overview
- Project Modules
- Data Engineering and Chemoinformatics
- Genomics and Sequence Analysis
- Structural Bioinformatics
- Technical Stack
- Disclaimer and Missing Data Files
- License
A major component of the coursework involved the prediction and mathematical validation of macromolecular structures. This includes an implementation of the Nussinov algorithm, which utilizes dynamic programming for RNA secondary structure prediction based on base-pair maximization.
To validate structural models, the Kabsch algorithm was implemented using Singular Value Decomposition (SVD) to calculate the Root-Mean-Square Deviation (RMSD) between atomic coordinates. Additionally, the projects utilize the PyMOL API to calculate validation metrics such as TM-scores and GDT_TS. Protein threading was performed using the I-TASSER suite.
| Module | Topic | Key Techniques |
|---|---|---|
| 01_Biological_Databases | Data Access & Chemoinformatics | PDB/PubChem/NCBI APIs, TF-IDF, RDKit |
| 02_Sequence_Analysis | Sequence Alignment & Visualization | BLAST, Polars, Matplotlib |
| 03_Genome_Analysis | Restriction Site Analysis | Biopython, E. coli genome |
| 04_Genome_Assembly | De Novo Assembly | De Bruijn graphs, k-mers |
| 05_Protein_Domains | Domain Architecture | PyHMMER, Pfam, HMM profiles |
| 06_Protein_Structure_Prediction | Structure Modeling | I-TASSER, AlphaFold 2/3 |
| 07_Structure_Validation | Model Quality Assessment | Kabsch algorithm, RMSD, TM-score |
| 08_RNA_Structure | Secondary Structure Prediction | Nussinov algorithm, Dynamic Programming |
| 09_scRNA_Seq | Single-Cell Transcriptomics | R, Seurat-style analysis |
| 10_Crystallography | Electron Density Analysis | Crystallographic maps |
This section contains scripts for acquiring and processing biological data from public repositories (NCBI E-utils, PDB, PubChem). I used the Polars library to handle larger datasets more efficiently than standard Pandas.
The work also includes:
- NLP techniques: Applying TF-IDF vectorization to scientific literature for MeSH term analysis.
- Chemoinformatics: Using RDKit to compute physicochemical properties and visualize molecular structures.
Projects focused on understanding genome assembly logic. I implemented De Bruijn graphs to demonstrate how short sequencing reads are processed into contigs.
For protein domain analysis, I used PyHMMER to query the Pfam database. This involved applying Hidden Markov Models (HMMs) and statistical filtering (E-values, sequence coverage) to correctly identify domain architectures.
This part covers the prediction and validation of macromolecular structures, including protein threading (I-TASSER) and a custom implementation of the Nussinov algorithm for RNA secondary structure prediction.
To validate structural models, I implemented the Kabsch algorithm using Singular Value Decomposition (SVD) to calculate RMSD between atomic coordinates. Additionally, I utilized the PyMOL API to compute TM-scores and GDT_TS for comparing predicted models against experimental data.
| Category | Tools |
|---|---|
| Data Analysis | Polars, NumPy, SciPy, Scikit-learn |
| Bioinformatics | Biopython, PyHMMER, RDKit, ViennaRNA |
| Structural Tools | PyMOL API, I-TASSER, AlphaFold |
| Visualization | Matplotlib, Seaborn |
| Environment | Jupyter Notebooks, R Markdown |
Warning
This repository has been reorganized from its original academic structure into a thematic portfolio. As a result, some relative file paths inside the notebooks (.ipynb) and scripts (.py) may reference directories that have been moved or renamed.
While this might prevent some scripts from executing "out of the box" without path adjustments, the logic, implementation details, and code structure remain fully intact. Also, notebooks were translated from Polish to English, so some phrasing might sound a bit unnatural or incorrect. Some leftovers of my beautiful Polish langauge may still be found in the code.
Some large data files are excluded from this repository. Each module's README contains instructions for obtaining necessary files. Key exclusions:
| File | Module | Source |
|---|---|---|
e_coli.fasta |
03_Genome_Analysis | NCBI |
| Protein FASTAs | 05_Protein_Domains | UniProt batch download |
*.cif files |
06_Protein_Structure_Prediction | AlphaFold and I-TASSER |
| scRNA data | 09_scRNA_Seq | Course materials |
*.map files |
10_Crystallography | PDB electron density server |
- My solutions, code, and documentation are licensed under the MIT License.
- Course materials and exercise descriptions (embedded in notebooks) remain the intellectual property of their respective authors at Jagiellonian University and are included here for educational and portfolio purposes with attribution.
Maintained by Aleksander Stanuch — Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University.
