Biography Parser & Story Generation: LLM Fine-tuning

📖 The Story Behind This Project

Scenario 1:

Imagine a Wikipedia editor facing a mountain of biographies to process. Each article looks like this:

Zulima Aban was a Venezuelan astronomer born on December 5, 1905, in Valencia, Spain. She studied at the University of Valencia...

The job? Extract structured data for the info box:

Name: Zulima Aban
Birth Date: 05 December 1905
Birth Place: Valencia, Spain
Occupation: Astronomer

Just like Google Translate converts between languages, this system converts between data formats and human stories.

Scenario 2:

Designed an automatic writing bot capable of completing prompts and mimics the tone of specific literary styles (a example output from this project):

📌 Project Overview

This project demonstrates practical large language models (LLMs GPT-3.5) fine-tuning for structured data extraction, text generation, and prompt engineering. It also explored model evaluation, cost-performance trade-offs, and limitations of large language models in real-world NLP tasks. 📓 View Full Notebook

Zero-shot learning task: Summarization, Q&A, simplification, translation
Few-shot learning task: Subject classification
Fine-tuning:
- Biography Generation: Structured data → Natural language biographies
- Biography Parsing: Natural text → Structured Wikipedia infobox format
Custom task: Sci-fi story generation with output length control

🤖 Outputs Examples

    {
        "notable_type": "artist",
        "attrs": {
            "name": "Igor Ivanov",
            "gender": "male",
            "nationality": "Russian",
            "birth_date": "18 October 1970",
            "birth_place": "Moscow, Russian Soviet Federative Socialist Republic",
            "known_for": "paintings of portraits, city scenes, still lifes, and flowers",
            "notable_works": "Portraits of Celebrities, Still Life with Flowers, Flowers and Fruit",
            "movement": "realism",
            "alma_mater": "the Moscow College of Art (now in the Moscow Institute of Painting, Sculpture and Architecture)",
            "awards": "People's Artist of the USSR, Gold Medals, Hero of Socialist Labour",
            "elected": "People's Artist of Russia, Professor of the Academy",
            "mother": "Anastasia Nikolaevna Ivanova",
            "father": "Nikolay Ivanovich Ivanov",
            "partner": "Alexandra Alexandrovna Vasilieva",
            "children": "Alexandra Ivanovna Vasilieva, Anastasia Ivanovna Vasilieva"
        },

And the generated biography text:

biographies: "Igor Ivanov was born on October 18, 1970 in Moscow, Russia. His mother is Anastasia Nikolaevna Ivanova, and his father is Nikolay Ivanovich Ivanov. Ivanov is a Russian artist who specializes in oil paintings of portraits, city scenes, still lifes, and flowers. Ivanov attended the Moscow College of Art (now in the Moscow Institute of Painting, Sculpture and Architecture), where he received a bachelor's degree in painting and a master's degree in art history. His notable works include Portraits of Celebrities, Still Life with Flowers, Flowers and Fruit and movement realism. He has received several awards, including the People's Artist of the USSR, Hero of Socialist Labour. Ivanov is also a Professor of the Academy, and he was elected a People's Artist of Russia in 2006. He is married to Alexandra Alexandrovna Vasilieva, and they have two daughters, Alexandra Ivanovna Vasilieva and Anastasia Ivanovna Vasilieva."

🎯 Key Results

Cost-Effective Training

Biography Parsing: 44.8% F-Score on 237 test cases with 100 training examples.
15% ROI improvement over few-shot learning with only $2 training cost and 100 examples
Key takeaways: Data Quality > Model Scale. Careful data curation outperforms brute-force scaling

Attribute Extraction Performance

Strong Performance (85%+ F-Score):
- Demographics: Gender (91.3%), Birth Date (90.5%), Name (86.1%), Nationality (87.9%)
- The highest performing attributes are those with clear, unambiguous mentions in text, like basic biographical facts.
Challenges (<15% F-Score):
- Specialized fields: Doctoral Students, Olympics, Academic Fields
- It is challenging to obtain high-quality training data for these rare attributes.

Performance by Category

The biography parsing model extracted structured attributes with 60+ types (name, birth date, education, etc.), an F-score is computed for each category. The result shows that the model excels at simple, frequently-occurring attributes but struggles with complex, rare ones.

Category	F-Score	Extraction Performance
Demographics	90.0%	Excellent extraction of gender, name, nationality
Temporal	87.8%	Strong performance on birth/death dates
Relationships	70.0%	Moderate success with family connections
Professional	50.2%	Struggles with occupation, awards
Educational	45.6%	Challenges with degrees, thesis details

Training Progress

Rapid learning phase (0-100 steps): Loss drops from 1.2 → 0.67
Refinement phase (100-200 steps): Loss decreases to 0.38
Convergence phase (200-300 steps): Final loss stabilizes at 0.30-0.55
Total training time: ~18 minutes, 300 epochs
Final loss shows minor discrepancy between notebook (0.55) and smoothed curve (0.30) due to training variance.

Gender Bias Detection

Discovered significant stereotypes in GPT-3.5 outputs by analyzing the professors names in generated biographies:

Physics professors: 100% male names
Gender Studies professors: 100% female names
Computer Science: 60% male, 40% female
Bias Awareness: Pre-trained LLMs encode biases from training data. Fine-tuning alone insufficient for bias mitigation.

Error Analysis

Primary error types:

Missing Attributes (34.1%): Model fails to extract present information
Wrong Values (25.7%): Incorrect attribute values
Hallucination (17.2%): Model invents non-existent information

🛠️ Technical Skills

Natural Language Processing (NLP): Implemented text preprocessing pipelines, tokenization, generation, and feature extraction for biographical text.
Large Language Models (LLM): GPT-3.5-turbo, model fine-tuning, prompt engineering
Data preprocessing and Model Evaluation (precision, recall, F-score, cross-validation)
Machine Learning: Designed and trained models for generation tasks
Python (OpenAI API, pandas, json, matplotlib)
Jupyter Notebook

📁 Files Structure

biography-parser-story-generation/
├── biography_parser_story_generation.ipynb   # Main analysis notebook
├── visualizations/                           # Performance charts
├── visualization.py                          # Visualization generator script
└── README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Biography Parser & Story Generation: LLM Fine-tuning

📖 The Story Behind This Project

Scenario 1:

Scenario 2:

📌 Project Overview

🤖 Outputs Examples

🎯 Key Results

Cost-Effective Training

Attribute Extraction Performance

Performance by Category

Training Progress

Gender Bias Detection

Error Analysis

🛠️ Technical Skills

📁 Files Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
visualizations		visualizations
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
biography_parser_story_generation.ipynb		biography_parser_story_generation.ipynb
visualization.py		visualization.py

Folders and files

Latest commit

History

Repository files navigation

Biography Parser & Story Generation: LLM Fine-tuning

📖 The Story Behind This Project

Scenario 1:

Scenario 2:

📌 Project Overview

🤖 Outputs Examples

🎯 Key Results

Cost-Effective Training

Attribute Extraction Performance

Performance by Category

Training Progress

Gender Bias Detection

Error Analysis

🛠️ Technical Skills

📁 Files Structure

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages