Skip to content

Chengyuli33/biography-parser-text-generation

Repository files navigation

Biography Parser & Story Generation: LLM Fine-tuning

📖 The Story Behind This Project

Scenario 1:

Imagine a Wikipedia editor facing a mountain of biographies to process. Each article looks like this:

Zulima Aban was a Venezuelan astronomer born on December 5, 1905, in Valencia, Spain. She studied at the University of Valencia...

The job? Extract structured data for the info box:

Name: Zulima Aban
Birth Date: 05 December 1905
Birth Place: Valencia, Spain
Occupation: Astronomer

Just like Google Translate converts between languages, this system converts between data formats and human stories.

Scenario 2:

Designed an automatic writing bot capable of completing prompts and mimics the tone of specific literary styles (a example output from this project):

📌 Project Overview

This project demonstrates practical large language models (LLMs GPT-3.5) fine-tuning for structured data extraction, text generation, and prompt engineering. It also explored model evaluation, cost-performance trade-offs, and limitations of large language models in real-world NLP tasks. 📓 View Full Notebook

  • Zero-shot learning task: Summarization, Q&A, simplification, translation

  • Few-shot learning task: Subject classification

  • Fine-tuning:

    • Biography Generation: Structured data → Natural language biographies
    • Biography Parsing: Natural text → Structured Wikipedia infobox format
  • Custom task: Sci-fi story generation with output length control

🤖 Outputs Examples

    {
        "notable_type": "artist",
        "attrs": {
            "name": "Igor Ivanov",
            "gender": "male",
            "nationality": "Russian",
            "birth_date": "18 October 1970",
            "birth_place": "Moscow, Russian Soviet Federative Socialist Republic",
            "known_for": "paintings of portraits, city scenes, still lifes, and flowers",
            "notable_works": "Portraits of Celebrities, Still Life with Flowers, Flowers and Fruit",
            "movement": "realism",
            "alma_mater": "the Moscow College of Art (now in the Moscow Institute of Painting, Sculpture and Architecture)",
            "awards": "People's Artist of the USSR, Gold Medals, Hero of Socialist Labour",
            "elected": "People's Artist of Russia, Professor of the Academy",
            "mother": "Anastasia Nikolaevna Ivanova",
            "father": "Nikolay Ivanovich Ivanov",
            "partner": "Alexandra Alexandrovna Vasilieva",
            "children": "Alexandra Ivanovna Vasilieva, Anastasia Ivanovna Vasilieva"
        },

And the generated biography text:

biographies: "Igor Ivanov was born on October 18, 1970 in Moscow, Russia. His mother is Anastasia Nikolaevna Ivanova, and his father is Nikolay Ivanovich Ivanov. Ivanov is a Russian artist who specializes in oil paintings of portraits, city scenes, still lifes, and flowers. Ivanov attended the Moscow College of Art (now in the Moscow Institute of Painting, Sculpture and Architecture), where he received a bachelor's degree in painting and a master's degree in art history. His notable works include Portraits of Celebrities, Still Life with Flowers, Flowers and Fruit and movement realism. He has received several awards, including the People's Artist of the USSR, Hero of Socialist Labour. Ivanov is also a Professor of the Academy, and he was elected a People's Artist of Russia in 2006. He is married to Alexandra Alexandrovna Vasilieva, and they have two daughters, Alexandra Ivanovna Vasilieva and Anastasia Ivanovna Vasilieva."

🎯 Key Results

Cost-Effective Training

  • Biography Parsing: 44.8% F-Score on 237 test cases with 100 training examples.
  • 15% ROI improvement over few-shot learning with only $2 training cost and 100 examples
  • Key takeaways: Data Quality > Model Scale. Careful data curation outperforms brute-force scaling

Cost vs Performance Trade-off (Fine-tuning vs Few-shot)

Attribute Extraction Performance

  • Strong Performance (85%+ F-Score):

    • Demographics: Gender (91.3%), Birth Date (90.5%), Name (86.1%), Nationality (87.9%)
    • The highest performing attributes are those with clear, unambiguous mentions in text, like basic biographical facts.
  • Challenges (<15% F-Score):

    • Specialized fields: Doctoral Students, Olympics, Academic Fields
    • It is challenging to obtain high-quality training data for these rare attributes.

Best vs Worst Attributes

Performance by Category

The biography parsing model extracted structured attributes with 60+ types (name, birth date, education, etc.), an F-score is computed for each category. The result shows that the model excels at simple, frequently-occurring attributes but struggles with complex, rare ones.

Category F-Score Extraction Performance
Demographics 90.0% Excellent extraction of gender, name, nationality
Temporal 87.8% Strong performance on birth/death dates
Relationships 70.0% Moderate success with family connections
Professional 50.2% Struggles with occupation, awards
Educational 45.6% Challenges with degrees, thesis details

Performance by Category

Training Progress

  • Rapid learning phase (0-100 steps): Loss drops from 1.2 → 0.67
  • Refinement phase (100-200 steps): Loss decreases to 0.38
  • Convergence phase (200-300 steps): Final loss stabilizes at 0.30-0.55
  • Total training time: ~18 minutes, 300 epochs
  • Final loss shows minor discrepancy between notebook (0.55) and smoothed curve (0.30) due to training variance.

Training Loss Curve

Gender Bias Detection

Discovered significant stereotypes in GPT-3.5 outputs by analyzing the professors names in generated biographies:

  • Physics professors: 100% male names
  • Gender Studies professors: 100% female names
  • Computer Science: 60% male, 40% female
  • Bias Awareness: Pre-trained LLMs encode biases from training data. Fine-tuning alone insufficient for bias mitigation.

Gender Bias in Attribute Extraction

Error Analysis

Primary error types:

  • Missing Attributes (34.1%): Model fails to extract present information
  • Wrong Values (25.7%): Incorrect attribute values
  • Hallucination (17.2%): Model invents non-existent information

Error Analysis

🛠️ Technical Skills

  • Natural Language Processing (NLP): Implemented text preprocessing pipelines, tokenization, generation, and feature extraction for biographical text.
  • Large Language Models (LLM): GPT-3.5-turbo, model fine-tuning, prompt engineering
  • Data preprocessing and Model Evaluation (precision, recall, F-score, cross-validation)
  • Machine Learning: Designed and trained models for generation tasks
  • Python (OpenAI API, pandas, json, matplotlib)
  • Jupyter Notebook

📁 Files Structure

biography-parser-story-generation/
├── biography_parser_story_generation.ipynb   # Main analysis notebook
├── visualizations/                           # Performance charts
├── visualization.py                          # Visualization generator script
└── README.md

About

✍️ Fine-tuned NLP system for Wiki bio parsing and sci-fi story generation, featuring text generation, structured data extraction, multi-attribute evaluation (60+ fields), cost–performance trade-off analysis, and bias detection.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors