Imagine a Wikipedia editor facing a mountain of biographies to process. Each article looks like this:
Zulima Aban was a Venezuelan astronomer born on December 5, 1905, in Valencia, Spain. She studied at the University of Valencia...
The job? Extract structured data for the info box:
Name: Zulima Aban
Birth Date: 05 December 1905
Birth Place: Valencia, Spain
Occupation: Astronomer
Just like Google Translate converts between languages, this system converts between data formats and human stories.
Designed an automatic writing bot capable of completing prompts and mimics the tone of specific literary styles (a example output from this project):
This project demonstrates practical large language models (LLMs GPT-3.5) fine-tuning for structured data extraction, text generation, and prompt engineering. It also explored model evaluation, cost-performance trade-offs, and limitations of large language models in real-world NLP tasks. 📓 View Full Notebook
-
Zero-shot learning task: Summarization, Q&A, simplification, translation
-
Few-shot learning task: Subject classification
-
Fine-tuning:
- Biography Generation: Structured data → Natural language biographies
- Biography Parsing: Natural text → Structured Wikipedia infobox format
-
Custom task: Sci-fi story generation with output length control
{
"notable_type": "artist",
"attrs": {
"name": "Igor Ivanov",
"gender": "male",
"nationality": "Russian",
"birth_date": "18 October 1970",
"birth_place": "Moscow, Russian Soviet Federative Socialist Republic",
"known_for": "paintings of portraits, city scenes, still lifes, and flowers",
"notable_works": "Portraits of Celebrities, Still Life with Flowers, Flowers and Fruit",
"movement": "realism",
"alma_mater": "the Moscow College of Art (now in the Moscow Institute of Painting, Sculpture and Architecture)",
"awards": "People's Artist of the USSR, Gold Medals, Hero of Socialist Labour",
"elected": "People's Artist of Russia, Professor of the Academy",
"mother": "Anastasia Nikolaevna Ivanova",
"father": "Nikolay Ivanovich Ivanov",
"partner": "Alexandra Alexandrovna Vasilieva",
"children": "Alexandra Ivanovna Vasilieva, Anastasia Ivanovna Vasilieva"
},
And the generated biography text:
biographies: "Igor Ivanov was born on October 18, 1970 in Moscow, Russia. His mother is Anastasia Nikolaevna Ivanova, and his father is Nikolay Ivanovich Ivanov. Ivanov is a Russian artist who specializes in oil paintings of portraits, city scenes, still lifes, and flowers. Ivanov attended the Moscow College of Art (now in the Moscow Institute of Painting, Sculpture and Architecture), where he received a bachelor's degree in painting and a master's degree in art history. His notable works include Portraits of Celebrities, Still Life with Flowers, Flowers and Fruit and movement realism. He has received several awards, including the People's Artist of the USSR, Hero of Socialist Labour. Ivanov is also a Professor of the Academy, and he was elected a People's Artist of Russia in 2006. He is married to Alexandra Alexandrovna Vasilieva, and they have two daughters, Alexandra Ivanovna Vasilieva and Anastasia Ivanovna Vasilieva."
- Biography Parsing: 44.8% F-Score on 237 test cases with 100 training examples.
- 15% ROI improvement over few-shot learning with only $2 training cost and 100 examples
- Key takeaways: Data Quality > Model Scale. Careful data curation outperforms brute-force scaling
-
Strong Performance (85%+ F-Score):
- Demographics: Gender (91.3%), Birth Date (90.5%), Name (86.1%), Nationality (87.9%)
- The highest performing attributes are those with clear, unambiguous mentions in text, like basic biographical facts.
-
Challenges (<15% F-Score):
- Specialized fields: Doctoral Students, Olympics, Academic Fields
- It is challenging to obtain high-quality training data for these rare attributes.
The biography parsing model extracted structured attributes with 60+ types (name, birth date, education, etc.), an F-score is computed for each category. The result shows that the model excels at simple, frequently-occurring attributes but struggles with complex, rare ones.
| Category | F-Score | Extraction Performance |
|---|---|---|
| Demographics | 90.0% | Excellent extraction of gender, name, nationality |
| Temporal | 87.8% | Strong performance on birth/death dates |
| Relationships | 70.0% | Moderate success with family connections |
| Professional | 50.2% | Struggles with occupation, awards |
| Educational | 45.6% | Challenges with degrees, thesis details |
- Rapid learning phase (0-100 steps): Loss drops from 1.2 → 0.67
- Refinement phase (100-200 steps): Loss decreases to 0.38
- Convergence phase (200-300 steps): Final loss stabilizes at 0.30-0.55
- Total training time: ~18 minutes, 300 epochs
- Final loss shows minor discrepancy between notebook (0.55) and smoothed curve (0.30) due to training variance.
Discovered significant stereotypes in GPT-3.5 outputs by analyzing the professors names in generated biographies:
- Physics professors: 100% male names
- Gender Studies professors: 100% female names
- Computer Science: 60% male, 40% female
- Bias Awareness: Pre-trained LLMs encode biases from training data. Fine-tuning alone insufficient for bias mitigation.
Primary error types:
- Missing Attributes (34.1%): Model fails to extract present information
- Wrong Values (25.7%): Incorrect attribute values
- Hallucination (17.2%): Model invents non-existent information
- Natural Language Processing (NLP): Implemented text preprocessing pipelines, tokenization, generation, and feature extraction for biographical text.
- Large Language Models (LLM): GPT-3.5-turbo, model fine-tuning, prompt engineering
- Data preprocessing and Model Evaluation (precision, recall, F-score, cross-validation)
- Machine Learning: Designed and trained models for generation tasks
- Python (OpenAI API, pandas, json, matplotlib)
- Jupyter Notebook
biography-parser-story-generation/
├── biography_parser_story_generation.ipynb # Main analysis notebook
├── visualizations/ # Performance charts
├── visualization.py # Visualization generator script
└── README.md






