IgnitionGPT: Automated Extraction of Ignition-Quality Metrics Using LLMs

Author: Abdulelah S. Alshehria
Affiliation: Department of Chemical Engineering, College of Engineering, King Saud University, Riyadh 11421, Saudi Arabia

📑 Abstract

Several ignition-quality metrics, including Research Octane Number (RON), Motor Octane Number (MON), and Cetane Number (CN), are critical for predictive combustion modeling, engine optimization, and emissions assessment. However, experimental determination is resource-intensive and limited in scope, while existing datasets, typically comprising 200-500 compounds, capture only a small fraction of the millions of potential candidates within the 1-20 carbon hydrocarbon space. Conventional rule-based text-mining systems have sought to leverage unstructured literature but remain constrained by poor scalability, domain-specific nomenclature, and weak contextual inference. To address these limitations, we introduce IgnitionGPT, a domain-adapted large language model (LLM) fine-tuned using the general-purpose LLM, GPT-4.1-mini, for automated extraction of ignition-relevant properties from unstructured scientific documents. The finetuning process employed a high-fidelity human annotated dataset of 304 sources (263 peer-reviewed articles and 41 patents) in JSONL format, encompassing 581 compounds across diverse chemical classes. Sequential training and validation experiments demonstrate that IgnitionGPT achieves rapid convergence, outperforming the zero-shot performance of the general-prupose GPT-4.1-mini baseline (F1 of 47.8%) and reaching 100% extraction accuracy with as little as 9% of the dataset. The resulting model can significantly expand ignition-property coverage, including alkanes, alcohols, ethers, furans, and aromatics, thereby enabling improved surrogate modeling, enhanced generalization to out-of-distribution chemistries, and more reliable integration into combustion simulations. All code and data are released for academic use to ensure reproducibility and foster expansion of ignition-based LLMs. Collectively, this work establishes a scalable framework for automated property extraction that bridges the gap between experimental limitations and the data-driven requirements of next-generation fuel design.

🚀 Key Features

Fine-tuned on GPT-4.1-mini for ignition-property extraction.
Training dataset: 304 curated sources (263 peer-reviewed articles, 41 patents).
Covers 581 compounds across diverse chemical classes.
Achieves 100% extraction accuracy with only 10% training data.
Outperforms zero-shot GPT-4.1-mini baseline (F1 ≈ 48%).
Supports compounds such as alkanes, alcohols, ethers, furans, aromatics.
Enables improved surrogate modeling and integration into combustion simulations.

📂 Dataset

Format: JSONL (annotated ignition-property data).
Sources: Peer-reviewed articles + patents.
License: MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IgnitionGPT: Automated Extraction of Ignition-Quality Metrics Using LLMs

📑 Abstract

🚀 Key Features

📂 Dataset

About

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

IgnitionGPT: Automated Extraction of Ignition-Quality Metrics Using LLMs

📑 Abstract

🚀 Key Features

📂 Dataset

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!