Skip to content

AI4CHEMIA/IgnitionGPT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 

Repository files navigation

IgnitionGPT: Automated Extraction of Ignition-Quality Metrics Using LLMs

Author: Abdulelah S. Alshehria
Affiliation: Department of Chemical Engineering, College of Engineering, King Saud University, Riyadh 11421, Saudi Arabia

πŸ“‘ Abstract

Several ignition-quality metrics, including Research Octane Number (RON), Motor Octane Number (MON), and Cetane Number (CN), are critical for predictive combustion modeling, engine optimization, and emissions assessment. However, experimental determination is resource-intensive and limited in scope, while existing datasets, typically comprising 200-500 compounds, capture only a small fraction of the millions of potential candidates within the 1-20 carbon hydrocarbon space. Conventional rule-based text-mining systems have sought to leverage unstructured literature but remain constrained by poor scalability, domain-specific nomenclature, and weak contextual inference. To address these limitations, we introduce IgnitionGPT, a domain-adapted large language model (LLM) fine-tuned using the general-purpose LLM, GPT-4.1-mini, for automated extraction of ignition-relevant properties from unstructured scientific documents. The finetuning process employed a high-fidelity human annotated dataset of 304 sources (263 peer-reviewed articles and 41 patents) in JSONL format, encompassing 581 compounds across diverse chemical classes. Sequential training and validation experiments demonstrate that IgnitionGPT achieves rapid convergence, outperforming the zero-shot performance of the general-prupose GPT-4.1-mini baseline (F1 of 47.8%) and reaching 100% extraction accuracy with as little as 9% of the dataset. The resulting model can significantly expand ignition-property coverage, including alkanes, alcohols, ethers, furans, and aromatics, thereby enabling improved surrogate modeling, enhanced generalization to out-of-distribution chemistries, and more reliable integration into combustion simulations. All code and data are released for academic use to ensure reproducibility and foster expansion of ignition-based LLMs. Collectively, this work establishes a scalable framework for automated property extraction that bridges the gap between experimental limitations and the data-driven requirements of next-generation fuel design.

πŸš€ Key Features

  • Fine-tuned on GPT-4.1-mini for ignition-property extraction.
  • Training dataset: 304 curated sources (263 peer-reviewed articles, 41 patents).
  • Covers 581 compounds across diverse chemical classes.
  • Achieves 100% extraction accuracy with only 10% training data.
  • Outperforms zero-shot GPT-4.1-mini baseline (F1 β‰ˆ 48%).
  • Supports compounds such as alkanes, alcohols, ethers, furans, aromatics.
  • Enables improved surrogate modeling and integration into combustion simulations.

πŸ“‚ Dataset

  • Format: JSONL (annotated ignition-property data).
  • Sources: Peer-reviewed articles + patents.
  • License: MIT.

About

IgnitionGPT: Automated Extraction of Ignition-Quality Metrics Using LLMs

Resources

Stars

Watchers

Forks

Contributors