Self-contained end-to-end pipeline that turns raw Kisan Call Centre (KCC) query data into a clean, ranked FAQ CSV of answer-distinct agricultural questions with professionally generated English Q&A pairs.
Raw KCC CSV β Clustering β LLM Repair β Unique-Q Extraction β Filtered FAQ β Q&A Generation
- Setup
- Running the Pipeline
- Pipeline Stages in Detail
- CLI Reference
- Resuming an Interrupted Run
- Output Files
- Irrelevant Corpus Filter
- Running Individual Scripts
- Hardware Requirements
# From the project root
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Install PyTorch separately with your CUDA version, e.g.:
pip install torch --index-url https://download.pytorch.org/whl/cu121
# For Stage 7 (Q&A generation), install vLLM:
pip install vllm>=0.6.0python run_pipeline.py \
--raw-file data/raw/punjab_maize_raw.csv \
--crop "Maize Makka" \
--api-key sk-ant-...python run_pipeline.py \
--raw-file data/raw/punjab_maize_raw.csv \
--crop "Maize Makka" \
--model /path/to/qwen2.5-7b-instructpython run_pipeline.py \
--raw-file data/raw/punjab_maize_raw.csv \
--crop "Maize Makka" \
--model /path/to/qwen2.5-7b-instruct \
--skip-qa-genNote:
--raw-fileshould contain aCropcolumn and aQueryTextcolumn (orquery_text). The pipeline filters rows whereCropmatches--cropexactly.
flowchart LR
A["π Raw CSV"] --> B["Stage 1\nPhase 1\nHP Screening"]
B --> C["Stage 2\nPhase 2\nLLM Eval"]
C --> D["Stage 3\nCluster Repair\nAβBβCβDβE"]
D --> E["Stage 4\nUnique-Q\nExtraction"]
E --> F["Stage 5\nDedup"]
F --> G["Stage 6\nCorpus Filter"]
G --> H["Stage 7\nQ&A Gen\nvLLM Batch"]
H --> I["β
FAQ Q&A CSV"]
Script: pipeline/hyperparameter_tuning.py
The pipeline runs a grid search over HDBSCAN + UMAP hyperparameter combinations to find configurations that produce good clusters.
- Queries are embedded with
paraphrase-multilingual-mpnet-base-v2 - A hybrid distance matrix is computed:
Ξ± Γ cosine(dense) + (1-Ξ±) Γ jaccard(TF-IDF) - Each config is scored on geometric metrics:
50 < n_clusters < 1000noise_ratio < 30 %clusters_for_85pct > 5
- Noise points are reassigned via KNN (nearest non-noise neighbour)
- Viable configs are saved as candidates for Stage 2
Controls: --grid-mode, --max-queries
Outputs: phase1_candidates.csv, phase1_results.pkl
Script: pipeline/llm_evaluator_hf.py
Model: Qwen2.5-7B-Instruct (local GPU)
The top-K candidates from Phase 1 are evaluated by the LLM on four axes per cluster sample:
| Score | Prompt asks | Good = |
|---|---|---|
| Coherence | Are all questions about the SAME topic? | A (same) |
| Separation | Should these two clusters stay separate? | A (different) |
| Merge | Should these near clusters merge? | A (merge) |
| Outlier | Which question doesn't belong? | -1 (none) |
A composite score is computed and the winning hyperparameter config is selected.
Controls: --phase2-top-k, --coverage-cap
Outputs: phase2_scores.csv
Script: pipeline/cluster_repair.py
Model: Qwen2.5-7B-Instruct (local GPU)
The best clustering is re-run and then cleaned up in five steps:
| Step | What it does |
|---|---|
| A β Diverse Reps | Selects k=3 maximally diverse representative questions per cluster using greedy furthest-point selection |
| B β Cross-Crop Filter | LLM removes questions that are about a different crop than --crop |
| C β Coherence + Split | LLM rates each cluster A/B/C; clusters rated C are sent to a detailed split prompt to create sub-clusters |
| D β Merge | Clusters with cosine similarity β₯ --merge-sim are proposed as merge candidates; LLM confirms |
| E β Back-Mapping | Every original raw row is mapped to its final cluster; outputs the working files for Stage 4 |
Controls: --diverse-k, --coherence-flag, --merge-sim
Outputs: repaired_clusters.csv, cluster_questions.csv, raw_row_mapping.csv
Script: pipeline/unique_question_finder.py
Model: Claude Haiku (Batch API) or Qwen2.5-7B (local)
Within each cluster, not all questions need a different answer. This stage groups questions that would get the same agricultural advice into a single answer-distinct group.
- The representative question of each group = the one with the highest raw call frequency
- Cross-cluster deduplication is then run: groups from different clusters that are near-identical
(embedding cosine β₯
--dedup-thresh, default 0.92) are merged and their frequencies summed
LLM provider selection:
- Pass
--api-keyβ Claude Haiku via Anthropic Batch API (faster, no GPU needed for this stage) - Omit
--api-keyβ Qwen 7B via local HF (slower, requires GPU)
Outputs: unique_questions.csv, unique_questions_freq.csv, unique_question_mapping.csv
Script: pipeline/dedup_freq_csv.py
Case-insensitive deduplication of representative_question in the FAQ CSV.
When exact-string duplicates exist (after strip + lowercase), the one with the higher
raw_frequency is kept. Ranks are removed from the final output.
Outputs: unique_questions_freq.csv (overwritten in-place, cleaned)
Script: pipeline/filter_faq_corpus.py
Config: config/irrelevant_corpus.yaml
Removes rows whose questions or cluster labels contain keywords from the irrelevant corpus. Filtered categories include:
| Category | Examples removed |
|---|---|
contact_details |
"Contact no of KVKβ¦", "Chief Agricultural Officer phone number" |
market_price |
"Mandi rate of maize", "MSP price details" |
weather |
"Weather forecast for Amritsar", "Rainfall today" |
subsidy_scheme |
"Subsidy on maize seeds", "Government scheme for machinery" |
incomplete_call |
"Wrong number", "Call disconnected" |
banking_admin |
"KCC loan details", "Bank manager contact" |
The filter checks three columns: representative_question, cluster_label, and answer_label.
Removed rows are saved to corpus_filtered_out.csv for auditing.
Controls: --corpus-file, --skip-corpus-filter
Script: pipeline/vllm_batch_qa_generator.py
Model: Qwen2.5-7B-Instruct via vLLM offline batch inference
Generates professional English Q&A pairs for each unique question in the FAQ.
- Uses vLLM's offline
LLM.generate()for high-throughput batch processing - Crop-specific system prompts with expert agricultural hints
- Automatic irrelevant-crop rejection (synonym-aware)
- Mandatory English output β translates Hindi/Hinglish input automatically
- Each answer includes: step-by-step technical guidance, chemical dosages, safety notes, and a KVK referral footer
Output format per row:
| Column | Description |
|---|---|
Generated_Question |
Polished English version of the farmer's question |
Generated_Category |
One of: Disease, Pest, Fertilizer and Nutrient, Variety, Agronomy, Other, IRRELEVANT_CROP |
Generated_Answer |
200β400 word professional technical answer |
Controls: --skip-qa-gen, --model
Outputs: unique_questions_freq_qa.csv
Standalone usage:
python pipeline/vllm_batch_qa_generator.py \
--input outputs/repair/maize_makka/unique_questions_freq.csv \
--crop "Maize Makka" \
--model /path/to/qwen2.5-7b-instruct \
--gpu-util 0.90python run_pipeline.py --help
| Argument | Description |
|---|---|
--raw-file PATH |
Path to raw KCC CSV (Crop + QueryText columns required) |
--crop NAME |
Crop name exactly as in the Crop column, e.g. "Maize Makka" |
| Argument | Default | Description |
|---|---|---|
--api-key KEY |
β | Anthropic API key; if omitted, local Qwen is used for Stage 4 |
--model PATH |
qwen2.5-7b-instruct |
Local HF model path (Stages 2, 3, and 4 if no API key) |
--gpu-id N |
0 |
CUDA device index |
--batch-size N |
8 |
LLM inference batch size |
| Argument | Default | Description |
|---|---|---|
--output-dir PATH |
outputs/repair |
Base directory for all outputs |
--skip-phase1 |
β | Skip Stage 1 β load existing phase1_results.pkl |
--skip-phase2 |
β | Skip Stage 2 β use best config from phase2_scores.csv |
--skip-repair |
β | Skip Stage 3 β use existing cluster_questions.csv |
--skip-unique-q |
β | Skip Stage 4 β use existing unique_questions_freq.csv |
--skip-corpus-filter |
β | Skip Stage 6 |
--skip-qa-gen |
β | Skip Stage 7 (Q&A generation via vLLM) |
--corpus-file PATH |
config/irrelevant_corpus.yaml |
Path to irrelevant keywords YAML |
| Argument | Default | Description |
|---|---|---|
--grid-mode |
medium |
HP search grid: quick (18) / medium (108) / full (240) / exhaustive (480) |
--max-queries |
20000 |
Max unique queries to cluster (random sample if exceeded) |
--phase2-top-k |
5 |
Number of Phase 1 candidates to LLM-evaluate in Stage 2 |
--coverage-cap |
0.80 |
Fraction of query volume covered in Phase 2 evaluation |
| Argument | Default | Description |
|---|---|---|
--diverse-k |
3 |
Number of diverse reps per cluster in Step A |
--coherence-flag |
C |
LLM rating that triggers a split (B = aggressive, C = conservative) |
--merge-sim |
0.82 |
Cosine similarity threshold for merge candidates in Step D |
Each stage writes intermediate files. If a run fails, you can resume from any stage:
# Resume from Stage 2 (Phase 1 already done)
python run_pipeline.py \
--raw-file data/raw/punjab_maize_raw.csv \
--crop "Maize Makka" \
--skip-phase1
# Resume from Stage 4 (Stages 1β3 done)
python run_pipeline.py \
--raw-file data/raw/punjab_maize_raw.csv \
--crop "Maize Makka" \
--skip-phase1 --skip-phase2 --skip-repair \
--api-key sk-ant-...
# Run only Stage 7 (Q&A generation on existing FAQ)
python run_pipeline.py \
--raw-file data/raw/punjab_maize_raw.csv \
--crop "Maize Makka" \
--skip-phase1 --skip-phase2 --skip-repair \
--skip-unique-q --skip-corpus-filter
# Run Stage 7 standalone
python pipeline/vllm_batch_qa_generator.py \
--input outputs/repair/maize_makka/unique_questions_freq.csv \
--crop "Maize Makka" \
--model /path/to/qwen2.5-7b-instruct
# Run only Stage 6 (filter on existing FAQ)
python pipeline/filter_faq_corpus.py \
--input outputs/repair/maize_makka/unique_questions_freq.csv \
--corpus config/irrelevant_corpus.yamlAll outputs are written to outputs/repair/<crop_slug>/:
| File | Description |
|---|---|
phase1_candidates.csv |
All valid HP configs from grid search |
phase1_results.pkl |
Serialized clustering results (used for resuming) |
phase2_scores.csv |
LLM coherence/separation/merge scores per config |
repaired_clusters.csv |
Cluster summary after Steps AβE |
cluster_questions.csv |
Every question per repaired cluster |
raw_row_mapping.csv |
Every original raw row β final cluster ID |
unique_questions.csv |
All unique question groups with full metadata |
β unique_questions_freq.csv |
FAQ questions β ranked by call frequency |
β unique_questions_freq_qa.csv |
FAQ Q&A pairs β questions + generated answers (Stage 7) |
unique_question_mapping.csv |
Every input question β unique group ID |
corpus_filtered_out.csv |
Rows removed by Stage 6 (for auditing) |
| Column | Description |
|---|---|
unique_q_id |
Group identifier, format <cluster_id>_<group_idx> |
representative_question |
The question text shown in the FAQ |
raw_frequency |
Number of times this question (or equivalent) was asked |
pct_of_total |
Percentage of total calls for the crop |
cluster_id |
Source cluster ID |
cluster_label |
Short topic label for the cluster |
answer_label |
Short label for this answer group |
n_questions_in_group |
How many distinct phrasings were grouped together |
was_cluster_split |
Whether the cluster was split during repair |
parent_cluster |
Original cluster before split (if applicable) |
merged_from |
Source group IDs if this group was merged |
merged_cross_cluster |
True if the merge crossed cluster boundaries |
The corpus filter config lives in config/irrelevant_corpus.yaml.
To add new irrelevant categories, open the YAML and add a new block:
categories:
your_category:
description: "What this category covers"
keywords:
- keyword one
- multi word phrase
typos:
- comman typoThe filter will pick it up automatically on the next run β no code changes needed.
| Stage | CPU/GPU | Notes |
|---|---|---|
| Stage 1 β Clustering | GPU recommended | Embedding ~20k queries takes ~2 min on GPU, ~15 min on CPU |
| Stage 2 β LLM Eval | GPU required | Loads Qwen2.5-7B (~16 GB VRAM in bfloat16) |
| Stage 3 β Repair | GPU required | Same Qwen model |
| Stage 4 β Unique-Q | GPU or API | Claude Haiku via --api-key requires no GPU |
| Stage 5 β Dedup | CPU only | Instant |
| Stage 6 β Filter | CPU only | ~1 sec for 1000 rows |
| Stage 7 β Q&A Gen | GPU required | vLLM loads Qwen2.5-7B (~16 GB VRAM); ~50β100 rq/s on RTX 4090+ |