Skip to content

vicharanashala/FAQCluster

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

kcc_faq β€” KCC FAQ Generation Pipeline

Self-contained end-to-end pipeline that turns raw Kisan Call Centre (KCC) query data into a clean, ranked FAQ CSV of answer-distinct agricultural questions with professionally generated English Q&A pairs.

Raw KCC CSV β†’ Clustering β†’ LLM Repair β†’ Unique-Q Extraction β†’ Filtered FAQ β†’ Q&A Generation

Table of Contents

  1. Setup
  2. Running the Pipeline
  3. Pipeline Stages in Detail
  4. CLI Reference
  5. Resuming an Interrupted Run
  6. Output Files
  7. Irrelevant Corpus Filter
  8. Running Individual Scripts
  9. Hardware Requirements

Setup

# From the project root
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Install PyTorch separately with your CUDA version, e.g.:
pip install torch --index-url https://download.pytorch.org/whl/cu121

# For Stage 7 (Q&A generation), install vLLM:
pip install vllm>=0.6.0

Running the Pipeline

Option A β€” With Claude Haiku (recommended for Stage 4)

python run_pipeline.py \
    --raw-file data/raw/punjab_maize_raw.csv \
    --crop "Maize Makka" \
    --api-key sk-ant-...

Option B β€” Fully local (Qwen 7B for all LLM stages + Q&A generation)

python run_pipeline.py \
    --raw-file data/raw/punjab_maize_raw.csv \
    --crop "Maize Makka" \
    --model /path/to/qwen2.5-7b-instruct

Option C β€” Stages 1–6 only (skip Q&A generation)

python run_pipeline.py \
    --raw-file data/raw/punjab_maize_raw.csv \
    --crop "Maize Makka" \
    --model /path/to/qwen2.5-7b-instruct \
    --skip-qa-gen

Note: --raw-file should contain a Crop column and a QueryText column (or query_text). The pipeline filters rows where Crop matches --crop exactly.


Pipeline Stages in Detail

flowchart LR
    A["πŸ“„ Raw CSV"] --> B["Stage 1\nPhase 1\nHP Screening"]
    B --> C["Stage 2\nPhase 2\nLLM Eval"]
    C --> D["Stage 3\nCluster Repair\nA→B→C→D→E"]
    D --> E["Stage 4\nUnique-Q\nExtraction"]
    E --> F["Stage 5\nDedup"]
    F --> G["Stage 6\nCorpus Filter"]
    G --> H["Stage 7\nQ&A Gen\nvLLM Batch"]
    H --> I["βœ… FAQ Q&A CSV"]
Loading

Stage 1 β€” Phase 1: Hyperparameter Screening

Script: pipeline/hyperparameter_tuning.py

The pipeline runs a grid search over HDBSCAN + UMAP hyperparameter combinations to find configurations that produce good clusters.

  • Queries are embedded with paraphrase-multilingual-mpnet-base-v2
  • A hybrid distance matrix is computed: Ξ± Γ— cosine(dense) + (1-Ξ±) Γ— jaccard(TF-IDF)
  • Each config is scored on geometric metrics:
    • 50 < n_clusters < 1000
    • noise_ratio < 30 %
    • clusters_for_85pct > 5
  • Noise points are reassigned via KNN (nearest non-noise neighbour)
  • Viable configs are saved as candidates for Stage 2

Controls: --grid-mode, --max-queries

Outputs: phase1_candidates.csv, phase1_results.pkl


Stage 2 β€” Phase 2: LLM Evaluation

Script: pipeline/llm_evaluator_hf.py
Model: Qwen2.5-7B-Instruct (local GPU)

The top-K candidates from Phase 1 are evaluated by the LLM on four axes per cluster sample:

Score Prompt asks Good =
Coherence Are all questions about the SAME topic? A (same)
Separation Should these two clusters stay separate? A (different)
Merge Should these near clusters merge? A (merge)
Outlier Which question doesn't belong? -1 (none)

A composite score is computed and the winning hyperparameter config is selected.

Controls: --phase2-top-k, --coverage-cap

Outputs: phase2_scores.csv


Stage 3 β€” Cluster Repair (Steps A–E)

Script: pipeline/cluster_repair.py
Model: Qwen2.5-7B-Instruct (local GPU)

The best clustering is re-run and then cleaned up in five steps:

Step What it does
A β€” Diverse Reps Selects k=3 maximally diverse representative questions per cluster using greedy furthest-point selection
B β€” Cross-Crop Filter LLM removes questions that are about a different crop than --crop
C β€” Coherence + Split LLM rates each cluster A/B/C; clusters rated C are sent to a detailed split prompt to create sub-clusters
D β€” Merge Clusters with cosine similarity β‰₯ --merge-sim are proposed as merge candidates; LLM confirms
E β€” Back-Mapping Every original raw row is mapped to its final cluster; outputs the working files for Stage 4

Controls: --diverse-k, --coherence-flag, --merge-sim

Outputs: repaired_clusters.csv, cluster_questions.csv, raw_row_mapping.csv


Stage 4 β€” Unique Question Extraction

Script: pipeline/unique_question_finder.py
Model: Claude Haiku (Batch API) or Qwen2.5-7B (local)

Within each cluster, not all questions need a different answer. This stage groups questions that would get the same agricultural advice into a single answer-distinct group.

  • The representative question of each group = the one with the highest raw call frequency
  • Cross-cluster deduplication is then run: groups from different clusters that are near-identical (embedding cosine β‰₯ --dedup-thresh, default 0.92) are merged and their frequencies summed

LLM provider selection:

  • Pass --api-key β†’ Claude Haiku via Anthropic Batch API (faster, no GPU needed for this stage)
  • Omit --api-key β†’ Qwen 7B via local HF (slower, requires GPU)

Outputs: unique_questions.csv, unique_questions_freq.csv, unique_question_mapping.csv


Stage 5 β€” Final Deduplication

Script: pipeline/dedup_freq_csv.py

Case-insensitive deduplication of representative_question in the FAQ CSV. When exact-string duplicates exist (after strip + lowercase), the one with the higher raw_frequency is kept. Ranks are removed from the final output.

Outputs: unique_questions_freq.csv (overwritten in-place, cleaned)


Stage 6 β€” Irrelevant Corpus Filter

Script: pipeline/filter_faq_corpus.py
Config: config/irrelevant_corpus.yaml

Removes rows whose questions or cluster labels contain keywords from the irrelevant corpus. Filtered categories include:

Category Examples removed
contact_details "Contact no of KVK…", "Chief Agricultural Officer phone number"
market_price "Mandi rate of maize", "MSP price details"
weather "Weather forecast for Amritsar", "Rainfall today"
subsidy_scheme "Subsidy on maize seeds", "Government scheme for machinery"
incomplete_call "Wrong number", "Call disconnected"
banking_admin "KCC loan details", "Bank manager contact"

The filter checks three columns: representative_question, cluster_label, and answer_label. Removed rows are saved to corpus_filtered_out.csv for auditing.

Controls: --corpus-file, --skip-corpus-filter


Stage 7 β€” Q&A Generation (vLLM Batch Inference)

Script: pipeline/vllm_batch_qa_generator.py
Model: Qwen2.5-7B-Instruct via vLLM offline batch inference

Generates professional English Q&A pairs for each unique question in the FAQ.

  • Uses vLLM's offline LLM.generate() for high-throughput batch processing
  • Crop-specific system prompts with expert agricultural hints
  • Automatic irrelevant-crop rejection (synonym-aware)
  • Mandatory English output β€” translates Hindi/Hinglish input automatically
  • Each answer includes: step-by-step technical guidance, chemical dosages, safety notes, and a KVK referral footer

Output format per row:

Column Description
Generated_Question Polished English version of the farmer's question
Generated_Category One of: Disease, Pest, Fertilizer and Nutrient, Variety, Agronomy, Other, IRRELEVANT_CROP
Generated_Answer 200–400 word professional technical answer

Controls: --skip-qa-gen, --model

Outputs: unique_questions_freq_qa.csv

Standalone usage:

python pipeline/vllm_batch_qa_generator.py \
    --input  outputs/repair/maize_makka/unique_questions_freq.csv \
    --crop   "Maize Makka" \
    --model  /path/to/qwen2.5-7b-instruct \
    --gpu-util 0.90

CLI Reference

python run_pipeline.py --help

I/O (Required)

Argument Description
--raw-file PATH Path to raw KCC CSV (Crop + QueryText columns required)
--crop NAME Crop name exactly as in the Crop column, e.g. "Maize Makka"

Model / API

Argument Default Description
--api-key KEY β€” Anthropic API key; if omitted, local Qwen is used for Stage 4
--model PATH qwen2.5-7b-instruct Local HF model path (Stages 2, 3, and 4 if no API key)
--gpu-id N 0 CUDA device index
--batch-size N 8 LLM inference batch size

Pipeline Control

Argument Default Description
--output-dir PATH outputs/repair Base directory for all outputs
--skip-phase1 β€” Skip Stage 1 β€” load existing phase1_results.pkl
--skip-phase2 β€” Skip Stage 2 β€” use best config from phase2_scores.csv
--skip-repair β€” Skip Stage 3 β€” use existing cluster_questions.csv
--skip-unique-q β€” Skip Stage 4 β€” use existing unique_questions_freq.csv
--skip-corpus-filter β€” Skip Stage 6
--skip-qa-gen β€” Skip Stage 7 (Q&A generation via vLLM)
--corpus-file PATH config/irrelevant_corpus.yaml Path to irrelevant keywords YAML

Tuning Parameters

Argument Default Description
--grid-mode medium HP search grid: quick (18) / medium (108) / full (240) / exhaustive (480)
--max-queries 20000 Max unique queries to cluster (random sample if exceeded)
--phase2-top-k 5 Number of Phase 1 candidates to LLM-evaluate in Stage 2
--coverage-cap 0.80 Fraction of query volume covered in Phase 2 evaluation

Repair Parameters

Argument Default Description
--diverse-k 3 Number of diverse reps per cluster in Step A
--coherence-flag C LLM rating that triggers a split (B = aggressive, C = conservative)
--merge-sim 0.82 Cosine similarity threshold for merge candidates in Step D

Resuming an Interrupted Run

Each stage writes intermediate files. If a run fails, you can resume from any stage:

# Resume from Stage 2 (Phase 1 already done)
python run_pipeline.py \
    --raw-file data/raw/punjab_maize_raw.csv \
    --crop "Maize Makka" \
    --skip-phase1

# Resume from Stage 4 (Stages 1–3 done)
python run_pipeline.py \
    --raw-file data/raw/punjab_maize_raw.csv \
    --crop "Maize Makka" \
    --skip-phase1 --skip-phase2 --skip-repair \
    --api-key sk-ant-...

# Run only Stage 7 (Q&A generation on existing FAQ)
python run_pipeline.py \
    --raw-file data/raw/punjab_maize_raw.csv \
    --crop "Maize Makka" \
    --skip-phase1 --skip-phase2 --skip-repair \
    --skip-unique-q --skip-corpus-filter

# Run Stage 7 standalone
python pipeline/vllm_batch_qa_generator.py \
    --input  outputs/repair/maize_makka/unique_questions_freq.csv \
    --crop   "Maize Makka" \
    --model  /path/to/qwen2.5-7b-instruct

# Run only Stage 6 (filter on existing FAQ)
python pipeline/filter_faq_corpus.py \
    --input  outputs/repair/maize_makka/unique_questions_freq.csv \
    --corpus config/irrelevant_corpus.yaml

Output Files

All outputs are written to outputs/repair/<crop_slug>/:

File Description
phase1_candidates.csv All valid HP configs from grid search
phase1_results.pkl Serialized clustering results (used for resuming)
phase2_scores.csv LLM coherence/separation/merge scores per config
repaired_clusters.csv Cluster summary after Steps A–E
cluster_questions.csv Every question per repaired cluster
raw_row_mapping.csv Every original raw row β†’ final cluster ID
unique_questions.csv All unique question groups with full metadata
⭐ unique_questions_freq.csv FAQ questions β€” ranked by call frequency
⭐ unique_questions_freq_qa.csv FAQ Q&A pairs β€” questions + generated answers (Stage 7)
unique_question_mapping.csv Every input question β†’ unique group ID
corpus_filtered_out.csv Rows removed by Stage 6 (for auditing)

FAQ Output Schema (unique_questions_freq.csv)

Column Description
unique_q_id Group identifier, format <cluster_id>_<group_idx>
representative_question The question text shown in the FAQ
raw_frequency Number of times this question (or equivalent) was asked
pct_of_total Percentage of total calls for the crop
cluster_id Source cluster ID
cluster_label Short topic label for the cluster
answer_label Short label for this answer group
n_questions_in_group How many distinct phrasings were grouped together
was_cluster_split Whether the cluster was split during repair
parent_cluster Original cluster before split (if applicable)
merged_from Source group IDs if this group was merged
merged_cross_cluster True if the merge crossed cluster boundaries

Irrelevant Corpus Filter

The corpus filter config lives in config/irrelevant_corpus.yaml. To add new irrelevant categories, open the YAML and add a new block:

categories:
  your_category:
    description: "What this category covers"
    keywords:
      - keyword one
      - multi word phrase
    typos:
      - comman typo

The filter will pick it up automatically on the next run β€” no code changes needed.


Hardware Requirements

Stage CPU/GPU Notes
Stage 1 β€” Clustering GPU recommended Embedding ~20k queries takes ~2 min on GPU, ~15 min on CPU
Stage 2 β€” LLM Eval GPU required Loads Qwen2.5-7B (~16 GB VRAM in bfloat16)
Stage 3 β€” Repair GPU required Same Qwen model
Stage 4 β€” Unique-Q GPU or API Claude Haiku via --api-key requires no GPU
Stage 5 β€” Dedup CPU only Instant
Stage 6 β€” Filter CPU only ~1 sec for 1000 rows
Stage 7 β€” Q&A Gen GPU required vLLM loads Qwen2.5-7B (~16 GB VRAM); ~50–100 rq/s on RTX 4090+

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages