This repository contains the code for Ranking Improved Self-Consistency (RISC).
The repository is based on the paper “Boosting Self-Consistency with Ranking”, accepted to ACL SRW 2026.
Figure. Comparison of RISC against the Self-Consistency, Stable Rank, ReASC, and CISC on three datasets. RISC consistently outperforms the baselines on the QA datasets across all LLM call budgets, while remaining competitive on MATH500.
-
SC_all_datasets_checkpoints_optimised_refine_fast.py
Core feature engineering, embedding cache, and ranking utilities. -
run_feature_sets_fast.py
Dataset loading and dataset config source of truth. -
reranker_configs.py
Shared reranker configs:- dataset cache paths
- feature sets
- default LightGBM parameters
- LightGBM hyperparameter grid
make_feature_cfg(...)
-
export_fixed5_feature_tables_one_split.py
Export one split at a time:--split trainbuilds a multibudget train feature table--split testbuilds a test feature table budget-by-budget
-
run_feature_sets_fast_custom_feature_paths.py
Train and evaluate a reranker from already prepared feature pickles. -
ablation_rerankers_fast_no_search_updated_paths_two_ablation.py
Full model, leave-one-feature-out, and optional leave-two-features-out ablations. -
data_samples/
Example labeled input CSV files for testing the feature-export pipeline.
The main feature set uses the following five signals.
share_ratio_to_best(a) = c(a) / c(a*)
where:
c(a)is the number of sampled chains that end with answeraa*is the most frequent answer for the current question
This measures how close the candidate answer's support is to the strongest vote winner.
ans_len_min(a) = min_r len(r)
where the minimum is taken over all sampled traces r that produce answer a.
This captures the shortest formulation observed for the candidate answer.
ans_dist2_to_id_centroid(a) = ||e_a - mu_id||^2
where:
e_ais the embedding of candidate answeramu_idis the centroid of all sampled answer embeddings for the same question
Lower values mean the candidate answer lies closer to the overall semantic center of the sampled responses.
step_to_chain_centroid_min = min_t cos(s_t, mu_chain)
where:
s_tis the embedding of reasoning steptmu_chainis the centroid of all step embeddings in the chain
This feature measures the weakest local coherence of any reasoning step with respect to the overall chain semantics.
shared_checkpoint_count = sum_i 1[p_i is shared]
where:
p_iis a prefix checkpoint1[...]is an indicator that equals 1 when the checkpoint is supported by semantically aligned prefixes from other traces
A checkpoint is counted as shared when it matches prefixes from other traces with:
- cosine similarity above a threshold
- depth difference within a tolerance
- enough relative support across distinct traces
This counts how many semantically aligned intermediate reasoning checkpoints are shared across independent sampled traces.
Feature export creates .pkl files with features and a metadata JSON.
Embedding caches are loaded from:
--cache-pathif provided- otherwise the dataset default cache path in
run_feature_sets_fast.py
If the cache file does not exist, the script starts from an empty cache and can save it at the end.
For raw labeled CSV input, the feature-export pipeline expects the following columns:
idquestionfinal_answer_newrationalerow_hit
Notes:
idis mandatory and is used as the question or group identifier.final_answer_newis the answer column expected by the current export pipeline.row_hitis the supervision label used downstream for ranking and evaluation.rationaleis a part of reasoning without final answer.
Example of input data is provided in data_samples folder.
python export_fixed5_feature_tables_one_split.py \
--dataset hotpotqa \
--split train \
--train-csv-path "/path/to/hotpotqa_train.csv" \
--train-budgets 2 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 \
--output-path reranker_feature_tables/hotpotqa/train_features.pkl \
--cache-path caches_olmo/hotpotqa_embedding_caches.joblib \
--prefix-sim-threshold 0.75 \
--prefix-rel-support-thr 0.5 \
--metadata-output-path reranker_feature_tables/hotpotqa/train_metadata.jsonWith explicit test budgets:
python export_fixed5_feature_tables_one_split.py \
--dataset hotpotqa \
--split test \
--test-csv-path "/path/to/hotpotqa_test.csv" \
--test-budgets 1 2 3 4 5 10 20 40 60 80 100 \
--output-path reranker_feature_tables/hotpotqa/test_features.pkl \
--cache-path caches_olmo/hotpotqa_embedding_caches.joblib \
--prefix-sim-threshold 0.75 \
--prefix-rel-support-thr 0.5 \
--metadata-output-path reranker_feature_tables/hotpotqa/test_metadata.jsonThis step uses already prepared feature pickles.
Inputs:
--train-features-path--test-features-path--prepared-metadata-path→ should point to the train metadata JSON
What happens inside:
- loads prepared train/test features
- optionally runs LightGBM hyperparameter search on a train/validation split of the prepared train table
- refits on the full prepared train table
- scores the test table budget-by-budget
- saves reranker, scores, best params, and metadata
By default, hyperparameter search is enabled and uses:
DEFAULT_LGB_PARAMSPARAM_GRIDfromreranker_configs.py
To skip search and use defaults directly, add:
--no-hparam-search
Example:
python run_feature_sets_fast_custom_feature_paths.py \
--dataset popqa \
--feature-set set7_hyp_search_adaptive \
--train-features-path reranker_feature_tables/popqa/train_features.pkl \
--test-features-path reranker_feature_tables/popqa/test_features.pkl \
--prepared-metadata-path reranker_feature_tables/popqa/train_metadata.json \
--device cuda \
--batch-size 1024 \
--budgets 2 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 \
--test-budgets 1 2 3 4 5 10 20 40 60 80 100 \
--reranker-dir rerankers \
--scores-dir reranker_scores \
--metadata-dir reranker_metadata \
--prefix-sim-threshold 0.75 \
--prefix-rel-support-thr 0.5The ablation script supports:
- full model
- leave-one-feature-out variants
- optional leave-two-features-out variants via
--include-two-feature-ablation
It can:
- load explicit feature-table paths
- reuse signature-matched cached feature tables
- or recompute them if allowed
Example:
python ablation_rerankers_fast_no_search_updated_paths_two_ablation.py \
--dataset math500 \
--feature-set set7_hyp_search_adaptive \
--train-features-path reranker_feature_tables/math500/train_features.pkl \
--test-features-path reranker_feature_tables/math500/test_features.pkl \
--prepared-metadata-path reranker_feature_tables/math500/train_metadata.json \
--device cuda \
--batch-size 1024 \
--budgets 2 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 \
--test-budgets 1 2 3 4 5 10 20 40 60 80 100 \
--lgbm-params-path reranker_metadata/math500_set7_hyp_search_adaptive_best_params.json \
--reranker-dir reranker_ablations/rerankers \
--scores-dir reranker_ablations/reranker_scores \
--metadata-dir reranker_ablations/reranker_metadata \
--include-two-feature-ablation \
--skip-feature-recalc- Train filtering in feature export is enabled through
filter_questions_for_reranker(...). --test-budgetsis supported in export, training, and ablation scripts. If provided, it overrides range-based budget generation.- For downstream training scripts, the metadata path should point to the train metadata JSON, not the test metadata JSON.
