Final Edit#1
Draft
vardhjain wants to merge 23 commits into
Draft
Conversation
- Both GraphRAG.ipynb and Plain_RAG.ipynb are now fully self-contained (no shared_utils import dependency) with all shared constants, prompts, FuzzyEvaluator, Evaluator, and call_ollama defined inline identically - GraphRAG: credentials via Colab Secrets (ARANGO_PASS) with env var fallback - Plain_RAG: GPU-aware FAISS with faiss-gpu-cu12 swap instruction - Add Comparison.ipynb for side-by-side accuracy/F1/latency/confusion charts - Add shared_utils.py for local script runs - Add run_graphrag.py, run_plainrag.py, run_comparison.py standalone scripts
The Ollama install script requires zstd for extraction (new in recent versions). Added `apt-get install -y zstd -q` before the curl install. Also expanded ensure_ollama() to search multiple candidate paths and fall back to a filesystem find rather than hard-coding /usr/local/bin. Both Plain_RAG and GraphRAG notebooks updated identically.
Implements the agreed revamp: an importable src/kgqa package, a 4-arm ablation (plain -> plain_rr -> graph -> graph_concepts), and the fairness fixes from the audit. Science fixes: - Single shared ChunkStore (identical corpus + per-section chunking across arms) - Reranker promoted to its own arm so the graph never gets it as a hidden edge - Label leakage removed: ingestion stores no question-derived title and no final_decision; graph context uses generic "=== STUDY n ===" labels - MeSH Concepts/MENTIONS now used via a concept-hop arm (graph_concepts) - Seeded random sampling (n=200) + paired McNemar significance test - Fixed the NameError in the graph-expansion fallback Repo hygiene: - src/ package, scripts/ (ingest, run_benchmark, compare), thin Colab notebooks - tests/ (17 pytest cases, CPU-only via fakes), ruff config, GitHub Actions CI - README, requirements, .gitignore, LICENSE (MIT), .env.example - Docs (PDF/PPTX) moved to docs/; .DS_Store untracked; superseded files removed
- concept-hop AQL now ranks neighbours by shared-concept count first and reconstructs abstracts only for the top-N (was building an abstract for every candidate on every query — 200x on a full run) - remove faiss-cpu: PlainRAG now uses the shared numpy-cosine ChunkStore
- default ARANGO_HOST -> new Oasis deployment (581c546a8d66), overridable via env - notebooks request A100 GPU + High-RAM; add nvidia-smi check and a labeled-only ingest smoke test before the full run
A single Ollama 500/timeout previously aborted an entire arm, and since the server is shared across arms one crash cascaded into all four failing. - run_benchmark: per-question retry (3x) with automatic Ollama restart between attempts; checkpoint results every 25 questions; the script now owns Ollama health (health-check + (re)start + warm) instead of relying on a one-shot start - llm: cap generation via num_predict and keep the model resident (keep_alive), so a runaway reasoning chain can't stall/crash the server - config: LLM_NUM_CTX / LLM_NUM_PREDICT / LLM_KEEP_ALIVE / LLM_TIMEOUT now env-tunable (shrink on small-VRAM GPUs); defaults 4096 / 1024 / 30m / 180s - notebook: drop --no-ollama-start so the runner can self-heal; add a GPU-memory tuning hint
- clone cell now %cd /content + rm -rf before clone, so re-running can't nest a second checkout (caused a doubled results path) - benchmark secrets cell sets LLM_NUM_CTX=8192 / LLM_NUM_PREDICT=1024 for the 80GB A100: full graph context, bounded generation (~halves runtime; identical across arms so the comparison is unaffected)
n=200, seed 42, deepseek-r1:8b on A100. Parent-document expansion is the decisive win (plain_rr -> graph +22.5pp, McNemar p<0.0001); reranker +7pp (n.s.); concept-hop -2pp (n.s.) at 5x latency. plain/plain_rr fall below the majority baseline — context sufficiency, which the graph supplies, dominates.
- results/ablation.png + results/summary.md generated from the n=200 run - revert notebooks/02_benchmark.ipynb to the clean (output-free) wrapper that Colab's "Created using Colab" save had filled with execution outputs
Bring the repo up to industry/community standards for a public release: - Makefile (install/test/lint/ingest/benchmark/compare/clean) + `make help` - CONTRIBUTING, CODE_OF_CONDUCT, SECURITY, CHANGELOG, CITATION.cff - .github/ ISSUE_TEMPLATE (bug, feature, config) + PULL_REQUEST_TEMPLATE - assets/architecture.svg + README badges and embedded diagram - .pre-commit-config.yaml (ruff + hygiene hooks), .editorconfig - .gitattributes to enforce LF (Makefile-safe) and tidy linguist stats - richer pyproject metadata (authors, urls, classifiers, keywords, dev extras) Deliberately no configs/ dir: configuration lives in src/kgqa/config.py (typed + env-overridable), which is documented in CONTRIBUTING.
So anyone can run it out of the box: - default ARANGO_HOST is now http://localhost:8529 (no specific deployment baked in) - add docker-compose.yml for a one-command local ArangoDB - .env.example and README document both paths (local Docker / cloud Oasis) - notebooks read ARANGO_HOST + ARANGO_PASS from Colab Secrets (nothing hardcoded) and clone the main branch - README setup/run instructions generalized
- app/chat_app.py: live GraphRAG chat over the winning `graph` arm; answers cite source PubMed IDs (--share for a public link, --concepts for the concept arm) - app/dashboard.py: Streamlit dashboard of the ablation (bars, McNemar, per-class confusion matrices when raw results are present); reads results/ only, so it deploys to Streamlit Cloud - BaseRetriever.chat(): conversational answer + retrieved source pubids (tested) - compare.py now also emits results/summary.json (structured metrics) for the dashboard - requirements-app.txt, make chat/dashboard/install-app, app/ added to ruff + CI - README "Interactive demo & dashboard" section + app/README deployment notes
The dashboard is the hosted public demo (no LLM/DB/GPU — reads results/ only). The chat stays local/Colab (it needs a GPU + a persistent ArangoDB). - app/requirements.txt: light deploy deps (streamlit/pandas/scikit-learn). Streamlit Cloud reads the entrypoint's dir first, so this is used and the heavy root requirements.txt is ignored for the deploy — benchmark users keep their full deps. - .streamlit/config.toml: clean light theme matching the ablation figure. - dashboard: richer set_page_config (icon, About/menu links, methodology expander); per-class section degrades cleanly when raw results are absent. - README: "Open in Streamlit" badge + "Live demo" section; app/README has the exact Streamlit Cloud deploy steps.
Match the presentation style of the EthicLens repo — centered title with emoji + tagline, a badge row, a quick-nav line (live demo / results / why / setup), and emoji on section headers. Nav-target headings kept plain so anchors stay stable.
- new unit tests for data sampling, the Ollama client, evaluation report/save, and ChunkStore.from_dataset (mocking the datasets/requests boundaries) — 18 -> 24 tests - CI runs pytest --cov and uploads to Codecov; README gets a coverage badge - pytest-cov added to dev deps; coverage config in pyproject
- docs/index.md + docs/_config.yml (Jekyll Cayman theme): a landing page with the architecture diagram, results table, ablation figure, honest findings, and links to the demo / report / slides - README: docs badge + a Documentation section Enable via Settings -> Pages -> Deploy from branch -> main -> /docs.
- assets/dashboard.png: real screenshot of the running dashboard, embedded (clickable) in the README Live demo section - dashboard: group the accuracy/macro-F1 bars (were misleadingly stacked) and replace the redundant figure with a horizontal latency-by-arm chart - bump deploy pin to streamlit>=1.39 (grouped/horizontal bar options)
- assets/chat.png: the Gradio chat interface (title, description, example questions), embedded in the README chat section - chat_app.py: drop the 'theme' kwarg — gr.ChatInterface no longer accepts it on Gradio 5/6 (requirements pin gradio>=4), so a fresh install would have crashed
The app deployed to Streamlit's auto-generated subdomain rather than the custom kgqa-ablation; repoint the README badge/nav/screenshot link, the docs site links, and the About website to the live URL so nothing 404s.
Codecov rejected tokenless uploads ('Token required - not valid tokenless
upload'), so the badge stayed unknown. Use the repo upload token via the
CODECOV_TOKEN secret.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.