Local-first academic research assistant for discovering, ranking, retrieving, summarizing, comparing, and exporting academic papers from API-friendly sources.
This app helps you research a topic without relying on Google Scholar scraping.
You give it:
- a topic
- optional subtopics
- include keywords
- exclude keywords
- a date range
- a maximum number of papers
- a ranking preference
It then:
- queries trusted sources
- normalizes and deduplicates results
- ranks the papers
- saves metadata locally
- downloads accessible PDFs when available
- extracts text from downloaded PDFs
- generates grounded per-paper summaries
- lets you compare papers and produce a synthesis report
- exports JSON and Markdown outputs
The app is local-first, inspectable, and reproducible. Both the CLI and the Streamlit GUI call the same backend pipeline.
- Searches arXiv
- Searches Semantic Scholar
- Searches Crossref
- Preserves source provenance
- Merges duplicate papers across sources
- Supports ranking by relevance, recency, or citations
- Saves metadata locally
- Downloads PDFs when technically accessible
- Extracts text from downloaded PDFs
- Keeps metadata even when a PDF is unavailable
- Logs failures and continues
For each paper, the app can produce:
- title
- authors
- year
- venue
- DOI / arXiv ID / source links
- abstract
- summary
- main problem
- proposed method
- key findings
- limitations
- takeaways
- confidence note
Confidence is labeled as:
metadata onlyabstract-basedfuller text-based
The app can generate a report with:
- executive summary
- selected papers
- recurring themes
- major methods used
- strongest findings
- common limitations
- disagreements or gaps
- recommended reading order
- CLI for scripted or terminal-based use
- Streamlit GUI for interactive exploration
The app is designed around stable, programmatic sources:
- arXiv
- Semantic Scholar
- Crossref
It does not use Google Scholar scraping as the main mechanism.
- Build a
SearchRequestfrom CLI or GUI input. - Optionally load a recent cached discovery response from
data/cache/. - Query enabled sources.
- Normalize all returned papers into one common schema.
- Deduplicate by DOI, arXiv ID, or normalized title/year.
- Filter by date and exclude terms.
- Rank the papers using weighted signals.
- Optionally download PDFs for top-ranked results.
- Optionally extract text from downloaded PDFs.
- Generate per-paper summaries from abstract and, when available, extracted text.
- Persist results under
data/processed/andoutputs/. - Display the results in CLI or GUI.
Ranking uses a weighted heuristic rather than a learned model.
Current signals include:
- relevance to the query text
- title match quality
- recency
- citation count
- text quality
- source quality
User preference can bias the final score toward:
relevancerecencycitations
The summaries are grounded only in retrieved data.
The current behavior is:
- if full text is unavailable, use abstract text
- if full text is available, use it as stronger evidence
- prefer abstract text for headline summary when it is cleaner than raw extracted PDF text
- never invent findings beyond retrieved metadata, abstract, or extracted text
This is still heuristic summarization, not a full scientific reading agent.
app/
config.py Settings and path setup
logging_utils.py Logging configuration
main.py Typer CLI
models.py Shared data models
pipeline.py Shared discovery/retrieval/summarization pipeline
ranking.py Ranking signals and sorting
storage.py Persistence, exports, cache, cleanup
summarization.py Per-paper summary extraction
reporting.py Cross-paper synthesis report generation
gui/streamlit_app.py Streamlit GUI
retrieval/
pdf_fetcher.py PDF download
text_extractor.py PDF text extraction
sources/
arxiv_client.py
semantic_scholar_client.py
crossref_client.py
config.yaml Paths, ranking weights, retrieval settings
data/
raw/ Downloaded PDFs
processed/ Discovery JSON, metadata, analysis, extracted text
cache/ Cached discovery responses
outputs/ Exported result JSON and synthesis reports
tests/ Smoke tests
The GUI does not implement separate logic. It calls the same backend used by the CLI.
That keeps behavior consistent across:
- discovery
- ranking
- retrieval
- summarization
- synthesis
- storage
Downloaded PDF files.
Per-query and per-paper processed outputs, including:
discovery.jsonmetadata.jsonanalysis.jsonextracted_text.txt
Cached discovery responses. These are reused for 24 hours by default.
Exportable end-user artifacts, such as:
- search results JSON
- synthesis report Markdown
- synthesis report JSON
Generated files are ignored via .gitignore, but the directories themselves are tracked.
Recommended:
python -m venv .venv
source .venv/bin/activate
pip install -e .For tests:
pip install -e ".[dev]"Optional:
- set
SEMANTIC_SCHOLAR_API_KEYif you have one
Basic discovery:
research-app discover "large language models for robot planning" --max-papers 10JSON output to stdout:
research-app discover "large language models for robot planning" --output-jsonFresh run without cache:
research-app discover "large language models for robot planning" --no-use-cacheDisable PDF download and text extraction:
research-app discover "large language models for robot planning" --no-fetch-pdfs --no-extract-textGenerate a synthesis report from saved results:
research-app synthesize outputs/large-language-models-for-robot-planning_results.jsonClear generated local data:
research-app clear-data--fetch-pdfs/--no-fetch-pdfscontrols PDF retrieval--extract-text/--no-extract-textcontrols text extraction--use-cache/--no-use-cachecontrols discovery cache usage
Run:
streamlit run app/gui/streamlit_app.py- Use the sidebar to enter topic, filters, sources, ranking, and retrieval options.
- Click
Search. - Browse ranked results in the Results tab.
- Open a paper in the Paper Detail tab.
- Select multiple papers in Compare & Report.
- Generate a synthesis report.
- Export Markdown, report JSON, or selected paper metadata.
- Use the Status & Logs tab to inspect warnings, source behavior, retrieval events, and cache activity.
- sidebar search controls
- pagination in results
- quick open/select actions
- detail view for one paper
- paper comparison table
- report viewer
- export buttons
- local data clear control
- status/log visibility
After a search, you will typically see:
data/processed/<query>/discovery.jsondata/processed/<query>/<paper>/metadata.jsondata/processed/<query>/<paper>/analysis.jsondata/raw/<paper>/paper.pdfif retrieveddata/processed/<paper>/extracted_text.txtif extractedoutputs/<query>_results.json
After synthesis:
outputs/synthesis_report.mdoutputs/synthesis_report.json
Main settings live in config.yaml.
You can change:
- data paths
- cache TTL
- request timeout
- ranking weights
- retrieval settings
- source defaults
Current notable settings:
- cache TTL: 24 hours
- max PDF downloads per query: 3
- local-first
- simple Python codebase
- shared backend for CLI and GUI
- inspectable JSON outputs
- provenance-aware result merging
- graceful fallback when a source fails
- no fabricated paper findings
- easy to extend
- ranking is heuristic, not learned
- summary extraction is heuristic
- limitations extraction can be noisy or repetitive
- full-text extraction quality depends on the PDF
- some papers only have metadata, so summaries stay sparse
- Semantic Scholar can rate-limit unauthenticated requests
- no user database or multi-project workspace yet
- no advanced pagination or batch background jobs yet
High-value next steps:
- add better caching controls and cache invalidation tools
- add retries and backoff per source client
- add pagination at the source-query level, not just the GUI level
- improve deduplication with fuzzy title similarity
- improve ranking with query expansion and venue-aware bonuses
- add better section-aware PDF parsing
- detect abstract/introduction/results sections from extracted text
- improve summary extraction with structured templates per paper type
- add citation graph support through Semantic Scholar when available
- add report templates for literature review, survey, and annotated bibliography
- add background jobs for long retrieval runs
- add a local SQLite index for faster browsing across past searches
- add regression tests around cached results and report quality
Read these files in order:
README.mdapp/models.pyapp/pipeline.pyapp/ranking.pyapp/summarization.pyapp/reporting.pyapp/gui/streamlit_app.pyconfig.yaml
That is enough to understand nearly all important behavior.
This app is a working local academic research assistant with:
- backend discovery across trusted sources
- local metadata and retrieval pipeline
- grounded per-paper summaries
- cross-paper synthesis
- CLI access
- Streamlit GUI
- caching
- exportable outputs
It is already useful as a research workflow tool, and the next gains are mostly about improving ranking quality, extraction quality, and operational polish rather than changing the basic architecture.