Research App

Local-first academic research assistant for discovering, ranking, retrieving, summarizing, comparing, and exporting academic papers from API-friendly sources.

What The App Does

This app helps you research a topic without relying on Google Scholar scraping.

You give it:

a topic
optional subtopics
include keywords
exclude keywords
a date range
a maximum number of papers
a ranking preference

It then:

queries trusted sources
normalizes and deduplicates results
ranks the papers
saves metadata locally
downloads accessible PDFs when available
extracts text from downloaded PDFs
generates grounded per-paper summaries
lets you compare papers and produce a synthesis report
exports JSON and Markdown outputs

The app is local-first, inspectable, and reproducible. Both the CLI and the Streamlit GUI call the same backend pipeline.

Main Features

Discovery

Searches arXiv
Searches Semantic Scholar
Searches Crossref
Preserves source provenance
Merges duplicate papers across sources
Supports ranking by relevance, recency, or citations

Retrieval

Saves metadata locally
Downloads PDFs when technically accessible
Extracts text from downloaded PDFs
Keeps metadata even when a PDF is unavailable
Logs failures and continues

Summarization

For each paper, the app can produce:

title
authors
year
venue
DOI / arXiv ID / source links
abstract
summary
main problem
proposed method
key findings
limitations
takeaways
confidence note

Confidence is labeled as:

metadata only
abstract-based
fuller text-based

Cross-Paper Synthesis

The app can generate a report with:

executive summary
selected papers
recurring themes
major methods used
strongest findings
common limitations
disagreements or gaps
recommended reading order

Interfaces

CLI for scripted or terminal-based use
Streamlit GUI for interactive exploration

Supported Sources

The app is designed around stable, programmatic sources:

arXiv
Semantic Scholar
Crossref

It does not use Google Scholar scraping as the main mechanism.

How It Works

End-To-End Flow

Build a SearchRequest from CLI or GUI input.
Optionally load a recent cached discovery response from data/cache/.
Query enabled sources.
Normalize all returned papers into one common schema.
Deduplicate by DOI, arXiv ID, or normalized title/year.
Filter by date and exclude terms.
Rank the papers using weighted signals.
Optionally download PDFs for top-ranked results.
Optionally extract text from downloaded PDFs.
Generate per-paper summaries from abstract and, when available, extracted text.
Persist results under data/processed/ and outputs/.
Display the results in CLI or GUI.

Ranking Logic

Ranking uses a weighted heuristic rather than a learned model.

Current signals include:

relevance to the query text
title match quality
recency
citation count
text quality
source quality

User preference can bias the final score toward:

relevance
recency
citations

Summary Logic

The summaries are grounded only in retrieved data.

The current behavior is:

if full text is unavailable, use abstract text
if full text is available, use it as stronger evidence
prefer abstract text for headline summary when it is cleaner than raw extracted PDF text
never invent findings beyond retrieved metadata, abstract, or extracted text

This is still heuristic summarization, not a full scientific reading agent.

Architecture

app/
  config.py                 Settings and path setup
  logging_utils.py          Logging configuration
  main.py                   Typer CLI
  models.py                 Shared data models
  pipeline.py               Shared discovery/retrieval/summarization pipeline
  ranking.py                Ranking signals and sorting
  storage.py                Persistence, exports, cache, cleanup
  summarization.py          Per-paper summary extraction
  reporting.py              Cross-paper synthesis report generation
  gui/streamlit_app.py      Streamlit GUI
  retrieval/
    pdf_fetcher.py          PDF download
    text_extractor.py       PDF text extraction
  sources/
    arxiv_client.py
    semantic_scholar_client.py
    crossref_client.py
config.yaml                 Paths, ranking weights, retrieval settings
data/
  raw/                      Downloaded PDFs
  processed/                Discovery JSON, metadata, analysis, extracted text
  cache/                    Cached discovery responses
outputs/                    Exported result JSON and synthesis reports
tests/                      Smoke tests

Key Design Choice

The GUI does not implement separate logic. It calls the same backend used by the CLI.

That keeps behavior consistent across:

discovery
ranking
retrieval
summarization
synthesis
storage

Local Storage Layout

`data/raw/`

Downloaded PDF files.

`data/processed/`

Per-query and per-paper processed outputs, including:

discovery.json
metadata.json
analysis.json
extracted_text.txt

`data/cache/`

Cached discovery responses. These are reused for 24 hours by default.

`outputs/`

Exportable end-user artifacts, such as:

search results JSON
synthesis report Markdown
synthesis report JSON

Generated files are ignored via .gitignore, but the directories themselves are tracked.

Setup

Recommended:

python -m venv .venv
source .venv/bin/activate
pip install -e .

For tests:

pip install -e ".[dev]"

Optional:

set SEMANTIC_SCHOLAR_API_KEY if you have one

How To Use The App

CLI

Basic discovery:

research-app discover "large language models for robot planning" --max-papers 10

JSON output to stdout:

research-app discover "large language models for robot planning" --output-json

Fresh run without cache:

research-app discover "large language models for robot planning" --no-use-cache

Disable PDF download and text extraction:

research-app discover "large language models for robot planning" --no-fetch-pdfs --no-extract-text

Generate a synthesis report from saved results:

research-app synthesize outputs/large-language-models-for-robot-planning_results.json

Clear generated local data:

research-app clear-data

CLI Notes

--fetch-pdfs/--no-fetch-pdfs controls PDF retrieval
--extract-text/--no-extract-text controls text extraction
--use-cache/--no-use-cache controls discovery cache usage

Streamlit GUI

Run:

streamlit run app/gui/streamlit_app.py

GUI Workflow

Use the sidebar to enter topic, filters, sources, ranking, and retrieval options.
Click Search.
Browse ranked results in the Results tab.
Open a paper in the Paper Detail tab.
Select multiple papers in Compare & Report.
Generate a synthesis report.
Export Markdown, report JSON, or selected paper metadata.
Use the Status & Logs tab to inspect warnings, source behavior, retrieval events, and cache activity.

GUI Features

sidebar search controls
pagination in results
quick open/select actions
detail view for one paper
paper comparison table
report viewer
export buttons
local data clear control
status/log visibility

Main Output Files

After a search, you will typically see:

data/processed/<query>/discovery.json
data/processed/<query>/<paper>/metadata.json
data/processed/<query>/<paper>/analysis.json
data/raw/<paper>/paper.pdf if retrieved
data/processed/<paper>/extracted_text.txt if extracted
outputs/<query>_results.json

After synthesis:

outputs/synthesis_report.md
outputs/synthesis_report.json

Configuration

Main settings live in config.yaml.

You can change:

data paths
cache TTL
request timeout
ranking weights
retrieval settings
source defaults

Current notable settings:

cache TTL: 24 hours
max PDF downloads per query: 3

What Is Good About The Current Version

local-first
simple Python codebase
shared backend for CLI and GUI
inspectable JSON outputs
provenance-aware result merging
graceful fallback when a source fails
no fabricated paper findings
easy to extend

Known Limitations

ranking is heuristic, not learned
summary extraction is heuristic
limitations extraction can be noisy or repetitive
full-text extraction quality depends on the PDF
some papers only have metadata, so summaries stay sparse
Semantic Scholar can rate-limit unauthenticated requests
no user database or multi-project workspace yet
no advanced pagination or batch background jobs yet

How To Make It Better

High-value next steps:

add better caching controls and cache invalidation tools
add retries and backoff per source client
add pagination at the source-query level, not just the GUI level
improve deduplication with fuzzy title similarity
improve ranking with query expansion and venue-aware bonuses
add better section-aware PDF parsing
detect abstract/introduction/results sections from extracted text
improve summary extraction with structured templates per paper type
add citation graph support through Semantic Scholar when available
add report templates for literature review, survey, and annotated bibliography
add background jobs for long retrieval runs
add a local SQLite index for faster browsing across past searches
add regression tests around cached results and report quality

If You Want To Understand The Whole App Quickly

Read these files in order:

README.md
app/models.py
app/pipeline.py
app/ranking.py
app/summarization.py
app/reporting.py
app/gui/streamlit_app.py
config.yaml

That is enough to understand nearly all important behavior.

Bottom Line

This app is a working local academic research assistant with:

backend discovery across trusted sources
local metadata and retrieval pipeline
grounded per-paper summaries
cross-paper synthesis
CLI access
Streamlit GUI
caching
exportable outputs

It is already useful as a research workflow tool, and the next gains are mostly about improving ranking quality, extraction quality, and operational polish rather than changing the basic architecture.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
app		app
data		data
outputs		outputs
research_app.egg-info		research_app.egg-info
tests		tests
.gitignore		.gitignore
README.md		README.md
config.yaml		config.yaml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Research App

What The App Does

Main Features

Discovery

Retrieval

Summarization

Cross-Paper Synthesis

Interfaces

Supported Sources

How It Works

End-To-End Flow

Ranking Logic

Summary Logic

Architecture

Key Design Choice

Local Storage Layout

data/raw/

data/processed/

data/cache/

outputs/

Setup

How To Use The App

CLI

CLI Notes

Streamlit GUI

GUI Workflow

GUI Features

Main Output Files

Configuration

What Is Good About The Current Version

Known Limitations

How To Make It Better

If You Want To Understand The Whole App Quickly

Bottom Line

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`data/raw/`

`data/processed/`

`data/cache/`

`outputs/`

Packages