Skip to content

edisimo/research_app

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Research App

Local-first academic research assistant for discovering, ranking, retrieving, summarizing, comparing, and exporting academic papers from API-friendly sources.

What The App Does

This app helps you research a topic without relying on Google Scholar scraping.

You give it:

  • a topic
  • optional subtopics
  • include keywords
  • exclude keywords
  • a date range
  • a maximum number of papers
  • a ranking preference

It then:

  1. queries trusted sources
  2. normalizes and deduplicates results
  3. ranks the papers
  4. saves metadata locally
  5. downloads accessible PDFs when available
  6. extracts text from downloaded PDFs
  7. generates grounded per-paper summaries
  8. lets you compare papers and produce a synthesis report
  9. exports JSON and Markdown outputs

The app is local-first, inspectable, and reproducible. Both the CLI and the Streamlit GUI call the same backend pipeline.

Main Features

Discovery

  • Searches arXiv
  • Searches Semantic Scholar
  • Searches Crossref
  • Preserves source provenance
  • Merges duplicate papers across sources
  • Supports ranking by relevance, recency, or citations

Retrieval

  • Saves metadata locally
  • Downloads PDFs when technically accessible
  • Extracts text from downloaded PDFs
  • Keeps metadata even when a PDF is unavailable
  • Logs failures and continues

Summarization

For each paper, the app can produce:

  • title
  • authors
  • year
  • venue
  • DOI / arXiv ID / source links
  • abstract
  • summary
  • main problem
  • proposed method
  • key findings
  • limitations
  • takeaways
  • confidence note

Confidence is labeled as:

  • metadata only
  • abstract-based
  • fuller text-based

Cross-Paper Synthesis

The app can generate a report with:

  • executive summary
  • selected papers
  • recurring themes
  • major methods used
  • strongest findings
  • common limitations
  • disagreements or gaps
  • recommended reading order

Interfaces

  • CLI for scripted or terminal-based use
  • Streamlit GUI for interactive exploration

Supported Sources

The app is designed around stable, programmatic sources:

  • arXiv
  • Semantic Scholar
  • Crossref

It does not use Google Scholar scraping as the main mechanism.

How It Works

End-To-End Flow

  1. Build a SearchRequest from CLI or GUI input.
  2. Optionally load a recent cached discovery response from data/cache/.
  3. Query enabled sources.
  4. Normalize all returned papers into one common schema.
  5. Deduplicate by DOI, arXiv ID, or normalized title/year.
  6. Filter by date and exclude terms.
  7. Rank the papers using weighted signals.
  8. Optionally download PDFs for top-ranked results.
  9. Optionally extract text from downloaded PDFs.
  10. Generate per-paper summaries from abstract and, when available, extracted text.
  11. Persist results under data/processed/ and outputs/.
  12. Display the results in CLI or GUI.

Ranking Logic

Ranking uses a weighted heuristic rather than a learned model.

Current signals include:

  • relevance to the query text
  • title match quality
  • recency
  • citation count
  • text quality
  • source quality

User preference can bias the final score toward:

  • relevance
  • recency
  • citations

Summary Logic

The summaries are grounded only in retrieved data.

The current behavior is:

  • if full text is unavailable, use abstract text
  • if full text is available, use it as stronger evidence
  • prefer abstract text for headline summary when it is cleaner than raw extracted PDF text
  • never invent findings beyond retrieved metadata, abstract, or extracted text

This is still heuristic summarization, not a full scientific reading agent.

Architecture

app/
  config.py                 Settings and path setup
  logging_utils.py          Logging configuration
  main.py                   Typer CLI
  models.py                 Shared data models
  pipeline.py               Shared discovery/retrieval/summarization pipeline
  ranking.py                Ranking signals and sorting
  storage.py                Persistence, exports, cache, cleanup
  summarization.py          Per-paper summary extraction
  reporting.py              Cross-paper synthesis report generation
  gui/streamlit_app.py      Streamlit GUI
  retrieval/
    pdf_fetcher.py          PDF download
    text_extractor.py       PDF text extraction
  sources/
    arxiv_client.py
    semantic_scholar_client.py
    crossref_client.py
config.yaml                 Paths, ranking weights, retrieval settings
data/
  raw/                      Downloaded PDFs
  processed/                Discovery JSON, metadata, analysis, extracted text
  cache/                    Cached discovery responses
outputs/                    Exported result JSON and synthesis reports
tests/                      Smoke tests

Key Design Choice

The GUI does not implement separate logic. It calls the same backend used by the CLI.

That keeps behavior consistent across:

  • discovery
  • ranking
  • retrieval
  • summarization
  • synthesis
  • storage

Local Storage Layout

data/raw/

Downloaded PDF files.

data/processed/

Per-query and per-paper processed outputs, including:

  • discovery.json
  • metadata.json
  • analysis.json
  • extracted_text.txt

data/cache/

Cached discovery responses. These are reused for 24 hours by default.

outputs/

Exportable end-user artifacts, such as:

  • search results JSON
  • synthesis report Markdown
  • synthesis report JSON

Generated files are ignored via .gitignore, but the directories themselves are tracked.

Setup

Recommended:

python -m venv .venv
source .venv/bin/activate
pip install -e .

For tests:

pip install -e ".[dev]"

Optional:

  • set SEMANTIC_SCHOLAR_API_KEY if you have one

How To Use The App

CLI

Basic discovery:

research-app discover "large language models for robot planning" --max-papers 10

JSON output to stdout:

research-app discover "large language models for robot planning" --output-json

Fresh run without cache:

research-app discover "large language models for robot planning" --no-use-cache

Disable PDF download and text extraction:

research-app discover "large language models for robot planning" --no-fetch-pdfs --no-extract-text

Generate a synthesis report from saved results:

research-app synthesize outputs/large-language-models-for-robot-planning_results.json

Clear generated local data:

research-app clear-data

CLI Notes

  • --fetch-pdfs/--no-fetch-pdfs controls PDF retrieval
  • --extract-text/--no-extract-text controls text extraction
  • --use-cache/--no-use-cache controls discovery cache usage

Streamlit GUI

Run:

streamlit run app/gui/streamlit_app.py

GUI Workflow

  1. Use the sidebar to enter topic, filters, sources, ranking, and retrieval options.
  2. Click Search.
  3. Browse ranked results in the Results tab.
  4. Open a paper in the Paper Detail tab.
  5. Select multiple papers in Compare & Report.
  6. Generate a synthesis report.
  7. Export Markdown, report JSON, or selected paper metadata.
  8. Use the Status & Logs tab to inspect warnings, source behavior, retrieval events, and cache activity.

GUI Features

  • sidebar search controls
  • pagination in results
  • quick open/select actions
  • detail view for one paper
  • paper comparison table
  • report viewer
  • export buttons
  • local data clear control
  • status/log visibility

Main Output Files

After a search, you will typically see:

  • data/processed/<query>/discovery.json
  • data/processed/<query>/<paper>/metadata.json
  • data/processed/<query>/<paper>/analysis.json
  • data/raw/<paper>/paper.pdf if retrieved
  • data/processed/<paper>/extracted_text.txt if extracted
  • outputs/<query>_results.json

After synthesis:

  • outputs/synthesis_report.md
  • outputs/synthesis_report.json

Configuration

Main settings live in config.yaml.

You can change:

  • data paths
  • cache TTL
  • request timeout
  • ranking weights
  • retrieval settings
  • source defaults

Current notable settings:

  • cache TTL: 24 hours
  • max PDF downloads per query: 3

What Is Good About The Current Version

  • local-first
  • simple Python codebase
  • shared backend for CLI and GUI
  • inspectable JSON outputs
  • provenance-aware result merging
  • graceful fallback when a source fails
  • no fabricated paper findings
  • easy to extend

Known Limitations

  • ranking is heuristic, not learned
  • summary extraction is heuristic
  • limitations extraction can be noisy or repetitive
  • full-text extraction quality depends on the PDF
  • some papers only have metadata, so summaries stay sparse
  • Semantic Scholar can rate-limit unauthenticated requests
  • no user database or multi-project workspace yet
  • no advanced pagination or batch background jobs yet

How To Make It Better

High-value next steps:

  • add better caching controls and cache invalidation tools
  • add retries and backoff per source client
  • add pagination at the source-query level, not just the GUI level
  • improve deduplication with fuzzy title similarity
  • improve ranking with query expansion and venue-aware bonuses
  • add better section-aware PDF parsing
  • detect abstract/introduction/results sections from extracted text
  • improve summary extraction with structured templates per paper type
  • add citation graph support through Semantic Scholar when available
  • add report templates for literature review, survey, and annotated bibliography
  • add background jobs for long retrieval runs
  • add a local SQLite index for faster browsing across past searches
  • add regression tests around cached results and report quality

If You Want To Understand The Whole App Quickly

Read these files in order:

  1. README.md
  2. app/models.py
  3. app/pipeline.py
  4. app/ranking.py
  5. app/summarization.py
  6. app/reporting.py
  7. app/gui/streamlit_app.py
  8. config.yaml

That is enough to understand nearly all important behavior.

Bottom Line

This app is a working local academic research assistant with:

  • backend discovery across trusted sources
  • local metadata and retrieval pipeline
  • grounded per-paper summaries
  • cross-paper synthesis
  • CLI access
  • Streamlit GUI
  • caching
  • exportable outputs

It is already useful as a research workflow tool, and the next gains are mostly about improving ranking quality, extraction quality, and operational polish rather than changing the basic architecture.

About

LLM created app for fast research in topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages