subiculum-lit-analysis

Broad literature analysis of the subiculum.

Data Access

NCBI API Usage (PubMed)

This project uses the NCBI Entrez E-Utilities API to programmatically access PubMed metadata related to neuroscience research papers. I strictly adhere to the NCBI API guidelines, data usage policiesand the Entrez programming utilities usage guidelines to ensure respectful and compliant access to public biomedical research data.

Compliance Summary

Rate Limits Respected: No more than 3 requests per second, with large batch queries run during off-peak hours (9 PM–5 AM ET or weekends).
Identification Headers Included: All requests include the tool and email parameters for accountability.
Metadata Only: These API requests retrieves public abstracts and metadata only, no copyrighted content.
Proper Attribution: The NCBI and U.S. National Library of Medicine (NLM) are clearly acknowledged as the data providers.
Non-Commercial Academic Use: All data usage is strictly for academic research and educational purposes.

For more information, please see: NCBI Data Usage Policies, and API guidelines

Getting Started & Project Setup

This project uses pixi for cross-platform reproducibility and env management, e.g. precise lib pinning. To get started, assuming you have installed pixi on your system, run the following commands:

Install Pixi, then clone the repo.

# Clone the repo and enter it
git clone https://github.com/yourname/subiculum-lit-analysis.git
cd subiculum-lit-analysis

Create and activate the Pixi environment.

pixi install
pixi shell

Configure settings:

# Copy the settings template
cp settings-template.yaml settings.yaml

# Edit settings.yaml with your information
# At minimum, update the email field (REQUIRED by NCBI)
# Optionally add your NCBI API key for 10 req/s (vs 3 req/s without)

Key settings to configure:

Setting	Description	Required	Default
`pubmed.email`	Your email address (NCBI requirement)	Yes	`your.email@example.com`
`pubmed.api_key`	NCBI API key for higher rate limits	No	Empty (will prompt securely)
`batch.size`	Number of papers per batch	No	100
`logging.level`	Log verbosity (DEBUG/INFO/WARNING/ERROR)	No	INFO
`dry_run.enabled`	Test mode without database writes	No	false

Get an NCBI API key (free) at: https://www.ncbi.nlm.nih.gov/account/settings/

See settings-template.yaml for full configuration options and detailed descriptions.

Test API connectivity (recommended):

pixi run test-api

# you should see
✨ Pixi task (test-api): bash scripts/test_api.sh: (Test PubMed API connectivity (fetches 3 articles))                                                      
=== PubMed API Test ===
Fetching 3 articles...
→ Querying: subiculum%5BTitle/Abstract%5D
✓ Retrieved PMIDs: 41213811,41194528,41174500
✓ Saved 3 articles to: data/test/pubmed_api_test.xml
✓ File size: 94877 bytes

✅ PASSED: PubMed API is working correctly

This verifies your connection to NCBI's PubMed API by fetching 3 sample articles and saving them to data/test/pubmed_api_test.xml. The test passes if it successfully retrieves and validates the XML data.

Run integration tests (optional, recommended for development):

pixi run integration-tests

# Expected output
=== Testing PubMed API Integration ===
Step 1: Running ESearch...
  ✓ Found 3461 total papers
  ✓ WebEnv: MCID_67890abcdef...
  ✓ Query key: 1
Step 2: Fetching 3 papers via EFetch...
  ✓ Retrieved 94877 bytes of XML
Step 3: Validating XML structure...
  ✓ Found 3 PubmedArticle elements
✅ API Integration Test PASSED

=== Testing XML Parsing ===
Step 1: Loading fixture from pubmed_sample.xml...
  ✓ Loaded 94877 bytes of XML
Step 2: Parsing XML...
  ✓ Parsed 3 papers
Step 3: Validating first paper structure...
  ✓ PMID: 41213811
  ✓ Title: Hippocampal Subfield Susceptibility...
  ✓ All 13 expected fields present
Step 4: Validating nested entities...
  ✓ Authors: 9 found
  ✓ MeSH terms: 8 found
  ✓ Citations: 85 found
✅ XML Parsing Test PASSED

=== Testing Database Integration ===
Step 1: Creating in-memory database...
Step 2: Applying schema from schema.sql...
  ✓ Schema applied successfully
Step 3: Loading and parsing XML fixture...
  ✓ Parsed 3 papers from fixture
Step 4: Inserting papers into database...
  ✓ Papers inserted: 3
  ✓ Authors inserted: 18
  ✓ Citations inserted: 254
  ✓ MeSH terms inserted: 24
Step 5: Verifying data in tables...
  ✓ Papers in DB: 3
  ✓ Authors in DB: 18
Step 6: Verifying foreign key constraints...
  ✓ No orphaned paper-author links
  ✓ No orphaned author references
✅ Database Integration Test PASSED

Note: The simple API test (step 4) is a quick connectivity check using bash/curl, while integration tests validate the full Python ETL pipeline components. Run the API test first to ensure network connectivity, then run integration tests to validate the implementation.

Project Data Schema

The pixi init-db task creates the subiculum_literature.db SQLite database and applies the full schema from schema.sql. See Database Schema Designfor table descriptions, normalization rationale, and query examples in the project Wiki. The schema includes tables for papers, authors, citations, MeSH terms, keywords, grants, and more. For more information on schema design, please see the github wiki entry.

Search Strategy

This project queries PubMed using a hybrid of MeSH terms and title/abstract matches to capture a high-quality dataset of neuroscience papers related to the subiculum: (subiculum[MeSH Terms] OR subiculum[Title/Abstract]). This approach balances precision and coverage, ensuring both curated relevance and access to the latest research.

Current query: subiculum[Title/Abstract]

Results: 3,576 papers in database (99.7% success rate)

Future expansions:

Broader MeSH-based query: subiculum[MeSH] (~10K papers)
Related structures: entorhinal cortex, CA1, dentate gyrus
Incremental updates for newly published papers

Running the Pipeline

Initial Data Collection

To fetch all papers from PubMed and populate the database:

# 1. Initialize the database (creates tables)
pixi run init-db

# 2. Run the complete ETL pipeline
pixi run run-pipeline

# Expected output:
# Starting ETL pipeline
# Search query: subiculum[Title/Abstract]
# Total papers found: 3,461
# Already fetched: 0 papers
# Papers remaining to fetch: 3,461
#
# --- Batch 1/35 (papers 1-100) ---
# Parsed 100 papers from XML
# Batch results: 100 inserted, 0 failed, 0 skipped
# ...
#
# === Pipeline Complete ===
# Total papers inserted: 3,430
# Total papers failed: 22
# Total papers in database: 3,430

Database size: 41 MB (3,576 papers)

What happens:

ESearch: Queries PubMed for all matching PMIDs
Batch fetching: Retrieves papers in batches of 100
XML parsing: Extracts metadata (title, authors, citations, etc.)
Database loading: Inserts into SQLite with transactions
Idempotency: Tracks successfully fetched papers in fetch_log table

On subsequent runs:

Pipeline skips already-fetched papers (uses fetch_log table)
Only fetches new/failed papers
Safe to re-run anytime

Recovering Failed Papers

Some papers may fail due to duplicate citations in PubMed XML data. To retry:

pixi run python scripts/fetch_failed_papers.py

# Output:
# Found 22 unique failed PMIDs
# [1/22] Fetching PMID 28673769...
#   Deduplicated citations: 84 → 83
#   ✓ Successfully inserted PMID 28673769
# ...
# === Retry Complete ===
# Successfully inserted: 21/22
# Still failing: 1

This script deduplicates citations before inserting, recovering most failures.

Fetching Additional Papers

To expand the dataset with different search queries:

Edit settings.yaml:

search:
  query: "subiculum[MeSH Terms]"  # Broader query (~10K papers)

Run pipeline again:

pixi run run-pipeline

Pipeline automatically skips existing papers and only fetches new ones.

Alternative search strategies:

Query	Papers	Description
`subiculum[Title/Abstract]`	~3,500	Current - high precision
`subiculum[MeSH Terms]`	~10,000	Professionally indexed papers
`subiculum[Title/Abstract] AND alzheimer*[Title/Abstract]`	~200	Disease-specific subset
`subiculum[Title/Abstract] AND epilep*[Title/Abstract]`	~150	Epilepsy research
`hippocampus[MeSH] AND CA1[Title/Abstract]`	~5,000	Related hippocampal subfield

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
analysis		analysis
data		data
logs		logs
scripts		scripts
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
DATA_DICTIONARY.md		DATA_DICTIONARY.md
README.md		README.md
REFERENCES.md		REFERENCES.md
pixi.lock		pixi.lock
pixi.toml		pixi.toml
settings-template.yaml		settings-template.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

subiculum-lit-analysis

Table of Contents

Data Access

NCBI API Usage (PubMed)

Getting Started & Project Setup

Project Data Schema

Search Strategy

Running the Pipeline

Initial Data Collection

Recovering Failed Papers

Fetching Additional Papers

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

subiculum-lit-analysis

Table of Contents

Data Access

NCBI API Usage (PubMed)

Getting Started & Project Setup

Project Data Schema

Search Strategy

Running the Pipeline

Initial Data Collection

Recovering Failed Papers

Fetching Additional Papers

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages