Skip to content

thunderbit-operations/goodreads-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Goodreads Scraper

A fast CLI + Python tool to scrape Goodreads book details — title, author, rating, ratings/reviews counts, description, ISBN, page count, format, cover image — to CSV or JSON. No API key, no JS rendering.

Features

  • Clean book metadata — title, author(s), average rating, ratings count, reviews count, description, ISBN, number of pages, format, publisher, language, and cover image URL
  • Two structured sources, no browser — reads Goodreads' own schema.org JSON-LD and Next.js __NEXT_DATA__ blobs that ship in the page HTML, so there's no Selenium/Playwright and no fragile CSS selectors
  • Takes an id or a URL — pass 3735293 or https://www.goodreads.com/book/show/3735293-clean-code
  • Multiple books at once — list several ids/URLs on one command
  • CSV / JSON / JSONL output — ready for Excel, pandas, or a database
  • Python API — use it as a library
  • Minimal dependencies — just requests + beautifulsoup4

Installation

pip install goodreads-scraper

Requires Python 3.10+.

Quick Start

# Scrape one book by id (prints JSON)
goodreads 3735293

# By full URL, to CSV
goodreads "https://www.goodreads.com/book/show/3735293-clean-code" -f csv -o clean-code.csv

# Several books at once, to JSONL
goodreads 3735293 5907 11870085 -f jsonl -o books.jsonl

Example JSON record (goodreads 3735293):

{
  "title": "Clean Code: A Handbook of Agile Software Craftsmanship",
  "author": "Robert C. Martin",
  "authors": ["Robert C. Martin"],
  "rating": 4.19,
  "ratings_count": 28173,
  "reviews_count": 1502,
  "description": "Even good software developers leave a trail of...",
  "isbn": "9780132350884",
  "num_pages": 464,
  "format": "Paperback",
  "publisher": "Prentice Hall",
  "language": "English",
  "cover_url": "https://i.gr-assets.com/images/S/.../3735293.jpg",
  "book_id": "3735293",
  "url": "https://www.goodreads.com/book/show/3735293"
}

CLI Reference

goodreads [OPTIONS] BOOK [BOOK ...]
Argument / Flag Default Description
BOOK One or more Goodreads book ids (3735293) or book-page URLs
--format, -f json Output format: csv, json, jsonl
--output, -o FILE stdout Write to file
--count off Print only how many books were scraped

A book that fails to scrape logs an error to stderr and is skipped; the rest still come through.

Python API

from goodreads_scraper import scrape, parse_book_id, parse_html

# Scrape by id or URL
book = scrape(3735293)
print(book["title"], book["rating"], book["ratings_count"])

book = scrape("https://www.goodreads.com/book/show/3735293-clean-code")

# Resolve an id from a URL without hitting the network
parse_book_id("https://www.goodreads.com/book/show/3735293-clean-code")  # -> "3735293"

# Parse a page you already have in hand (no network)
record = parse_html(open("book.html").read())

How it works

Goodreads renders each book detail page server-side and embeds two machine-readable data sources directly in the HTML — neither needs JavaScript:

  1. JSON-LD (<script type="application/ld+json">, a schema.org Book) — the most stable source for title, author, average rating, and ratings/reviews counts. Used first.
  2. __NEXT_DATA__ (<script id="__NEXT_DATA__">, a Next.js/Apollo cache) — fills in the description (HTML stripped), ISBN-13, page count, format, publisher, and cover image.

The scraper fetches https://www.goodreads.com/book/show/{book_id} with a full Chrome request fingerprint, merges both sources, and returns clean fields.

Limitations

This is an honest, lightweight tool. Read this before relying on it:

  • Detail pages only. It scrapes a book page when you already know its id or URL. It does not search or discover books — Goodreads' search and listing endpoints sit behind AWS WAF and reject scripted requests, so you need to get book ids from elsewhere (a browser, your existing library export, the Goodreads URL bar, etc.).
  • No official API. Goodreads shut down its public API in late 2020, so this reads the public page HTML instead.
  • Page-structure dependent. If Goodreads changes its JSON-LD or __NEXT_DATA__ layout, parsing may need an update. Some older/sparse book pages omit fields (e.g. publisher or cover); those come back as null.
  • Be polite. Throttle your requests; don't hammer the site.

💡 Don't want to write code or hunt for book IDs? Thunderbit is an AI web scraper Chrome extension that scrapes Goodreads (and any site) in 2 clicks, no code.

Development

git clone https://github.com/thunderbit-operations/goodreads-scraper.git
cd goodreads-scraper
pip install -e ".[dev]"
pytest

Tests run fully offline against a saved book-page fixture.

Related tools

Legal

Scrape responsibly and at a polite rate. Only collect publicly available data, and review Goodreads' Terms of Service and your local regulations before use.

License

MIT — Built by Thunderbit, AI-powered web scraper & data extraction tools.

About

Scrape Goodreads book data (title, author, rating, reviews, ISBN) to CSV/JSON — Python CLI + API.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages