Classifieds Analyser

Purpose

Track how listings for specific search terms change over time. The scraper periodically collects listings and stores them in a database. The analyser then provides statistics on pricing, discounts, listing lifespan, and online/offline distribution.

Constraints

No official API is available; the site is scraped directly using HTTP requests and HTML parsing
Requests are minimized to avoid overloading the site's resources
Searches which are specified over json-files need to result in less than 1250 results, otherwise data will be incomplete: the maximum number of pages which are displayed is 50, with 25 results on each page

Architecture

The project is a Gradle multi-module Kotlin/Spring Boot project with the following modules:

scraper — fetches listings and writes them to the database
analyser — reads the database and prints statistical reports
shared — common data model and database access, used by both scraper and analyser

Railway Deployment

Three services run on Railway:

Service	Description
PostgreSQL	Managed database with a persistent volume
Scraper	Cron job that runs the scraper on a schedule
Backup	Cron job that dumps the database and pushes it to Dropbox

Deployment is triggered manually on Railway after a new Docker image has been pushed. The Docker image is built and published to the GitHub Container Registry (ghcr.io) automatically via GitHub Actions on every commit to main.

Infrastructure is managed with Terraform. Database schema migrations are managed with Flyway and run automatically on scraper startup.

Local Development

Prerequisites

JDK 21
Docker (for running a local test database)
IntelliJ IDEA (both modules are run directly from the IDE)
Railway CLI (for accessing the production database)

Running the Scraper

Run ScraperApplicationKt from IntelliJ with the test Spring profile active. This connects to a local PostgreSQL instance on port 5433.

Running the Analyser

Run AnalysisApplicationKt from IntelliJ. This connects to the same local database and prints statistical reports to the console.

Accessing the Production Database

Use the Railway CLI to open a tunnel to the production database:

railway connect

This allows read access to the live database without risking accidental writes, provided a read-only database user is configured.

CI/CD

On every commit to main, a GitHub Actions workflow:

Builds a new Docker image of the scraper
Tags it with the build number (e.g. ghcr.io/tschy/classifieds-scraper:42)
Pushes it to the GitHub Container Registry

Railway is then updated manually to pull the new image. This gives explicit control over which version is running in production and makes rollbacks straightforward.

Crawling Configuration

The scraping interval is fixed at once per hour in the current version.

Search Configurations

Search configurations are defined as JSON files in shared/src/main/resources/search-configs/. The filename must match the name field in the JSON.

Example — rennraeder-berlin.json:

{
  "name": "rennraeder-berlin",
  "category": "fahrraeder",
  "art": "herren",
  "plz": "10115",
  "searchTerm": "rennrad",
  "radius": 30
}

The name field is the unique identifier of a configuration. It represents the intention of the search and must be chosen carefully — any change to the search parameters requires a new name. Configurations in the database are immutable: the scraper never overwrites an existing configuration. If a new JSON file is added with a name that doesn't yet exist in the database, a new entry is created. This ensures that every historical scrape can be traced back to the exact parameters that produced it.

["config-name-one", "config-name-two"]

To activate a config, add its name field to this list. To deactivate it, remove it. The config file itself is never deleted.

If no configs should run, keep the file as an empty array, do not leave it blank:

[]

URL Construction

The Kleinanzeigen search URL contains a location/category segment (e.g. k0c217l3411r10) that must be determined manually by performing the search in a browser and copying the resulting URL. This is intentional — it allows you to verify that the search results match your expectations before committing to a configuration.

Data Model

Each listing is stored with:

Listing ID, title, price, old price (if discounted), negotiable flag
Created date (as reported by the site)
First and last scrape timestamp
Scrape count
Online/offline flag
Reference to the search configuration that produced it (search_config_id)

Each scrape run is recorded in the scrape table with:

Scrape timestamp
Reference to the search configuration used (search_config_id)

Analysis Output

The analyser reports the following per age group:

Number of listings
Percentage discounted
Online/offline distribution
Discount rate broken down by online and offline listings

Planned Analysis

Average number of new listings per 24 hours
Average listing lifespan
Average price of disappearing vs. active listings
Relationship between price and disappearance rate
Relationship between number of photos and price
Relationship between listing text length and price
Keyword influence on price (NLP, future)

Acknowledgements

Developed by Tamara Schymura, with mentoring by matey-jack and AI-assisted development.

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
.github/workflows		.github/workflows
analysis		analysis
doc		doc
gradle/wrapper		gradle/wrapper
ops/kleinanzeigen-backup		ops/kleinanzeigen-backup
scraper		scraper
shared		shared
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
TODO.md		TODO.md
build.gradle.kts		build.gradle.kts
docker-compose.yml		docker-compose.yml
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
run.sh		run.sh
settings.gradle.kts		settings.gradle.kts
spec.md		spec.md
tasks.md		tasks.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Classifieds Analyser

Purpose

Constraints

Architecture

Railway Deployment

Local Development

Prerequisites

Running the Scraper

Running the Analyser

Accessing the Production Database

CI/CD

Crawling Configuration

Search Configurations

URL Construction

Data Model

Analysis Output

Planned Analysis

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Classifieds Analyser

Purpose

Constraints

Architecture

Railway Deployment

Local Development

Prerequisites

Running the Scraper

Running the Analyser

Accessing the Production Database

CI/CD

Crawling Configuration

Search Configurations

URL Construction

Data Model

Analysis Output

Planned Analysis

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages