Skip to content

dulait/helloworld-crawler

Repository files navigation

helloworld-crawler

A web crawler that scrapes IT interview questions, ratings, and experiences from helloworld.rs, a popular Serbian IT job market platform.

Built this to help people prep for interviews by seeing what companies actually ask. It pulls company names, positions, interview questions, difficulty ratings, and more, then exports everything to CSV, JSON, or Excel.

Installation

Download

Grab the latest .exe from Releases. No Python or dependencies required.

From source

Requires Python 3.10+.

git clone https://github.com/dulait/helloworld-crawler.git
cd helloworld-crawler
pip install -e .

Run the CLI with python -m helloworld_crawler or the GUI with python entry_gui.py.

GUI

The GUI lets you configure everything visually, output folder, file formats, number of pages, proxies. Hit Start, watch the progress bar, get your files. Hit Stop at any time and it saves whatever it scraped so far.

CLI

python -m helloworld_crawler

python -m helloworld_crawler --pages 50 --format json

python -m helloworld_crawler --format xlsx --output ./data/results

python -m helloworld_crawler --proxy-file proxies.txt --concurrency 20
Option Default Description
--pages N auto-detect Number of pages to scrape
--output PATH ./interview_data Output file path (without extension)
--format all csv, json, xlsx, or all
--concurrency N 10 Parallel requests
--proxy-file PATH none Path to a proxy list file
--verbose off Debug logging

Proxies

When you scrape a website, every request comes from your IP address. If you send a lot of requests, the site might notice and temporarily block you. A proxy is a middleman server, your request goes to the proxy first, and the proxy forwards it to the website. From the website's perspective, the request came from the proxy's IP, not yours.

This crawler supports rotating proxies, meaning each request can go through a different proxy server. This spreads the load across multiple IPs so no single one gets flagged.

You don't need proxies for small scrapes, but for the full site (~1300 pages) they're recommended.

In the GUI, paste your proxies directly into the text box (one per line).

In the CLI, create a text file:

http://1.2.3.4:8080
socks5://5.6.7.8:1080

Then pass it with --proxy-file proxies.txt.

Both HTTP and SOCKS5 proxies are supported. The crawler also rotates User-Agent headers on every request automatically, so each request looks like it's coming from a different browser.

What gets scraped

Each interview entry includes:

  • Company name and position
  • Interview questions
  • Date, rating, and recommendation
  • Metadata like difficulty, format (online/in-person), duration, and outcome

Building executables

If you're working on the project and want to build the .exe files locally:

pip install -e .[build]
python build.py          # builds both CLI and GUI
python build.py cli      # CLI only
python build.py gui      # GUI only

Executables end up in dist/. Pushing to main also triggers a GitHub Actions workflow that builds and publishes a new release automatically (version is bumped based on conventional commits).

Note

This crawler respects the rules set by helloworld.rs. The /iskustva path is allowed by the site's robots.txt. If the site ever asks for this to stop, the repo gets taken down, no questions asked. All data is used for educational purposes only.

About

Python Web Crawler used to extract interview data from popular IT job-hunting website helloworld.rs

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages