A web crawler that scrapes IT interview questions, ratings, and experiences from helloworld.rs, a popular Serbian IT job market platform.
Built this to help people prep for interviews by seeing what companies actually ask. It pulls company names, positions, interview questions, difficulty ratings, and more, then exports everything to CSV, JSON, or Excel.
Grab the latest .exe from Releases. No Python or dependencies required.
Requires Python 3.10+.
git clone https://github.com/dulait/helloworld-crawler.git
cd helloworld-crawler
pip install -e .Run the CLI with python -m helloworld_crawler or the GUI with python entry_gui.py.
The GUI lets you configure everything visually, output folder, file formats, number of pages, proxies. Hit Start, watch the progress bar, get your files. Hit Stop at any time and it saves whatever it scraped so far.
python -m helloworld_crawler
python -m helloworld_crawler --pages 50 --format json
python -m helloworld_crawler --format xlsx --output ./data/results
python -m helloworld_crawler --proxy-file proxies.txt --concurrency 20| Option | Default | Description |
|---|---|---|
--pages N |
auto-detect | Number of pages to scrape |
--output PATH |
./interview_data |
Output file path (without extension) |
--format |
all |
csv, json, xlsx, or all |
--concurrency N |
10 |
Parallel requests |
--proxy-file PATH |
none | Path to a proxy list file |
--verbose |
off | Debug logging |
When you scrape a website, every request comes from your IP address. If you send a lot of requests, the site might notice and temporarily block you. A proxy is a middleman server, your request goes to the proxy first, and the proxy forwards it to the website. From the website's perspective, the request came from the proxy's IP, not yours.
This crawler supports rotating proxies, meaning each request can go through a different proxy server. This spreads the load across multiple IPs so no single one gets flagged.
You don't need proxies for small scrapes, but for the full site (~1300 pages) they're recommended.
In the GUI, paste your proxies directly into the text box (one per line).
In the CLI, create a text file:
http://1.2.3.4:8080
socks5://5.6.7.8:1080
Then pass it with --proxy-file proxies.txt.
Both HTTP and SOCKS5 proxies are supported. The crawler also rotates User-Agent headers on every request automatically, so each request looks like it's coming from a different browser.
Each interview entry includes:
- Company name and position
- Interview questions
- Date, rating, and recommendation
- Metadata like difficulty, format (online/in-person), duration, and outcome
If you're working on the project and want to build the .exe files locally:
pip install -e .[build]
python build.py # builds both CLI and GUI
python build.py cli # CLI only
python build.py gui # GUI onlyExecutables end up in dist/. Pushing to main also triggers a GitHub Actions workflow that builds and publishes a new release automatically (version is bumped based on conventional commits).
This crawler respects the rules set by helloworld.rs. The /iskustva path is allowed by the site's robots.txt. If the site ever asks for this to stop, the repo gets taken down, no questions asked. All data is used for educational purposes only.