A versatile web scraping application built to efficiently gather data from a variety of websites.
This project is ideal for automating data extraction tasks and transforming raw HTML data into structured formats.
This service application allows users to specify target websites and define the elements to extract, enabling seamless and customizable public data scraping. The tool is designed to handle a range of web scraping scenarios, from simple data extraction to complex, multi-page crawling tasks.
- User-friendly configuration
Easily set up scraping tasks through a simple UI or via supported API requests - Customizable scraping rules
Users can specify target elements using CSS selectors - Multiple users support
Run scraper tasks for several users at once - Direct link to data monitor service
Configure scraped data thresholds and notifiers in supplementary monitor - REST API
Output data can be retrieved remotely by sending a request to appropriate REST API endpoint - Error handling
Built-in mechanisms to manage failed requests and handle dynamic content
- Node.js 18+ (service - LINK)
- MongoDB (data storage - LINK)
- Docker (all-in-one approach - LINK)
- Bruno (API testing - LINK)
It's strongly recommended to use the Docker containers approach for installing and running the web-scraper service. However if, for whatever reasons, it's not an option then a local installation is also possible and it requires the following steps to be performed:
- Clone the repository:
git clone https://github.com/piopon/web-scraper.git - Navigate to the project directory:
cd web-scraper - Install dependencies:
npm install
Before running the application service create an .env file with the following data:
# connection parameters
SERVER_ADDRESS=[STRING] # service address
SERVER_PORT=[INTEGER] # service port number
# database settings
DB_PORT=[INTEGER] # the port for database connection
DB_NAME=[STRING] # the name of the database
DB_ADDRESS=[STRING] # the IP address for datavase
DB_USER=[STRING] # database user (authentication)
DB_PASSWORD=[STRING] # database password (authentication)
# data monitor settings
MONITOR_ADDRESS=[STRING] # monitor service address
MONITOR_PORT=[INTEGER] # monitor service port number
#scraper settings
SCRAP_INACTIVE_DAYS=[INTEGER] # number of days from last login to treat user as inative
SCRAP_INTERVAL_SEC=[INTEGER] # default seconds interval between each scrap operation
SCRAP_EXTRAS_TYPE=[ENUM] # the type of extra value in data output
# internal hash and secrets
ENCRYPT_SALT=[STRING|INTEGER] # randomize salt value
SESSION_SHA=[STRING] # hash for session cookie
JWT_SECRET=[STRING] # hash for JSON Web Token
CHALLENGE_PREFIX=[STRING] # challenge token prefix
CHALLENGE_JOIN=[STRING] # challenge token separator
CHALLENGE_EOL_MINS=[INTEGER] # challenge deadline minutes
CHALLENGE_EOL_SEPARATOR=[STRING] # challenge deadline separator
# external authentication
GOOGLE_CLIENT_ID=[STRING] # external Google login client ID
GOOGLE_CLIENT_SECRET=[STRING] # secret for external Google login client
# demo functionality
DEMO_MODE=[ENUM] # the demo session mode (duplicate OR overwrite)
DEMO_BASE=[STRING] # base demo email
DEMO_USER=[STRING] # user template email
DEMO_PASS=[STRING] # base demo user password
# CI functionality
CI_USER=[STRING] # CI user email
CI_PASS=[STRING] # CI user password
There are two supported ways to run web-scraper service:
-
LOCAL
- Start the MongoDB instance
- Go to web-scraper directory and use the command:
npm run start
This will invoke the web-scraper locally on your platform.
-
DOCKER
- Go to web-scraper directory and use the command:
docker compose up -d
This will invoke the web-scraper in the Docker container in the detached mode (argument
-d).
In order to display the logs of the service type the command:docker logs scraper - Go to web-scraper directory and use the command:
After the service is up and running the next steps are as follows:
- Open the web-browser and navigate to the configured
IP:PORTaddress.
Login to your account, create a new one, or open a demo session
- Customize your scraping tasks by modifying configuration groups, observers and fill all components data

- Go to data monitor service and define thresholds and notifiers to make use of scraped values
This button is available only when MONITOR_ADDRESS (and optionally MONITOR_PORT) environment variable is defined.
After correctly adding first observer your data is now being scraped from the specified location!
Check the created users directory for scraped data values or error details if configuration is incorrect.
Also keep in mind that initially this directory may contain folders for CI and demo users, depending on the configuration.
Current status of web-scraper's components can be quicky checked in the bottom right corner:

After successfull login this panel contains also additional links to:
web-scraper/
├── .github/workflows/ # GitHub workflows for CI/CD
├── docs/ # Requests documentation and docs assets
├── public # Frontend UI source files
├── src/ # Backend UI source files
├── test/ # Unit tests logic
├── .dockerignore # List of files ignored by Docker
├── .gitignore # List of files ignored by GIT
├── CODEOWNERS # List of code owners
├── docker-compose.yml # Docker compose file for this service and MongoDB
├── Dockerfile # Docker container recipe
├── LICENSE # GPL-2.0 license description
├── package-lock.json # Node.js snapshot of the dependency tree
├── package.json # Node.js project metadata
└── README.md # Top-level project description (this document)
Contributions are welcome! To contribute:
- Fork the repository.
- Create a new branch for your feature or bugfix.
- Submit a pull request with a clear description of your changes.
This project is licensed under the GPL-2.0 license. See the LICENSE file for details.
For questions or suggestions, feel free to contact me through GitHub or via email.
Created by PNK with ❤ @ 2023-2025


