README.md
-
Install Homebrew (if not already installed):
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" -
Install & Start Redis: Redis is required for the Celery task queue.
brew install redis brew services start redis
-
Install Python Dependencies:
python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt
-
Update Package List and Install Redis:
sudo apt update sudo apt install redis-server python3-pip python3-venv sudo systemctl enable --now redis-server -
Setup Python Virtual Environment:
python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt
Use the provided run script to start all three components (Redis, Celery worker, Flask) in one command:
./run.shAll output is sent to syslog under the tag link-scanner. To tail logs:
# macOS
log stream --predicate 'senderImagePath contains "logger"' --info | grep "link-scanner"
# Ubuntu
journalctl -f -t "link-scanner"Press Ctrl+C in the terminal running run.sh. This will cleanly shut down all three processes.
If you prefer to run each component separately:
-
Redis Server:
redis-server
-
Celery Worker:
source .venv/bin/activate celery -A tasks worker --loglevel=info -
Flask App:
source .venv/bin/activate python3 app.py
scanner.py can be run directly from the command line to scan a sitemap for a search string.
python3 scanner.py <sitemap_url> <search_string> [options]
<Search_string> can be a regular expression.
| Argument | Description |
|---|---|
sitemap_url |
The URL of the sitemap to scan (e.g. https://example.com/sitemap.xml) |
search_string |
The string (or regex) to search for |
| Flag | Description |
|---|---|
-s, --silent |
Suppress "skip" messages for 4xx HTTP errors |
-a, --all |
Search the entire HTML source text instead of only <a href> attributes |
-t, --timeout <seconds> |
Request timeout in seconds (default: 10) |
-d, --debug |
Enable debug logging to stdout |
-u, --url |
Print each URL as it is being scanned |
Results are written to stdout as CSV with columns: Index, URL, Found Text.
Search all page links for mailto::
python3 scanner.py https://example.com/sitemap.xml "mailto:"Search full HTML source for a phone number pattern, with debug output:
python3 scanner.py https://example.com/sitemap.xml "\d{3}-\d{4}" --all --debugRedirect results to a file:
python3 scanner.py https://example.com/sitemap.xml "contact" -s > results.csvgraph TD
subgraph Client_Side [Client Side]
Browser[User Browser<br/>(HTML/JS Frontend)]
end
subgraph Server_Side [Server Side]
Flask[Flask Web Server<br/>(app.py)]
Redis[(Redis<br/>Message Broker & Result Backend)]
Worker[Celery Worker<br/>(tasks.py + scanner.py)]
end
subgraph Internet [External]
Target[Target Websites<br/>(Sitemaps & HTML)]
end
%% Flow
Browser -- "1. Start Scan (POST /api/scan)" --> Flask
Flask -- "2. Enqueue Task" --> Redis
Flask -. "3. Return Task ID" .-> Browser
Redis -- "4. Distribute Task" --> Worker
Worker -- "5. Fetch Sitemap & Scan URLs" --> Target
Target -- "6. Return HTML Content" --> Worker
Worker -- "7. Update Progress & Results" --> Redis
Browser -- "8. Poll Status (GET /api/status)" --> Flask
Flask -- "9. Query Task State" --> Redis
Redis -- "10. Return State/Result" --> Flask
Flask -. "11. Return JSON Response" .-> Browser
- Regex injection risk (
scanner.py) — raw user input is passed directly tore.search()with no validation; malformed patterns (e.g. unbalanced parentheses) can crash Celery workers. - Hardcoded test values (
templates/index.html) — default form values point tohealymarketinggroup.com/Microsoft; should be cleared or replaced with placeholder examples. - No Celery task timeout (
tasks.py) — long-running scans can hang indefinitely; notime_limitorsoft_time_limitconfigured on the task. - No rate limiting (
scanner.py) — requests to target sites are made without throttling, which may trigger rate limits or IP blocks on large sitemaps. - Unused variable (
scanner.py:142) —x=1is initialized outside theif __name__block; should be moved inside for clarity. - Silent failure on non-HTML content (
scanner.py) — returnsFalsewithout logging when the responseContent-Typeis not HTML, making it hard to distinguish from "not found".
Copyright 2026 Bill Roth
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.