Skip to content

wgroth2/link-scanner

Repository files navigation

README.md

Readme for Website String Search

Installation: MacOS

  1. Install Homebrew (if not already installed):

    /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
  2. Install & Start Redis: Redis is required for the Celery task queue.

    brew install redis
    brew services start redis
  3. Install Python Dependencies:

    python3 -m venv .venv
    source .venv/bin/activate
    pip install -r requirements.txt

Installation: Ubuntu

  1. Update Package List and Install Redis:

    sudo apt update
    sudo apt install redis-server python3-pip python3-venv
    sudo systemctl enable --now redis-server
  2. Setup Python Virtual Environment:

    python3 -m venv .venv
    source .venv/bin/activate
    pip install -r requirements.txt

Setup

Use the provided run script to start all three components (Redis, Celery worker, Flask) in one command:

./run.sh

All output is sent to syslog under the tag link-scanner. To tail logs:

# macOS
log stream --predicate 'senderImagePath contains "logger"' --info | grep "link-scanner"

# Ubuntu
journalctl -f -t "link-scanner"

Stopping the Application

Press Ctrl+C in the terminal running run.sh. This will cleanly shut down all three processes.

Manual Setup (Alternative)

If you prefer to run each component separately:

  1. Redis Server:

    redis-server
  2. Celery Worker:

    source .venv/bin/activate
    celery -A tasks worker --loglevel=info
  3. Flask App:

    source .venv/bin/activate
    python3 app.py

Command Line Usage (scanner.py)

scanner.py can be run directly from the command line to scan a sitemap for a search string.

python3 scanner.py <sitemap_url> <search_string> [options]

Comment about search strings

<Search_string> can be a regular expression.

Positional Arguments

Argument Description
sitemap_url The URL of the sitemap to scan (e.g. https://example.com/sitemap.xml)
search_string The string (or regex) to search for

Options

Flag Description
-s, --silent Suppress "skip" messages for 4xx HTTP errors
-a, --all Search the entire HTML source text instead of only <a href> attributes
-t, --timeout <seconds> Request timeout in seconds (default: 10)
-d, --debug Enable debug logging to stdout
-u, --url Print each URL as it is being scanned

Output

Results are written to stdout as CSV with columns: Index, URL, Found Text.

Examples

Search all page links for mailto::

python3 scanner.py https://example.com/sitemap.xml "mailto:"

Search full HTML source for a phone number pattern, with debug output:

python3 scanner.py https://example.com/sitemap.xml "\d{3}-\d{4}" --all --debug

Redirect results to a file:

python3 scanner.py https://example.com/sitemap.xml "contact" -s > results.csv

Architcture

graph TD
    subgraph Client_Side [Client Side]
        Browser[User Browser<br/>(HTML/JS Frontend)]
    end

    subgraph Server_Side [Server Side]
        Flask[Flask Web Server<br/>(app.py)]
        Redis[(Redis<br/>Message Broker & Result Backend)]
        Worker[Celery Worker<br/>(tasks.py + scanner.py)]
    end

    subgraph Internet [External]
        Target[Target Websites<br/>(Sitemaps & HTML)]
    end

    %% Flow
    Browser -- "1. Start Scan (POST /api/scan)" --> Flask
    Flask -- "2. Enqueue Task" --> Redis
    Flask -. "3. Return Task ID" .-> Browser
    
    Redis -- "4. Distribute Task" --> Worker
    Worker -- "5. Fetch Sitemap & Scan URLs" --> Target
    Target -- "6. Return HTML Content" --> Worker
    
    Worker -- "7. Update Progress & Results" --> Redis
    
    Browser -- "8. Poll Status (GET /api/status)" --> Flask
    Flask -- "9. Query Task State" --> Redis
    Redis -- "10. Return State/Result" --> Flask
    Flask -. "11. Return JSON Response" .-> Browser

Loading

Notes

Issues Worth Addressing

  1. Regex injection risk (scanner.py) — raw user input is passed directly to re.search() with no validation; malformed patterns (e.g. unbalanced parentheses) can crash Celery workers.
  2. Hardcoded test values (templates/index.html) — default form values point to healymarketinggroup.com / Microsoft; should be cleared or replaced with placeholder examples.
  3. No Celery task timeout (tasks.py) — long-running scans can hang indefinitely; no time_limit or soft_time_limit configured on the task.
  4. No rate limiting (scanner.py) — requests to target sites are made without throttling, which may trigger rate limits or IP blocks on large sitemaps.
  5. Unused variable (scanner.py:142) — x=1 is initialized outside the if __name__ block; should be moved inside for clarity.
  6. Silent failure on non-HTML content (scanner.py) — returns False without logging when the response Content-Type is not HTML, making it hard to distinguish from "not found".

License

Copyright 2026 Bill Roth

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

About

website to allow me to scan website pages for links.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors