Readme for Website String Search

README.md

Readme for Website String Search

Installation: MacOS

Install Homebrew (if not already installed):

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Install & Start Redis: Redis is required for the Celery task queue.
```
brew install redis
brew services start redis
```

Install Python Dependencies:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Installation: Ubuntu

Update Package List and Install Redis:

sudo apt update
sudo apt install redis-server python3-pip python3-venv
sudo systemctl enable --now redis-server

Setup Python Virtual Environment:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Setup

Use the provided run script to start all three components (Redis, Celery worker, Flask) in one command:

./run.sh

All output is sent to syslog under the tag link-scanner. To tail logs:

# macOS
log stream --predicate 'senderImagePath contains "logger"' --info | grep "link-scanner"

# Ubuntu
journalctl -f -t "link-scanner"

Stopping the Application

Press Ctrl+C in the terminal running run.sh. This will cleanly shut down all three processes.

Manual Setup (Alternative)

If you prefer to run each component separately:

Redis Server:
```
redis-server
```

Celery Worker:

source .venv/bin/activate
celery -A tasks worker --loglevel=info

Flask App:

source .venv/bin/activate
python3 app.py

Command Line Usage (scanner.py)

scanner.py can be run directly from the command line to scan a sitemap for a search string.

python3 scanner.py <sitemap_url> <search_string> [options]

Comment about search strings

<Search_string> can be a regular expression.

Positional Arguments

Argument	Description
`sitemap_url`	The URL of the sitemap to scan (e.g. `https://example.com/sitemap.xml`)
`search_string`	The string (or regex) to search for

Options

Flag	Description
`-s`, `--silent`	Suppress "skip" messages for 4xx HTTP errors
`-a`, `--all`	Search the entire HTML source text instead of only `<a href>` attributes
`-t`, `--timeout <seconds>`	Request timeout in seconds (default: 10)
`-d`, `--debug`	Enable debug logging to stdout
`-u`, `--url`	Print each URL as it is being scanned

Output

Results are written to stdout as CSV with columns: Index, URL, Found Text.

Examples

Search all page links for mailto::

python3 scanner.py https://example.com/sitemap.xml "mailto:"

Search full HTML source for a phone number pattern, with debug output:

python3 scanner.py https://example.com/sitemap.xml "\d{3}-\d{4}" --all --debug

Redirect results to a file:

python3 scanner.py https://example.com/sitemap.xml "contact" -s > results.csv

Architcture

graph TD
    subgraph Client_Side [Client Side]
        Browser[User Browser<br/>(HTML/JS Frontend)]
    end

    subgraph Server_Side [Server Side]
        Flask[Flask Web Server<br/>(app.py)]
        Redis[(Redis<br/>Message Broker & Result Backend)]
        Worker[Celery Worker<br/>(tasks.py + scanner.py)]
    end

    subgraph Internet [External]
        Target[Target Websites<br/>(Sitemaps & HTML)]
    end

    %% Flow
    Browser -- "1. Start Scan (POST /api/scan)" --> Flask
    Flask -- "2. Enqueue Task" --> Redis
    Flask -. "3. Return Task ID" .-> Browser
    
    Redis -- "4. Distribute Task" --> Worker
    Worker -- "5. Fetch Sitemap & Scan URLs" --> Target
    Target -- "6. Return HTML Content" --> Worker
    
    Worker -- "7. Update Progress & Results" --> Redis
    
    Browser -- "8. Poll Status (GET /api/status)" --> Flask
    Flask -- "9. Query Task State" --> Redis
    Redis -- "10. Return State/Result" --> Flask
    Flask -. "11. Return JSON Response" .-> Browser

Notes

Issues Worth Addressing

Regex injection risk (scanner.py) — raw user input is passed directly to re.search() with no validation; malformed patterns (e.g. unbalanced parentheses) can crash Celery workers.
Hardcoded test values (templates/index.html) — default form values point to healymarketinggroup.com / Microsoft; should be cleared or replaced with placeholder examples.
No Celery task timeout (tasks.py) — long-running scans can hang indefinitely; no time_limit or soft_time_limit configured on the task.
No rate limiting (scanner.py) — requests to target sites are made without throttling, which may trigger rate limits or IP blocks on large sitemaps.
Unused variable (scanner.py:142) — x=1 is initialized outside the if __name__ block; should be moved inside for clarity.
Silent failure on non-HTML content (scanner.py) — returns False without logging when the response Content-Type is not HTML, making it hard to distinguish from "not found".

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
config		config
static		static
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
architecture.mmd		architecture.mmd
debugapp_run.sh		debugapp_run.sh
debugapp_run.zsh		debugapp_run.zsh
google-cloud.md		google-cloud.md
requirements.txt		requirements.txt
run.sh		run.sh
run.zsh		run.zsh
scanner.py		scanner.py
tasks.py		tasks.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Readme for Website String Search

Installation: MacOS

Installation: Ubuntu

Setup

Stopping the Application

Manual Setup (Alternative)

Command Line Usage (scanner.py)

Comment about search strings

Positional Arguments

Options

Output

Examples

Architcture

Notes

Issues Worth Addressing

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Readme for Website String Search

Installation: MacOS

Installation: Ubuntu

Setup

Stopping the Application

Manual Setup (Alternative)

Command Line Usage (scanner.py)

Comment about search strings

Positional Arguments

Options

Output

Examples

Architcture

Notes

Issues Worth Addressing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages