A flexible, configurable web content spider designed for extracting structured content from websites with different layouts. Perfect for API documentation, creating offline archives, or building custom knowledge bases from multiple sources.
- Path-specific rules: Apply different extraction patterns based on URL paths or domains
- Content transformation: Extract content as Markdown or HTML with customizable selectors
- Parallel processing: Multi-threaded search for efficiency
- Configurable: YAML/TOML configuration files or command-line options
- Polite crawling: Built-in rate limiting and respectful bot behavior
- Cross-domain support: Follow specific external links with domain-specific rules
- Intuitive file organization: Maintains source URL structure in output files
# Clone the repository
git clone https://github.com/tahouse/markdown-spider.git
cd markdown-spider
# Install dependencies
pip install -r requirements.txt- Python 3.7+
- BeautifulSoup4
- Requests
- Click
- PyYAML
- Markdownify
# Crawl a website with default settings
python crawl.py --url https://example.com --output-dir ./example_docs# Generate a sample configuration file
python crawl.py --generate-config my_config.yaml
# Edit the configuration file with your settings
# Then run the spider with your config
python crawl.py --config my_config.yaml --debugOptions:
-u, --url TEXT Base URL to start crawling from
-o, --output-dir TEXT Directory to save files
-d, --max-depth INTEGER Maximum crawl depth
-t, --num-threads INTEGER Number of worker threads
-r, --throttle FLOAT Delay between requests in seconds
--debug Enable debug logging
--domain-only Only crawl URLs on the same domain
-f, --format [md|html] Output format
--user-agent TEXT Custom User-Agent string
--max-children INTEGER Maximum number of child URLs to process per page
-c, --config TEXT Path to YAML or TOML configuration file
-g, --generate-config TEXT Generate a sample configuration file
--help Show this message and exit
# Base configuration
url: "https://example.com/docs/"
output_dir: "./example_docs"
max_depth: 3
num_threads: 8
throttle: 0.5
same_domain_only: false
file_extension: ".md"
# HTTP settings
headers:
"User-Agent": "My Web Crawler"
"Accept-Language": "en-US,en;q=0.9"
timeout: 10
# Path-specific configurations
path_configs:
- path_prefix: "https://example.com/docs/"
target_content: ["main", "article.content"]
ignore_selectors:
- "nav"
- "footer"
- ".sidebar"
exclude_patterns:
- "/deprecated/"
- "/beta/"
description: "Main documentation"
- path_prefix: "https://api.example.com/"
target_content: ["div.api-content"]
ignore_selectors:
- "header"
- ".api-sidebar"
description: "API reference"url: "https://www.pulumi.com/registry/packages/gcp/api-docs/"
output_dir: "./pulumi_gcp_docs"
max_depth: 3
num_threads: 12
same_domain_only: false
path_configs:
- path_prefix: "https://www.pulumi.com/registry/packages/gcp/api-docs/"
target_content: ["div.docs-main-content"]
ignore_selectors:
- "nav"
- "footer"
- ".header-nav"
exclude_patterns:
- "/typescript/"
- "/go/"
- "/csharp/"
description: "Pulumi GCP API docs"
- path_prefix: "https://cloud.google.com/"
target_content: [".devsite-article-body", "main", "article"]
ignore_selectors:
- "nav"
- "header"
- "footer"
description: "Google Cloud documentation"url: "https://techblog.example.com/"
output_dir: "./blog_archive"
max_depth: 2
file_extension: ".md"
path_configs:
- path_prefix: "https://techblog.example.com/articles/"
target_content: ["article.post-content"]
ignore_selectors:
- "aside"
- ".author-bio"
- ".social-share"
description: "Blog articles"python crawl.py --url https://startingsite.com --domain-only false --config external_rules.yamlpython crawl.py --url https://example.com --max-depth 1 --max-children 2 --debugpython crawl.py --url https://example.com --format html- Be respectful: Use appropriate throttling (delay between requests)
- Set descriptive User-Agent: Identify your spider appropriately
- Test with small crawls: Use
--max-depthand--max-childrenfor testing - Check robots.txt: Ensure you're allowed to crawl the target site
- Update selectors: Website layouts may change; keep your selectors updated
- Empty content files: Check your CSS selectors in
target_content - Missing pages: Verify URL patterns in
valid_pathsand checkexclude_patterns - Slow crawling: Adjust
num_threadsandthrottlevalues - HTTP errors: Check your
headersand ensure the site allows crawling
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Distributed under the Apache 2.0 License. See LICENSE for more information.
Your Name - @yourtwitter - email@example.com
Project Link: https://github.com/yourusername/web-content-spider
Made with ❤️ for web content preservation and knowledge sharing