A Streamlit-based web application that leverages Crawl4AI to analyze and visualize website structures. This tool helps users understand website hierarchies by generating an interactive directory tree of any website's structure.
- Crawls any website URL using Crawl4AI
- Generates a hierarchical directory tree structure
- Filters and processes internal links
- Handles various URL patterns and structures
- Clean, intuitive Streamlit interface
- Simple URL input with validation
- Clear directory structure display
- Error handling and feedback
- Comprehensive logging system
- Timestamped log files
- Multiple log levels (INFO, DEBUG, WARNING, ERROR)
- Detailed crawl and processing information
streamlit>=1.29.0
pytest>=7.4.0
pytest-asyncio>=0.23.0
crawl4ai>=0.4.0
crawl4ai/
├── app.py # Main Streamlit application
├── logs/ # Directory for log files
├── requirements.txt # Project dependencies
└── README.md # Project documentation
- Install dependencies:
pip install -r requirements.txt- Start the Streamlit app:
streamlit run app.py- Enter any website URL to generate its directory structure
# Main async function handling the crawl and structure generation
async def get_website_structure(url: str):
# Configure crawler
browser_config = BrowserConfig(headless=True)
crawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
try:
async with AsyncWebCrawler(config=browser_config) as crawler:
# Perform the crawl
result = await crawler.arun(url=url, config=crawler_config)
if result.success and hasattr(result, 'links'):
# Generate directory structure
structure = build_directory_structure(url, result.links)
if structure:
tree_text = format_directory_tree(structure)
return True, tree_text
except Exception as e:
logger.error(f"Error during crawl: {str(e)}")
return False, f"Error during crawl: {str(e)}"def build_directory_structure(base_url: str, links: dict) -> dict:
from urllib.parse import urlparse
structure = {}
base_domain = urlparse(base_url).netloc
# Process only internal links
internal_links = links.get('internal', [])
logger.info(f"Processing {len(internal_links)} internal links from {base_url}")
for link in internal_links:
if 'href' not in link:
continue
url = link['href']
parsed = urlparse(url)
# Filter same-domain URLs
if parsed.netloc and parsed.netloc != base_domain:
continue
# Extract and clean path
path = parsed.path.strip('/')
if not path or path.startswith('#'):
continue
# Build tree structure
parts = [p for p in path.split('/') if p]
current = structure
for part in parts:
if part not in current:
current[part] = {}
current = current[part]
return structuredef format_directory_tree(structure: dict, prefix: str = "") -> str:
lines = []
items = list(structure.items())
for i, (name, subtree) in enumerate(items):
# Determine if this is the last item at this level
is_last = i == len(items) - 1
# Choose the appropriate connector symbols
connector = "└── " if is_last else "├── "
child_prefix = " " if is_last else "│ "
# Add this item to the output
lines.append(f"{prefix}{connector}{name}")
# Recursively process children
if subtree:
lines.append(format_directory_tree(subtree, prefix + child_prefix))
return "\n".join(lines)# Setup logging configuration
log_dir = 'logs'
os.makedirs(log_dir, exist_ok=True)
log_file = os.path.join(log_dir, f'crawl_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log')
logger = logging.getLogger('crawl4ai_app')
logger.setLevel(logging.DEBUG)
# File handler with formatting
fh = logging.FileHandler(log_file)
fh.setLevel(logging.DEBUG)
fh_formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
fh.setFormatter(fh_formatter)
logger.addHandler(fh)
# Example log entries
logger.info("Starting crawl with configuration:")
logger.debug("Processing URL: example.com")
logger.warning("No subdirectories found")
logger.error("Failed to connect to website")-
URL Input & Validation
url = st.text_input( "Enter Website URL", help="Enter the full URL including http:// or https://" )
-
Crawl Process
- User submits URL
- AsyncWebCrawler initiates crawl
- Retrieves all links from the website
- Filters internal vs external links
-
Structure Building
- Process internal links
- Extract paths
- Build hierarchical dictionary
- Format as tree structure
-
Output Generation
# Example output format Directory structure: ├── docs │ ├── api │ │ └── reference │ └── guides └── blog └── posts
-
Logging Flow
- Configuration details logged at start
- URL processing logged as DEBUG
- Structure generation logged as INFO
- Errors and warnings captured
- All data written to timestamped log file
- Add content extraction functionality
- Enhance error handling and user feedback
- Add progress indicators for long-running operations
- Implement caching for improved performance
- Add support for multi-URL crawling
- Add performance benchmarks and monitoring
-
Universal URL Support
- Enhanced the directory structure generation to work with any website URL
- Removed dependency on specific URL patterns (previously limited to mkdocs)
- Properly handles URLs from the same domain while filtering external links
-
Improved Path Processing
- Added robust URL parsing using
urllib.parse - Better handling of path components and subdirectories
- Filters out anchor links and empty path segments
- Added robust URL parsing using
-
Diagnostic Logging System
- Implemented comprehensive logging system
- Logs are stored in timestamped files under the
logsdirectory - Different log levels for various types of information:
- INFO: General process information
- DEBUG: Detailed data about paths and links
- WARNING: Non-critical issues
- ERROR: Exceptions and critical problems
- Log format includes timestamps and severity levels
-
UI Improvements
- Cleaned up the user interface
- Moved diagnostic information to log files
- Focus on displaying the final directory structure
Log files are created with the following format:
crawl_YYYYMMDD_HHMMSS.log
Each log entry contains:
- Timestamp
- Log level
- Detailed message
The logs capture:
- Crawl configuration details
- Processing of internal links
- Path generation information
- Any errors or exceptions
- Final structure generation status
The application now works with any website URL and automatically generates a directory structure based on the site's organization. The diagnostic information is available in the logs directory for debugging and monitoring purposes.
When contributing to this project, please:
- Update documentation as needed