Skip to content

Rajeev-SG/DocuCraft

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

DocuCraft

A Streamlit-based web application that leverages Crawl4AI to analyze and visualize website structures. This tool helps users understand website hierarchies by generating an interactive directory tree of any website's structure.

Features

1. Website Structure Analysis

  • Crawls any website URL using Crawl4AI
  • Generates a hierarchical directory tree structure
  • Filters and processes internal links
  • Handles various URL patterns and structures

2. User Interface

  • Clean, intuitive Streamlit interface
  • Simple URL input with validation
  • Clear directory structure display
  • Error handling and feedback

3. Diagnostic Logging

  • Comprehensive logging system
  • Timestamped log files
  • Multiple log levels (INFO, DEBUG, WARNING, ERROR)
  • Detailed crawl and processing information

Dependencies

streamlit>=1.29.0
pytest>=7.4.0
pytest-asyncio>=0.23.0
crawl4ai>=0.4.0

Project Structure

crawl4ai/
├── app.py                    # Main Streamlit application
├── logs/                     # Directory for log files
├── requirements.txt          # Project dependencies
└── README.md                # Project documentation

Running the Application

  1. Install dependencies:
pip install -r requirements.txt
  1. Start the Streamlit app:
streamlit run app.py
  1. Enter any website URL to generate its directory structure

How It Works

1. Core Application Flow

# Main async function handling the crawl and structure generation
async def get_website_structure(url: str):
    # Configure crawler
    browser_config = BrowserConfig(headless=True)
    crawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
    
    try:
        async with AsyncWebCrawler(config=browser_config) as crawler:
            # Perform the crawl
            result = await crawler.arun(url=url, config=crawler_config)
            
            if result.success and hasattr(result, 'links'):
                # Generate directory structure
                structure = build_directory_structure(url, result.links)
                if structure:
                    tree_text = format_directory_tree(structure)
                    return True, tree_text
    except Exception as e:
        logger.error(f"Error during crawl: {str(e)}")
        return False, f"Error during crawl: {str(e)}"

2. Directory Structure Generation

def build_directory_structure(base_url: str, links: dict) -> dict:
    from urllib.parse import urlparse
    structure = {}
    base_domain = urlparse(base_url).netloc
    
    # Process only internal links
    internal_links = links.get('internal', [])
    logger.info(f"Processing {len(internal_links)} internal links from {base_url}")
    
    for link in internal_links:
        if 'href' not in link:
            continue
            
        url = link['href']
        parsed = urlparse(url)
        
        # Filter same-domain URLs
        if parsed.netloc and parsed.netloc != base_domain:
            continue
            
        # Extract and clean path
        path = parsed.path.strip('/')
        if not path or path.startswith('#'):
            continue
            
        # Build tree structure
        parts = [p for p in path.split('/') if p]
        current = structure
        for part in parts:
            if part not in current:
                current[part] = {}
            current = current[part]
    
    return structure

3. Tree Formatting

def format_directory_tree(structure: dict, prefix: str = "") -> str:
    lines = []
    items = list(structure.items())
    
    for i, (name, subtree) in enumerate(items):
        # Determine if this is the last item at this level
        is_last = i == len(items) - 1
        
        # Choose the appropriate connector symbols
        connector = "└── " if is_last else "├── "
        child_prefix = "    " if is_last else "│   "
        
        # Add this item to the output
        lines.append(f"{prefix}{connector}{name}")
        
        # Recursively process children
        if subtree:
            lines.append(format_directory_tree(subtree, prefix + child_prefix))
    
    return "\n".join(lines)

4. Logging System

# Setup logging configuration
log_dir = 'logs'
os.makedirs(log_dir, exist_ok=True)
log_file = os.path.join(log_dir, f'crawl_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log')

logger = logging.getLogger('crawl4ai_app')
logger.setLevel(logging.DEBUG)

# File handler with formatting
fh = logging.FileHandler(log_file)
fh.setLevel(logging.DEBUG)
fh_formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
fh.setFormatter(fh_formatter)
logger.addHandler(fh)

# Example log entries
logger.info("Starting crawl with configuration:")
logger.debug("Processing URL: example.com")
logger.warning("No subdirectories found")
logger.error("Failed to connect to website")

Data Flow

  1. URL Input & Validation

    url = st.text_input(
        "Enter Website URL",
        help="Enter the full URL including http:// or https://"
    )
  2. Crawl Process

    • User submits URL
    • AsyncWebCrawler initiates crawl
    • Retrieves all links from the website
    • Filters internal vs external links
  3. Structure Building

    • Process internal links
    • Extract paths
    • Build hierarchical dictionary
    • Format as tree structure
  4. Output Generation

    # Example output format
    Directory structure:
    ├── docs
    │   ├── api
    │   │   └── reference
    │   └── guides
    └── blog
        └── posts
  5. Logging Flow

    • Configuration details logged at start
    • URL processing logged as DEBUG
    • Structure generation logged as INFO
    • Errors and warnings captured
    • All data written to timestamped log file

Next Steps

  1. Add content extraction functionality
  2. Enhance error handling and user feedback
  3. Add progress indicators for long-running operations
  4. Implement caching for improved performance
  5. Add support for multi-URL crawling
  6. Add performance benchmarks and monitoring

Recent Updates (2025-02-09)

Directory Structure Generation Improvements

  1. Universal URL Support

    • Enhanced the directory structure generation to work with any website URL
    • Removed dependency on specific URL patterns (previously limited to mkdocs)
    • Properly handles URLs from the same domain while filtering external links
  2. Improved Path Processing

    • Added robust URL parsing using urllib.parse
    • Better handling of path components and subdirectories
    • Filters out anchor links and empty path segments
  3. Diagnostic Logging System

    • Implemented comprehensive logging system
    • Logs are stored in timestamped files under the logs directory
    • Different log levels for various types of information:
      • INFO: General process information
      • DEBUG: Detailed data about paths and links
      • WARNING: Non-critical issues
      • ERROR: Exceptions and critical problems
    • Log format includes timestamps and severity levels
  4. UI Improvements

    • Cleaned up the user interface
    • Moved diagnostic information to log files
    • Focus on displaying the final directory structure

Log File Details

Log files are created with the following format:

crawl_YYYYMMDD_HHMMSS.log

Each log entry contains:

  • Timestamp
  • Log level
  • Detailed message

The logs capture:

  • Crawl configuration details
  • Processing of internal links
  • Path generation information
  • Any errors or exceptions
  • Final structure generation status

Usage

The application now works with any website URL and automatically generates a directory structure based on the site's organization. The diagnostic information is available in the logs directory for debugging and monitoring purposes.

Contributing

When contributing to this project, please:

  1. Update documentation as needed

About

Analyze and visualize website structures

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages