DocuCraft

A Streamlit-based web application that leverages Crawl4AI to analyze and visualize website structures. This tool helps users understand website hierarchies by generating an interactive directory tree of any website's structure.

Features

1. Website Structure Analysis

Crawls any website URL using Crawl4AI
Generates a hierarchical directory tree structure
Filters and processes internal links
Handles various URL patterns and structures

2. User Interface

Clean, intuitive Streamlit interface
Simple URL input with validation
Clear directory structure display
Error handling and feedback

3. Diagnostic Logging

Comprehensive logging system
Timestamped log files
Multiple log levels (INFO, DEBUG, WARNING, ERROR)
Detailed crawl and processing information

Dependencies

streamlit>=1.29.0
pytest>=7.4.0
pytest-asyncio>=0.23.0
crawl4ai>=0.4.0

Project Structure

crawl4ai/
├── app.py                    # Main Streamlit application
├── logs/                     # Directory for log files
├── requirements.txt          # Project dependencies
└── README.md                # Project documentation

Running the Application

Install dependencies:

pip install -r requirements.txt

Start the Streamlit app:

streamlit run app.py

Enter any website URL to generate its directory structure

How It Works

1. Core Application Flow

# Main async function handling the crawl and structure generation
async def get_website_structure(url: str):
    # Configure crawler
    browser_config = BrowserConfig(headless=True)
    crawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
    
    try:
        async with AsyncWebCrawler(config=browser_config) as crawler:
            # Perform the crawl
            result = await crawler.arun(url=url, config=crawler_config)
            
            if result.success and hasattr(result, 'links'):
                # Generate directory structure
                structure = build_directory_structure(url, result.links)
                if structure:
                    tree_text = format_directory_tree(structure)
                    return True, tree_text
    except Exception as e:
        logger.error(f"Error during crawl: {str(e)}")
        return False, f"Error during crawl: {str(e)}"

2. Directory Structure Generation

def build_directory_structure(base_url: str, links: dict) -> dict:
    from urllib.parse import urlparse
    structure = {}
    base_domain = urlparse(base_url).netloc
    
    # Process only internal links
    internal_links = links.get('internal', [])
    logger.info(f"Processing {len(internal_links)} internal links from {base_url}")
    
    for link in internal_links:
        if 'href' not in link:
            continue
            
        url = link['href']
        parsed = urlparse(url)
        
        # Filter same-domain URLs
        if parsed.netloc and parsed.netloc != base_domain:
            continue
            
        # Extract and clean path
        path = parsed.path.strip('/')
        if not path or path.startswith('#'):
            continue
            
        # Build tree structure
        parts = [p for p in path.split('/') if p]
        current = structure
        for part in parts:
            if part not in current:
                current[part] = {}
            current = current[part]
    
    return structure

3. Tree Formatting

def format_directory_tree(structure: dict, prefix: str = "") -> str:
    lines = []
    items = list(structure.items())
    
    for i, (name, subtree) in enumerate(items):
        # Determine if this is the last item at this level
        is_last = i == len(items) - 1
        
        # Choose the appropriate connector symbols
        connector = "└── " if is_last else "├── "
        child_prefix = "    " if is_last else "│   "
        
        # Add this item to the output
        lines.append(f"{prefix}{connector}{name}")
        
        # Recursively process children
        if subtree:
            lines.append(format_directory_tree(subtree, prefix + child_prefix))
    
    return "\n".join(lines)

4. Logging System

# Setup logging configuration
log_dir = 'logs'
os.makedirs(log_dir, exist_ok=True)
log_file = os.path.join(log_dir, f'crawl_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log')

logger = logging.getLogger('crawl4ai_app')
logger.setLevel(logging.DEBUG)

# File handler with formatting
fh = logging.FileHandler(log_file)
fh.setLevel(logging.DEBUG)
fh_formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
fh.setFormatter(fh_formatter)
logger.addHandler(fh)

# Example log entries
logger.info("Starting crawl with configuration:")
logger.debug("Processing URL: example.com")
logger.warning("No subdirectories found")
logger.error("Failed to connect to website")

Data Flow

URL Input & Validation

url = st.text_input(
    "Enter Website URL",
    help="Enter the full URL including http:// or https://"
)

Crawl Process
- User submits URL
- AsyncWebCrawler initiates crawl
- Retrieves all links from the website
- Filters internal vs external links
Structure Building
- Process internal links
- Extract paths
- Build hierarchical dictionary
- Format as tree structure

Output Generation

# Example output format
Directory structure:
├── docs
│   ├── api
│   │   └── reference
│   └── guides
└── blog
    └── posts

Logging Flow
- Configuration details logged at start
- URL processing logged as DEBUG
- Structure generation logged as INFO
- Errors and warnings captured
- All data written to timestamped log file

Next Steps

Add content extraction functionality
Enhance error handling and user feedback
Add progress indicators for long-running operations
Implement caching for improved performance
Add support for multi-URL crawling
Add performance benchmarks and monitoring

Recent Updates (2025-02-09)

Directory Structure Generation Improvements

Universal URL Support
- Enhanced the directory structure generation to work with any website URL
- Removed dependency on specific URL patterns (previously limited to mkdocs)
- Properly handles URLs from the same domain while filtering external links
Improved Path Processing
- Added robust URL parsing using urllib.parse
- Better handling of path components and subdirectories
- Filters out anchor links and empty path segments
Diagnostic Logging System
- Implemented comprehensive logging system
- Logs are stored in timestamped files under the logs directory
- Different log levels for various types of information:
  - INFO: General process information
  - DEBUG: Detailed data about paths and links
  - WARNING: Non-critical issues
  - ERROR: Exceptions and critical problems
- Log format includes timestamps and severity levels
UI Improvements
- Cleaned up the user interface
- Moved diagnostic information to log files
- Focus on displaying the final directory structure

Log File Details

Log files are created with the following format:

crawl_YYYYMMDD_HHMMSS.log

Each log entry contains:

Timestamp
Log level
Detailed message

The logs capture:

Crawl configuration details
Processing of internal links
Path generation information
Any errors or exceptions
Final structure generation status

Usage

The application now works with any website URL and automatically generates a directory structure based on the site's organization. The diagnostic information is available in the logs directory for debugging and monitoring purposes.

Contributing

When contributing to this project, please:

Update documentation as needed

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocuCraft

Features

1. Website Structure Analysis

2. User Interface

3. Diagnostic Logging

Dependencies

Project Structure

Running the Application

How It Works

1. Core Application Flow

2. Directory Structure Generation

3. Tree Formatting

4. Logging System

Data Flow

Next Steps

Recent Updates (2025-02-09)

Directory Structure Generation Improvements

Log File Details

Usage

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DocuCraft

Features

1. Website Structure Analysis

2. User Interface

3. Diagnostic Logging

Dependencies

Project Structure

Running the Application

How It Works

1. Core Application Flow

2. Directory Structure Generation

3. Tree Formatting

4. Logging System

Data Flow

Next Steps

Recent Updates (2025-02-09)

Directory Structure Generation Improvements

Log File Details

Usage

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages