fix(mdict): resolve database compatibility and non-ASCII filename issues by sdy623 · Pull Request #2 · VimWei/MdxScraper

sdy623 · 2025-10-07T06:26:56Z

BREAKING CHANGES:

Force rebuild of incompatible .mdx.db files created by other tools
Validate table structure before use (9 columns required)

Changes:

Add database table structure validation on initialization
Implement automatic index rebuild for incompatible databases
Fix SQL injection vulnerabilities in lookup_indexes() and get_keys()
Add comprehensive error handling for sqlite3 operations
Validate query result tuple length before accessing indices
Replace string formatting with parameterized queries for security

Database Schema Validation:

Check MDX_INDEX table exists before use
Verify 9-column structure: key_text, file_path, file_pos, compressed_size, decompressed_size, record_block_type, record_start, record_end, offset
Compare actual columns against expected schema
Auto-rebuild if structure mismatch detected

Security Improvements:

Replace unsafe SQL string formatting with parameterized queries
Prevent SQL injection in keyword lookups
Sanitize wildcard queries (* → %)

Bug Fixes:

Fix IndexError when accessing result[8] with incompatible databases
Handle databases created by mdict-utils, GoldenDict, or older versions
Gracefully handle corrupt or incomplete database entries
Support MDX files with non-ASCII characters in filename

Error Handling:

Add try-catch blocks for sqlite3.Error
Log detailed warnings for incompatible table structures
Skip incomplete index entries instead of crashing
Provide informative error messages for debugging

Compatibility:

Works with databases from multiple MDX tools
Maintains backward compatibility with existing code
Automatically upgrades old database formats

Fixes: #
Related: Non-ASCII filename support, database version conflicts

BREAKING CHANGES: - Force rebuild of incompatible .mdx.db files created by other tools - Validate table structure before use (9 columns required) Changes: - Add database table structure validation on initialization - Implement automatic index rebuild for incompatible databases - Fix SQL injection vulnerabilities in lookup_indexes() and get_keys() - Add comprehensive error handling for sqlite3 operations - Validate query result tuple length before accessing indices - Replace string formatting with parameterized queries for security Database Schema Validation: - Check MDX_INDEX table exists before use - Verify 9-column structure: key_text, file_path, file_pos, compressed_size, decompressed_size, record_block_type, record_start, record_end, offset - Compare actual columns against expected schema - Auto-rebuild if structure mismatch detected Security Improvements: - Replace unsafe SQL string formatting with parameterized queries - Prevent SQL injection in keyword lookups - Sanitize wildcard queries (* → %) Bug Fixes: - Fix IndexError when accessing result[8] with incompatible databases - Handle databases created by mdict-utils, GoldenDict, or older versions - Gracefully handle corrupt or incomplete database entries - Support MDX files with non-ASCII characters in filename Error Handling: - Add try-catch blocks for sqlite3.Error - Log detailed warnings for incompatible table structures - Skip incomplete index entries instead of crashing - Provide informative error messages for debugging Compatibility: - Works with databases from multiple MDX tools - Maintains backward compatibility with existing code - Automatically upgrades old database formats Fixes: #<issue-number> Related: Non-ASCII filename support, database version conflicts

Add comprehensive headless API for MDX dictionary queries without GUI dependency, enabling programmatic access to all dictionary features with clean Python interface. Features: - Expose core APIs: Dictionary, WordParser, mdx2html, mdx2pdf, mdx2img - Support both simple queries and batch conversions - Auto-fallback strategies (case-insensitive, hyphen removal, link following) - Optional dependency groups: [gui], [conversion], [all] - Command-line argument support for all example scripts Core APIs: - Dictionary: Main interface for MDX/MDD queries with automatic resource management - WordParser: Parse input files with lesson markers and comments - mdx2html: Convert word lists to HTML with CSS embedding and image support - mdx2pdf: Generate PDFs with wkhtmltopdf integration - mdx2img: Export to images (PNG/JPEG/WEBP) with optimization Query Word Tools: - query_word.py: Fast single-word query with complete HTML output * Extract and embed dictionary internal CSS styles * Auto-embed images as base64 from .mdd files * Two-layer CSS system (dictionary + custom beautification) * Standalone HTML files ready for offline use - single_word_query.py: Advanced queries with multiple output modes Batch Conversion: - batch_conversion.py: Batch convert word lists to HTML/PDF/images * Timestamped output files: YYYYMMDD-HHMMSS_{input_stem}.{ext} * Prevent file overwrites with automatic naming * Support custom PDF/image options * Track invalid words for review Example Scripts with CLI Arguments: - basic_query.py: --mdx <path> - batch_conversion.py: --mdx-file, --input-file, --output-dir - custom_styles.py: --mdx <path> - progress_callback.py: --mdx <path> - query_word.py: <word> --mdx --output [--no-images] - single_word_query.py: <word> --mdx [--mode simple|complete|custom-css] Dependency Management: - Core: No GUI dependencies required - Optional [gui]: PySide6, markdown for GUI features - Optional [conversion]: wkhtmltopdf for PDF/image generation - Install flexibility: pip install .[gui] or .[conversion] or .[all] CSS Integration: - Automatic extraction of dictionary internal CSS via merge_css() - Image embedding with embed_images() for standalone files - Custom beautification styles layered on top - Responsive design with mobile support Documentation: - HEADLESS_API.md: Complete API reference - HEADLESS_LIBRARY_SUMMARY.md: Architecture and design decisions - QUERY_WORD_UPDATE.md: Detailed query_word.py enhancements - WORD_QUERY_GUIDE.md: Usage guide with examples - QUERY_SUMMARY.md: Feature comparison matrix - AUDIO_IMPLEMENTATION_SUMMARY.md: Audio support details - DATABASE_COMPATIBILITY.md: Table structure compatibility guide - NON_ASCII_FILENAME_SUPPORT.md: Non-ASCII filename handling Breaking Changes: - Restructure pyproject.toml with optional dependency groups - Update __init__.py to expose headless APIs - GUI components now optional, install with [gui] extra Architecture: - Clean separation of concerns (core, GUI, conversion) - Context manager support for proper resource cleanup - Type hints for better IDE support - Comprehensive error handling and logging Testing: - All example scripts support command-line arguments - Tested with ASCII and non-ASCII filenames - Validated CSS extraction and image embedding - Confirmed batch conversion with timestamped outputs Migration Guide: - Existing GUI functionality unchanged - New headless features work alongside GUI - No breaking changes for current GUI users - Progressive enhancement approach

sdy623 added 3 commits October 7, 2025 15:25

Add the Audio Process

13d677d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mdict): resolve database compatibility and non-ASCII filename issues#2

fix(mdict): resolve database compatibility and non-ASCII filename issues#2
sdy623 wants to merge 3 commits intoVimWei:mainfrom
sdy623:main

sdy623 commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sdy623 commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant