Skip to content

fix(mdict): resolve database compatibility and non-ASCII filename issues#2

Open
sdy623 wants to merge 3 commits intoVimWei:mainfrom
sdy623:main
Open

fix(mdict): resolve database compatibility and non-ASCII filename issues#2
sdy623 wants to merge 3 commits intoVimWei:mainfrom
sdy623:main

Conversation

@sdy623
Copy link
Copy Markdown

@sdy623 sdy623 commented Oct 7, 2025

BREAKING CHANGES:

  • Force rebuild of incompatible .mdx.db files created by other tools
  • Validate table structure before use (9 columns required)

Changes:

  • Add database table structure validation on initialization
  • Implement automatic index rebuild for incompatible databases
  • Fix SQL injection vulnerabilities in lookup_indexes() and get_keys()
  • Add comprehensive error handling for sqlite3 operations
  • Validate query result tuple length before accessing indices
  • Replace string formatting with parameterized queries for security

Database Schema Validation:

  • Check MDX_INDEX table exists before use
  • Verify 9-column structure: key_text, file_path, file_pos, compressed_size, decompressed_size, record_block_type, record_start, record_end, offset
  • Compare actual columns against expected schema
  • Auto-rebuild if structure mismatch detected

Security Improvements:

  • Replace unsafe SQL string formatting with parameterized queries
  • Prevent SQL injection in keyword lookups
  • Sanitize wildcard queries (* → %)

Bug Fixes:

  • Fix IndexError when accessing result[8] with incompatible databases
  • Handle databases created by mdict-utils, GoldenDict, or older versions
  • Gracefully handle corrupt or incomplete database entries
  • Support MDX files with non-ASCII characters in filename

Error Handling:

  • Add try-catch blocks for sqlite3.Error
  • Log detailed warnings for incompatible table structures
  • Skip incomplete index entries instead of crashing
  • Provide informative error messages for debugging

Compatibility:

  • Works with databases from multiple MDX tools
  • Maintains backward compatibility with existing code
  • Automatically upgrades old database formats

Fixes: #
Related: Non-ASCII filename support, database version conflicts

sdy623 added 3 commits October 7, 2025 15:25
BREAKING CHANGES:
- Force rebuild of incompatible .mdx.db files created by other tools
- Validate table structure before use (9 columns required)

Changes:
- Add database table structure validation on initialization
- Implement automatic index rebuild for incompatible databases
- Fix SQL injection vulnerabilities in lookup_indexes() and get_keys()
- Add comprehensive error handling for sqlite3 operations
- Validate query result tuple length before accessing indices
- Replace string formatting with parameterized queries for security

Database Schema Validation:
- Check MDX_INDEX table exists before use
- Verify 9-column structure: key_text, file_path, file_pos,
  compressed_size, decompressed_size, record_block_type,
  record_start, record_end, offset
- Compare actual columns against expected schema
- Auto-rebuild if structure mismatch detected

Security Improvements:
- Replace unsafe SQL string formatting with parameterized queries
- Prevent SQL injection in keyword lookups
- Sanitize wildcard queries (* → %)

Bug Fixes:
- Fix IndexError when accessing result[8] with incompatible databases
- Handle databases created by mdict-utils, GoldenDict, or older versions
- Gracefully handle corrupt or incomplete database entries
- Support MDX files with non-ASCII characters in filename

Error Handling:
- Add try-catch blocks for sqlite3.Error
- Log detailed warnings for incompatible table structures
- Skip incomplete index entries instead of crashing
- Provide informative error messages for debugging

Compatibility:
- Works with databases from multiple MDX tools
- Maintains backward compatibility with existing code
- Automatically upgrades old database formats

Fixes: #<issue-number>
Related: Non-ASCII filename support, database version conflicts
Add comprehensive headless API for MDX dictionary queries without GUI dependency,
enabling programmatic access to all dictionary features with clean Python interface.

Features:
- Expose core APIs: Dictionary, WordParser, mdx2html, mdx2pdf, mdx2img
- Support both simple queries and batch conversions
- Auto-fallback strategies (case-insensitive, hyphen removal, link following)
- Optional dependency groups: [gui], [conversion], [all]
- Command-line argument support for all example scripts

Core APIs:
- Dictionary: Main interface for MDX/MDD queries with automatic resource management
- WordParser: Parse input files with lesson markers and comments
- mdx2html: Convert word lists to HTML with CSS embedding and image support
- mdx2pdf: Generate PDFs with wkhtmltopdf integration
- mdx2img: Export to images (PNG/JPEG/WEBP) with optimization

Query Word Tools:
- query_word.py: Fast single-word query with complete HTML output
  * Extract and embed dictionary internal CSS styles
  * Auto-embed images as base64 from .mdd files
  * Two-layer CSS system (dictionary + custom beautification)
  * Standalone HTML files ready for offline use
- single_word_query.py: Advanced queries with multiple output modes

Batch Conversion:
- batch_conversion.py: Batch convert word lists to HTML/PDF/images
  * Timestamped output files: YYYYMMDD-HHMMSS_{input_stem}.{ext}
  * Prevent file overwrites with automatic naming
  * Support custom PDF/image options
  * Track invalid words for review

Example Scripts with CLI Arguments:
- basic_query.py: --mdx <path>
- batch_conversion.py: --mdx-file, --input-file, --output-dir
- custom_styles.py: --mdx <path>
- progress_callback.py: --mdx <path>
- query_word.py: <word> --mdx --output [--no-images]
- single_word_query.py: <word> --mdx [--mode simple|complete|custom-css]

Dependency Management:
- Core: No GUI dependencies required
- Optional [gui]: PySide6, markdown for GUI features
- Optional [conversion]: wkhtmltopdf for PDF/image generation
- Install flexibility: pip install .[gui] or .[conversion] or .[all]

CSS Integration:
- Automatic extraction of dictionary internal CSS via merge_css()
- Image embedding with embed_images() for standalone files
- Custom beautification styles layered on top
- Responsive design with mobile support

Documentation:
- HEADLESS_API.md: Complete API reference
- HEADLESS_LIBRARY_SUMMARY.md: Architecture and design decisions
- QUERY_WORD_UPDATE.md: Detailed query_word.py enhancements
- WORD_QUERY_GUIDE.md: Usage guide with examples
- QUERY_SUMMARY.md: Feature comparison matrix
- AUDIO_IMPLEMENTATION_SUMMARY.md: Audio support details
- DATABASE_COMPATIBILITY.md: Table structure compatibility guide
- NON_ASCII_FILENAME_SUPPORT.md: Non-ASCII filename handling

Breaking Changes:
- Restructure pyproject.toml with optional dependency groups
- Update __init__.py to expose headless APIs
- GUI components now optional, install with [gui] extra

Architecture:
- Clean separation of concerns (core, GUI, conversion)
- Context manager support for proper resource cleanup
- Type hints for better IDE support
- Comprehensive error handling and logging

Testing:
- All example scripts support command-line arguments
- Tested with ASCII and non-ASCII filenames
- Validated CSS extraction and image embedding
- Confirmed batch conversion with timestamped outputs

Migration Guide:
- Existing GUI functionality unchanged
- New headless features work alongside GUI
- No breaking changes for current GUI users
- Progressive enhancement approach
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant