Compress Files!!!
In modern computing, efficient data storage and transmission are critical. Standard text files (ASCII/UTF-8) use a fixed width of 8 bits per character, which is wasteful for characters that appear frequently (e.g., 'e', 'a', ' ').
The Solution: This project implements Canonical Huffman Coding, a lossless data compression algorithm widely used in industrial standards like ZIP, GZIP, and JPEG.
- Efficiency: We assign shorter binary codes to frequent characters and longer codes to rare ones.
- Real-World Optimization: Unlike basic implementations that store the entire tree structure, our compressor uses Canonical state (storing only bit lengths) and Bit Packing (writing actual binary bits, not ASCII '0's and '1's). This mimics professional compression tools, achieving significantly smaller file sizes.
- GHC (Glasgow Haskell Compiler) (Version 8.0 or higher)
- Cabal (Haskell build tool) - Usually installed with GHC
The project uses Cabal for dependency management and includes parallel processing support.
# Build the project (downloads dependencies automatically)
cabal build
# Or use cabal install to make it available globally
cabal installBenefits of using Cabal:
- Automatic dependency management (
parallel,bytestring,containers) - Built-in parallel processing support (uses all CPU cores)
- Optimized compilation with
-O2flag - Reproducible builds
Using Cabal (Recommended):
# Compress a file (parallel processing enabled automatically)
cabal run compressor -- -c <input_file> <output_file>
# Decompress a file
cabal run compressor -- -d <input_file> <output_file>
# Show help
cabal run compressor -- -hAlternative: Direct Compilation with GHC
If you prefer not to use Cabal:
ghc -O2 --make Main.hs -o compressor -package containers -package bytestring -package parallel -threaded -rtsopts
./compressor -c data.txt data.huff# Compress 'data.txt' into 'data.huff'
cabal run compressor -- -c data.txt data.huff
# Decompress 'data.huff' back to 'restored.txt'
cabal run compressor -- -d data.huff restored.txt
# Verify the files match
diff data.txt restored.txtFlags:
-c: Compress a file-d: Decompress a file-h: Show help message
Command:
cabal run compressor -- -c example.txt compressed.huffOutput:
Reading example.txt...
Analyzing frequencies (parallel)...
Building Canonical Huffman Tree...
Encoding and Bit-Packing...
Compressed to compressed.huff
Compression Complete.
Command:
cabal run compressor -- -d compressed.huff restored.txtOutput:
Reading compressed.huff...
Reconstructing Tree from Canonical Lengths...
Decoding...
Writing to restored.txt...
Decompression Complete.
$ time cabal run compressor -- -c large_test.txt output.huff
real 0m1.047s # Wall-clock time
user 0m1.107s # CPU time (parallel processing)
sys 0m0.869s # System I/O timeAnalysis: user + sys > real proves parallel processing is working! Multiple CPU cores are being utilized simultaneously.
This project strictly adheres to Functional Programming principles to ensure reliability and testability.
The core logic resides in Processing.hs and uses no mutable state. Functions take input and produce output deterministically.
- Example:
canonicalCodestransforms a length table into code assignments without modifying external memory.-- Processing.hs canonicalCodes :: BitLenTable -> CodeTable
We use recursion instead of loops for tree traversal and bit stream decoding. This allows us to handle potentially infinite data streams or complex tree structures safely.
- Example: The
decodefunction uses a recursive state machine to traverse the Huffman Tree.-- Processing.hs go (b:bs) node bitIdx | bitIdx < 0 = go bs node 7 -- Recursive call to next byte
We define a custom data structure to model the Huffman Tree, ensuring type safety and preventing invalid tree states.
- Example:
-- DataTypes.hs data HuffmanTree = Leaf Char Int | Node Int HuffmanTree HuffmanTree
We utilize Haskell's powerful list processing functions (map, sortBy, foldl) to manipulate data concisely.
- Example: Sorting characters by frequency and length:
-- Processing.hs let sorted = sortBy (\(c1, l1) (c2, l2) -> ...) lenTable
The implementation uses Haskell's parallel library for concurrent frequency counting, automatically distributing work across multiple CPU cores.
-
Example: Parallel frequency analysis in
Utils.hs:frequencyCountParallel :: String -> [(Char, Int)] frequencyCountParallel str = let chunks = chunksOf chunkSize str localCounts = parMap rdeepseq frequencyCount chunks in mergeFrequencies localCounts
-
Benefits:
- Achieves 2-4x speedup on large files
- No manual thread management required
- Race-condition free (pure functions guarantee safety)
The code is separated into single-responsibility modules:
Main.hs: CLI Argument Parsing and IO Orchestration.Processing.hs: Pure compression/decompression logic.IOHandler.hs: Binary file reading/writing.DataTypes.hs: Type definitions.Utils.hs: General helper functions and parallel processing.