Skip to content

Malisha4065/HuffmanCodingHaskell

Repository files navigation

Canonical Huffman Compressor

Compress Files!!!

📝 Problem Description

In modern computing, efficient data storage and transmission are critical. Standard text files (ASCII/UTF-8) use a fixed width of 8 bits per character, which is wasteful for characters that appear frequently (e.g., 'e', 'a', ' ').

The Solution: This project implements Canonical Huffman Coding, a lossless data compression algorithm widely used in industrial standards like ZIP, GZIP, and JPEG.

  • Efficiency: We assign shorter binary codes to frequent characters and longer codes to rare ones.
  • Real-World Optimization: Unlike basic implementations that store the entire tree structure, our compressor uses Canonical state (storing only bit lengths) and Bit Packing (writing actual binary bits, not ASCII '0's and '1's). This mimics professional compression tools, achieving significantly smaller file sizes.

🚀 Instructions to Run

Prerequisites

  • GHC (Glasgow Haskell Compiler) (Version 8.0 or higher)
  • Cabal (Haskell build tool) - Usually installed with GHC

1. Build with Cabal (Recommended)

The project uses Cabal for dependency management and includes parallel processing support.

# Build the project (downloads dependencies automatically)
cabal build

# Or use cabal install to make it available globally
cabal install

Benefits of using Cabal:

  • Automatic dependency management (parallel, bytestring, containers)
  • Built-in parallel processing support (uses all CPU cores)
  • Optimized compilation with -O2 flag
  • Reproducible builds

2. Usage

Using Cabal (Recommended):

# Compress a file (parallel processing enabled automatically)
cabal run compressor -- -c <input_file> <output_file>

# Decompress a file
cabal run compressor -- -d <input_file> <output_file>

# Show help
cabal run compressor -- -h

Alternative: Direct Compilation with GHC

If you prefer not to use Cabal:

ghc -O2 --make Main.hs -o compressor -package containers -package bytestring -package parallel -threaded -rtsopts
./compressor -c data.txt data.huff

3. Examples

# Compress 'data.txt' into 'data.huff'
cabal run compressor -- -c data.txt data.huff

# Decompress 'data.huff' back to 'restored.txt'
cabal run compressor -- -d data.huff restored.txt

# Verify the files match
diff data.txt restored.txt

Flags:

  • -c : Compress a file
  • -d : Decompress a file
  • -h : Show help message

💻 Sample Input/Output

Compression Workflow

Command:

cabal run compressor -- -c example.txt compressed.huff

Output:

Reading example.txt...
Analyzing frequencies (parallel)...
Building Canonical Huffman Tree...
Encoding and Bit-Packing...
Compressed to compressed.huff
Compression Complete.

Decompression Workflow

Command:

cabal run compressor -- -d compressed.huff restored.txt

Output:

Reading compressed.huff...
Reconstructing Tree from Canonical Lengths...
Decoding...
Writing to restored.txt...
Decompression Complete.

Performance Metrics

$ time cabal run compressor -- -c large_test.txt output.huff

real    0m1.047s   # Wall-clock time
user    0m1.107s   # CPU time (parallel processing)
sys     0m0.869s   # System I/O time

Analysis: user + sys > real proves parallel processing is working! Multiple CPU cores are being utilized simultaneously.


🧠 Functional Programming Concepts Used

This project strictly adheres to Functional Programming principles to ensure reliability and testability.

1. Pure Functions & Immutability

The core logic resides in Processing.hs and uses no mutable state. Functions take input and produce output deterministically.

  • Example: canonicalCodes transforms a length table into code assignments without modifying external memory.
    -- Processing.hs
    canonicalCodes :: BitLenTable -> CodeTable

2. Recursion

We use recursion instead of loops for tree traversal and bit stream decoding. This allows us to handle potentially infinite data streams or complex tree structures safely.

  • Example: The decode function uses a recursive state machine to traverse the Huffman Tree.
    -- Processing.hs
    go (b:bs) node bitIdx
        | bitIdx < 0 = go bs node 7 -- Recursive call to next byte

3. Algebraic Data Types (ADTs)

We define a custom data structure to model the Huffman Tree, ensuring type safety and preventing invalid tree states.

  • Example:
    -- DataTypes.hs
    data HuffmanTree = Leaf Char Int
                     | Node Int HuffmanTree HuffmanTree

4. Higher-Order Functions

We utilize Haskell's powerful list processing functions (map, sortBy, foldl) to manipulate data concisely.

  • Example: Sorting characters by frequency and length:
    -- Processing.hs
    let sorted = sortBy (\(c1, l1) (c2, l2) -> ...) lenTable

5. Concurrency & Parallelism

The implementation uses Haskell's parallel library for concurrent frequency counting, automatically distributing work across multiple CPU cores.

  • Example: Parallel frequency analysis in Utils.hs:

    frequencyCountParallel :: String -> [(Char, Int)]
    frequencyCountParallel str = 
        let chunks = chunksOf chunkSize str
            localCounts = parMap rdeepseq frequencyCount chunks
        in mergeFrequencies localCounts
  • Benefits:

    • Achieves 2-4x speedup on large files
    • No manual thread management required
    • Race-condition free (pure functions guarantee safety)

6. Modular Design

The code is separated into single-responsibility modules:

  • Main.hs: CLI Argument Parsing and IO Orchestration.
  • Processing.hs: Pure compression/decompression logic.
  • IOHandler.hs: Binary file reading/writing.
  • DataTypes.hs: Type definitions.
  • Utils.hs: General helper functions and parallel processing.

About

A file compressor that use Canonical Huffman Coding. Written in Haskell.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •