Skip to content

dwin-gharibi/project-aura

Repository files navigation

AURA: Adversarial Use for Reliable Assessment

A comprehensive framework for generating adversarial perturbations against OCR systems while maintaining human readability.

Python 3.12+ License: MIT Docker Benchmark


📄 Research Paper: Illusions for Machines

Illusions for Machines: Adversarial Attacks on OCR via Human Visual Patterns

Dwin Gharibi  ·  Mohammad Amin Haji Alirezaei  ·  Kaebeh Yaeghoobi
Department of Computer Engineering & Computer Networks,
K. N. Toosi University of Technology, Tehran, Iran

Contents of this section: Abstract · Headline result · Key contributions · How OCR & human vision differ · Evaluation metrics · Attack taxonomy · Benchmark results · Ineffective attacks · Composite attacks · Future work · AI-assistance disclosure · Citation

Abstract

Optical Character Recognition (OCR) systems have become ubiquitous in modern applications, from document digitization to autonomous vehicles. However, these systems fundamentally differ from human visual perception in their inability to exploit contextual understanding, imagination, and pattern completion. We present Illusions for Machines, a unified framework for generating adversarial perturbations that exploit the perceptual gap between human vision and machine recognition. Unlike previous works that focused on model-specific attacks using gradient-based methods such as FGSM, PGD, and C&W — which require white-box access and are tailored to specific architectures — our approach introduces architecture-agnostic, model-agnostic attacks based on human visual cognition patterns. By using visual illusions, gestalt principles, and the cognitive pattern-completion mechanisms that humans naturally employ, we construct perturbations that remain fully readable to human observers while effectively misleading traditional OCR engines (Tesseract, EasyOCR), modern vision–language models (GPT-4V, Claude, Gemini), and LLMs that rely on visual or extracted textual inputs. This paper is a systematization of attacks together with a unified evaluation framework, and constitutes the first phase of a larger research programme. We benchmark 27 attacks across 213,840 perturbed images (20 Persian sentences × 11 free fonts × 9 font sizes × 4 intensity levels × 27 attacks), running every sample through Tesseract and EasyOCR and reporting every metric with both standard deviation and 95% confidence intervals. Our top-performing structural attack (ConnectionBreaker) attains an effectiveness of 0.850 ± 0.078 (mean ± s.d.; 95% CI ±0.002), with Tesseract CER 0.882 ± 0.742, EasyOCR CER 0.658 ± 0.347, SSIM 0.920 ± 0.034, and humanized readability 0.781 ± 0.068.

Keywords: Adversarial Examples, OCR, Vision-Language Models, Human Visual Perception, Deep Learning Security.

TL;DR — Headline Result

The core thesis in one picture: a perturbation that a human reads effortlessly but that collapses OCR.

ConnectionBreaker adversarial sample
ConnectionBreaker (rank 1). Aggregated over n = 7,920 samples: SSIM = 0.920 ± 0.034, humanized readability Rh = 0.781 ± 0.068, Tesseract CER = 0.882 ± 0.742, EasyOCR CER = 0.658 ± 0.347, effectiveness E = 0.850 ± 0.078.

  • Scale: 27 attacks × 213,840 perturbed images = 20 texts × 11 free Persian fonts × 9 sizes × 4 intensities (7,920 samples per attack), each scored by both Tesseract and EasyOCR.
  • Rigor: every metric reported as mean ± s.d. with two-sided 95% confidence intervals; pairwise attack differences bootstrapped (1,000 resamples) into a Cohen's-d matrix.
  • Finding: structural attacks that exploit the Gestalt principle of closure (ConnectionBreaker, StrokeBreaking) dominate every font, size and intensity — while classical gradient attacks (PGD, C&W, MI-FGSM) are the weakest in the whole benchmark because OCR binarization "heals" their pixel noise.
  • Metric: effectiveness E = D · Rh (destruction × humanized readability) rewards attacks that destroy OCR and stay human-readable.

Key Contributions

  1. A taxonomy of 27 attack families spanning structural, geometric, photometric and gradient-based perturbations, each mapped to a human-perception mechanism (Gestalt closure / continuity / figure-ground).
  2. A reproducible benchmark of 213,840 perturbed samples across 11 free Persian fonts, 9 sizes, 4 intensities and 20 reference sentences — released in full as a per-sample CSV.
  3. A statistically rigorous reporting protocol (means, standard deviations, 95% CIs, per-intensity sweeps, per-font / per-size / per-text generalisation, and bootstrapped significance testing).
  4. A composite-attack selection framework (pipeline-stage tagging → mutual-information filter → top-k synergy search) that explains why certain attack combinations amplify non-linearly.

Positioning. This is the breadth paper of a multi-phase programme: it surveys, taxonomises and benchmarks attacks and provides the evaluation harness. A follow-up paper will distill these findings into a single optimised, architecture-agnostic attack and add a controlled human study and VLM transfer matrix (see Future Work).

How OCR and Human Vision Differ (Background)

The attacks are designed against the concrete architectures below. Traditional engines rely on binarization → layout/line detection → character segmentation → sequence decoding; VLMs use a holistic vision transformer. Both lack human imagination and pattern completion, which is precisely the gap the attacks exploit.

VLM architecture
Fig. 1 — Vision–language model. A vision encoder projects the image into the language embedding space; visual tokens are fused with text tokens and decoded by an LLM.
CRNN OCR architecture
Fig. 2 — CRNN OCR. Convolutional feature extraction → BiLSTM sequence modeling → CTC transcription (the EasyOCR family).
Tesseract architecture
Fig. 3 — Tesseract. LSTM-based recognition with adaptive binarization — the step that neutralises most pixel-noise attacks.
Vision Transformer
Fig. 4 — Vision Transformer. Patch embedding + multi-head self-attention give VLMs global context across the image.

Gestalt principles
Fig. 5 — Gestalt principles. Closure, similarity, proximity, continuity/common-fate: the human perceptual machinery the attacks lean on so that text stays readable to people while breaking machines.

Evaluation Metrics

Each perturbed image is scored on a rich, fully-released metric suite:

Group Metrics Purpose
OCR failure Character Error Rate (CER), Word Error Rate (WER) Levenshtein-based recognition damage (can exceed 1.0 when OCR hallucinates).
Perceptual fidelity SSIM, PSNR, MS-SSIM, LPIPS, GMSD, CIEDE2000 How close the attacked image stays to the clean one.
Topology preservation Text-preservation, Stroke-preservation (mask-IoU) Whether character body and central spine survive.
Composite (ours) Humanized readability Rh, Destruction D, Effectiveness E = D · Rh The balanced ranking criteria used throughout the paper.
  • Destruction D is a calibrated weighted sum of seven complementary OCR-damage axes (structural, text-region, edge, connected-component, contrast, morphological, noise) aligned empirically against measured CER.
  • Humanized readability R_h is a sigmoid-calibrated blend of MS-SSIM, GMSD, CIEDE2000, text-preservation and Canny-edge overlap, tuned so that R_h ≥ 0.75 ⇔ text all three native-Persian-speaking co-authors could read effortlessly. It is an author-validated automatic proxy, not a crowd study (a pre-registered study is deferred to follow-up work).
  • Effectiveness E = D · R_h is the primary ranking metric — it only rewards attacks that break OCR and remain legible.

Attack Taxonomy & Perceptual Mapping

The 27 benchmarked attacks fall into three mechanism families, each tied to a perceptual principle. CER/WER/SSIM/Rh values in each caption are aggregated over the full 7,920-sample grid for that attack.

1 · Structural Fragmentation — Gestalt closure

Break the thin ligatures/strokes; humans bridge the gaps, OCR segmentation collapses.

ConnectionBreaker
ConnectionBreaker — rank 1. SSIM 0.920 ± 0.034 · Rh 0.781 ± 0.068 · CERTess 0.882 · CEREasy 0.658 · E 0.850 ± 0.078
StrokeBreaking
StrokeBreaking — rank 2. SSIM 0.891 ± 0.059 · Rh 0.725 ± 0.133 · CERTess 1.226 · CEREasy 0.882 · E 0.820 ± 0.080
DotRemoval
Persian Dot Removal. Near-perfect visual similarity (SSIM 0.969 · Rh 0.900); removes semantically critical diacritics. E 0.477 ± 0.283.
Puzzle
Puzzle / Jigsaw. Tile displacement: Tesseract collapses (CER 3.074 ± 2.632). A powerful destabiliser used inside composite attacks.

2 · Geometric Warping — Gestalt continuity

Bend the baseline; readers follow the flow, OCR line-tracking fails.

Wavy
Wavy — rank 3. Sinusoidal baseline shift. SSIM 0.902 · Rh 0.791 · E 0.767 ± 0.133
Zigzag
Zigzag — rank 4. Triangle-wave warp. SSIM 0.753 · Rh 0.614 · E 0.727 ± 0.060
Elastic
Elastic — rank 5. Smooth random displacement field. SSIM 0.848 · Rh 0.720 · E 0.670 ± 0.052

3 · Photometric & Interference — figure-ground segregation

Inject clutter/noise/contrast; humans isolate text from ground, OCR binarization does not.

Interleaved
Interleaved Pattern. Periodic stripes between text rows. CERTess 0.957 · CEREasy 0.950 · E 0.600 ± 0.081
GaussianNoise
Gaussian Noise. Highest Tesseract CER in the benchmark: 1.883 ± 2.681. E 0.665 ± 0.066
SaltPepper
Salt & Pepper. Most consistent EasyOCR damage: CER 0.921 ± 0.111. E 0.660 ± 0.028
Mirror
Text Mirror. Faint flipped overlay. SSIM 0.808 · Rh 0.753 · E 0.642 ± 0.113
CaptchaStyle
CAPTCHA-style. Waves + crossing lines + dots. CEREasy 2.170 (hallucination).
Contrast
Contrast Manipulation. Block-wise brightness/gamma. CERTess 0.666 · CEREasy 0.717

Benchmark Results (full grid — 213,840 samples)

Every value below is computed over 7,920 perturbed samples per attack (N = 213,840 total) and reported as mean ± s.d.. These are the paper's authoritative numbers; the single-image CER/WER quoted alongside individual figures are illustrative only.

Top-15 attacks by effectiveness (E = D · Rh)

# Attack SSIM MS-SSIM Destruction Humanized Effectiveness 95% CI Tess. CER Easy. CER
1 ConnectionBreaker 0.920 ± 0.034 0.923 ± 0.038 0.993 ± 0.035 0.781 ± 0.068 0.850 ± 0.078 ±0.002 0.882 ± 0.742 0.658 ± 0.347
2 StrokeBreaking 0.891 ± 0.059 0.889 ± 0.069 0.972 ± 0.094 0.725 ± 0.133 0.820 ± 0.080 ±0.002 1.226 ± 1.300 0.882 ± 0.310
3 Wavy 0.902 ± 0.054 0.973 ± 0.025 0.900 ± 0.185 0.791 ± 0.092 0.767 ± 0.133 ±0.003 0.440 ± 0.513 0.520 ± 0.329
4 Zigzag 0.753 ± 0.102 0.841 ± 0.088 0.992 ± 0.034 0.614 ± 0.138 0.727 ± 0.060 ±0.001 0.612 ± 0.333 0.813 ± 0.190
5 Elastic 0.848 ± 0.073 0.901 ± 0.064 0.982 ± 0.053 0.720 ± 0.100 0.670 ± 0.052 ±0.001 0.157 ± 0.316 0.346 ± 0.289
6 GaussianNoise 0.447 ± 0.107 0.480 ± 0.135 1.000 ± 0.000 0.445 ± 0.175 0.665 ± 0.066 ±0.001 1.883 ± 2.681 0.406 ± 0.288
7 SaltPepper 0.682 ± 0.047 0.801 ± 0.037 1.000 ± 0.000 0.453 ± 0.048 0.660 ± 0.028 ±0.001 2.117 ± 1.715 0.921 ± 0.111
8 MaxDestruction 0.942 ± 0.007 0.980 ± 0.005 0.868 ± 0.080 0.865 ± 0.011 0.660 ± 0.054 ±0.001 0.111 ± 0.284 0.313 ± 0.279
9 Mirror 0.808 ± 0.034 0.825 ± 0.052 0.733 ± 0.143 0.753 ± 0.076 0.642 ± 0.113 ±0.002 0.326 ± 0.344 0.598 ± 0.399
10 Interleaved 0.517 ± 0.137 0.521 ± 0.100 1.000 ± 0.000 0.271 ± 0.095 0.600 ± 0.081 ±0.002 0.957 ± 0.077 0.950 ± 0.147
11 MicroRotation 0.773 ± 0.130 0.783 ± 0.160 0.942 ± 0.184 0.609 ± 0.194 0.590 ± 0.138 ±0.003 0.151 ± 0.313 0.373 ± 0.288
12 CaptchaStyle 0.424 ± 0.131 0.587 ± 0.159 1.000 ± 0.001 0.330 ± 0.123 0.532 ± 0.080 ±0.002 1.025 ± 0.656 2.170 ± 1.851
13 StructuredNoise 0.848 ± 0.090 0.903 ± 0.066 0.553 ± 0.263 0.872 ± 0.056 0.488 ± 0.208 ±0.005 0.106 ± 0.278 0.323 ± 0.276
14 DotRemoval 0.969 ± 0.033 0.976 ± 0.023 0.542 ± 0.348 0.900 ± 0.060 0.477 ± 0.283 ±0.006 0.266 ± 0.310 0.518 ± 0.273
15 TI-FGSM 0.865 ± 0.085 0.974 ± 0.020 0.504 ± 0.254 0.917 ± 0.025 0.450 ± 0.206 ±0.005 0.106 ± 0.278 0.307 ± 0.275
Effectiveness bar chart
Mean effectiveness ± s.d. per attack.
Attack ranking with 95% CI
Attack ranking with 95% CI markers — the top-3 structural cluster is significant.

Engine-level CER / WER (Tesseract & EasyOCR)

Attack Tess. CER Tess. WER Easy. CER Easy. WER
ConnectionBreaker 0.882 ± 0.742 1.736 ± 1.555 0.658 ± 0.347 1.073 ± 0.512
StrokeBreaking 1.226 ± 1.300 2.641 ± 2.794 0.882 ± 0.310 1.435 ± 0.532
Wavy 0.440 ± 0.513 0.666 ± 0.830 0.520 ± 0.329 0.753 ± 0.354
Zigzag 0.612 ± 0.333 0.895 ± 0.361 0.813 ± 0.190 1.124 ± 0.320
Elastic 0.157 ± 0.316 0.258 ± 0.406 0.346 ± 0.289 0.549 ± 0.319
GaussianNoise 1.883 ± 2.681 3.659 ± 5.240 0.406 ± 0.288 0.628 ± 0.310
SaltPepper 2.117 ± 1.715 4.306 ± 3.997 0.921 ± 0.111 1.029 ± 0.118
Interleaved 0.957 ± 0.077 1.002 ± 0.047 0.950 ± 0.147 0.990 ± 0.065
CaptchaStyle 1.025 ± 0.656 1.592 ± 1.527 2.170 ± 1.851 2.249 ± 1.494
Contrast 0.666 ± 0.314 0.969 ± 0.369 0.717 ± 0.299 0.882 ± 0.249
Mirror 0.326 ± 0.344 0.616 ± 0.542 0.598 ± 0.399 0.950 ± 0.485
Puzzle 3.074 ± 2.632 5.457 ± 5.103 0.782 ± 0.217 0.984 ± 0.143
DotRemoval 0.266 ± 0.310 0.505 ± 0.378 0.518 ± 0.273 0.798 ± 0.270
PGD 0.107 ± 0.280 0.156 ± 0.360 0.312 ± 0.277 0.512 ± 0.307
MicroRotation 0.151 ± 0.313 0.236 ± 0.411 0.373 ± 0.288 0.574 ± 0.319
Tesseract CER CDF
Empirical CDF of Tesseract CER per attack family.
Tesseract vs EasyOCR scatter
Joint Tesseract↔EasyOCR CER. Structural attacks break both engines; gradient attacks (PGD/CW/MI-FGSM) hug the EasyOCR axis — the signature of surrogate mismatch.

Intensity sweep (θ ∈ {0.25, 0.50, 0.75, 1.00})

Effectiveness saturates globally around θ ≈ 0.75; structural attacks plateau already at θ = 0.5.

Intensity θ n SSIM Destruction Efficacy Tess. CER Easy. CER
0.25 53,460 0.868 ± 0.207 0.566 ± 0.382 0.426 ± 0.272 0.399 ± 1.008 0.466 ± 0.386
0.50 53,460 0.824 ± 0.223 0.642 ± 0.357 0.470 ± 0.242 0.496 ± 1.025 0.537 ± 0.557
0.75 53,460 0.788 ± 0.228 0.701 ± 0.328 0.502 ± 0.219 0.672 ± 1.288 0.586 ± 0.677
1.00 53,460 0.756 ± 0.231 0.758 ± 0.315 0.530 ± 0.216 0.703 ± 1.254 0.607 ± 0.684
Effectiveness vs intensity
Global mean effectiveness vs θ (95% CI band).
Per-attack intensity heatmap
Per-attack effectiveness across the four intensity levels.

Generalisation & significance

Attack rankings are remarkably stable across fonts, sizes and texts (cross-font effectiveness range ≤ 0.0037; cross-text range ≤ 0.008), and the top-15 vs bottom-12 attacks separate at p < 0.001 after Bonferroni correction.

Per-font heatmap
Per-(attack, font) mean effectiveness — almost no font-specific striping.
Per-text heatmap
Per-(text, attack) effectiveness across the 20 Persian sentences.
Effectiveness violins
Effectiveness distribution per attack (note the tight top-3 violins).
Cohen's d matrix
Pairwise Cohen's-d on effectiveness (dark = large effect size).

Single-figure benchmark dashboard
Single-figure dashboard summarising the entire 213,840-sample benchmark: top-10 ranking, stealth frontier, per-attack effectiveness with 95% CI, readability–destruction scatter, effectiveness histogram, destruction-vs-intensity, joint OCR CER scatter, and the humanized-readability histogram.

📚 Complete Chart Atlas — all 67 benchmark figures

Every figure below is regenerated from the released per-sample CSV and is included in this repository under assets/benchmark_results/aura_results/charts/. They follow the same family taxonomy (A–I) used in §IX of the paper. The charts already embedded in context above are repeated here so this atlas is self-contained.

Headline per-attack metrics. Mean of each headline metric per attack (with ± s.d. error bars), the ranking and radar comparisons, and the normalised (attack × metric) heatmap.


Effectiveness ± s.d.

Destruction

Humanized readability

SSIM

PSNR

Ranking with 95% CI

Top-attack radar (all metrics)

(Attack × metric) heatmap

Perceptual trade-off & cross-metric correlations. The destruction–readability "stealth frontier", SSIM vs humanized readability, and the correlation matrices that justify the composite metrics (SSIM↔Rh r = 0.941).


Stealth (destruction–readability) frontier

Readability vs destruction (per sample)

SSIM vs humanized readability

Humanized-readability box plots

Perceptual-metric correlations

Full 34-metric correlation matrix

Metric distributions (box)

Metric histograms

Intensity sweep (θ ∈ {0.25, 0.50, 0.75, 1.00}). Global line plots and per-attack heatmaps for every headline metric, plus the per-intensity trade-off and variance/CI summaries.


Effectiveness vs θ

Destruction vs θ

Readability vs θ

SSIM vs θ

PSNR vs θ

Per-attack effectiveness × θ

Destruction × θ

Readability × θ

SSIM × θ

PSNR × θ

Trade-off coloured by θ

Mean / s.d. / 95% CI

Std-deviation analysis

Family A — Distribution & stability. Best-vs-worst attacks per metric, the mean-vs-std Pareto frontier, the signal-to-noise ranking, and per-attack stability surfaces.


Best vs worst — effectiveness

Best vs worst — destruction

Best vs worst — readability

Best vs worst — humanized

Mean-vs-std Pareto frontier

Signal-to-noise ranking

Stability — effectiveness

Stability — destruction

Stability — readability

Stability — humanized

Family B — Per-font generalisation (11 free Persian fonts). Mean/std heatmaps, best attack per font, and per-font box plots of the top attacks.


(Attack × font) mean effectiveness

(Attack × font) std

Best attack per font

Per-font box — top attacks

Family C — Per-size generalisation (9 font sizes). Metric-vs-size curves and (attack × size) / (size × intensity) heatmaps.


Headline metrics vs size

(Attack × size) effectiveness

(Size × intensity) heatmap

Per-size effectiveness violins

Family E — Distributional behaviour. Empirical CDFs and violin plots of each headline metric across the full 213,840-sample grid.


ECDF — effectiveness

ECDF — destruction

ECDF — humanized

ECDF — SSIM

Violin — effectiveness

Violin — destruction

Violin — humanized

Violin — SSIM

Family F — Statistical significance. CI-based ranking, the pairwise Cohen's-d effect-size matrix, and the bootstrapped p-value matrix (top-15 vs bottom-12 separate at p < 0.001 after Bonferroni).


CI-based effectiveness ranking

Pairwise Cohen's-d

Bootstrapped p-value matrix

Family G — OCR-level CER behaviour. Per-engine CER CDFs, the joint Tesseract↔EasyOCR scatter/density, and per-attack ΔCER.


Tesseract CER CDF

Tesseract vs EasyOCR CER

Joint CER density

ΔCER per attack

ΔCER histogram

Family H — Per-text behaviour (20 Persian reference sentences). Per-text difficulty, the (text × attack) heatmap, and the best attack per text.


Per-text difficulty curve

(Text × attack) effectiveness

Best attack per text

Family I — Master dashboard. A single self-contained figure compositing the headline findings (also shown above).

What does not work (the failure manifold)

A key negative result: classical gradient-based attacks and environmental simulations are the weakest in the benchmark. Pixel-precise noise (CW, MI-FGSM, PGD) is discarded by Tesseract's non-differentiable adaptive thresholding ("surrogate mismatch"), and low/high-frequency environmental textures are absorbed by binarization and LSTM temporal smoothing.

Attack modality Tess. CER Tess. WER Easy. CER Effectiveness
Micro Rotation 0.151 ± 0.313 0.236 ± 0.411 0.373 ± 0.288 0.590 ± 0.138
Structured Noise 0.106 ± 0.278 0.157 ± 0.360 0.323 ± 0.276 0.488 ± 0.208
TI-FGSM 0.106 ± 0.278 0.155 ± 0.355 0.307 ± 0.275 0.450 ± 0.206
Shadow Simulation 0.274 ± 0.391 0.388 ± 0.481 0.328 ± 0.278 0.426 ± 0.220
Moire Pattern 0.105 ± 0.277 0.155 ± 0.360 0.320 ± 0.281 0.359 ± 0.158
PGD (iterative) 0.107 ± 0.280 0.156 ± 0.360 0.312 ± 0.277 0.338 ± 0.120
Perlin Noise 0.109 ± 0.275 0.169 ± 0.364 0.347 ± 0.283 0.330 ± 0.144
Scene Text 0.105 ± 0.278 0.153 ± 0.359 0.324 ± 0.281 0.195 ± 0.065
MI-FGSM 0.108 ± 0.282 0.158 ± 0.366 0.321 ± 0.283 0.112 ± 0.055
CW (L₂) 0.106 ± 0.279 0.153 ± 0.361 0.323 ± 0.280 0.071 ± 0.052 (worst)
CW
CW L₂ — E 0.071
MI-FGSM
MI-FGSM — E 0.112
PGD
PGD — E 0.338
MicroRotation
MicroRotation — E 0.590

Composite Attacks

Real-world degradation rarely comes from a single source. A three-step selection framework (pipeline-stage tagging → mutual-information filter → top-k synergy search over S = (1−I)·D̄ᵢ·R̄ⱼ) picks composite pairs whose components hit different OCR stages and therefore stack non-linearly. Puzzle+Interleaved and GaussianNoise+Interleaved drive both engines to CER = WER = 1.0, while Puzzle+Wavy keeps readability high (SSIM 0.938).

Composite attack comparison grid
Composite comparison grid on one Persian sentence at θ = 0.7.

Puzzle+Wavy
Puzzle + Wavy. Readability preserved (SSIM 0.938, Rh 0.875).
Puzzle+Interleaved
Puzzle + Interleaved. Both engines collapse to CER = WER = 1.0.
Noise+Interleaved
Gaussian + Interleaved. Most destructive interaction (SSIM ≈ 0.13).

Future Work

This breadth paper is phase one. The follow-up will: (i) distill a single architecture-agnostic attack via Bayesian optimisation over the destruction-score weights with R_h ≥ 0.85 as a hard constraint; (ii) replace the automatic readability proxy with a pre-registered crowd-sourced human study; (iii) extend the grid to VLMs (GPT-4V, Claude, Gemini) and publish a transferability matrix; and (iv) evaluate defences (input randomisation, ensemble OCR, learned denoisers).

Acknowledgment of AI Assistance

Per current best practices, the authors disclose that general-purpose AI coding assistants were used only as a search accelerator to scaffold quick prototypes across ~500–600 candidate perturbations during early exploration (27 survived screening). All retained attacks were rewritten, tested, audited, parameter-tuned and benchmarked by the authors; the AI was not a source of scientific claims, evaluation results, derivations, or written analysis. Every number is publicly auditable in this repository.

How to Cite

@article{gharibi2026illusions,
  title   = {Illusions for Machines: Adversarial Attacks on OCR via Human Visual Patterns},
  author  = {Gharibi, Dwin and Haji Alirezaei, Mohammad Amin and Yaeghoobi, Kaebeh},
  year    = {2026},
  url     = {https://github.com/dwin-gharibi/aura-repo}
}

📋 Table of Contents


🎯 Overview

AURA is a research-grade framework designed to evaluate the robustness of Optical Character Recognition (OCR) systems against adversarial attacks. The framework specifically supports Persian/Arabic (RTL) text and implements state-of-the-art adversarial techniques from recent academic literature.

AURA addresses a critical gap in OCR security research: no general-purpose OCR attack framework exists that comprehensively evaluates multiple attack mechanisms across different OCR engines while maintaining human readability. This claim is supported by 9+ peer-reviewed references documented in our citation manager.

Why AURA?

  • Research-Grade: Implements attacks from 11+ academic papers
  • Production-Ready: 80+ attack implementations with pre-optimized configurations
  • Multi-Engine: Evaluates against Tesseract, EasyOCR, and LLM-based OCR (GPT-4V, Claude, Gemini)
  • Persian/RTL Support: Proper character joining and RTL text rendering using libraqm
  • Publication-Ready: Generates LaTeX tables, visualization charts, and detailed JSON reports

✨ Key Features

  • 80+ Attack Implementations: From simple noise injection to sophisticated research-based attacks
  • 8 Mechanism Categories: Attacks organized by underlying perceptual mechanism
  • 9 Perceptual Principles: Mapped from vision science to explain human vs. machine differences
  • 20 Persian-Specific Attacks: Designed for Persian/Arabic script characteristics
  • 7 ArXiv Paper-Enhanced Attacks: State-of-the-art implementations (MI-FGSM, DI-FGSM, TI-FGSM, etc.)
  • Multi-Engine Evaluation: Tesseract, EasyOCR, and LLM-based OCR (GPT-4V, Claude, Gemini)
  • Comprehensive Metrics: SSIM, PSNR, CER, WER, and custom effectiveness scores
  • Statistical Rigor: Variance, confidence intervals, statistical tests (t-test, ANOVA, Mann-Whitney U)
  • 20 Persian Fonts: Multi-font, multi-size, multi-color, multi-style dataset generation
  • 5 Intensity Levels: Systematic analysis from 20% to 100% intensity
  • 120+ Charts: Publication-ready visualizations (attack grids, heatmaps, scatter plots, etc.)
  • Systematic Composite Optimization: 5 combination strategies for finding optimal attack combinations
  • Enhanced Readability Metrics: Multi-scale SSIM with better human correlation (r=0.87)

📦 Installation

Prerequisites

# Python 3.8+ required
python --version  # Python 3.8 or higher

# System dependencies (Ubuntu/Debian)
sudo apt-get update
sudo apt-get install -y \
    build-essential \
    git \
    libglib2.0-0 \
    libraqm0 \
    libsm6 \
    libxext6 \
    libxrender1 \
    libgl1 \
    tesseract-ocr \
    tesseract-ocr-fas \
    tesseract-ocr-ara \
    fonts-dejavu \
    fonts-freefont-ttf

# macOS
brew install tesseract tesseract-lang libraqm

Python Dependencies

# Clone the repository
git clone https://github.com/dwin-gharibi/aura-repo.git
cd aura

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Persian text support
pip install arabic-reshaper python-bidi pillow

# OCR evaluation
pip install pytesseract easyocr

# LLM-based OCR (optional)
pip install openai anthropic

Docker Installation

# Build Docker image
docker build -t aura:latest .

# Run container
docker run -it --rm \
    -v $(pwd):/workspace \
    -v $(pwd)/results:/workspace/results \
    aura:latest

# Or use Docker Compose (see docker-compose.yml)
docker-compose up -d

🚀 Quick Start

1. Generate Adversarial Images

from aura.utils import PersianTextGenerator
from aura.attacks import PersianOptimizedAttack

# Generate Persian text image
generator = PersianTextGenerator()
image, text = generator.generate_image("سلام، این یک متن آزمایشی است.")

# Apply optimized attack
attack = PersianOptimizedAttack(intensity=1.0)
result = attack.apply(image)

# Get perturbed image
perturbed = result.perturbed_image
print(f"SSIM: {result.ssim:.3f}")
print(f"L2 norm: {result.l2_norm:.3f}")

2. Run Basic Benchmark

# Quick benchmark (5 samples, 12 attacks)
python scripts/run_comprehensive_benchmark.py --num-samples 5 --full-benchmark

# Full benchmark (1000 samples, all 67 attacks)
python scripts/run_comprehensive_benchmark.py --num-samples 1000 --full-benchmark

# Specific attacks only
python scripts/run_comprehensive_benchmark.py --num-samples 100 --attacks "ElasticDeformation,WavyDistortion,DotRemoval"

# Resumable, continuously-synced paper run (S3 / MinIO / R2 / ...)
python scripts/run_comprehensive_benchmark.py \
    --full-benchmark --num-samples 100 \
    --resume \
    --s3-bucket aura-benchmarks --s3-prefix runs/$(date -u +%F) --s3-bundle

Running a long benchmark? See docs/BENCHMARK_UPDATES.md for the full reference on the streaming checkpoint, --resume, S3 sync, the new perceptual metrics (MS-SSIM, GMSD, CIEDE2000, humanised readability, text preservation), and the paper-critical bug fixes applied to the comprehensive runner.

3. Evaluate OCR Engines

from aura.evaluation import TesseractEngine, EasyOCREngine, LLMOCREngine

# Tesseract OCR (recommended for Persian)
tesseract = TesseractEngine(lang="fas", config="--psm 6")
result = tesseract.recognize(image)
print(f"Tesseract: {result.text}")

# EasyOCR
easyocr = EasyOCREngine(lang="fas", gpu=False)
result = easyocr.recognize(image)
print(f"EasyOCR: {result.text}")

# GPT-4V OCR (requires API key)
llm_ocr = LLMOCREngine(provider="openai", model="gpt-4o", api_key="YOUR_KEY")
result = llm_ocr.recognize(image)
print(f"GPT-4V: {result.text}")

⚔️ Attack Framework

AURA implements 80+ adversarial attacks organized into 8 mechanism categories based on their perceptual impact on human vision systems.

Attack Visualizations

Attack Examples

Attack Visualization Category
Aggressive Aggressive Composite
ConnectionBreaker ConnectionBreaker Structural
DotRemoval DotRemoval Structural
WavyDistortion Wavy Geometric
StrokeBreaking StrokeBreaking Structural
Zigzag Zigzag Creative
GaussianNoise GaussianNoise Noise

Attack Categories

Quick Reference: Top Attacks by Effectiveness

Attack Target Est. Destruction Readability Effectiveness
StrokeBreaking Structural integrity 1.00 0.94 0.94
ConnectionBreaker Cursive flow 1.00 0.91 0.91
PersianUltimateV2 All Persian features 1.00 0.91 0.91
LetterFormConfusion Initial/medial/final forms 1.00 0.90 0.90
UltimateDestruction Multi-layer 1.00 0.87 0.87
ElasticDeformation Geometric distortion 0.95+ 0.80 0.76
MaxDestruction Maximum damage 0.87 0.79 0.67
AggressiveAttack Multi-layer attack 0.87 0.86 0.75
WavyDistortion Text flow disruption 0.85 0.88 0.75
HighReadabilityAttack Stealth mode Low 0.95+ Variable

Note: Destruction scores are proxy-based estimates using image metrics (binarization changes, edge disruption, etc.). Actual OCR destruction varies by engine and configuration.


1. Mechanism-Based Grouping (8 Categories)

Instead of describing 80+ individual attacks, we group them by their underlying perceptual mechanism:

1.1 Geometric Warping (8 attacks)

Perceptual Principle: Topological Invariance Perception - Humans are sensitive to topological properties (connectedness, holes) but invariant to continuous deformations.

Attack Description Target
ElasticDeformation Smooth non-linear warping Character shape
WavyDistortion Sinusoidal baseline shift Text flow
Zigzag Triangle-wave baseline Line structure
MicroRotation Small local rotations Character orientation
PerspectiveTilt Perspective transformation Global geometry
Swirl Swirl distortion Local structure
Barrel Barrel/pincushion distortion Global warp
SpatialTransform Flow-based transformation Spatial coherence

1.2 Photometric Distortion (5 attacks)

Perceptual Principle: Contrast Constancy - Human visual system compensates for global contrast changes through gain control mechanisms.

Attack Description Target
ContrastManipulation Local contrast shifts Edge detection
Shadow Gradient shadows Binarization
DCTAttack DCT coefficient modification Frequency domain
Texture Texture overlay Segmentation
Illumination Lighting changes Thresholding

1.3 Structural Fragmentation (8 attacks)

Perceptual Principle: Gestalt Closure - Humans mentally complete fragmented shapes based on prior knowledge and context.

Attack Description Target
DotRemoval Removes/blurs letter dots Persian letter identity (ب/ت/ث)
DotDisplacement Moves dots to alter meaning Letter confusion
ConnectionBreaker Breaks letter connections Cursive flow
StrokeBreaking Breaks character strokes Structural integrity
BinarizationConfusion Disrupts Otsu thresholding Segmentation
DiacriticAttack Injects fake diacritical marks Letter disambiguation
TeethDestruction Targets س، ش، ص patterns Letter patterns
LetterConfusion Adds phantom dots, smears Letter shape

1.4 Noise Injection (9 attacks)

Perceptual Principle: Lateral Inhibition - Human retina enhances edges through lateral inhibition, making noise less disruptive than for machines.

Attack Description Target
GaussianNoise Random pixel noise Overall quality
SaltPepperNoise Impulse noise Edge detection
PerlinNoise Natural-looking noise Texture
StructuredNoise Pattern-based noise Segmentation
PatternWatermark Watermark patterns OCR confidence
TextWatermark Faint text overlay Character recognition
GridWatermark Grid pattern Line detection
RandomErasing Random pixel deletion Feature extraction
Cutout Block removal Attention mechanisms

1.5 Frequency Domain (2 attacks)

Perceptual Principle: Contrast Sensitivity Function (CSF) - Humans are less sensitive to high-frequency changes, especially in textured regions.

Attack Description Target
FrequencyDomainAttack DCT/FFT-based perturbations Frequency components
FourierAttack Fourier domain modifications Global structure

1.6 Adversarial Gradient (6 attacks)

Perceptual Principle: Weber-Fechner Law - Human perception is logarithmic, while machine gradients exploit linear sensitivity.

Attack Description Paper
PGD Projected Gradient Descent Madry et al., 2017
CW Carlini-Wagner Attack Carlini & Wagner, 2017
MI-FGSM Momentum Iterative FGSM Dong et al., 2017
DI-FGSM Diverse Input FGSM Xie et al., 2019
TI-FGSM Translation Invariant FGSM Dong et al., 2019
VLMTargeted Vision Language Model attack arXiv:2208.14302

1.7 Typographic Manipulation (13 attacks)

Perceptual Principle: Top-Down Processing - Humans use context and language knowledge to resolve ambiguous characters.

Attack Description Target
FontPerturbation Font variation Font recognition
KashidaInjection Word stretching Word spacing
LetterFormConfusion Initial/medial/final confusion Letter forms
BaselineWobble Vertical wobble Baseline detection
PerCharRotation Per-character rotation Character alignment
Puzzle Tile displacement Spatial coherence
CaptchaStyle CAPTCHA-like distortion Overall readability
Interleaved Pattern between lines Line separation
TextMirror Faint duplicate text Character confusion
LetterSpacing Spacing variation Word segmentation
LineSpacing Line height changes Paragraph structure
FontWeight Weight perturbation Stroke width
FontStyle Style changes (italic, bold) Font robustness

1.8 Composite/Multi-Layer (5 attacks)

Perceptual Principle: Feature Integration Theory - Humans integrate features across multiple scales and dimensions more robustly than machines.

Attack Description Components
AggressiveAttack Multi-layer attack 5+ layers
MaxDestructionAttack Maximum damage Optimized combo
UltimateOCRKiller Combined frequency + transformer Multi-domain
OCRKillerAttack 5-layer progressive Progressive
StealthAttack Minimal perturbation Subtle combo

2. Persian-Specific Attacks (20 Attacks)

AURA includes 20 attacks specifically designed for Persian/Arabic script characteristics:

Attack Target Description
DotRemoval Letter dots Removes/blurs dots crucial for letter identity (ب/ت/ث/پ)
DotDisplacement Dot position Moves dots to alter letter meaning
ConnectionBreaker Cursive flow Breaks letter connections in cursive script
LetterConfusion Letter shapes Adds phantom dots, smears letters
DiacriticAttack Diacritics Injects fake diacritical marks
TeethDestruction Teeth patterns Targets س، ش، ص patterns
KashidaInjection Word spacing Stretches words with kashida (ـ)
BaselineWobble Text baseline Creates vertical wobble
LetterFormConfusion Forms Confuses initial/medial/final forms
PersianUltimateV2 All features Maximum Persian destruction
PersianOptimizedAttack Persian/RTL text Optimized for Persian
PersianScriptOptimizedAttack RTL & connections Connection-aware
HamzaAttack Hamza position Moves/removes hamza (ء)
YehBarreh Yeh/Barreh confusion Persian ی vs Arabic ي
KafGaf Kaf/Gaf distinction Persian گ vs Arabic ك
HehGoal Heh/Goal forms Persian ه final form
AlefMadda Alef with madda آ variant
LamAlefLigature Lam-Alef ligature لا ligature breaking
NumeralConfusion Persian/Arabic numerals ۰۱۲۳ vs ٠١٢٣
ZeroWidthJoiner ZWJ manipulation Character joining

3. ArXiv Paper-Enhanced Attacks

State-of-the-art implementations from recent academic papers:

Attack Paper Year Description
MomentumIterativeFGSM Dong et al., "Boosting Adversarial Attacks with Momentum" 2017 MI-FGSM with momentum-based gradient accumulation
DiverseInputFGSM Xie et al., "Adversarial Examples for Semantic Segmentation" 2019 DI-FGSM with input diversity for transferability
TranslationInvariantFGSM Dong et al., "Evading Defenses to Transferable Adversarial Examples" 2019 TI-FGSM with kernel convolution
VLMTargetedAttack arXiv:2208.14302 "On Evaluating Adversarial Robustness of VLMs" 2022 Targets Vision Language Models (GPT-4V, Claude, Gemini)
LLMOCRTargetedAttack arXiv:2306.07033 "Robust Visual Reasoning Attacks" 2023 Targets LLM-based OCR specifically
PersianScriptOptimizedAttack Custom implementation - RTL and connection-aware for Persian
MultiModalEnsembleAttack Custom implementation - Ensemble for CNN, Transformer, and VLM

Perceptual Mechanism Mapping (9 Principles)

AURA maps each attack category to specific perceptual principles from vision science, explaining why humans can still read perturbed text while OCR systems fail:

# Perceptual Principle Description Mapped Attack Category
1 Gestalt Closure Humans mentally complete fragmented shapes Structural Fragmentation
2 Contrast Constancy Human visual system compensates for global contrast Photometric Distortion
3 Weber-Fechner Law Human perception is logarithmic Adversarial Gradient
4 Lateral Inhibition Retina enhances edges, reducing noise impact Noise Injection
5 Top-Down Processing Context and language knowledge resolve ambiguity Typographic Manipulation
6 Contrast Sensitivity Function Less sensitive to high-frequency changes Frequency Domain
7 Topological Invariance Sensitive to connectedness, invariant to deformation Geometric Warping
8 Feature Integration Theory Robust multi-scale feature integration Composite/Multi-Layer
9 Letter Similarity Confusion Language models resolve similar letters Persian-Specific

Scientific Basis: All perceptual principles are documented in vision science literature. See our citation manager (aura/research/citations.py) for 9+ peer-reviewed references supporting each principle.


📊 Evaluation & Metrics

AURA provides comprehensive evaluation metrics organized into three categories: image quality, OCR degradation, and combined effectiveness.

Image Quality Metrics

Metric Full Name Description Range Ideal
SSIM Structural Similarity Index Measures structural similarity (luminance, contrast, structure) -1 to 1 > 0.85
PSNR Peak Signal-to-Noise Ratio Ratio between max signal and noise power 0-∞ dB > 30 dB
LPIPS Learned Perceptual Image Patch Similarity Deep learning-based perceptual similarity 0-1+ < 0.3
L2 Norm Euclidean perturbation magnitude Average perturbation size 0-∞ Lower is better
L∞ Norm Maximum pixel change Max perturbation at any pixel 0-255 < 20

SSIM Formula

SSIM(x, y) = [l(x,y)]^α · [c(x,y)]^β · [s(x,y)]^γ

Where:
  l(x,y) = (2μ_x μ_y + C₁) / (μ_x² + μ_y² + C₁)  [Luminance]
  c(x,y) = (2σ_x σ_y + C₂) / (σ_x² + σ_y² + C₂)  [Contrast]
  s(x,y) = (σ_xy + C₃) / (σ_x σ_y + C₃)           [Structure]

PSNR Formula

PSNR = 10 · log₁₀(MAX² / MSE)

Where:
  MAX = Maximum pixel value (255 for 8-bit)
  MSE = Mean Squared Error

OCR Degradation Metrics

Metric Full Name Formula Range Interpretation
CER Character Error Rate (S + D + I) / N 0-∞ 0 = perfect, >1 = more errors than chars
WER Word Error Rate (S + D + I) / N (words) 0-∞ 0 = perfect, more semantic
SER Sentence Error Rate Incorrect sentences / Total 0-1 0 = all correct

Where:

  • S = Substitutions
  • D = Deletions
  • I = Insertions
  • N = Total characters/words in reference

Combined Metrics

Metric Formula Interpretation Target
Destruction Score Weighted composite of image metrics OCR degradation > 0.80
Readability Score SSIM-based readability Human readability preservation > 0.75
Effectiveness Destruction × Readability Overall attack quality > 0.60

Destruction Score Formula

destruction = (
    0.20 * ssim_destruction +      # Structural changes
    0.25 * text_destruction +       # Binarization changes
    0.20 * edge_destruction +       # Edge disruption
    0.15 * cc_destruction +         # Connected components
    0.10 * hist_destruction +       # Histogram changes
    0.10 * stroke_destruction       # Stroke width changes
)

Enhanced Readability Formula

readability = (
    0.25 * multi_scale_ssim +       # Multi-resolution SSIM
    0.20 * text_region_ssim +       # Text-area focused SSIM
    0.20 * structural_preservation + # Edges, corners, loops
    0.15 * character_legibility +   # Stroke consistency
    0.10 * context_readability +    # Word-level comprehension
    0.05 * contrast_ratio +         # Contrast preservation
    0.05 * edge_overlap             # Edge preservation
)

Validation: Enhanced readability metrics correlate with human studies at r=0.87 (vs. r=0.72 for basic SSIM alone).


🔍 OCR Engines

AURA supports multiple OCR engines for comprehensive evaluation:

Traditional OCR

1. Tesseract OCR

Description: Open-source OCR engine by Google. Uses LSTM-based recognition. Best for document OCR.

from aura.evaluation import TesseractEngine

engine = TesseractEngine(
    lang="fas",  # Persian/Farsi
    config="--psm 6"  # Page segmentation mode
)
result = engine.recognize(image)
print(f"Text: {result.text}")
print(f"Confidence: {result.confidence:.1f}%")

Page Segmentation Modes (PSM):

  • 0: Orientation and script detection only
  • 3: Fully automatic page segmentation
  • 6: Assume uniform block of text (default)
  • 7: Treat image as single text line
  • 13: Raw line (no OSD or OCR)

Strengths: Fast, good for documents, supports many languages
Weaknesses: Sensitive to binarization, struggles with non-standard fonts

2. EasyOCR

Description: Deep learning-based OCR using CRAFT detection and CRNN recognition.

from aura.evaluation import EasyOCREngine

engine = EasyOCREngine(
    lang="fas",  # Mapped to "fa" internally
    gpu=False
)
result = engine.recognize(image)
print(f"Text: {result.text}")

Strengths: Better scene text recognition, handles curved text
Weaknesses: Slower, higher memory usage, may hallucinate on noise


LLM-Based OCR

1. OpenAI GPT-4V / GPT-4o

from aura.evaluation import LLMOCREngine

engine = LLMOCREngine(
    provider="openai",
    model="gpt-4o",  # or "gpt-4-vision-preview"
    api_key="YOUR_KEY",  # or set OPENAI_API_KEY env var
    lang="fas"
)
result = engine.recognize(image)

Strengths: Context understanding, handles degraded images, excellent Persian support
Weaknesses: API costs, rate limits, latency

2. Anthropic Claude

engine = LLMOCREngine(
    provider="anthropic",
    model="claude-3-5-sonnet-20241022",
    api_key="YOUR_KEY"  # or set ANTHROPIC_API_KEY
)

3. Google Gemini

engine = LLMOCREngine(
    provider="gemini",
    model="gemini-pro-vision",
    api_key="YOUR_KEY"  # or set GOOGLE_API_KEY
)

4. Local LLMs (Ollama)

engine = LLMOCREngine(
    provider="local",
    model="llava",
    api_base="http://localhost:11434"
)

Strengths: No API costs, privacy, no rate limits
Weaknesses: Lower accuracy, requires GPU


VLLM/OpenRouter Vision Models

AURA supports benchmarking multiple vision models through OpenRouter:

Model Family Models Description
Claude 3 Opus, Sonnet, Haiku, 3.5 Sonnet/Opus Anthropic's vision models
GPT-4 GPT-4o, GPT-4 Turbo, Vision OpenAI's vision models
Qwen VL Max, VL Plus, Qwen2-VL-72B Alibaba's vision models
Gemini Pro Vision, 1.5 Pro Google's vision models
Others LLaVA, Pixtral Open-source models
from aura.evaluation.vllm_ocr import benchmark_vision_models

results = benchmark_vision_models(
    image=perturbed_image,
    ground_truth="متن اصلی",
    models=["claude-3-opus", "gpt-4o", "qwen-vl-max"],
    api_key="your_openrouter_key"
)

# Results: CER, WER, processing time for each model
for model, metrics in results.items():
    print(f"{model}: CER={metrics['cer']:.3f}, WER={metrics['wer']:.3f}, Time={metrics['processing_time']:.2f}s")

Hybrid OCR Systems

Combine traditional OCR with LLM verification:

from aura.evaluation import HybridOCREngine

hybrid = HybridOCREngine(
    primary_engine="tesseract",
    llm_provider="openai",
    confidence_threshold=60.0,  # Use LLM if confidence below this
    lang="fas"
)

result = hybrid.recognize(image)
# Uses Tesseract first, falls back to GPT-4V if confidence < 60%

Multi-Provider OCR

Try multiple providers and compare:

from aura.evaluation import MultiProviderLLMOCR

multi_ocr = MultiProviderLLMOCR(
    providers=["openai", "anthropic"],
    lang="fas"
)

# Get results from all providers
results = multi_ocr.recognize(image)

# Or use fallback (tries providers in order)
result = multi_ocr.recognize_with_fallback(image)

🇮🇷 Persian Text Support

AURA provides comprehensive support for Persian/Arabic (RTL) text:

Multi-Font Dataset (20 Persian Fonts)

AURA supports 20 open-source Persian fonts (OFL License) with extensive variation:

Font Families

Font Variants Description
Vazirmatn Thin, ExtraLight, Light, Regular, Medium, SemiBold, Bold, ExtraBold, Black Most popular Persian font (9 weights)
Samim Regular, Bold Modern, clean design
Parastoo Regular, Bold Traditional style
Shabnam Regular, Bold Rounded, friendly
Sahel Regular, Bold Sharp, professional
Tanha Regular Compact, modern
Gandom Regular Elegant curves
Rouba Regular Classic style

Variation Dimensions

Dimension Options Examples
Font Sizes 9 sizes 14, 18, 24, 32, 42, 56, 72, 96, 120 pt
Text Colors 10 colors Black, Dark Gray, Medium Gray, Dark Blue, Navy Blue, Dark Red, Dark Green, Dark Brown, Dark Purple, Teal
Backgrounds 7 backgrounds White, Off-White, Light Gray, Cream, Sepia, Light Yellow, Light Blue
Style Effects 7 effects Normal, Bold, Italic, Underlined, Strikethrough, Bold+Italic, Shadow

Total Possible Combinations: 20 × 9 × 10 × 7 × 7 × 22 = 19,404,000 unique images!

Dataset Generation

from aura.dataset import EnhancedMultiFontDataset

gen = EnhancedMultiFontDataset(font_dir="fonts")
dataset = gen.generate_comprehensive_dataset(
    output_dir="persian_dataset",
    fonts=[
        "vazirmatn_regular", "vazirmatn_bold", "vazirmatn_light",
        "samim_regular", "parastoo_regular", "shabnam_regular",
        "sahel_regular", "tanha_regular", "gandom_regular"
    ],
    sizes=[24, 32, 42, 56, 72],
    colors=["black", "dark_blue", "dark_red", "dark_green"],
    backgrounds=["white", "cream", "sepia"],
    style_effects=["normal", "bold_effect", "underlined"],
    max_samples=10000,  # Limit for testing
    stratified_sampling=True  # Balanced across conditions
)

print(gen.generate_comprehensive_report(dataset))

Sample Dataset Output

Dataset: Enhanced Persian Multi-Font
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total Samples: 10,000
Font Families: 9 (Vazirmatn, Samim, Parastoo, Shabnam, Sahel, Tanha, Gandom, Rouba)
Font Variants: 20 (9 Vazirmatn weights + 11 others)
Font Sizes: 9 (14, 18, 24, 32, 42, 56, 72, 96, 120 pt)
Text Colors: 10 (black, grays, blue, red, green, brown, purple, teal)
Backgrounds: 7 (white, off-white, gray, cream, sepia, yellow, blue)
Style Effects: 7 (normal, bold, italic, underline, strikethrough, bold+italic, shadow)
Text Categories: 6 (simple, sentences, paragraphs, mixed numerals, complex, poetry)
Theoretical Maximum: 19,404,000 combinations

Persian Text Generation

Generate Persian text images with proper RTL rendering:

from aura.utils import PersianTextGenerator, TextStyle

generator = PersianTextGenerator()

# Simple text
image, text = generator.generate_image("متن فارسی")

# Custom style
style = TextStyle(
    font_size=42,
    font_weight="bold",
    line_spacing=1.8,
    alignment="right",  # RTL default
)
image, text = generator.generate_image("متن با استایل سفارشی", style=style)

# Exam-style questions
image, text = generator.generate_exam_style(num_questions=5)

# Paragraph
image, text = generator.generate_paragraph_image()

🔧 API Reference

Creating Attacks

from aura.attacks import (
    # Fine-tuned (recommended)
    PersianOptimizedAttack,
    MaxDestructionAttack,
    get_attack_for_target,

    # ArXiv paper-enhanced
    MomentumIterativeFGSM,
    DiverseInputFGSM,
    TranslationInvariantFGSM,
    VLMTargetedAttack,
    LLMOCRTargetedAttack,
    PersianScriptOptimizedAttack,
    MultiModalEnsembleAttack,

    # Composite
    create_custom_composite,

    # Base
    AttackConfig,
)

# Use pre-configured attack
attack = PersianOptimizedAttack(intensity=1.0)
result = attack.apply(image)

# Get attack for specific OCR target
attack = get_attack_for_target(
    target_ocr="tesseract",  # Options: tesseract, easyocr, transformer, vlm, llm, persian
    destruction_priority=0.9,
    readability_priority=0.7
)

# Target VLM/LLM-based OCR (GPT-4V, Claude, etc.)
attack = get_attack_for_target(target_ocr="vlm")

# Multi-modal ensemble for all OCR types
attack = get_attack_for_target(target_ocr="multimodal")

# Custom composite attack
attack = create_custom_composite(
    attacks=[
        ("perlin_noise", {"amplitude": 20.0}),
        ("elastic", {"alpha": 30.0}),
        ("micro_rotation", {"max_angle": 1.5}),
    ],
    total_intensity=0.5,
    distribution="decreasing"
)

Batch Processing

from aura.attacks import PersianOptimizedAttack

attack = PersianOptimizedAttack()

# Process multiple images
results = attack.apply_batch(images)

for result in results:
    print(f"L2 norm: {result.l2_norm}")
    print(f"L∞ norm: {result.linf_norm}")
    print(f"SSIM: {result.ssim}")

Persian Text Generation API

from aura.utils import PersianTextGenerator, TextStyle

generator = PersianTextGenerator()

# Generate with custom style
style = TextStyle(
    font_size=42,
    font_weight="bold",
    line_spacing=1.8,
    alignment="right",
)
image, text = generator.generate_image("متن سفارشی", style=style)

# Generate exam-style questions
image, text = generator.generate_exam_style(num_questions=5)

# Generate paragraph
image, text = generator.generate_paragraph_image()

# Generate with specific font
image, text = generator.generate_image(
    "متن با فونت وزیر",
    font="vazirmatn_bold",
    size=56,
    color="dark_blue",
    background="cream",
    style_effect="shadow"
)

OCR Evaluation

from aura.evaluation import OCREvaluator, MetricsCalculator, LLMOCREvaluator
from aura.attacks import PersianScriptV2Attack

# Create attack
attack = PersianScriptV2Attack(intensity=1.0)

# Apply attack
result = attack.apply(original_image)
perturbed = result.perturbed_image

# Traditional OCR evaluation
ocr_eval = OCREvaluator(engines=["tesseract", "easyocr"])
eval_results = ocr_eval.evaluate_attack(
    original_image,
    perturbed,
    ground_truth="متن فارسی نمونه"
)

# Print results
for engine, metrics in eval_results.items():
    print(f"{engine}:")
    print(f"  Original: {metrics['original_text']}")
    print(f"  Perturbed: {metrics['perturbed_text']}")
    print(f"  CER: {metrics['cer']:.3f}")
    print(f"  WER: {metrics['wer']:.3f}")

# Metrics computation
metrics_calc = MetricsCalculator(use_lpips=True)
metrics = metrics_calc.compute_all(
    original_image,
    perturbed,
    original_text=eval_results["tesseract"]["original_text"],
    perturbed_text=eval_results["tesseract"]["perturbed_text"],
    ground_truth="متن فارسی نمونه"
)

print(f"SSIM: {metrics.ssim:.3f}")
print(f"PSNR: {metrics.psnr:.1f} dB")
print(f"Effectiveness: {metrics.effectiveness:.3f}")

🚀 Advanced Features

Attack Parameter Optimization

Find optimal attack configurations automatically:

# Quick optimization
python scripts/run_attack_optimizer.py --quick

# Full optimization with EasyOCR
python scripts/run_attack_optimizer.py --full --with-easyocr
from aura.analysis import AttackParameterOptimizer

optimizer = AttackParameterOptimizer(
    attack_name="PersianOptimized",
    metric="effectiveness",
    intensity_range=(0.0, 1.0),
    num_steps=20
)

best_config = optimizer.optimize(image, ground_truth)
print(f"Best intensity: {best_config.intensity:.3f}")
print(f"Best effectiveness: {best_config.effectiveness:.3f}")

Systematic Composite Optimization

Find optimal attack combinations using 5 strategies:

from aura.attacks.composite_optimizer import (
    CompositeOptimizer,
    AttackComponent,
    CombinationStrategy
)

# Define attack pool
attack_pool = [
    AttackComponent("ElasticDeformation", "geometric_warping", 0.3),
    AttackComponent("ContrastManipulation", "photometric_distortion", 0.3),
    AttackComponent("DotRemoval", "structural_fragmentation", 0.3),
    AttackComponent("PerlinNoise", "noise_injection", 0.3),
]

# Optimize
optimizer = CompositeOptimizer(
    min_readability=0.75,  # Enforce minimum readability
    max_components=4,
    strategy=CombinationStrategy.PARETO_OPTIMAL
)

optimizer.set_attack_pool(attack_pool)
result = optimizer.optimize(evaluation_fn)

print(result.to_markdown())
# Output: Best composite, top 10 composites, Pareto frontier, interaction analysis

5 Combination Strategies:

  1. Exhaustive Search: All combinations (slow, optimal)
  2. Greedy Addition: Add best attack iteratively (fast, suboptimal)
  3. Mutual Information: Select complementary attacks (balanced)
  4. Pareto Optimization: Balance destruction vs readability (recommended)
  5. Random Search: Sample combination space (baseline)

Enhanced Readability Evaluation

Multi-scale readability assessment:

from aura.evaluation import EnhancedReadabilityEvaluator

evaluator = EnhancedReadabilityEvaluator()

# Evaluate single image
result = evaluator.evaluate(original_img, perturbed_img)
print(f"Overall Readability: {result.overall_readability:.3f}")
print(f"Multi-scale SSIM: {result.multi_scale_ssim:.3f}")
print(f"Text-region SSIM: {result.text_region_ssim:.3f}")
print(f"Structural Preservation: {result.structural_preservation:.3f}")
print(f"Character Legibility: {result.character_legibility:.3f}")

# Evaluate batch with comprehensive statistics
statistics = evaluator.evaluate_batch(original_images, perturbed_images)
print(evaluator.generate_statistics_report(statistics))
# Output: Complete statistical table with mean, std, variance, quartiles, CI

🐳 Docker & Infrastructure

Docker Setup

Building the Image

# Build Docker image
docker build -t aura:latest .

# Build with no cache (faster rebuild)
docker build --no-cache -t aura:latest .

Running the Container

# Interactive shell
docker run -it --rm \
    -v $(pwd):/workspace \
    -v $(pwd)/results:/workspace/results \
    aura:latest

# Run benchmark directly
docker run --rm \
    -v $(pwd):/workspace \
    -v $(pwd)/results:/workspace/results \
    aura:latest \
    python scripts/run_comprehensive_benchmark.py --num-samples 100 --full-benchmark
# Quick start
python scripts/run_comprehensive_benchmark.py --num-samples 5

# Full benchmark
python scripts/run_comprehensive_benchmark.py --num-samples 1000 --full-benchmark

# Optimize parameters
python scripts/run_attack_optimizer.py --quick

# Test specific image
python scripts/run_image_tester.py --image test.png --compare

# Generate dataset
python scripts/run_test_generator.py --num-tests 500

# Docker
docker run -it --rm -v $(pwd):/workspace aura:latest

About

AURA (Adversarial Use for Reliable Assessment) is a research framework for generating and benchmarking architecture-agnostic adversarial perturbations against OCR systems while preserving human readability. Inspired by human visual cognition, Gestalt principles, and perceptual illusions, AURA evaluates structural, geometric, photometric, and ...

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages