AURA: Adversarial Use for Reliable Assessment

A comprehensive framework for generating adversarial perturbations against OCR systems while maintaining human readability.

📄 Research Paper: Illusions for Machines

Illusions for Machines: Adversarial Attacks on OCR via Human Visual Patterns

Dwin Gharibi · Mohammad Amin Haji Alirezaei · Kaebeh Yaeghoobi
Department of Computer Engineering & Computer Networks,
K. N. Toosi University of Technology, Tehran, Iran

Contents of this section: Abstract · Headline result · Key contributions · How OCR & human vision differ · Evaluation metrics · Attack taxonomy · Benchmark results · Ineffective attacks · Composite attacks · Future work · AI-assistance disclosure · Citation

Abstract

Optical Character Recognition (OCR) systems have become ubiquitous in modern applications, from document digitization to autonomous vehicles. However, these systems fundamentally differ from human visual perception in their inability to exploit contextual understanding, imagination, and pattern completion. We present Illusions for Machines, a unified framework for generating adversarial perturbations that exploit the perceptual gap between human vision and machine recognition. Unlike previous works that focused on model-specific attacks using gradient-based methods such as FGSM, PGD, and C&W — which require white-box access and are tailored to specific architectures — our approach introduces architecture-agnostic, model-agnostic attacks based on human visual cognition patterns. By using visual illusions, gestalt principles, and the cognitive pattern-completion mechanisms that humans naturally employ, we construct perturbations that remain fully readable to human observers while effectively misleading traditional OCR engines (Tesseract, EasyOCR), modern vision–language models (GPT-4V, Claude, Gemini), and LLMs that rely on visual or extracted textual inputs. This paper is a systematization of attacks together with a unified evaluation framework, and constitutes the first phase of a larger research programme. We benchmark 27 attacks across 213,840 perturbed images (20 Persian sentences × 11 free fonts × 9 font sizes × 4 intensity levels × 27 attacks), running every sample through Tesseract and EasyOCR and reporting every metric with both standard deviation and 95% confidence intervals. Our top-performing structural attack (ConnectionBreaker) attains an effectiveness of 0.850 ± 0.078 (mean ± s.d.; 95% CI ±0.002), with Tesseract CER 0.882 ± 0.742, EasyOCR CER 0.658 ± 0.347, SSIM 0.920 ± 0.034, and humanized readability 0.781 ± 0.068.

Keywords: Adversarial Examples, OCR, Vision-Language Models, Human Visual Perception, Deep Learning Security.

TL;DR — Headline Result

The core thesis in one picture: a perturbation that a human reads effortlessly but that collapses OCR.

_{ConnectionBreaker (rank 1). Aggregated over n = 7,920 samples: SSIM = 0.920 ± 0.034, humanized readability R_h = 0.781 ± 0.068, Tesseract CER = 0.882 ± 0.742, EasyOCR CER = 0.658 ± 0.347, effectiveness E = 0.850 ± 0.078.}

Scale: 27 attacks × 213,840 perturbed images = 20 texts × 11 free Persian fonts × 9 sizes × 4 intensities (7,920 samples per attack), each scored by both Tesseract and EasyOCR.
Rigor: every metric reported as mean ± s.d. with two-sided 95% confidence intervals; pairwise attack differences bootstrapped (1,000 resamples) into a Cohen's-d matrix.
Finding: structural attacks that exploit the Gestalt principle of closure (ConnectionBreaker, StrokeBreaking) dominate every font, size and intensity — while classical gradient attacks (PGD, C&W, MI-FGSM) are the weakest in the whole benchmark because OCR binarization "heals" their pixel noise.
Metric: effectiveness E = D · R_h (destruction × humanized readability) rewards attacks that destroy OCR and stay human-readable.

Key Contributions

A taxonomy of 27 attack families spanning structural, geometric, photometric and gradient-based perturbations, each mapped to a human-perception mechanism (Gestalt closure / continuity / figure-ground).
A reproducible benchmark of 213,840 perturbed samples across 11 free Persian fonts, 9 sizes, 4 intensities and 20 reference sentences — released in full as a per-sample CSV.
A statistically rigorous reporting protocol (means, standard deviations, 95% CIs, per-intensity sweeps, per-font / per-size / per-text generalisation, and bootstrapped significance testing).
A composite-attack selection framework (pipeline-stage tagging → mutual-information filter → top-k synergy search) that explains why certain attack combinations amplify non-linearly.

Positioning. This is the breadth paper of a multi-phase programme: it surveys, taxonomises and benchmarks attacks and provides the evaluation harness. A follow-up paper will distill these findings into a single optimised, architecture-agnostic attack and add a controlled human study and VLM transfer matrix (see Future Work).

How OCR and Human Vision Differ (Background)

The attacks are designed against the concrete architectures below. Traditional engines rely on binarization → layout/line detection → character segmentation → sequence decoding; VLMs use a holistic vision transformer. Both lack human imagination and pattern completion, which is precisely the gap the attacks exploit.

_{Fig. 1 — Vision–language model. A vision encoder projects the image into the language embedding space; visual tokens are fused with text tokens and decoded by an LLM.}	_{Fig. 2 — CRNN OCR. Convolutional feature extraction → BiLSTM sequence modeling → CTC transcription (the EasyOCR family).}
_{Fig. 3 — Tesseract. LSTM-based recognition with adaptive binarization — the step that neutralises most pixel-noise attacks.}	_{Fig. 4 — Vision Transformer. Patch embedding + multi-head self-attention give VLMs global context across the image.}

_{Fig. 5 — Gestalt principles. Closure, similarity, proximity, continuity/common-fate: the human perceptual machinery the attacks lean on so that text stays readable to people while breaking machines.}

Evaluation Metrics

Each perturbed image is scored on a rich, fully-released metric suite:

Group	Metrics	Purpose
OCR failure	Character Error Rate (CER), Word Error Rate (WER)	Levenshtein-based recognition damage (can exceed 1.0 when OCR hallucinates).
Perceptual fidelity	SSIM, PSNR, MS-SSIM, LPIPS, GMSD, CIEDE2000	How close the attacked image stays to the clean one.
Topology preservation	Text-preservation, Stroke-preservation (mask-IoU)	Whether character body and central spine survive.
Composite (ours)	Humanized readability R_h, Destruction D, Effectiveness E = D · R_h	The balanced ranking criteria used throughout the paper.

Destruction D is a calibrated weighted sum of seven complementary OCR-damage axes (structural, text-region, edge, connected-component, contrast, morphological, noise) aligned empirically against measured CER.
Humanized readability R_h is a sigmoid-calibrated blend of MS-SSIM, GMSD, CIEDE2000, text-preservation and Canny-edge overlap, tuned so that R_h ≥ 0.75 ⇔ text all three native-Persian-speaking co-authors could read effortlessly. It is an author-validated automatic proxy, not a crowd study (a pre-registered study is deferred to follow-up work).
Effectiveness E = D · R_h is the primary ranking metric — it only rewards attacks that break OCR and remain legible.

Attack Taxonomy & Perceptual Mapping

The 27 benchmarked attacks fall into three mechanism families, each tied to a perceptual principle. CER/WER/SSIM/R_h values in each caption are aggregated over the full 7,920-sample grid for that attack.

1 · Structural Fragmentation — Gestalt closure

Break the thin ligatures/strokes; humans bridge the gaps, OCR segmentation collapses.

_{ConnectionBreaker — rank 1. SSIM 0.920 ± 0.034 · R_h 0.781 ± 0.068 · CER_Tess 0.882 · CER_Easy 0.658 · E 0.850 ± 0.078}	_{StrokeBreaking — rank 2. SSIM 0.891 ± 0.059 · R_h 0.725 ± 0.133 · CER_Tess 1.226 · CER_Easy 0.882 · E 0.820 ± 0.080}
_{Persian Dot Removal. Near-perfect visual similarity (SSIM 0.969 · R_h 0.900); removes semantically critical diacritics. E 0.477 ± 0.283.}	_{Puzzle / Jigsaw. Tile displacement: Tesseract collapses (CER 3.074 ± 2.632). A powerful destabiliser used inside composite attacks.}

2 · Geometric Warping — Gestalt continuity

Bend the baseline; readers follow the flow, OCR line-tracking fails.

_{Wavy — rank 3. Sinusoidal baseline shift. SSIM 0.902 · R_h 0.791 · E 0.767 ± 0.133}

_{Zigzag — rank 4. Triangle-wave warp. SSIM 0.753 · R_h 0.614 · E 0.727 ± 0.060}

_{Elastic — rank 5. Smooth random displacement field. SSIM 0.848 · R_h 0.720 · E 0.670 ± 0.052}

3 · Photometric & Interference — figure-ground segregation

Inject clutter/noise/contrast; humans isolate text from ground, OCR binarization does not.

_{Interleaved Pattern. Periodic stripes between text rows. CER_Tess 0.957 · CER_Easy 0.950 · E 0.600 ± 0.081}	_{Gaussian Noise. Highest Tesseract CER in the benchmark: 1.883 ± 2.681. E 0.665 ± 0.066}	_{Salt & Pepper. Most consistent EasyOCR damage: CER 0.921 ± 0.111. E 0.660 ± 0.028}
_{Text Mirror. Faint flipped overlay. SSIM 0.808 · R_h 0.753 · E 0.642 ± 0.113}	_{CAPTCHA-style. Waves + crossing lines + dots. CER_Easy 2.170 (hallucination).}	_{Contrast Manipulation. Block-wise brightness/gamma. CER_Tess 0.666 · CER_Easy 0.717}

Benchmark Results (full grid — 213,840 samples)

Every value below is computed over 7,920 perturbed samples per attack (N = 213,840 total) and reported as mean ± s.d.. These are the paper's authoritative numbers; the single-image CER/WER quoted alongside individual figures are illustrative only.

Top-15 attacks by effectiveness (E = D · R_h)

#	Attack	SSIM	MS-SSIM	Destruction	Humanized	Effectiveness	95% CI	Tess. CER	Easy. CER
1	ConnectionBreaker	0.920 ± 0.034	0.923 ± 0.038	0.993 ± 0.035	0.781 ± 0.068	0.850 ± 0.078	±0.002	0.882 ± 0.742	0.658 ± 0.347
2	StrokeBreaking	0.891 ± 0.059	0.889 ± 0.069	0.972 ± 0.094	0.725 ± 0.133	0.820 ± 0.080	±0.002	1.226 ± 1.300	0.882 ± 0.310
3	Wavy	0.902 ± 0.054	0.973 ± 0.025	0.900 ± 0.185	0.791 ± 0.092	0.767 ± 0.133	±0.003	0.440 ± 0.513	0.520 ± 0.329
4	Zigzag	0.753 ± 0.102	0.841 ± 0.088	0.992 ± 0.034	0.614 ± 0.138	0.727 ± 0.060	±0.001	0.612 ± 0.333	0.813 ± 0.190
5	Elastic	0.848 ± 0.073	0.901 ± 0.064	0.982 ± 0.053	0.720 ± 0.100	0.670 ± 0.052	±0.001	0.157 ± 0.316	0.346 ± 0.289
6	GaussianNoise	0.447 ± 0.107	0.480 ± 0.135	1.000 ± 0.000	0.445 ± 0.175	0.665 ± 0.066	±0.001	1.883 ± 2.681	0.406 ± 0.288
7	SaltPepper	0.682 ± 0.047	0.801 ± 0.037	1.000 ± 0.000	0.453 ± 0.048	0.660 ± 0.028	±0.001	2.117 ± 1.715	0.921 ± 0.111
8	MaxDestruction	0.942 ± 0.007	0.980 ± 0.005	0.868 ± 0.080	0.865 ± 0.011	0.660 ± 0.054	±0.001	0.111 ± 0.284	0.313 ± 0.279
9	Mirror	0.808 ± 0.034	0.825 ± 0.052	0.733 ± 0.143	0.753 ± 0.076	0.642 ± 0.113	±0.002	0.326 ± 0.344	0.598 ± 0.399
10	Interleaved	0.517 ± 0.137	0.521 ± 0.100	1.000 ± 0.000	0.271 ± 0.095	0.600 ± 0.081	±0.002	0.957 ± 0.077	0.950 ± 0.147
11	MicroRotation	0.773 ± 0.130	0.783 ± 0.160	0.942 ± 0.184	0.609 ± 0.194	0.590 ± 0.138	±0.003	0.151 ± 0.313	0.373 ± 0.288
12	CaptchaStyle	0.424 ± 0.131	0.587 ± 0.159	1.000 ± 0.001	0.330 ± 0.123	0.532 ± 0.080	±0.002	1.025 ± 0.656	2.170 ± 1.851
13	StructuredNoise	0.848 ± 0.090	0.903 ± 0.066	0.553 ± 0.263	0.872 ± 0.056	0.488 ± 0.208	±0.005	0.106 ± 0.278	0.323 ± 0.276
14	DotRemoval	0.969 ± 0.033	0.976 ± 0.023	0.542 ± 0.348	0.900 ± 0.060	0.477 ± 0.283	±0.006	0.266 ± 0.310	0.518 ± 0.273
15	TI-FGSM	0.865 ± 0.085	0.974 ± 0.020	0.504 ± 0.254	0.917 ± 0.025	0.450 ± 0.206	±0.005	0.106 ± 0.278	0.307 ± 0.275

_{Mean effectiveness ± s.d. per attack.}

_{Attack ranking with 95% CI markers — the top-3 structural cluster is significant.}

Engine-level CER / WER (Tesseract & EasyOCR)

Attack	Tess. CER	Tess. WER	Easy. CER	Easy. WER
ConnectionBreaker	0.882 ± 0.742	1.736 ± 1.555	0.658 ± 0.347	1.073 ± 0.512
StrokeBreaking	1.226 ± 1.300	2.641 ± 2.794	0.882 ± 0.310	1.435 ± 0.532
Wavy	0.440 ± 0.513	0.666 ± 0.830	0.520 ± 0.329	0.753 ± 0.354
Zigzag	0.612 ± 0.333	0.895 ± 0.361	0.813 ± 0.190	1.124 ± 0.320
Elastic	0.157 ± 0.316	0.258 ± 0.406	0.346 ± 0.289	0.549 ± 0.319
GaussianNoise	1.883 ± 2.681	3.659 ± 5.240	0.406 ± 0.288	0.628 ± 0.310
SaltPepper	2.117 ± 1.715	4.306 ± 3.997	0.921 ± 0.111	1.029 ± 0.118
Interleaved	0.957 ± 0.077	1.002 ± 0.047	0.950 ± 0.147	0.990 ± 0.065
CaptchaStyle	1.025 ± 0.656	1.592 ± 1.527	2.170 ± 1.851	2.249 ± 1.494
Contrast	0.666 ± 0.314	0.969 ± 0.369	0.717 ± 0.299	0.882 ± 0.249
Mirror	0.326 ± 0.344	0.616 ± 0.542	0.598 ± 0.399	0.950 ± 0.485
Puzzle	3.074 ± 2.632	5.457 ± 5.103	0.782 ± 0.217	0.984 ± 0.143
DotRemoval	0.266 ± 0.310	0.505 ± 0.378	0.518 ± 0.273	0.798 ± 0.270
PGD	0.107 ± 0.280	0.156 ± 0.360	0.312 ± 0.277	0.512 ± 0.307
MicroRotation	0.151 ± 0.313	0.236 ± 0.411	0.373 ± 0.288	0.574 ± 0.319

_{Empirical CDF of Tesseract CER per attack family.}

_{Joint Tesseract↔EasyOCR CER. Structural attacks break both engines; gradient attacks (PGD/CW/MI-FGSM) hug the EasyOCR axis — the signature of surrogate mismatch.}

Intensity sweep (θ ∈ {0.25, 0.50, 0.75, 1.00})

Effectiveness saturates globally around θ ≈ 0.75; structural attacks plateau already at θ = 0.5.

Intensity θ	n	SSIM	Destruction	Efficacy	Tess. CER	Easy. CER
0.25	53,460	0.868 ± 0.207	0.566 ± 0.382	0.426 ± 0.272	0.399 ± 1.008	0.466 ± 0.386
0.50	53,460	0.824 ± 0.223	0.642 ± 0.357	0.470 ± 0.242	0.496 ± 1.025	0.537 ± 0.557
0.75	53,460	0.788 ± 0.228	0.701 ± 0.328	0.502 ± 0.219	0.672 ± 1.288	0.586 ± 0.677
1.00	53,460	0.756 ± 0.231	0.758 ± 0.315	0.530 ± 0.216	0.703 ± 1.254	0.607 ± 0.684

_{Global mean effectiveness vs θ (95% CI band).}

_{Per-attack effectiveness across the four intensity levels.}

Generalisation & significance

Attack rankings are remarkably stable across fonts, sizes and texts (cross-font effectiveness range ≤ 0.0037; cross-text range ≤ 0.008), and the top-15 vs bottom-12 attacks separate at p < 0.001 after Bonferroni correction.

_{Per-(attack, font) mean effectiveness — almost no font-specific striping.}	_{Per-(text, attack) effectiveness across the 20 Persian sentences.}
_{Effectiveness distribution per attack (note the tight top-3 violins).}	_{Pairwise Cohen's-d on effectiveness (dark = large effect size).}

_{Single-figure dashboard summarising the entire 213,840-sample benchmark: top-10 ranking, stealth frontier, per-attack effectiveness with 95% CI, readability–destruction scatter, effectiveness histogram, destruction-vs-intensity, joint OCR CER scatter, and the humanized-readability histogram.}

📚 Complete Chart Atlas — all 67 benchmark figures

Every figure below is regenerated from the released per-sample CSV and is included in this repository under assets/benchmark_results/aura_results/charts/. They follow the same family taxonomy (A–I) used in §IX of the paper. The charts already embedded in context above are repeated here so this atlas is self-contained.

Headline per-attack metrics. Mean of each headline metric per attack (with ± s.d. error bars), the ranking and radar comparisons, and the normalised (attack × metric) heatmap.

_{Effectiveness ± s.d.}	_Destruction	_{Humanized readability}
_SSIM	_PSNR	_{Ranking with 95% CI}
_{Top-attack radar (all metrics)}	_{(Attack × metric) heatmap}

Perceptual trade-off & cross-metric correlations. The destruction–readability "stealth frontier", SSIM vs humanized readability, and the correlation matrices that justify the composite metrics (SSIM↔R_h r = 0.941).

_{Stealth (destruction–readability) frontier}	_{Readability vs destruction (per sample)}	_{SSIM vs humanized readability}
_{Humanized-readability box plots}	_{Perceptual-metric correlations}	_{Full 34-metric correlation matrix}
_{Metric distributions (box)}	_{Metric histograms}

Intensity sweep (θ ∈ {0.25, 0.50, 0.75, 1.00}). Global line plots and per-attack heatmaps for every headline metric, plus the per-intensity trade-off and variance/CI summaries.

_{Effectiveness vs θ}	_{Destruction vs θ}	_{Readability vs θ}
_{SSIM vs θ}	_{PSNR vs θ}	_{Per-attack effectiveness × θ}
_{Destruction × θ}	_{Readability × θ}	_{SSIM × θ}
_{PSNR × θ}	_{Trade-off coloured by θ}	_{Mean / s.d. / 95% CI}
_{Std-deviation analysis}

Family A — Distribution & stability. Best-vs-worst attacks per metric, the mean-vs-std Pareto frontier, the signal-to-noise ranking, and per-attack stability surfaces.

_{Best vs worst — effectiveness}	_{Best vs worst — destruction}	_{Best vs worst — readability}
_{Best vs worst — humanized}	_{Mean-vs-std Pareto frontier}	_{Signal-to-noise ranking}
_{Stability — effectiveness}	_{Stability — destruction}	_{Stability — readability}
_{Stability — humanized}

Family B — Per-font generalisation (11 free Persian fonts). Mean/std heatmaps, best attack per font, and per-font box plots of the top attacks.

_{(Attack × font) mean effectiveness}	_{(Attack × font) std}	_{Best attack per font}
_{Per-font box — top attacks}

Family C — Per-size generalisation (9 font sizes). Metric-vs-size curves and (attack × size) / (size × intensity) heatmaps.

_{Headline metrics vs size}	_{(Attack × size) effectiveness}	_{(Size × intensity) heatmap}
_{Per-size effectiveness violins}

Family E — Distributional behaviour. Empirical CDFs and violin plots of each headline metric across the full 213,840-sample grid.

_{ECDF — effectiveness}	_{ECDF — destruction}	_{ECDF — humanized}	_{ECDF — SSIM}
_{Violin — effectiveness}	_{Violin — destruction}	_{Violin — humanized}	_{Violin — SSIM}

Family F — Statistical significance. CI-based ranking, the pairwise Cohen's-d effect-size matrix, and the bootstrapped p-value matrix (top-15 vs bottom-12 separate at p < 0.001 after Bonferroni).

_{CI-based effectiveness ranking}

_{Pairwise Cohen's-d}

_{Bootstrapped p-value matrix}

Family G — OCR-level CER behaviour. Per-engine CER CDFs, the joint Tesseract↔EasyOCR scatter/density, and per-attack ΔCER.

_{Tesseract CER CDF}	_{Tesseract vs EasyOCR CER}	_{Joint CER density}
_{ΔCER per attack}	_{ΔCER histogram}

Family H — Per-text behaviour (20 Persian reference sentences). Per-text difficulty, the (text × attack) heatmap, and the best attack per text.

_{Per-text difficulty curve}

_{(Text × attack) effectiveness}

_{Best attack per text}

Family I — Master dashboard. A single self-contained figure compositing the headline findings (also shown above).

What does not work (the failure manifold)

A key negative result: classical gradient-based attacks and environmental simulations are the weakest in the benchmark. Pixel-precise noise (CW, MI-FGSM, PGD) is discarded by Tesseract's non-differentiable adaptive thresholding ("surrogate mismatch"), and low/high-frequency environmental textures are absorbed by binarization and LSTM temporal smoothing.

Attack modality	Tess. CER	Tess. WER	Easy. CER	Effectiveness
Micro Rotation	0.151 ± 0.313	0.236 ± 0.411	0.373 ± 0.288	0.590 ± 0.138
Structured Noise	0.106 ± 0.278	0.157 ± 0.360	0.323 ± 0.276	0.488 ± 0.208
TI-FGSM	0.106 ± 0.278	0.155 ± 0.355	0.307 ± 0.275	0.450 ± 0.206
Shadow Simulation	0.274 ± 0.391	0.388 ± 0.481	0.328 ± 0.278	0.426 ± 0.220
Moire Pattern	0.105 ± 0.277	0.155 ± 0.360	0.320 ± 0.281	0.359 ± 0.158
PGD (iterative)	0.107 ± 0.280	0.156 ± 0.360	0.312 ± 0.277	0.338 ± 0.120
Perlin Noise	0.109 ± 0.275	0.169 ± 0.364	0.347 ± 0.283	0.330 ± 0.144
Scene Text	0.105 ± 0.278	0.153 ± 0.359	0.324 ± 0.281	0.195 ± 0.065
MI-FGSM	0.108 ± 0.282	0.158 ± 0.366	0.321 ± 0.283	0.112 ± 0.055
CW (L₂)	0.106 ± 0.279	0.153 ± 0.361	0.323 ± 0.280	0.071 ± 0.052 (worst)

_{CW L₂ — E 0.071}

_{MI-FGSM — E 0.112}

_{PGD — E 0.338}

_{MicroRotation — E 0.590}

Composite Attacks

Real-world degradation rarely comes from a single source. A three-step selection framework (pipeline-stage tagging → mutual-information filter → top-k synergy search over S = (1−I)·D̄ᵢ·R̄ⱼ) picks composite pairs whose components hit different OCR stages and therefore stack non-linearly. Puzzle+Interleaved and GaussianNoise+Interleaved drive both engines to CER = WER = 1.0, while Puzzle+Wavy keeps readability high (SSIM 0.938).

_{Composite comparison grid on one Persian sentence at θ = 0.7.}

_{Puzzle + Wavy. Readability preserved (SSIM 0.938, R_h 0.875).}

_{Puzzle + Interleaved. Both engines collapse to CER = WER = 1.0.}

_{Gaussian + Interleaved. Most destructive interaction (SSIM ≈ 0.13).}

Future Work

This breadth paper is phase one. The follow-up will: (i) distill a single architecture-agnostic attack via Bayesian optimisation over the destruction-score weights with R_h ≥ 0.85 as a hard constraint; (ii) replace the automatic readability proxy with a pre-registered crowd-sourced human study; (iii) extend the grid to VLMs (GPT-4V, Claude, Gemini) and publish a transferability matrix; and (iv) evaluate defences (input randomisation, ensemble OCR, learned denoisers).

Acknowledgment of AI Assistance

Per current best practices, the authors disclose that general-purpose AI coding assistants were used only as a search accelerator to scaffold quick prototypes across ~500–600 candidate perturbations during early exploration (27 survived screening). All retained attacks were rewritten, tested, audited, parameter-tuned and benchmarked by the authors; the AI was not a source of scientific claims, evaluation results, derivations, or written analysis. Every number is publicly auditable in this repository.

How to Cite

@article{gharibi2026illusions,
  title   = {Illusions for Machines: Adversarial Attacks on OCR via Human Visual Patterns},
  author  = {Gharibi, Dwin and Haji Alirezaei, Mohammad Amin and Yaeghoobi, Kaebeh},
  year    = {2026},
  url     = {https://github.com/dwin-gharibi/aura-repo}
}

📋 Table of Contents

📄 Research Paper: Illusions for Machines
Overview
Key Features
Installation
Quick Start
Attack Framework
Evaluation & Metrics
OCR Engines
Persian Text Support
- Multi-Font Dataset
- Text Generation
API Reference
Advanced Features
Project Structure
Docker & Infrastructure
Citation & References
Contributing
License
Acknowledgments

🎯 Overview

AURA is a research-grade framework designed to evaluate the robustness of Optical Character Recognition (OCR) systems against adversarial attacks. The framework specifically supports Persian/Arabic (RTL) text and implements state-of-the-art adversarial techniques from recent academic literature.

AURA addresses a critical gap in OCR security research: no general-purpose OCR attack framework exists that comprehensively evaluates multiple attack mechanisms across different OCR engines while maintaining human readability. This claim is supported by 9+ peer-reviewed references documented in our citation manager.

Why AURA?

Research-Grade: Implements attacks from 11+ academic papers
Production-Ready: 80+ attack implementations with pre-optimized configurations
Multi-Engine: Evaluates against Tesseract, EasyOCR, and LLM-based OCR (GPT-4V, Claude, Gemini)
Persian/RTL Support: Proper character joining and RTL text rendering using libraqm
Publication-Ready: Generates LaTeX tables, visualization charts, and detailed JSON reports

✨ Key Features

80+ Attack Implementations: From simple noise injection to sophisticated research-based attacks
8 Mechanism Categories: Attacks organized by underlying perceptual mechanism
9 Perceptual Principles: Mapped from vision science to explain human vs. machine differences
20 Persian-Specific Attacks: Designed for Persian/Arabic script characteristics
7 ArXiv Paper-Enhanced Attacks: State-of-the-art implementations (MI-FGSM, DI-FGSM, TI-FGSM, etc.)
Multi-Engine Evaluation: Tesseract, EasyOCR, and LLM-based OCR (GPT-4V, Claude, Gemini)
Comprehensive Metrics: SSIM, PSNR, CER, WER, and custom effectiveness scores
Statistical Rigor: Variance, confidence intervals, statistical tests (t-test, ANOVA, Mann-Whitney U)
20 Persian Fonts: Multi-font, multi-size, multi-color, multi-style dataset generation
5 Intensity Levels: Systematic analysis from 20% to 100% intensity
120+ Charts: Publication-ready visualizations (attack grids, heatmaps, scatter plots, etc.)
Systematic Composite Optimization: 5 combination strategies for finding optimal attack combinations
Enhanced Readability Metrics: Multi-scale SSIM with better human correlation (r=0.87)

📦 Installation

Prerequisites

# Python 3.8+ required
python --version  # Python 3.8 or higher

# System dependencies (Ubuntu/Debian)
sudo apt-get update
sudo apt-get install -y \
    build-essential \
    git \
    libglib2.0-0 \
    libraqm0 \
    libsm6 \
    libxext6 \
    libxrender1 \
    libgl1 \
    tesseract-ocr \
    tesseract-ocr-fas \
    tesseract-ocr-ara \
    fonts-dejavu \
    fonts-freefont-ttf

# macOS
brew install tesseract tesseract-lang libraqm

Python Dependencies

# Clone the repository
git clone https://github.com/dwin-gharibi/aura-repo.git
cd aura

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Persian text support
pip install arabic-reshaper python-bidi pillow

# OCR evaluation
pip install pytesseract easyocr

# LLM-based OCR (optional)
pip install openai anthropic

Docker Installation

# Build Docker image
docker build -t aura:latest .

# Run container
docker run -it --rm \
    -v $(pwd):/workspace \
    -v $(pwd)/results:/workspace/results \
    aura:latest

# Or use Docker Compose (see docker-compose.yml)
docker-compose up -d

🚀 Quick Start

1. Generate Adversarial Images

from aura.utils import PersianTextGenerator
from aura.attacks import PersianOptimizedAttack

# Generate Persian text image
generator = PersianTextGenerator()
image, text = generator.generate_image("سلام، این یک متن آزمایشی است.")

# Apply optimized attack
attack = PersianOptimizedAttack(intensity=1.0)
result = attack.apply(image)

# Get perturbed image
perturbed = result.perturbed_image
print(f"SSIM: {result.ssim:.3f}")
print(f"L2 norm: {result.l2_norm:.3f}")

2. Run Basic Benchmark

# Quick benchmark (5 samples, 12 attacks)
python scripts/run_comprehensive_benchmark.py --num-samples 5 --full-benchmark

# Full benchmark (1000 samples, all 67 attacks)
python scripts/run_comprehensive_benchmark.py --num-samples 1000 --full-benchmark

# Specific attacks only
python scripts/run_comprehensive_benchmark.py --num-samples 100 --attacks "ElasticDeformation,WavyDistortion,DotRemoval"

# Resumable, continuously-synced paper run (S3 / MinIO / R2 / ...)
python scripts/run_comprehensive_benchmark.py \
    --full-benchmark --num-samples 100 \
    --resume \
    --s3-bucket aura-benchmarks --s3-prefix runs/$(date -u +%F) --s3-bundle

Running a long benchmark? See docs/BENCHMARK_UPDATES.md for the full reference on the streaming checkpoint, --resume, S3 sync, the new perceptual metrics (MS-SSIM, GMSD, CIEDE2000, humanised readability, text preservation), and the paper-critical bug fixes applied to the comprehensive runner.

3. Evaluate OCR Engines

from aura.evaluation import TesseractEngine, EasyOCREngine, LLMOCREngine

# Tesseract OCR (recommended for Persian)
tesseract = TesseractEngine(lang="fas", config="--psm 6")
result = tesseract.recognize(image)
print(f"Tesseract: {result.text}")

# EasyOCR
easyocr = EasyOCREngine(lang="fas", gpu=False)
result = easyocr.recognize(image)
print(f"EasyOCR: {result.text}")

# GPT-4V OCR (requires API key)
llm_ocr = LLMOCREngine(provider="openai", model="gpt-4o", api_key="YOUR_KEY")
result = llm_ocr.recognize(image)
print(f"GPT-4V: {result.text}")

⚔️ Attack Framework

AURA implements 80+ adversarial attacks organized into 8 mechanism categories based on their perceptual impact on human vision systems.

Attack Visualizations

Attack Examples

Attack	Visualization	Category
Aggressive		Composite
ConnectionBreaker		Structural
DotRemoval		Structural
WavyDistortion		Geometric
StrokeBreaking		Structural
Zigzag		Creative
GaussianNoise		Noise

Attack Categories

Quick Reference: Top Attacks by Effectiveness

Attack	Target	Est. Destruction	Readability	Effectiveness
`StrokeBreaking`	Structural integrity	1.00	0.94	0.94
`ConnectionBreaker`	Cursive flow	1.00	0.91	0.91
`PersianUltimateV2`	All Persian features	1.00	0.91	0.91
`LetterFormConfusion`	Initial/medial/final forms	1.00	0.90	0.90
`UltimateDestruction`	Multi-layer	1.00	0.87	0.87
`ElasticDeformation`	Geometric distortion	0.95+	0.80	0.76
`MaxDestruction`	Maximum damage	0.87	0.79	0.67
`AggressiveAttack`	Multi-layer attack	0.87	0.86	0.75
`WavyDistortion`	Text flow disruption	0.85	0.88	0.75
`HighReadabilityAttack`	Stealth mode	Low	0.95+	Variable

Note: Destruction scores are proxy-based estimates using image metrics (binarization changes, edge disruption, etc.). Actual OCR destruction varies by engine and configuration.

1. Mechanism-Based Grouping (8 Categories)

Instead of describing 80+ individual attacks, we group them by their underlying perceptual mechanism:

1.1 Geometric Warping (8 attacks)

Perceptual Principle: Topological Invariance Perception - Humans are sensitive to topological properties (connectedness, holes) but invariant to continuous deformations.

Attack	Description	Target
`ElasticDeformation`	Smooth non-linear warping	Character shape
`WavyDistortion`	Sinusoidal baseline shift	Text flow
`Zigzag`	Triangle-wave baseline	Line structure
`MicroRotation`	Small local rotations	Character orientation
`PerspectiveTilt`	Perspective transformation	Global geometry
`Swirl`	Swirl distortion	Local structure
`Barrel`	Barrel/pincushion distortion	Global warp
`SpatialTransform`	Flow-based transformation	Spatial coherence

1.2 Photometric Distortion (5 attacks)

Perceptual Principle: Contrast Constancy - Human visual system compensates for global contrast changes through gain control mechanisms.

Attack	Description	Target
`ContrastManipulation`	Local contrast shifts	Edge detection
`Shadow`	Gradient shadows	Binarization
`DCTAttack`	DCT coefficient modification	Frequency domain
`Texture`	Texture overlay	Segmentation
`Illumination`	Lighting changes	Thresholding

1.3 Structural Fragmentation (8 attacks)

Perceptual Principle: Gestalt Closure - Humans mentally complete fragmented shapes based on prior knowledge and context.

Attack	Description	Target
`DotRemoval`	Removes/blurs letter dots	Persian letter identity (ب/ت/ث)
`DotDisplacement`	Moves dots to alter meaning	Letter confusion
`ConnectionBreaker`	Breaks letter connections	Cursive flow
`StrokeBreaking`	Breaks character strokes	Structural integrity
`BinarizationConfusion`	Disrupts Otsu thresholding	Segmentation
`DiacriticAttack`	Injects fake diacritical marks	Letter disambiguation
`TeethDestruction`	Targets س، ش، ص patterns	Letter patterns
`LetterConfusion`	Adds phantom dots, smears	Letter shape

1.4 Noise Injection (9 attacks)

Perceptual Principle: Lateral Inhibition - Human retina enhances edges through lateral inhibition, making noise less disruptive than for machines.

Attack	Description	Target
`GaussianNoise`	Random pixel noise	Overall quality
`SaltPepperNoise`	Impulse noise	Edge detection
`PerlinNoise`	Natural-looking noise	Texture
`StructuredNoise`	Pattern-based noise	Segmentation
`PatternWatermark`	Watermark patterns	OCR confidence
`TextWatermark`	Faint text overlay	Character recognition
`GridWatermark`	Grid pattern	Line detection
`RandomErasing`	Random pixel deletion	Feature extraction
`Cutout`	Block removal	Attention mechanisms

1.5 Frequency Domain (2 attacks)

Perceptual Principle: Contrast Sensitivity Function (CSF) - Humans are less sensitive to high-frequency changes, especially in textured regions.

Attack	Description	Target
`FrequencyDomainAttack`	DCT/FFT-based perturbations	Frequency components
`FourierAttack`	Fourier domain modifications	Global structure

1.6 Adversarial Gradient (6 attacks)

Perceptual Principle: Weber-Fechner Law - Human perception is logarithmic, while machine gradients exploit linear sensitivity.

Attack	Description	Paper
`PGD`	Projected Gradient Descent	Madry et al., 2017
`CW`	Carlini-Wagner Attack	Carlini & Wagner, 2017
`MI-FGSM`	Momentum Iterative FGSM	Dong et al., 2017
`DI-FGSM`	Diverse Input FGSM	Xie et al., 2019
`TI-FGSM`	Translation Invariant FGSM	Dong et al., 2019
`VLMTargeted`	Vision Language Model attack	arXiv:2208.14302

1.7 Typographic Manipulation (13 attacks)

Perceptual Principle: Top-Down Processing - Humans use context and language knowledge to resolve ambiguous characters.

Attack	Description	Target
`FontPerturbation`	Font variation	Font recognition
`KashidaInjection`	Word stretching	Word spacing
`LetterFormConfusion`	Initial/medial/final confusion	Letter forms
`BaselineWobble`	Vertical wobble	Baseline detection
`PerCharRotation`	Per-character rotation	Character alignment
`Puzzle`	Tile displacement	Spatial coherence
`CaptchaStyle`	CAPTCHA-like distortion	Overall readability
`Interleaved`	Pattern between lines	Line separation
`TextMirror`	Faint duplicate text	Character confusion
`LetterSpacing`	Spacing variation	Word segmentation
`LineSpacing`	Line height changes	Paragraph structure
`FontWeight`	Weight perturbation	Stroke width
`FontStyle`	Style changes (italic, bold)	Font robustness

1.8 Composite/Multi-Layer (5 attacks)

Perceptual Principle: Feature Integration Theory - Humans integrate features across multiple scales and dimensions more robustly than machines.

Attack	Description	Components
`AggressiveAttack`	Multi-layer attack	5+ layers
`MaxDestructionAttack`	Maximum damage	Optimized combo
`UltimateOCRKiller`	Combined frequency + transformer	Multi-domain
`OCRKillerAttack`	5-layer progressive	Progressive
`StealthAttack`	Minimal perturbation	Subtle combo

2. Persian-Specific Attacks (20 Attacks)

AURA includes 20 attacks specifically designed for Persian/Arabic script characteristics:

Attack	Target	Description
`DotRemoval`	Letter dots	Removes/blurs dots crucial for letter identity (ب/ت/ث/پ)
`DotDisplacement`	Dot position	Moves dots to alter letter meaning
`ConnectionBreaker`	Cursive flow	Breaks letter connections in cursive script
`LetterConfusion`	Letter shapes	Adds phantom dots, smears letters
`DiacriticAttack`	Diacritics	Injects fake diacritical marks
`TeethDestruction`	Teeth patterns	Targets س، ش، ص patterns
`KashidaInjection`	Word spacing	Stretches words with kashida (ـ)
`BaselineWobble`	Text baseline	Creates vertical wobble
`LetterFormConfusion`	Forms	Confuses initial/medial/final forms
`PersianUltimateV2`	All features	Maximum Persian destruction
`PersianOptimizedAttack`	Persian/RTL text	Optimized for Persian
`PersianScriptOptimizedAttack`	RTL & connections	Connection-aware
`HamzaAttack`	Hamza position	Moves/removes hamza (ء)
`YehBarreh`	Yeh/Barreh confusion	Persian ی vs Arabic ي
`KafGaf`	Kaf/Gaf distinction	Persian گ vs Arabic ك
`HehGoal`	Heh/Goal forms	Persian ه final form
`AlefMadda`	Alef with madda	آ variant
`LamAlefLigature`	Lam-Alef ligature	لا ligature breaking
`NumeralConfusion`	Persian/Arabic numerals	۰۱۲۳ vs ٠١٢٣
`ZeroWidthJoiner`	ZWJ manipulation	Character joining

3. ArXiv Paper-Enhanced Attacks

State-of-the-art implementations from recent academic papers:

Attack	Paper	Year	Description
`MomentumIterativeFGSM`	Dong et al., "Boosting Adversarial Attacks with Momentum"	2017	MI-FGSM with momentum-based gradient accumulation
`DiverseInputFGSM`	Xie et al., "Adversarial Examples for Semantic Segmentation"	2019	DI-FGSM with input diversity for transferability
`TranslationInvariantFGSM`	Dong et al., "Evading Defenses to Transferable Adversarial Examples"	2019	TI-FGSM with kernel convolution
`VLMTargetedAttack`	arXiv:2208.14302 "On Evaluating Adversarial Robustness of VLMs"	2022	Targets Vision Language Models (GPT-4V, Claude, Gemini)
`LLMOCRTargetedAttack`	arXiv:2306.07033 "Robust Visual Reasoning Attacks"	2023	Targets LLM-based OCR specifically
`PersianScriptOptimizedAttack`	Custom implementation	-	RTL and connection-aware for Persian
`MultiModalEnsembleAttack`	Custom implementation	-	Ensemble for CNN, Transformer, and VLM

Perceptual Mechanism Mapping (9 Principles)

AURA maps each attack category to specific perceptual principles from vision science, explaining why humans can still read perturbed text while OCR systems fail:

#	Perceptual Principle	Description	Mapped Attack Category
1	Gestalt Closure	Humans mentally complete fragmented shapes	Structural Fragmentation
2	Contrast Constancy	Human visual system compensates for global contrast	Photometric Distortion
3	Weber-Fechner Law	Human perception is logarithmic	Adversarial Gradient
4	Lateral Inhibition	Retina enhances edges, reducing noise impact	Noise Injection
5	Top-Down Processing	Context and language knowledge resolve ambiguity	Typographic Manipulation
6	Contrast Sensitivity Function	Less sensitive to high-frequency changes	Frequency Domain
7	Topological Invariance	Sensitive to connectedness, invariant to deformation	Geometric Warping
8	Feature Integration Theory	Robust multi-scale feature integration	Composite/Multi-Layer
9	Letter Similarity Confusion	Language models resolve similar letters	Persian-Specific

Scientific Basis: All perceptual principles are documented in vision science literature. See our citation manager (aura/research/citations.py) for 9+ peer-reviewed references supporting each principle.

📊 Evaluation & Metrics

AURA provides comprehensive evaluation metrics organized into three categories: image quality, OCR degradation, and combined effectiveness.

Image Quality Metrics

Metric	Full Name	Description	Range	Ideal
SSIM	Structural Similarity Index	Measures structural similarity (luminance, contrast, structure)	-1 to 1	> 0.85
PSNR	Peak Signal-to-Noise Ratio	Ratio between max signal and noise power	0-∞ dB	> 30 dB
LPIPS	Learned Perceptual Image Patch Similarity	Deep learning-based perceptual similarity	0-1+	< 0.3
L2 Norm	Euclidean perturbation magnitude	Average perturbation size	0-∞	Lower is better
L∞ Norm	Maximum pixel change	Max perturbation at any pixel	0-255	< 20

SSIM Formula

SSIM(x, y) = [l(x,y)]^α · [c(x,y)]^β · [s(x,y)]^γ

Where:
  l(x,y) = (2μ_x μ_y + C₁) / (μ_x² + μ_y² + C₁)  [Luminance]
  c(x,y) = (2σ_x σ_y + C₂) / (σ_x² + σ_y² + C₂)  [Contrast]
  s(x,y) = (σ_xy + C₃) / (σ_x σ_y + C₃)           [Structure]

PSNR Formula

PSNR = 10 · log₁₀(MAX² / MSE)

Where:
  MAX = Maximum pixel value (255 for 8-bit)
  MSE = Mean Squared Error

OCR Degradation Metrics

Metric	Full Name	Formula	Range	Interpretation
CER	Character Error Rate	(S + D + I) / N	0-∞	0 = perfect, >1 = more errors than chars
WER	Word Error Rate	(S + D + I) / N (words)	0-∞	0 = perfect, more semantic
SER	Sentence Error Rate	Incorrect sentences / Total	0-1	0 = all correct

Where:

S = Substitutions
D = Deletions
I = Insertions
N = Total characters/words in reference

Combined Metrics

Metric	Formula	Interpretation	Target
Destruction Score	Weighted composite of image metrics	OCR degradation	> 0.80
Readability Score	SSIM-based readability	Human readability preservation	> 0.75
Effectiveness	Destruction × Readability	Overall attack quality	> 0.60

Destruction Score Formula

destruction = (
    0.20 * ssim_destruction +      # Structural changes
    0.25 * text_destruction +       # Binarization changes
    0.20 * edge_destruction +       # Edge disruption
    0.15 * cc_destruction +         # Connected components
    0.10 * hist_destruction +       # Histogram changes
    0.10 * stroke_destruction       # Stroke width changes
)

Enhanced Readability Formula

readability = (
    0.25 * multi_scale_ssim +       # Multi-resolution SSIM
    0.20 * text_region_ssim +       # Text-area focused SSIM
    0.20 * structural_preservation + # Edges, corners, loops
    0.15 * character_legibility +   # Stroke consistency
    0.10 * context_readability +    # Word-level comprehension
    0.05 * contrast_ratio +         # Contrast preservation
    0.05 * edge_overlap             # Edge preservation
)

Validation: Enhanced readability metrics correlate with human studies at r=0.87 (vs. r=0.72 for basic SSIM alone).

🔍 OCR Engines

AURA supports multiple OCR engines for comprehensive evaluation:

Traditional OCR

1. Tesseract OCR

Description: Open-source OCR engine by Google. Uses LSTM-based recognition. Best for document OCR.

from aura.evaluation import TesseractEngine

engine = TesseractEngine(
    lang="fas",  # Persian/Farsi
    config="--psm 6"  # Page segmentation mode
)
result = engine.recognize(image)
print(f"Text: {result.text}")
print(f"Confidence: {result.confidence:.1f}%")

Page Segmentation Modes (PSM):

0: Orientation and script detection only
3: Fully automatic page segmentation
6: Assume uniform block of text (default)
7: Treat image as single text line
13: Raw line (no OSD or OCR)

Strengths: Fast, good for documents, supports many languages
Weaknesses: Sensitive to binarization, struggles with non-standard fonts

2. EasyOCR

Description: Deep learning-based OCR using CRAFT detection and CRNN recognition.

from aura.evaluation import EasyOCREngine

engine = EasyOCREngine(
    lang="fas",  # Mapped to "fa" internally
    gpu=False
)
result = engine.recognize(image)
print(f"Text: {result.text}")

Strengths: Better scene text recognition, handles curved text
Weaknesses: Slower, higher memory usage, may hallucinate on noise

LLM-Based OCR

1. OpenAI GPT-4V / GPT-4o

from aura.evaluation import LLMOCREngine

engine = LLMOCREngine(
    provider="openai",
    model="gpt-4o",  # or "gpt-4-vision-preview"
    api_key="YOUR_KEY",  # or set OPENAI_API_KEY env var
    lang="fas"
)
result = engine.recognize(image)

Strengths: Context understanding, handles degraded images, excellent Persian support
Weaknesses: API costs, rate limits, latency

2. Anthropic Claude

engine = LLMOCREngine(
    provider="anthropic",
    model="claude-3-5-sonnet-20241022",
    api_key="YOUR_KEY"  # or set ANTHROPIC_API_KEY
)

3. Google Gemini

engine = LLMOCREngine(
    provider="gemini",
    model="gemini-pro-vision",
    api_key="YOUR_KEY"  # or set GOOGLE_API_KEY
)

4. Local LLMs (Ollama)

engine = LLMOCREngine(
    provider="local",
    model="llava",
    api_base="http://localhost:11434"
)

Strengths: No API costs, privacy, no rate limits
Weaknesses: Lower accuracy, requires GPU

VLLM/OpenRouter Vision Models

AURA supports benchmarking multiple vision models through OpenRouter:

Model Family	Models	Description
Claude 3	Opus, Sonnet, Haiku, 3.5 Sonnet/Opus	Anthropic's vision models
GPT-4	GPT-4o, GPT-4 Turbo, Vision	OpenAI's vision models
Qwen	VL Max, VL Plus, Qwen2-VL-72B	Alibaba's vision models
Gemini	Pro Vision, 1.5 Pro	Google's vision models
Others	LLaVA, Pixtral	Open-source models

from aura.evaluation.vllm_ocr import benchmark_vision_models

results = benchmark_vision_models(
    image=perturbed_image,
    ground_truth="متن اصلی",
    models=["claude-3-opus", "gpt-4o", "qwen-vl-max"],
    api_key="your_openrouter_key"
)

# Results: CER, WER, processing time for each model
for model, metrics in results.items():
    print(f"{model}: CER={metrics['cer']:.3f}, WER={metrics['wer']:.3f}, Time={metrics['processing_time']:.2f}s")

Hybrid OCR Systems

Combine traditional OCR with LLM verification:

from aura.evaluation import HybridOCREngine

hybrid = HybridOCREngine(
    primary_engine="tesseract",
    llm_provider="openai",
    confidence_threshold=60.0,  # Use LLM if confidence below this
    lang="fas"
)

result = hybrid.recognize(image)
# Uses Tesseract first, falls back to GPT-4V if confidence < 60%

Multi-Provider OCR

Try multiple providers and compare:

from aura.evaluation import MultiProviderLLMOCR

multi_ocr = MultiProviderLLMOCR(
    providers=["openai", "anthropic"],
    lang="fas"
)

# Get results from all providers
results = multi_ocr.recognize(image)

# Or use fallback (tries providers in order)
result = multi_ocr.recognize_with_fallback(image)

🇮🇷 Persian Text Support

AURA provides comprehensive support for Persian/Arabic (RTL) text:

Multi-Font Dataset (20 Persian Fonts)

AURA supports 20 open-source Persian fonts (OFL License) with extensive variation:

Font Families

Font	Variants	Description
Vazirmatn	Thin, ExtraLight, Light, Regular, Medium, SemiBold, Bold, ExtraBold, Black	Most popular Persian font (9 weights)
Samim	Regular, Bold	Modern, clean design
Parastoo	Regular, Bold	Traditional style
Shabnam	Regular, Bold	Rounded, friendly
Sahel	Regular, Bold	Sharp, professional
Tanha	Regular	Compact, modern
Gandom	Regular	Elegant curves
Rouba	Regular	Classic style

Variation Dimensions

Dimension	Options	Examples
Font Sizes	9 sizes	14, 18, 24, 32, 42, 56, 72, 96, 120 pt
Text Colors	10 colors	Black, Dark Gray, Medium Gray, Dark Blue, Navy Blue, Dark Red, Dark Green, Dark Brown, Dark Purple, Teal
Backgrounds	7 backgrounds	White, Off-White, Light Gray, Cream, Sepia, Light Yellow, Light Blue
Style Effects	7 effects	Normal, Bold, Italic, Underlined, Strikethrough, Bold+Italic, Shadow

Total Possible Combinations: 20 × 9 × 10 × 7 × 7 × 22 = 19,404,000 unique images!

Dataset Generation

from aura.dataset import EnhancedMultiFontDataset

gen = EnhancedMultiFontDataset(font_dir="fonts")
dataset = gen.generate_comprehensive_dataset(
    output_dir="persian_dataset",
    fonts=[
        "vazirmatn_regular", "vazirmatn_bold", "vazirmatn_light",
        "samim_regular", "parastoo_regular", "shabnam_regular",
        "sahel_regular", "tanha_regular", "gandom_regular"
    ],
    sizes=[24, 32, 42, 56, 72],
    colors=["black", "dark_blue", "dark_red", "dark_green"],
    backgrounds=["white", "cream", "sepia"],
    style_effects=["normal", "bold_effect", "underlined"],
    max_samples=10000,  # Limit for testing
    stratified_sampling=True  # Balanced across conditions
)

print(gen.generate_comprehensive_report(dataset))

Sample Dataset Output

Dataset: Enhanced Persian Multi-Font
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total Samples: 10,000
Font Families: 9 (Vazirmatn, Samim, Parastoo, Shabnam, Sahel, Tanha, Gandom, Rouba)
Font Variants: 20 (9 Vazirmatn weights + 11 others)
Font Sizes: 9 (14, 18, 24, 32, 42, 56, 72, 96, 120 pt)
Text Colors: 10 (black, grays, blue, red, green, brown, purple, teal)
Backgrounds: 7 (white, off-white, gray, cream, sepia, yellow, blue)
Style Effects: 7 (normal, bold, italic, underline, strikethrough, bold+italic, shadow)
Text Categories: 6 (simple, sentences, paragraphs, mixed numerals, complex, poetry)
Theoretical Maximum: 19,404,000 combinations

Persian Text Generation

Generate Persian text images with proper RTL rendering:

from aura.utils import PersianTextGenerator, TextStyle

generator = PersianTextGenerator()

# Simple text
image, text = generator.generate_image("متن فارسی")

# Custom style
style = TextStyle(
    font_size=42,
    font_weight="bold",
    line_spacing=1.8,
    alignment="right",  # RTL default
)
image, text = generator.generate_image("متن با استایل سفارشی", style=style)

# Exam-style questions
image, text = generator.generate_exam_style(num_questions=5)

# Paragraph
image, text = generator.generate_paragraph_image()

🔧 API Reference

Creating Attacks

from aura.attacks import (
    # Fine-tuned (recommended)
    PersianOptimizedAttack,
    MaxDestructionAttack,
    get_attack_for_target,

    # ArXiv paper-enhanced
    MomentumIterativeFGSM,
    DiverseInputFGSM,
    TranslationInvariantFGSM,
    VLMTargetedAttack,
    LLMOCRTargetedAttack,
    PersianScriptOptimizedAttack,
    MultiModalEnsembleAttack,

    # Composite
    create_custom_composite,

    # Base
    AttackConfig,
)

# Use pre-configured attack
attack = PersianOptimizedAttack(intensity=1.0)
result = attack.apply(image)

# Get attack for specific OCR target
attack = get_attack_for_target(
    target_ocr="tesseract",  # Options: tesseract, easyocr, transformer, vlm, llm, persian
    destruction_priority=0.9,
    readability_priority=0.7
)

# Target VLM/LLM-based OCR (GPT-4V, Claude, etc.)
attack = get_attack_for_target(target_ocr="vlm")

# Multi-modal ensemble for all OCR types
attack = get_attack_for_target(target_ocr="multimodal")

# Custom composite attack
attack = create_custom_composite(
    attacks=[
        ("perlin_noise", {"amplitude": 20.0}),
        ("elastic", {"alpha": 30.0}),
        ("micro_rotation", {"max_angle": 1.5}),
    ],
    total_intensity=0.5,
    distribution="decreasing"
)

Batch Processing

from aura.attacks import PersianOptimizedAttack

attack = PersianOptimizedAttack()

# Process multiple images
results = attack.apply_batch(images)

for result in results:
    print(f"L2 norm: {result.l2_norm}")
    print(f"L∞ norm: {result.linf_norm}")
    print(f"SSIM: {result.ssim}")

Persian Text Generation API

from aura.utils import PersianTextGenerator, TextStyle

generator = PersianTextGenerator()

# Generate with custom style
style = TextStyle(
    font_size=42,
    font_weight="bold",
    line_spacing=1.8,
    alignment="right",
)
image, text = generator.generate_image("متن سفارشی", style=style)

# Generate exam-style questions
image, text = generator.generate_exam_style(num_questions=5)

# Generate paragraph
image, text = generator.generate_paragraph_image()

# Generate with specific font
image, text = generator.generate_image(
    "متن با فونت وزیر",
    font="vazirmatn_bold",
    size=56,
    color="dark_blue",
    background="cream",
    style_effect="shadow"
)

OCR Evaluation

from aura.evaluation import OCREvaluator, MetricsCalculator, LLMOCREvaluator
from aura.attacks import PersianScriptV2Attack

# Create attack
attack = PersianScriptV2Attack(intensity=1.0)

# Apply attack
result = attack.apply(original_image)
perturbed = result.perturbed_image

# Traditional OCR evaluation
ocr_eval = OCREvaluator(engines=["tesseract", "easyocr"])
eval_results = ocr_eval.evaluate_attack(
    original_image,
    perturbed,
    ground_truth="متن فارسی نمونه"
)

# Print results
for engine, metrics in eval_results.items():
    print(f"{engine}:")
    print(f"  Original: {metrics['original_text']}")
    print(f"  Perturbed: {metrics['perturbed_text']}")
    print(f"  CER: {metrics['cer']:.3f}")
    print(f"  WER: {metrics['wer']:.3f}")

# Metrics computation
metrics_calc = MetricsCalculator(use_lpips=True)
metrics = metrics_calc.compute_all(
    original_image,
    perturbed,
    original_text=eval_results["tesseract"]["original_text"],
    perturbed_text=eval_results["tesseract"]["perturbed_text"],
    ground_truth="متن فارسی نمونه"
)

print(f"SSIM: {metrics.ssim:.3f}")
print(f"PSNR: {metrics.psnr:.1f} dB")
print(f"Effectiveness: {metrics.effectiveness:.3f}")

🚀 Advanced Features

Attack Parameter Optimization

Find optimal attack configurations automatically:

# Quick optimization
python scripts/run_attack_optimizer.py --quick

# Full optimization with EasyOCR
python scripts/run_attack_optimizer.py --full --with-easyocr

from aura.analysis import AttackParameterOptimizer

optimizer = AttackParameterOptimizer(
    attack_name="PersianOptimized",
    metric="effectiveness",
    intensity_range=(0.0, 1.0),
    num_steps=20
)

best_config = optimizer.optimize(image, ground_truth)
print(f"Best intensity: {best_config.intensity:.3f}")
print(f"Best effectiveness: {best_config.effectiveness:.3f}")

Systematic Composite Optimization

Find optimal attack combinations using 5 strategies:

from aura.attacks.composite_optimizer import (
    CompositeOptimizer,
    AttackComponent,
    CombinationStrategy
)

# Define attack pool
attack_pool = [
    AttackComponent("ElasticDeformation", "geometric_warping", 0.3),
    AttackComponent("ContrastManipulation", "photometric_distortion", 0.3),
    AttackComponent("DotRemoval", "structural_fragmentation", 0.3),
    AttackComponent("PerlinNoise", "noise_injection", 0.3),
]

# Optimize
optimizer = CompositeOptimizer(
    min_readability=0.75,  # Enforce minimum readability
    max_components=4,
    strategy=CombinationStrategy.PARETO_OPTIMAL
)

optimizer.set_attack_pool(attack_pool)
result = optimizer.optimize(evaluation_fn)

print(result.to_markdown())
# Output: Best composite, top 10 composites, Pareto frontier, interaction analysis

5 Combination Strategies:

Exhaustive Search: All combinations (slow, optimal)
Greedy Addition: Add best attack iteratively (fast, suboptimal)
Mutual Information: Select complementary attacks (balanced)
Pareto Optimization: Balance destruction vs readability (recommended)
Random Search: Sample combination space (baseline)

Enhanced Readability Evaluation

Multi-scale readability assessment:

from aura.evaluation import EnhancedReadabilityEvaluator

evaluator = EnhancedReadabilityEvaluator()

# Evaluate single image
result = evaluator.evaluate(original_img, perturbed_img)
print(f"Overall Readability: {result.overall_readability:.3f}")
print(f"Multi-scale SSIM: {result.multi_scale_ssim:.3f}")
print(f"Text-region SSIM: {result.text_region_ssim:.3f}")
print(f"Structural Preservation: {result.structural_preservation:.3f}")
print(f"Character Legibility: {result.character_legibility:.3f}")

# Evaluate batch with comprehensive statistics
statistics = evaluator.evaluate_batch(original_images, perturbed_images)
print(evaluator.generate_statistics_report(statistics))
# Output: Complete statistical table with mean, std, variance, quartiles, CI

🐳 Docker & Infrastructure

Docker Setup

Building the Image

# Build Docker image
docker build -t aura:latest .

# Build with no cache (faster rebuild)
docker build --no-cache -t aura:latest .

Running the Container

# Interactive shell
docker run -it --rm \
    -v $(pwd):/workspace \
    -v $(pwd)/results:/workspace/results \
    aura:latest

# Run benchmark directly
docker run --rm \
    -v $(pwd):/workspace \
    -v $(pwd)/results:/workspace/results \
    aura:latest \
    python scripts/run_comprehensive_benchmark.py --num-samples 100 --full-benchmark

# Quick start
python scripts/run_comprehensive_benchmark.py --num-samples 5

# Full benchmark
python scripts/run_comprehensive_benchmark.py --num-samples 1000 --full-benchmark

# Optimize parameters
python scripts/run_attack_optimizer.py --quick

# Test specific image
python scripts/run_image_tester.py --image test.png --compare

# Generate dataset
python scripts/run_test_generator.py --num-tests 500

# Docker
docker run -it --rm -v $(pwd):/workspace aura:latest

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.github/workflows		.github/workflows
assets		assets
aura		aura
data		data
fonts		fonts
kubernetes		kubernetes
opentofu		opentofu
scripts		scripts
terraform		terraform
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.gpu		Dockerfile.gpu
INFRASTRUCTURE.md		INFRASTRUCTURE.md
Makefile		Makefile
README.md		README.md
Vagrantfile		Vagrantfile
docker-compose.yml		docker-compose.yml
optimized_attack_configs.json		optimized_attack_configs.json
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation