A comprehensive framework for generating adversarial perturbations against OCR systems while maintaining human readability.
Illusions for Machines: Adversarial Attacks on OCR via Human Visual Patterns
Dwin Gharibi · Mohammad Amin Haji Alirezaei · Kaebeh Yaeghoobi
Department of Computer Engineering & Computer Networks,
K. N. Toosi University of Technology, Tehran, Iran
Contents of this section: Abstract · Headline result · Key contributions · How OCR & human vision differ · Evaluation metrics · Attack taxonomy · Benchmark results · Ineffective attacks · Composite attacks · Future work · AI-assistance disclosure · Citation
Optical Character Recognition (OCR) systems have become ubiquitous in modern applications, from document digitization to autonomous vehicles. However, these systems fundamentally differ from human visual perception in their inability to exploit contextual understanding, imagination, and pattern completion. We present Illusions for Machines, a unified framework for generating adversarial perturbations that exploit the perceptual gap between human vision and machine recognition. Unlike previous works that focused on model-specific attacks using gradient-based methods such as FGSM, PGD, and C&W — which require white-box access and are tailored to specific architectures — our approach introduces architecture-agnostic, model-agnostic attacks based on human visual cognition patterns. By using visual illusions, gestalt principles, and the cognitive pattern-completion mechanisms that humans naturally employ, we construct perturbations that remain fully readable to human observers while effectively misleading traditional OCR engines (Tesseract, EasyOCR), modern vision–language models (GPT-4V, Claude, Gemini), and LLMs that rely on visual or extracted textual inputs. This paper is a systematization of attacks together with a unified evaluation framework, and constitutes the first phase of a larger research programme. We benchmark 27 attacks across 213,840 perturbed images (20 Persian sentences × 11 free fonts × 9 font sizes × 4 intensity levels × 27 attacks), running every sample through Tesseract and EasyOCR and reporting every metric with both standard deviation and 95% confidence intervals. Our top-performing structural attack (ConnectionBreaker) attains an effectiveness of 0.850 ± 0.078 (mean ± s.d.; 95% CI ±0.002), with Tesseract CER 0.882 ± 0.742, EasyOCR CER 0.658 ± 0.347, SSIM 0.920 ± 0.034, and humanized readability 0.781 ± 0.068.
Keywords: Adversarial Examples, OCR, Vision-Language Models, Human Visual Perception, Deep Learning Security.
The core thesis in one picture: a perturbation that a human reads effortlessly but that collapses OCR.
ConnectionBreaker (rank 1). Aggregated over n = 7,920 samples: SSIM = 0.920 ± 0.034, humanized readability Rh = 0.781 ± 0.068, Tesseract CER = 0.882 ± 0.742, EasyOCR CER = 0.658 ± 0.347, effectiveness E = 0.850 ± 0.078.
- Scale: 27 attacks × 213,840 perturbed images = 20 texts × 11 free Persian fonts × 9 sizes × 4 intensities (7,920 samples per attack), each scored by both Tesseract and EasyOCR.
- Rigor: every metric reported as mean ± s.d. with two-sided 95% confidence intervals; pairwise attack differences bootstrapped (1,000 resamples) into a Cohen's-d matrix.
- Finding: structural attacks that exploit the Gestalt principle of closure (ConnectionBreaker, StrokeBreaking) dominate every font, size and intensity — while classical gradient attacks (PGD, C&W, MI-FGSM) are the weakest in the whole benchmark because OCR binarization "heals" their pixel noise.
- Metric: effectiveness E = D · Rh (destruction × humanized readability) rewards attacks that destroy OCR and stay human-readable.
- A taxonomy of 27 attack families spanning structural, geometric, photometric and gradient-based perturbations, each mapped to a human-perception mechanism (Gestalt closure / continuity / figure-ground).
- A reproducible benchmark of 213,840 perturbed samples across 11 free Persian fonts, 9 sizes, 4 intensities and 20 reference sentences — released in full as a per-sample CSV.
- A statistically rigorous reporting protocol (means, standard deviations, 95% CIs, per-intensity sweeps, per-font / per-size / per-text generalisation, and bootstrapped significance testing).
- A composite-attack selection framework (pipeline-stage tagging → mutual-information filter → top-k synergy search) that explains why certain attack combinations amplify non-linearly.
Positioning. This is the breadth paper of a multi-phase programme: it surveys, taxonomises and benchmarks attacks and provides the evaluation harness. A follow-up paper will distill these findings into a single optimised, architecture-agnostic attack and add a controlled human study and VLM transfer matrix (see Future Work).
The attacks are designed against the concrete architectures below. Traditional engines rely on binarization → layout/line detection → character segmentation → sequence decoding; VLMs use a holistic vision transformer. Both lack human imagination and pattern completion, which is precisely the gap the attacks exploit.
Fig. 5 — Gestalt principles. Closure, similarity, proximity, continuity/common-fate: the human perceptual machinery the attacks lean on so that text stays readable to people while breaking machines.
Each perturbed image is scored on a rich, fully-released metric suite:
| Group | Metrics | Purpose |
|---|---|---|
| OCR failure | Character Error Rate (CER), Word Error Rate (WER) | Levenshtein-based recognition damage (can exceed 1.0 when OCR hallucinates). |
| Perceptual fidelity | SSIM, PSNR, MS-SSIM, LPIPS, GMSD, CIEDE2000 | How close the attacked image stays to the clean one. |
| Topology preservation | Text-preservation, Stroke-preservation (mask-IoU) | Whether character body and central spine survive. |
| Composite (ours) | Humanized readability Rh, Destruction D, Effectiveness E = D · Rh | The balanced ranking criteria used throughout the paper. |
- Destruction
Dis a calibrated weighted sum of seven complementary OCR-damage axes (structural, text-region, edge, connected-component, contrast, morphological, noise) aligned empirically against measured CER. - Humanized readability
R_his a sigmoid-calibrated blend of MS-SSIM, GMSD, CIEDE2000, text-preservation and Canny-edge overlap, tuned so thatR_h ≥ 0.75⇔ text all three native-Persian-speaking co-authors could read effortlessly. It is an author-validated automatic proxy, not a crowd study (a pre-registered study is deferred to follow-up work). - Effectiveness
E = D · R_his the primary ranking metric — it only rewards attacks that break OCR and remain legible.
The 27 benchmarked attacks fall into three mechanism families, each tied to a perceptual principle. CER/WER/SSIM/Rh values in each caption are aggregated over the full 7,920-sample grid for that attack.
Break the thin ligatures/strokes; humans bridge the gaps, OCR segmentation collapses.
Bend the baseline; readers follow the flow, OCR line-tracking fails.
Inject clutter/noise/contrast; humans isolate text from ground, OCR binarization does not.
Every value below is computed over 7,920 perturbed samples per attack (N = 213,840 total) and reported as mean ± s.d.. These are the paper's authoritative numbers; the single-image CER/WER quoted alongside individual figures are illustrative only.
| # | Attack | SSIM | MS-SSIM | Destruction | Humanized | Effectiveness | 95% CI | Tess. CER | Easy. CER |
|---|---|---|---|---|---|---|---|---|---|
| 1 | ConnectionBreaker | 0.920 ± 0.034 | 0.923 ± 0.038 | 0.993 ± 0.035 | 0.781 ± 0.068 | 0.850 ± 0.078 | ±0.002 | 0.882 ± 0.742 | 0.658 ± 0.347 |
| 2 | StrokeBreaking | 0.891 ± 0.059 | 0.889 ± 0.069 | 0.972 ± 0.094 | 0.725 ± 0.133 | 0.820 ± 0.080 | ±0.002 | 1.226 ± 1.300 | 0.882 ± 0.310 |
| 3 | Wavy | 0.902 ± 0.054 | 0.973 ± 0.025 | 0.900 ± 0.185 | 0.791 ± 0.092 | 0.767 ± 0.133 | ±0.003 | 0.440 ± 0.513 | 0.520 ± 0.329 |
| 4 | Zigzag | 0.753 ± 0.102 | 0.841 ± 0.088 | 0.992 ± 0.034 | 0.614 ± 0.138 | 0.727 ± 0.060 | ±0.001 | 0.612 ± 0.333 | 0.813 ± 0.190 |
| 5 | Elastic | 0.848 ± 0.073 | 0.901 ± 0.064 | 0.982 ± 0.053 | 0.720 ± 0.100 | 0.670 ± 0.052 | ±0.001 | 0.157 ± 0.316 | 0.346 ± 0.289 |
| 6 | GaussianNoise | 0.447 ± 0.107 | 0.480 ± 0.135 | 1.000 ± 0.000 | 0.445 ± 0.175 | 0.665 ± 0.066 | ±0.001 | 1.883 ± 2.681 | 0.406 ± 0.288 |
| 7 | SaltPepper | 0.682 ± 0.047 | 0.801 ± 0.037 | 1.000 ± 0.000 | 0.453 ± 0.048 | 0.660 ± 0.028 | ±0.001 | 2.117 ± 1.715 | 0.921 ± 0.111 |
| 8 | MaxDestruction | 0.942 ± 0.007 | 0.980 ± 0.005 | 0.868 ± 0.080 | 0.865 ± 0.011 | 0.660 ± 0.054 | ±0.001 | 0.111 ± 0.284 | 0.313 ± 0.279 |
| 9 | Mirror | 0.808 ± 0.034 | 0.825 ± 0.052 | 0.733 ± 0.143 | 0.753 ± 0.076 | 0.642 ± 0.113 | ±0.002 | 0.326 ± 0.344 | 0.598 ± 0.399 |
| 10 | Interleaved | 0.517 ± 0.137 | 0.521 ± 0.100 | 1.000 ± 0.000 | 0.271 ± 0.095 | 0.600 ± 0.081 | ±0.002 | 0.957 ± 0.077 | 0.950 ± 0.147 |
| 11 | MicroRotation | 0.773 ± 0.130 | 0.783 ± 0.160 | 0.942 ± 0.184 | 0.609 ± 0.194 | 0.590 ± 0.138 | ±0.003 | 0.151 ± 0.313 | 0.373 ± 0.288 |
| 12 | CaptchaStyle | 0.424 ± 0.131 | 0.587 ± 0.159 | 1.000 ± 0.001 | 0.330 ± 0.123 | 0.532 ± 0.080 | ±0.002 | 1.025 ± 0.656 | 2.170 ± 1.851 |
| 13 | StructuredNoise | 0.848 ± 0.090 | 0.903 ± 0.066 | 0.553 ± 0.263 | 0.872 ± 0.056 | 0.488 ± 0.208 | ±0.005 | 0.106 ± 0.278 | 0.323 ± 0.276 |
| 14 | DotRemoval | 0.969 ± 0.033 | 0.976 ± 0.023 | 0.542 ± 0.348 | 0.900 ± 0.060 | 0.477 ± 0.283 | ±0.006 | 0.266 ± 0.310 | 0.518 ± 0.273 |
| 15 | TI-FGSM | 0.865 ± 0.085 | 0.974 ± 0.020 | 0.504 ± 0.254 | 0.917 ± 0.025 | 0.450 ± 0.206 | ±0.005 | 0.106 ± 0.278 | 0.307 ± 0.275 |
![]() Mean effectiveness ± s.d. per attack. |
![]() Attack ranking with 95% CI markers — the top-3 structural cluster is significant. |
| Attack | Tess. CER | Tess. WER | Easy. CER | Easy. WER |
|---|---|---|---|---|
| ConnectionBreaker | 0.882 ± 0.742 | 1.736 ± 1.555 | 0.658 ± 0.347 | 1.073 ± 0.512 |
| StrokeBreaking | 1.226 ± 1.300 | 2.641 ± 2.794 | 0.882 ± 0.310 | 1.435 ± 0.532 |
| Wavy | 0.440 ± 0.513 | 0.666 ± 0.830 | 0.520 ± 0.329 | 0.753 ± 0.354 |
| Zigzag | 0.612 ± 0.333 | 0.895 ± 0.361 | 0.813 ± 0.190 | 1.124 ± 0.320 |
| Elastic | 0.157 ± 0.316 | 0.258 ± 0.406 | 0.346 ± 0.289 | 0.549 ± 0.319 |
| GaussianNoise | 1.883 ± 2.681 | 3.659 ± 5.240 | 0.406 ± 0.288 | 0.628 ± 0.310 |
| SaltPepper | 2.117 ± 1.715 | 4.306 ± 3.997 | 0.921 ± 0.111 | 1.029 ± 0.118 |
| Interleaved | 0.957 ± 0.077 | 1.002 ± 0.047 | 0.950 ± 0.147 | 0.990 ± 0.065 |
| CaptchaStyle | 1.025 ± 0.656 | 1.592 ± 1.527 | 2.170 ± 1.851 | 2.249 ± 1.494 |
| Contrast | 0.666 ± 0.314 | 0.969 ± 0.369 | 0.717 ± 0.299 | 0.882 ± 0.249 |
| Mirror | 0.326 ± 0.344 | 0.616 ± 0.542 | 0.598 ± 0.399 | 0.950 ± 0.485 |
| Puzzle | 3.074 ± 2.632 | 5.457 ± 5.103 | 0.782 ± 0.217 | 0.984 ± 0.143 |
| DotRemoval | 0.266 ± 0.310 | 0.505 ± 0.378 | 0.518 ± 0.273 | 0.798 ± 0.270 |
| PGD | 0.107 ± 0.280 | 0.156 ± 0.360 | 0.312 ± 0.277 | 0.512 ± 0.307 |
| MicroRotation | 0.151 ± 0.313 | 0.236 ± 0.411 | 0.373 ± 0.288 | 0.574 ± 0.319 |
Effectiveness saturates globally around θ ≈ 0.75; structural attacks plateau already at θ = 0.5.
| Intensity θ | n | SSIM | Destruction | Efficacy | Tess. CER | Easy. CER |
|---|---|---|---|---|---|---|
| 0.25 | 53,460 | 0.868 ± 0.207 | 0.566 ± 0.382 | 0.426 ± 0.272 | 0.399 ± 1.008 | 0.466 ± 0.386 |
| 0.50 | 53,460 | 0.824 ± 0.223 | 0.642 ± 0.357 | 0.470 ± 0.242 | 0.496 ± 1.025 | 0.537 ± 0.557 |
| 0.75 | 53,460 | 0.788 ± 0.228 | 0.701 ± 0.328 | 0.502 ± 0.219 | 0.672 ± 1.288 | 0.586 ± 0.677 |
| 1.00 | 53,460 | 0.756 ± 0.231 | 0.758 ± 0.315 | 0.530 ± 0.216 | 0.703 ± 1.254 | 0.607 ± 0.684 |
![]() Global mean effectiveness vs θ (95% CI band). |
![]() Per-attack effectiveness across the four intensity levels. |
Attack rankings are remarkably stable across fonts, sizes and texts (cross-font effectiveness range ≤ 0.0037; cross-text range ≤ 0.008), and the top-15 vs bottom-12 attacks separate at p < 0.001 after Bonferroni correction.
Single-figure dashboard summarising the entire 213,840-sample benchmark: top-10 ranking, stealth frontier, per-attack effectiveness with 95% CI, readability–destruction scatter, effectiveness histogram, destruction-vs-intensity, joint OCR CER scatter, and the humanized-readability histogram.
Every figure below is regenerated from the released per-sample CSV and is included in this repository under
assets/benchmark_results/aura_results/charts/. They follow the same family taxonomy (A–I) used in §IX of the paper. The charts already embedded in context above are repeated here so this atlas is self-contained.
Headline per-attack metrics. Mean of each headline metric per attack (with ± s.d. error bars), the ranking and radar comparisons, and the normalised (attack × metric) heatmap.
![]() Effectiveness ± s.d. |
![]() Destruction |
![]() Humanized readability |
![]() SSIM |
![]() PSNR |
![]() Ranking with 95% CI |
![]() Top-attack radar (all metrics) |
![]() (Attack × metric) heatmap |
Perceptual trade-off & cross-metric correlations. The destruction–readability "stealth frontier", SSIM vs humanized readability, and the correlation matrices that justify the composite metrics (SSIM↔Rh r = 0.941).
Intensity sweep (θ ∈ {0.25, 0.50, 0.75, 1.00}). Global line plots and per-attack heatmaps for every headline metric, plus the per-intensity trade-off and variance/CI summaries.
Family A — Distribution & stability. Best-vs-worst attacks per metric, the mean-vs-std Pareto frontier, the signal-to-noise ranking, and per-attack stability surfaces.
Family B — Per-font generalisation (11 free Persian fonts). Mean/std heatmaps, best attack per font, and per-font box plots of the top attacks.
![]() (Attack × font) mean effectiveness |
![]() (Attack × font) std |
![]() Best attack per font |
![]() Per-font box — top attacks |
Family C — Per-size generalisation (9 font sizes). Metric-vs-size curves and (attack × size) / (size × intensity) heatmaps.
![]() Headline metrics vs size |
![]() (Attack × size) effectiveness |
![]() (Size × intensity) heatmap |
![]() Per-size effectiveness violins |
Family E — Distributional behaviour. Empirical CDFs and violin plots of each headline metric across the full 213,840-sample grid.
![]() ECDF — effectiveness |
![]() ECDF — destruction |
![]() ECDF — humanized |
![]() ECDF — SSIM |
![]() Violin — effectiveness |
![]() Violin — destruction |
![]() Violin — humanized |
![]() Violin — SSIM |
Family F — Statistical significance. CI-based ranking, the pairwise Cohen's-d effect-size matrix, and the bootstrapped p-value matrix (top-15 vs bottom-12 separate at p < 0.001 after Bonferroni).
![]() CI-based effectiveness ranking |
![]() Pairwise Cohen's-d |
![]() Bootstrapped p-value matrix |
Family G — OCR-level CER behaviour. Per-engine CER CDFs, the joint Tesseract↔EasyOCR scatter/density, and per-attack ΔCER.
![]() Tesseract CER CDF |
![]() Tesseract vs EasyOCR CER |
![]() Joint CER density |
![]() ΔCER per attack |
![]() ΔCER histogram |
Family H — Per-text behaviour (20 Persian reference sentences). Per-text difficulty, the (text × attack) heatmap, and the best attack per text.
![]() Per-text difficulty curve |
![]() (Text × attack) effectiveness |
![]() Best attack per text |
Family I — Master dashboard. A single self-contained figure compositing the headline findings (also shown above).
A key negative result: classical gradient-based attacks and environmental simulations are the weakest in the benchmark. Pixel-precise noise (CW, MI-FGSM, PGD) is discarded by Tesseract's non-differentiable adaptive thresholding ("surrogate mismatch"), and low/high-frequency environmental textures are absorbed by binarization and LSTM temporal smoothing.
| Attack modality | Tess. CER | Tess. WER | Easy. CER | Effectiveness |
|---|---|---|---|---|
| Micro Rotation | 0.151 ± 0.313 | 0.236 ± 0.411 | 0.373 ± 0.288 | 0.590 ± 0.138 |
| Structured Noise | 0.106 ± 0.278 | 0.157 ± 0.360 | 0.323 ± 0.276 | 0.488 ± 0.208 |
| TI-FGSM | 0.106 ± 0.278 | 0.155 ± 0.355 | 0.307 ± 0.275 | 0.450 ± 0.206 |
| Shadow Simulation | 0.274 ± 0.391 | 0.388 ± 0.481 | 0.328 ± 0.278 | 0.426 ± 0.220 |
| Moire Pattern | 0.105 ± 0.277 | 0.155 ± 0.360 | 0.320 ± 0.281 | 0.359 ± 0.158 |
| PGD (iterative) | 0.107 ± 0.280 | 0.156 ± 0.360 | 0.312 ± 0.277 | 0.338 ± 0.120 |
| Perlin Noise | 0.109 ± 0.275 | 0.169 ± 0.364 | 0.347 ± 0.283 | 0.330 ± 0.144 |
| Scene Text | 0.105 ± 0.278 | 0.153 ± 0.359 | 0.324 ± 0.281 | 0.195 ± 0.065 |
| MI-FGSM | 0.108 ± 0.282 | 0.158 ± 0.366 | 0.321 ± 0.283 | 0.112 ± 0.055 |
| CW (L₂) | 0.106 ± 0.279 | 0.153 ± 0.361 | 0.323 ± 0.280 | 0.071 ± 0.052 (worst) |
![]() CW L₂ — E 0.071 |
![]() MI-FGSM — E 0.112 |
![]() PGD — E 0.338 |
![]() MicroRotation — E 0.590 |
Real-world degradation rarely comes from a single source. A three-step selection framework (pipeline-stage tagging → mutual-information filter → top-k synergy search over S = (1−I)·D̄ᵢ·R̄ⱼ) picks composite pairs whose components hit different OCR stages and therefore stack non-linearly. Puzzle+Interleaved and GaussianNoise+Interleaved drive both engines to CER = WER = 1.0, while Puzzle+Wavy keeps readability high (SSIM 0.938).
Composite comparison grid on one Persian sentence at θ = 0.7.
![]() Puzzle + Wavy. Readability preserved (SSIM 0.938, Rh 0.875). |
![]() Puzzle + Interleaved. Both engines collapse to CER = WER = 1.0. |
![]() Gaussian + Interleaved. Most destructive interaction (SSIM ≈ 0.13). |
This breadth paper is phase one. The follow-up will: (i) distill a single architecture-agnostic attack via Bayesian optimisation over the destruction-score weights with R_h ≥ 0.85 as a hard constraint; (ii) replace the automatic readability proxy with a pre-registered crowd-sourced human study; (iii) extend the grid to VLMs (GPT-4V, Claude, Gemini) and publish a transferability matrix; and (iv) evaluate defences (input randomisation, ensemble OCR, learned denoisers).
Per current best practices, the authors disclose that general-purpose AI coding assistants were used only as a search accelerator to scaffold quick prototypes across ~500–600 candidate perturbations during early exploration (27 survived screening). All retained attacks were rewritten, tested, audited, parameter-tuned and benchmarked by the authors; the AI was not a source of scientific claims, evaluation results, derivations, or written analysis. Every number is publicly auditable in this repository.
@article{gharibi2026illusions,
title = {Illusions for Machines: Adversarial Attacks on OCR via Human Visual Patterns},
author = {Gharibi, Dwin and Haji Alirezaei, Mohammad Amin and Yaeghoobi, Kaebeh},
year = {2026},
url = {https://github.com/dwin-gharibi/aura-repo}
}- 📄 Research Paper: Illusions for Machines
- Overview
- Key Features
- Installation
- Quick Start
- Attack Framework
- Evaluation & Metrics
- OCR Engines
- Persian Text Support
- API Reference
- Advanced Features
- Project Structure
- Docker & Infrastructure
- Citation & References
- Contributing
- License
- Acknowledgments
AURA is a research-grade framework designed to evaluate the robustness of Optical Character Recognition (OCR) systems against adversarial attacks. The framework specifically supports Persian/Arabic (RTL) text and implements state-of-the-art adversarial techniques from recent academic literature.
AURA addresses a critical gap in OCR security research: no general-purpose OCR attack framework exists that comprehensively evaluates multiple attack mechanisms across different OCR engines while maintaining human readability. This claim is supported by 9+ peer-reviewed references documented in our citation manager.
- Research-Grade: Implements attacks from 11+ academic papers
- Production-Ready: 80+ attack implementations with pre-optimized configurations
- Multi-Engine: Evaluates against Tesseract, EasyOCR, and LLM-based OCR (GPT-4V, Claude, Gemini)
- Persian/RTL Support: Proper character joining and RTL text rendering using libraqm
- Publication-Ready: Generates LaTeX tables, visualization charts, and detailed JSON reports
- 80+ Attack Implementations: From simple noise injection to sophisticated research-based attacks
- 8 Mechanism Categories: Attacks organized by underlying perceptual mechanism
- 9 Perceptual Principles: Mapped from vision science to explain human vs. machine differences
- 20 Persian-Specific Attacks: Designed for Persian/Arabic script characteristics
- 7 ArXiv Paper-Enhanced Attacks: State-of-the-art implementations (MI-FGSM, DI-FGSM, TI-FGSM, etc.)
- Multi-Engine Evaluation: Tesseract, EasyOCR, and LLM-based OCR (GPT-4V, Claude, Gemini)
- Comprehensive Metrics: SSIM, PSNR, CER, WER, and custom effectiveness scores
- Statistical Rigor: Variance, confidence intervals, statistical tests (t-test, ANOVA, Mann-Whitney U)
- 20 Persian Fonts: Multi-font, multi-size, multi-color, multi-style dataset generation
- 5 Intensity Levels: Systematic analysis from 20% to 100% intensity
- 120+ Charts: Publication-ready visualizations (attack grids, heatmaps, scatter plots, etc.)
- Systematic Composite Optimization: 5 combination strategies for finding optimal attack combinations
- Enhanced Readability Metrics: Multi-scale SSIM with better human correlation (r=0.87)
# Python 3.8+ required
python --version # Python 3.8 or higher
# System dependencies (Ubuntu/Debian)
sudo apt-get update
sudo apt-get install -y \
build-essential \
git \
libglib2.0-0 \
libraqm0 \
libsm6 \
libxext6 \
libxrender1 \
libgl1 \
tesseract-ocr \
tesseract-ocr-fas \
tesseract-ocr-ara \
fonts-dejavu \
fonts-freefont-ttf
# macOS
brew install tesseract tesseract-lang libraqm# Clone the repository
git clone https://github.com/dwin-gharibi/aura-repo.git
cd aura
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Persian text support
pip install arabic-reshaper python-bidi pillow
# OCR evaluation
pip install pytesseract easyocr
# LLM-based OCR (optional)
pip install openai anthropic# Build Docker image
docker build -t aura:latest .
# Run container
docker run -it --rm \
-v $(pwd):/workspace \
-v $(pwd)/results:/workspace/results \
aura:latest
# Or use Docker Compose (see docker-compose.yml)
docker-compose up -dfrom aura.utils import PersianTextGenerator
from aura.attacks import PersianOptimizedAttack
# Generate Persian text image
generator = PersianTextGenerator()
image, text = generator.generate_image("سلام، این یک متن آزمایشی است.")
# Apply optimized attack
attack = PersianOptimizedAttack(intensity=1.0)
result = attack.apply(image)
# Get perturbed image
perturbed = result.perturbed_image
print(f"SSIM: {result.ssim:.3f}")
print(f"L2 norm: {result.l2_norm:.3f}")# Quick benchmark (5 samples, 12 attacks)
python scripts/run_comprehensive_benchmark.py --num-samples 5 --full-benchmark
# Full benchmark (1000 samples, all 67 attacks)
python scripts/run_comprehensive_benchmark.py --num-samples 1000 --full-benchmark
# Specific attacks only
python scripts/run_comprehensive_benchmark.py --num-samples 100 --attacks "ElasticDeformation,WavyDistortion,DotRemoval"
# Resumable, continuously-synced paper run (S3 / MinIO / R2 / ...)
python scripts/run_comprehensive_benchmark.py \
--full-benchmark --num-samples 100 \
--resume \
--s3-bucket aura-benchmarks --s3-prefix runs/$(date -u +%F) --s3-bundleRunning a long benchmark? See
docs/BENCHMARK_UPDATES.mdfor the full reference on the streaming checkpoint,--resume, S3 sync, the new perceptual metrics (MS-SSIM, GMSD, CIEDE2000, humanised readability, text preservation), and the paper-critical bug fixes applied to the comprehensive runner.
from aura.evaluation import TesseractEngine, EasyOCREngine, LLMOCREngine
# Tesseract OCR (recommended for Persian)
tesseract = TesseractEngine(lang="fas", config="--psm 6")
result = tesseract.recognize(image)
print(f"Tesseract: {result.text}")
# EasyOCR
easyocr = EasyOCREngine(lang="fas", gpu=False)
result = easyocr.recognize(image)
print(f"EasyOCR: {result.text}")
# GPT-4V OCR (requires API key)
llm_ocr = LLMOCREngine(provider="openai", model="gpt-4o", api_key="YOUR_KEY")
result = llm_ocr.recognize(image)
print(f"GPT-4V: {result.text}")AURA implements 80+ adversarial attacks organized into 8 mechanism categories based on their perceptual impact on human vision systems.
| Attack | Visualization | Category |
|---|---|---|
| Aggressive | ![]() |
Composite |
| ConnectionBreaker | ![]() |
Structural |
| DotRemoval | ![]() |
Structural |
| WavyDistortion | ![]() |
Geometric |
| StrokeBreaking | ![]() |
Structural |
| Zigzag | ![]() |
Creative |
| GaussianNoise | ![]() |
Noise |
| Attack | Target | Est. Destruction | Readability | Effectiveness |
|---|---|---|---|---|
StrokeBreaking |
Structural integrity | 1.00 | 0.94 | 0.94 |
ConnectionBreaker |
Cursive flow | 1.00 | 0.91 | 0.91 |
PersianUltimateV2 |
All Persian features | 1.00 | 0.91 | 0.91 |
LetterFormConfusion |
Initial/medial/final forms | 1.00 | 0.90 | 0.90 |
UltimateDestruction |
Multi-layer | 1.00 | 0.87 | 0.87 |
ElasticDeformation |
Geometric distortion | 0.95+ | 0.80 | 0.76 |
MaxDestruction |
Maximum damage | 0.87 | 0.79 | 0.67 |
AggressiveAttack |
Multi-layer attack | 0.87 | 0.86 | 0.75 |
WavyDistortion |
Text flow disruption | 0.85 | 0.88 | 0.75 |
HighReadabilityAttack |
Stealth mode | Low | 0.95+ | Variable |
Note: Destruction scores are proxy-based estimates using image metrics (binarization changes, edge disruption, etc.). Actual OCR destruction varies by engine and configuration.
Instead of describing 80+ individual attacks, we group them by their underlying perceptual mechanism:
Perceptual Principle: Topological Invariance Perception - Humans are sensitive to topological properties (connectedness, holes) but invariant to continuous deformations.
| Attack | Description | Target |
|---|---|---|
ElasticDeformation |
Smooth non-linear warping | Character shape |
WavyDistortion |
Sinusoidal baseline shift | Text flow |
Zigzag |
Triangle-wave baseline | Line structure |
MicroRotation |
Small local rotations | Character orientation |
PerspectiveTilt |
Perspective transformation | Global geometry |
Swirl |
Swirl distortion | Local structure |
Barrel |
Barrel/pincushion distortion | Global warp |
SpatialTransform |
Flow-based transformation | Spatial coherence |
Perceptual Principle: Contrast Constancy - Human visual system compensates for global contrast changes through gain control mechanisms.
| Attack | Description | Target |
|---|---|---|
ContrastManipulation |
Local contrast shifts | Edge detection |
Shadow |
Gradient shadows | Binarization |
DCTAttack |
DCT coefficient modification | Frequency domain |
Texture |
Texture overlay | Segmentation |
Illumination |
Lighting changes | Thresholding |
Perceptual Principle: Gestalt Closure - Humans mentally complete fragmented shapes based on prior knowledge and context.
| Attack | Description | Target |
|---|---|---|
DotRemoval |
Removes/blurs letter dots | Persian letter identity (ب/ت/ث) |
DotDisplacement |
Moves dots to alter meaning | Letter confusion |
ConnectionBreaker |
Breaks letter connections | Cursive flow |
StrokeBreaking |
Breaks character strokes | Structural integrity |
BinarizationConfusion |
Disrupts Otsu thresholding | Segmentation |
DiacriticAttack |
Injects fake diacritical marks | Letter disambiguation |
TeethDestruction |
Targets س، ش، ص patterns | Letter patterns |
LetterConfusion |
Adds phantom dots, smears | Letter shape |
Perceptual Principle: Lateral Inhibition - Human retina enhances edges through lateral inhibition, making noise less disruptive than for machines.
| Attack | Description | Target |
|---|---|---|
GaussianNoise |
Random pixel noise | Overall quality |
SaltPepperNoise |
Impulse noise | Edge detection |
PerlinNoise |
Natural-looking noise | Texture |
StructuredNoise |
Pattern-based noise | Segmentation |
PatternWatermark |
Watermark patterns | OCR confidence |
TextWatermark |
Faint text overlay | Character recognition |
GridWatermark |
Grid pattern | Line detection |
RandomErasing |
Random pixel deletion | Feature extraction |
Cutout |
Block removal | Attention mechanisms |
Perceptual Principle: Contrast Sensitivity Function (CSF) - Humans are less sensitive to high-frequency changes, especially in textured regions.
| Attack | Description | Target |
|---|---|---|
FrequencyDomainAttack |
DCT/FFT-based perturbations | Frequency components |
FourierAttack |
Fourier domain modifications | Global structure |
Perceptual Principle: Weber-Fechner Law - Human perception is logarithmic, while machine gradients exploit linear sensitivity.
| Attack | Description | Paper |
|---|---|---|
PGD |
Projected Gradient Descent | Madry et al., 2017 |
CW |
Carlini-Wagner Attack | Carlini & Wagner, 2017 |
MI-FGSM |
Momentum Iterative FGSM | Dong et al., 2017 |
DI-FGSM |
Diverse Input FGSM | Xie et al., 2019 |
TI-FGSM |
Translation Invariant FGSM | Dong et al., 2019 |
VLMTargeted |
Vision Language Model attack | arXiv:2208.14302 |
Perceptual Principle: Top-Down Processing - Humans use context and language knowledge to resolve ambiguous characters.
| Attack | Description | Target |
|---|---|---|
FontPerturbation |
Font variation | Font recognition |
KashidaInjection |
Word stretching | Word spacing |
LetterFormConfusion |
Initial/medial/final confusion | Letter forms |
BaselineWobble |
Vertical wobble | Baseline detection |
PerCharRotation |
Per-character rotation | Character alignment |
Puzzle |
Tile displacement | Spatial coherence |
CaptchaStyle |
CAPTCHA-like distortion | Overall readability |
Interleaved |
Pattern between lines | Line separation |
TextMirror |
Faint duplicate text | Character confusion |
LetterSpacing |
Spacing variation | Word segmentation |
LineSpacing |
Line height changes | Paragraph structure |
FontWeight |
Weight perturbation | Stroke width |
FontStyle |
Style changes (italic, bold) | Font robustness |
Perceptual Principle: Feature Integration Theory - Humans integrate features across multiple scales and dimensions more robustly than machines.
| Attack | Description | Components |
|---|---|---|
AggressiveAttack |
Multi-layer attack | 5+ layers |
MaxDestructionAttack |
Maximum damage | Optimized combo |
UltimateOCRKiller |
Combined frequency + transformer | Multi-domain |
OCRKillerAttack |
5-layer progressive | Progressive |
StealthAttack |
Minimal perturbation | Subtle combo |
AURA includes 20 attacks specifically designed for Persian/Arabic script characteristics:
| Attack | Target | Description |
|---|---|---|
DotRemoval |
Letter dots | Removes/blurs dots crucial for letter identity (ب/ت/ث/پ) |
DotDisplacement |
Dot position | Moves dots to alter letter meaning |
ConnectionBreaker |
Cursive flow | Breaks letter connections in cursive script |
LetterConfusion |
Letter shapes | Adds phantom dots, smears letters |
DiacriticAttack |
Diacritics | Injects fake diacritical marks |
TeethDestruction |
Teeth patterns | Targets س، ش، ص patterns |
KashidaInjection |
Word spacing | Stretches words with kashida (ـ) |
BaselineWobble |
Text baseline | Creates vertical wobble |
LetterFormConfusion |
Forms | Confuses initial/medial/final forms |
PersianUltimateV2 |
All features | Maximum Persian destruction |
PersianOptimizedAttack |
Persian/RTL text | Optimized for Persian |
PersianScriptOptimizedAttack |
RTL & connections | Connection-aware |
HamzaAttack |
Hamza position | Moves/removes hamza (ء) |
YehBarreh |
Yeh/Barreh confusion | Persian ی vs Arabic ي |
KafGaf |
Kaf/Gaf distinction | Persian گ vs Arabic ك |
HehGoal |
Heh/Goal forms | Persian ه final form |
AlefMadda |
Alef with madda | آ variant |
LamAlefLigature |
Lam-Alef ligature | لا ligature breaking |
NumeralConfusion |
Persian/Arabic numerals | ۰۱۲۳ vs ٠١٢٣ |
ZeroWidthJoiner |
ZWJ manipulation | Character joining |
State-of-the-art implementations from recent academic papers:
| Attack | Paper | Year | Description |
|---|---|---|---|
MomentumIterativeFGSM |
Dong et al., "Boosting Adversarial Attacks with Momentum" | 2017 | MI-FGSM with momentum-based gradient accumulation |
DiverseInputFGSM |
Xie et al., "Adversarial Examples for Semantic Segmentation" | 2019 | DI-FGSM with input diversity for transferability |
TranslationInvariantFGSM |
Dong et al., "Evading Defenses to Transferable Adversarial Examples" | 2019 | TI-FGSM with kernel convolution |
VLMTargetedAttack |
arXiv:2208.14302 "On Evaluating Adversarial Robustness of VLMs" | 2022 | Targets Vision Language Models (GPT-4V, Claude, Gemini) |
LLMOCRTargetedAttack |
arXiv:2306.07033 "Robust Visual Reasoning Attacks" | 2023 | Targets LLM-based OCR specifically |
PersianScriptOptimizedAttack |
Custom implementation | - | RTL and connection-aware for Persian |
MultiModalEnsembleAttack |
Custom implementation | - | Ensemble for CNN, Transformer, and VLM |
AURA maps each attack category to specific perceptual principles from vision science, explaining why humans can still read perturbed text while OCR systems fail:
| # | Perceptual Principle | Description | Mapped Attack Category |
|---|---|---|---|
| 1 | Gestalt Closure | Humans mentally complete fragmented shapes | Structural Fragmentation |
| 2 | Contrast Constancy | Human visual system compensates for global contrast | Photometric Distortion |
| 3 | Weber-Fechner Law | Human perception is logarithmic | Adversarial Gradient |
| 4 | Lateral Inhibition | Retina enhances edges, reducing noise impact | Noise Injection |
| 5 | Top-Down Processing | Context and language knowledge resolve ambiguity | Typographic Manipulation |
| 6 | Contrast Sensitivity Function | Less sensitive to high-frequency changes | Frequency Domain |
| 7 | Topological Invariance | Sensitive to connectedness, invariant to deformation | Geometric Warping |
| 8 | Feature Integration Theory | Robust multi-scale feature integration | Composite/Multi-Layer |
| 9 | Letter Similarity Confusion | Language models resolve similar letters | Persian-Specific |
Scientific Basis: All perceptual principles are documented in vision science literature. See our citation manager (
aura/research/citations.py) for 9+ peer-reviewed references supporting each principle.
AURA provides comprehensive evaluation metrics organized into three categories: image quality, OCR degradation, and combined effectiveness.
| Metric | Full Name | Description | Range | Ideal |
|---|---|---|---|---|
| SSIM | Structural Similarity Index | Measures structural similarity (luminance, contrast, structure) | -1 to 1 | > 0.85 |
| PSNR | Peak Signal-to-Noise Ratio | Ratio between max signal and noise power | 0-∞ dB | > 30 dB |
| LPIPS | Learned Perceptual Image Patch Similarity | Deep learning-based perceptual similarity | 0-1+ | < 0.3 |
| L2 Norm | Euclidean perturbation magnitude | Average perturbation size | 0-∞ | Lower is better |
| L∞ Norm | Maximum pixel change | Max perturbation at any pixel | 0-255 | < 20 |
SSIM(x, y) = [l(x,y)]^α · [c(x,y)]^β · [s(x,y)]^γ
Where:
l(x,y) = (2μ_x μ_y + C₁) / (μ_x² + μ_y² + C₁) [Luminance]
c(x,y) = (2σ_x σ_y + C₂) / (σ_x² + σ_y² + C₂) [Contrast]
s(x,y) = (σ_xy + C₃) / (σ_x σ_y + C₃) [Structure]
PSNR = 10 · log₁₀(MAX² / MSE)
Where:
MAX = Maximum pixel value (255 for 8-bit)
MSE = Mean Squared Error
| Metric | Full Name | Formula | Range | Interpretation |
|---|---|---|---|---|
| CER | Character Error Rate | (S + D + I) / N | 0-∞ | 0 = perfect, >1 = more errors than chars |
| WER | Word Error Rate | (S + D + I) / N (words) | 0-∞ | 0 = perfect, more semantic |
| SER | Sentence Error Rate | Incorrect sentences / Total | 0-1 | 0 = all correct |
Where:
- S = Substitutions
- D = Deletions
- I = Insertions
- N = Total characters/words in reference
| Metric | Formula | Interpretation | Target |
|---|---|---|---|
| Destruction Score | Weighted composite of image metrics | OCR degradation | > 0.80 |
| Readability Score | SSIM-based readability | Human readability preservation | > 0.75 |
| Effectiveness | Destruction × Readability | Overall attack quality | > 0.60 |
destruction = (
0.20 * ssim_destruction + # Structural changes
0.25 * text_destruction + # Binarization changes
0.20 * edge_destruction + # Edge disruption
0.15 * cc_destruction + # Connected components
0.10 * hist_destruction + # Histogram changes
0.10 * stroke_destruction # Stroke width changes
)readability = (
0.25 * multi_scale_ssim + # Multi-resolution SSIM
0.20 * text_region_ssim + # Text-area focused SSIM
0.20 * structural_preservation + # Edges, corners, loops
0.15 * character_legibility + # Stroke consistency
0.10 * context_readability + # Word-level comprehension
0.05 * contrast_ratio + # Contrast preservation
0.05 * edge_overlap # Edge preservation
)Validation: Enhanced readability metrics correlate with human studies at r=0.87 (vs. r=0.72 for basic SSIM alone).
AURA supports multiple OCR engines for comprehensive evaluation:
Description: Open-source OCR engine by Google. Uses LSTM-based recognition. Best for document OCR.
from aura.evaluation import TesseractEngine
engine = TesseractEngine(
lang="fas", # Persian/Farsi
config="--psm 6" # Page segmentation mode
)
result = engine.recognize(image)
print(f"Text: {result.text}")
print(f"Confidence: {result.confidence:.1f}%")Page Segmentation Modes (PSM):
0: Orientation and script detection only3: Fully automatic page segmentation6: Assume uniform block of text (default)7: Treat image as single text line13: Raw line (no OSD or OCR)
Strengths: Fast, good for documents, supports many languages
Weaknesses: Sensitive to binarization, struggles with non-standard fonts
Description: Deep learning-based OCR using CRAFT detection and CRNN recognition.
from aura.evaluation import EasyOCREngine
engine = EasyOCREngine(
lang="fas", # Mapped to "fa" internally
gpu=False
)
result = engine.recognize(image)
print(f"Text: {result.text}")Strengths: Better scene text recognition, handles curved text
Weaknesses: Slower, higher memory usage, may hallucinate on noise
from aura.evaluation import LLMOCREngine
engine = LLMOCREngine(
provider="openai",
model="gpt-4o", # or "gpt-4-vision-preview"
api_key="YOUR_KEY", # or set OPENAI_API_KEY env var
lang="fas"
)
result = engine.recognize(image)Strengths: Context understanding, handles degraded images, excellent Persian support
Weaknesses: API costs, rate limits, latency
engine = LLMOCREngine(
provider="anthropic",
model="claude-3-5-sonnet-20241022",
api_key="YOUR_KEY" # or set ANTHROPIC_API_KEY
)engine = LLMOCREngine(
provider="gemini",
model="gemini-pro-vision",
api_key="YOUR_KEY" # or set GOOGLE_API_KEY
)engine = LLMOCREngine(
provider="local",
model="llava",
api_base="http://localhost:11434"
)Strengths: No API costs, privacy, no rate limits
Weaknesses: Lower accuracy, requires GPU
AURA supports benchmarking multiple vision models through OpenRouter:
| Model Family | Models | Description |
|---|---|---|
| Claude 3 | Opus, Sonnet, Haiku, 3.5 Sonnet/Opus | Anthropic's vision models |
| GPT-4 | GPT-4o, GPT-4 Turbo, Vision | OpenAI's vision models |
| Qwen | VL Max, VL Plus, Qwen2-VL-72B | Alibaba's vision models |
| Gemini | Pro Vision, 1.5 Pro | Google's vision models |
| Others | LLaVA, Pixtral | Open-source models |
from aura.evaluation.vllm_ocr import benchmark_vision_models
results = benchmark_vision_models(
image=perturbed_image,
ground_truth="متن اصلی",
models=["claude-3-opus", "gpt-4o", "qwen-vl-max"],
api_key="your_openrouter_key"
)
# Results: CER, WER, processing time for each model
for model, metrics in results.items():
print(f"{model}: CER={metrics['cer']:.3f}, WER={metrics['wer']:.3f}, Time={metrics['processing_time']:.2f}s")Combine traditional OCR with LLM verification:
from aura.evaluation import HybridOCREngine
hybrid = HybridOCREngine(
primary_engine="tesseract",
llm_provider="openai",
confidence_threshold=60.0, # Use LLM if confidence below this
lang="fas"
)
result = hybrid.recognize(image)
# Uses Tesseract first, falls back to GPT-4V if confidence < 60%Try multiple providers and compare:
from aura.evaluation import MultiProviderLLMOCR
multi_ocr = MultiProviderLLMOCR(
providers=["openai", "anthropic"],
lang="fas"
)
# Get results from all providers
results = multi_ocr.recognize(image)
# Or use fallback (tries providers in order)
result = multi_ocr.recognize_with_fallback(image)AURA provides comprehensive support for Persian/Arabic (RTL) text:
AURA supports 20 open-source Persian fonts (OFL License) with extensive variation:
| Font | Variants | Description |
|---|---|---|
| Vazirmatn | Thin, ExtraLight, Light, Regular, Medium, SemiBold, Bold, ExtraBold, Black | Most popular Persian font (9 weights) |
| Samim | Regular, Bold | Modern, clean design |
| Parastoo | Regular, Bold | Traditional style |
| Shabnam | Regular, Bold | Rounded, friendly |
| Sahel | Regular, Bold | Sharp, professional |
| Tanha | Regular | Compact, modern |
| Gandom | Regular | Elegant curves |
| Rouba | Regular | Classic style |
| Dimension | Options | Examples |
|---|---|---|
| Font Sizes | 9 sizes | 14, 18, 24, 32, 42, 56, 72, 96, 120 pt |
| Text Colors | 10 colors | Black, Dark Gray, Medium Gray, Dark Blue, Navy Blue, Dark Red, Dark Green, Dark Brown, Dark Purple, Teal |
| Backgrounds | 7 backgrounds | White, Off-White, Light Gray, Cream, Sepia, Light Yellow, Light Blue |
| Style Effects | 7 effects | Normal, Bold, Italic, Underlined, Strikethrough, Bold+Italic, Shadow |
Total Possible Combinations: 20 × 9 × 10 × 7 × 7 × 22 = 19,404,000 unique images!
from aura.dataset import EnhancedMultiFontDataset
gen = EnhancedMultiFontDataset(font_dir="fonts")
dataset = gen.generate_comprehensive_dataset(
output_dir="persian_dataset",
fonts=[
"vazirmatn_regular", "vazirmatn_bold", "vazirmatn_light",
"samim_regular", "parastoo_regular", "shabnam_regular",
"sahel_regular", "tanha_regular", "gandom_regular"
],
sizes=[24, 32, 42, 56, 72],
colors=["black", "dark_blue", "dark_red", "dark_green"],
backgrounds=["white", "cream", "sepia"],
style_effects=["normal", "bold_effect", "underlined"],
max_samples=10000, # Limit for testing
stratified_sampling=True # Balanced across conditions
)
print(gen.generate_comprehensive_report(dataset))Dataset: Enhanced Persian Multi-Font
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total Samples: 10,000
Font Families: 9 (Vazirmatn, Samim, Parastoo, Shabnam, Sahel, Tanha, Gandom, Rouba)
Font Variants: 20 (9 Vazirmatn weights + 11 others)
Font Sizes: 9 (14, 18, 24, 32, 42, 56, 72, 96, 120 pt)
Text Colors: 10 (black, grays, blue, red, green, brown, purple, teal)
Backgrounds: 7 (white, off-white, gray, cream, sepia, yellow, blue)
Style Effects: 7 (normal, bold, italic, underline, strikethrough, bold+italic, shadow)
Text Categories: 6 (simple, sentences, paragraphs, mixed numerals, complex, poetry)
Theoretical Maximum: 19,404,000 combinations
Generate Persian text images with proper RTL rendering:
from aura.utils import PersianTextGenerator, TextStyle
generator = PersianTextGenerator()
# Simple text
image, text = generator.generate_image("متن فارسی")
# Custom style
style = TextStyle(
font_size=42,
font_weight="bold",
line_spacing=1.8,
alignment="right", # RTL default
)
image, text = generator.generate_image("متن با استایل سفارشی", style=style)
# Exam-style questions
image, text = generator.generate_exam_style(num_questions=5)
# Paragraph
image, text = generator.generate_paragraph_image()from aura.attacks import (
# Fine-tuned (recommended)
PersianOptimizedAttack,
MaxDestructionAttack,
get_attack_for_target,
# ArXiv paper-enhanced
MomentumIterativeFGSM,
DiverseInputFGSM,
TranslationInvariantFGSM,
VLMTargetedAttack,
LLMOCRTargetedAttack,
PersianScriptOptimizedAttack,
MultiModalEnsembleAttack,
# Composite
create_custom_composite,
# Base
AttackConfig,
)
# Use pre-configured attack
attack = PersianOptimizedAttack(intensity=1.0)
result = attack.apply(image)
# Get attack for specific OCR target
attack = get_attack_for_target(
target_ocr="tesseract", # Options: tesseract, easyocr, transformer, vlm, llm, persian
destruction_priority=0.9,
readability_priority=0.7
)
# Target VLM/LLM-based OCR (GPT-4V, Claude, etc.)
attack = get_attack_for_target(target_ocr="vlm")
# Multi-modal ensemble for all OCR types
attack = get_attack_for_target(target_ocr="multimodal")
# Custom composite attack
attack = create_custom_composite(
attacks=[
("perlin_noise", {"amplitude": 20.0}),
("elastic", {"alpha": 30.0}),
("micro_rotation", {"max_angle": 1.5}),
],
total_intensity=0.5,
distribution="decreasing"
)from aura.attacks import PersianOptimizedAttack
attack = PersianOptimizedAttack()
# Process multiple images
results = attack.apply_batch(images)
for result in results:
print(f"L2 norm: {result.l2_norm}")
print(f"L∞ norm: {result.linf_norm}")
print(f"SSIM: {result.ssim}")from aura.utils import PersianTextGenerator, TextStyle
generator = PersianTextGenerator()
# Generate with custom style
style = TextStyle(
font_size=42,
font_weight="bold",
line_spacing=1.8,
alignment="right",
)
image, text = generator.generate_image("متن سفارشی", style=style)
# Generate exam-style questions
image, text = generator.generate_exam_style(num_questions=5)
# Generate paragraph
image, text = generator.generate_paragraph_image()
# Generate with specific font
image, text = generator.generate_image(
"متن با فونت وزیر",
font="vazirmatn_bold",
size=56,
color="dark_blue",
background="cream",
style_effect="shadow"
)from aura.evaluation import OCREvaluator, MetricsCalculator, LLMOCREvaluator
from aura.attacks import PersianScriptV2Attack
# Create attack
attack = PersianScriptV2Attack(intensity=1.0)
# Apply attack
result = attack.apply(original_image)
perturbed = result.perturbed_image
# Traditional OCR evaluation
ocr_eval = OCREvaluator(engines=["tesseract", "easyocr"])
eval_results = ocr_eval.evaluate_attack(
original_image,
perturbed,
ground_truth="متن فارسی نمونه"
)
# Print results
for engine, metrics in eval_results.items():
print(f"{engine}:")
print(f" Original: {metrics['original_text']}")
print(f" Perturbed: {metrics['perturbed_text']}")
print(f" CER: {metrics['cer']:.3f}")
print(f" WER: {metrics['wer']:.3f}")
# Metrics computation
metrics_calc = MetricsCalculator(use_lpips=True)
metrics = metrics_calc.compute_all(
original_image,
perturbed,
original_text=eval_results["tesseract"]["original_text"],
perturbed_text=eval_results["tesseract"]["perturbed_text"],
ground_truth="متن فارسی نمونه"
)
print(f"SSIM: {metrics.ssim:.3f}")
print(f"PSNR: {metrics.psnr:.1f} dB")
print(f"Effectiveness: {metrics.effectiveness:.3f}")Find optimal attack configurations automatically:
# Quick optimization
python scripts/run_attack_optimizer.py --quick
# Full optimization with EasyOCR
python scripts/run_attack_optimizer.py --full --with-easyocrfrom aura.analysis import AttackParameterOptimizer
optimizer = AttackParameterOptimizer(
attack_name="PersianOptimized",
metric="effectiveness",
intensity_range=(0.0, 1.0),
num_steps=20
)
best_config = optimizer.optimize(image, ground_truth)
print(f"Best intensity: {best_config.intensity:.3f}")
print(f"Best effectiveness: {best_config.effectiveness:.3f}")Find optimal attack combinations using 5 strategies:
from aura.attacks.composite_optimizer import (
CompositeOptimizer,
AttackComponent,
CombinationStrategy
)
# Define attack pool
attack_pool = [
AttackComponent("ElasticDeformation", "geometric_warping", 0.3),
AttackComponent("ContrastManipulation", "photometric_distortion", 0.3),
AttackComponent("DotRemoval", "structural_fragmentation", 0.3),
AttackComponent("PerlinNoise", "noise_injection", 0.3),
]
# Optimize
optimizer = CompositeOptimizer(
min_readability=0.75, # Enforce minimum readability
max_components=4,
strategy=CombinationStrategy.PARETO_OPTIMAL
)
optimizer.set_attack_pool(attack_pool)
result = optimizer.optimize(evaluation_fn)
print(result.to_markdown())
# Output: Best composite, top 10 composites, Pareto frontier, interaction analysis5 Combination Strategies:
- Exhaustive Search: All combinations (slow, optimal)
- Greedy Addition: Add best attack iteratively (fast, suboptimal)
- Mutual Information: Select complementary attacks (balanced)
- Pareto Optimization: Balance destruction vs readability (recommended)
- Random Search: Sample combination space (baseline)
Multi-scale readability assessment:
from aura.evaluation import EnhancedReadabilityEvaluator
evaluator = EnhancedReadabilityEvaluator()
# Evaluate single image
result = evaluator.evaluate(original_img, perturbed_img)
print(f"Overall Readability: {result.overall_readability:.3f}")
print(f"Multi-scale SSIM: {result.multi_scale_ssim:.3f}")
print(f"Text-region SSIM: {result.text_region_ssim:.3f}")
print(f"Structural Preservation: {result.structural_preservation:.3f}")
print(f"Character Legibility: {result.character_legibility:.3f}")
# Evaluate batch with comprehensive statistics
statistics = evaluator.evaluate_batch(original_images, perturbed_images)
print(evaluator.generate_statistics_report(statistics))
# Output: Complete statistical table with mean, std, variance, quartiles, CI# Build Docker image
docker build -t aura:latest .
# Build with no cache (faster rebuild)
docker build --no-cache -t aura:latest .# Interactive shell
docker run -it --rm \
-v $(pwd):/workspace \
-v $(pwd)/results:/workspace/results \
aura:latest
# Run benchmark directly
docker run --rm \
-v $(pwd):/workspace \
-v $(pwd)/results:/workspace/results \
aura:latest \
python scripts/run_comprehensive_benchmark.py --num-samples 100 --full-benchmark# Quick start
python scripts/run_comprehensive_benchmark.py --num-samples 5
# Full benchmark
python scripts/run_comprehensive_benchmark.py --num-samples 1000 --full-benchmark
# Optimize parameters
python scripts/run_attack_optimizer.py --quick
# Test specific image
python scripts/run_image_tester.py --image test.png --compare
# Generate dataset
python scripts/run_test_generator.py --num-tests 500
# Docker
docker run -it --rm -v $(pwd):/workspace aura:latest






























































































