Skip to content

fix(docker): bake DeepFace/Facenet weights + self-healing cache volume#104

Merged
ahmetabdullahgultekin merged 1 commit into
mainfrom
fix/2026-05-12-bake-mini-fasnet-models
May 28, 2026
Merged

fix(docker): bake DeepFace/Facenet weights + self-healing cache volume#104
ahmetabdullahgultekin merged 1 commit into
mainfrom
fix/2026-05-12-bake-mini-fasnet-models

Conversation

@ahmetabdullahgultekin

Copy link
Copy Markdown
Contributor

Summary

Closes the 4th recurrence of feedback_readonly_rootfs_cache_dirs
(prior offenders: DeepFace, Numba, UniFace; now MiniFASNet). Today's
hot-fix manually docker cp'd the two MiniFASNet .pth weights into the
running container's volume; that fix was load-bearing on operator memory
and would have vanished on the next docker volume rm.

This PR shifts the fix to the image layer + entrypoint shim so the volume
becomes self-healing and operator memory is no longer a dependency.

Why the bug exists in the first place

docker-compose.prod.yml runs the bio container with:

  • read_only: true rootfs
  • Named volume biometric_models mounted at /tmp/.deepface
  • App user uid 100 / gid 101 (adduser --system)

The named volume is created by Docker owned by root:root. When DeepFace
0.0.98 tries to download 2.7_80x80_MiniFASNetV2.pth on first inference,
it cannot write under uid 100 → silently falls back → anti-spoof verdict
collapses to a false-positive. Team A (PR forthcoming) is fixing the
runtime error-path; this PR fixes the build-time + ops layer.

What changed

1. Dockerfile — new model-fetcher builder stage

Downloads the four critical weights with SHA256 verification, then
COPYs them into the runtime stage at /opt/baked-models/.deepface with
--chown=100:101. Build is reproducible because each curl is followed
by sha256sum -c against an ARG-pinned hash.

Model SHA256 Upstream
facenet512_weights.h5 3f76b5117a9ca574d536af8199e6720089eb4ad3dc7e93534496d88265de864f serengil/deepface_models@v1.0
centerface.onnx 77e394b51108381b4c4f7b4baf1c64ca9f4aba73e5e803b2636419578913b5fe Star-Clouds/CenterFace@master
2.7_80x80_MiniFASNetV2.pth a5eb02e1843f19b5386b953cc4c9f011c3f985d0ee2bb9819eea9a142099bec0 minivision-ai/Silent-Face-Anti-Spoofing@master
4_0_0_80x80_MiniFASNetV1SE.pth 84ee1d37d96894d5e82de5a57df044ef80a58be2b218b5ed7cdfd875ec2f5990 minivision-ai/Silent-Face-Anti-Spoofing@master

All four match the running container's live SHAs (captured via
docker exec biometric-api sha256sum ...) AND cross-verify against
upstream — confirmed by curl | sha256sum from the host before opening
this PR.

2. Dockerfile — pinned uid/gid 100/101 explicitly

The previous adduser --system left numbering implicit. Now:

RUN addgroup --system --gid 101 app \
 && adduser --system --ingroup app --uid 100 app

So host-side chown -R 100:101 /var/lib/docker/volumes/... always
matches the in-container app user across rebuilds.

3. deploy/entrypoint.sh (new) — self-healing cache shim

Runs as root, performs two idempotent best-effort operations, then drops
to uid 100 via gosu:

  1. Chown /tmp/.deepface to 100:101 — so any externally-mounted
    root-owned named volume doesn't shadow the baked weights.
  2. Seed missing weight files from /opt/baked-models/.deepface/weights/
    into the cache dir — so a fresh docker volume rm repopulates the
    four critical files on the next boot without operator intervention.

Both steps fail-soft (|| true); the entrypoint never blocks container
startup.

4. .env.example

Documents the runtime SHA pin required by PR #102
(DEEPFACE_SHA256_REQUIRED=true):

DEEPFACE_FACENET512_SHA256=3f76b5117a9ca574d536af8199e6720089eb4ad3dc7e93534496d88265de864f

Plus the three other SHAs documented inline for audit reference (DeepFace
0.0.98 has no integrity hook for centerface / MiniFASNet today, so they
are documented not enforced).

5. docker-compose.prod.yml

Inline comment documents the new semantics: the volume is now
self-healing, docker volume rm is safe, and removing the volume mount
entirely is also safe (the image-baked layer would be served directly).

Test plan

  • Build image with --no-cache and confirm all four sha256sum -c checks pass during the model-fetcher stage.
  • Inspect the built image: docker run --rm <image> ls -la /opt/baked-models/.deepface/weights/ returns the four files owned by 100:101.
  • Volume-wipe rehydration drill (per Operator Action Add comprehensive Senior Performance Engineer audit report #11 in parent repo):
    • docker compose -f docker-compose.prod.yml --env-file .env.prod down biometric-api
    • docker volume rm biometric-processor_biometric_models
    • docker compose -f docker-compose.prod.yml --env-file .env.prod up -d biometric-api
    • Confirm docker exec biometric-api stat -c '%u:%g' /tmp/.deepface/.deepface/weights/facenet512_weights.h5 returns 100:101.
    • Confirm a face /verify call against the testbed completes without recommended_action=block due to missing MiniFASNet.
  • Confirm bio container boots with DEEPFACE_FACENET512_SHA256 set to the value documented above (Team A PR fix(verify): enforce anti-spoof block + EAR + aged-threshold + SHA-pin + verify-challenge (2026-05-12 ML review) #102 enforces this).
  • Smoke test: /api/v1/health returns 200 within 60s of up -d.

Operator notes

Coordinated with parent PR (FIVUCSAS / fix/2026-05-12-bake-mini-fasnet-models) which adds Operator Action item 11 to OPERATOR_ACTIONS_2026-05-12.md with the post-merge cleanup runbook. No prod rebuild from this PR — the operator owns deployment.

Out of scope (intentionally not in this PR)

  • Team A's runtime fix for DeepFace download-failure → false-spoof-verdict (separate concern, separate PR).
  • DeepFace 0.0.98 version bump (separate concern — the bug isn't the version, it's the volume-ownership × read-only-rootfs interaction).
  • Dockerfile.gpu / Dockerfile.optimized parity (not used in prod today; deferred until those paths are reactivated).

Memory references

  • feedback_readonly_rootfs_cache_dirs (4th sighting)
  • feedback_env_file_docker (PR body commands all use --env-file .env.prod)
  • feedback_git_push (used bare git push -u origin <branch>)

🤖 Generated with Claude Code

Closes the 4th recurrence of feedback_readonly_rootfs_cache_dirs
(DeepFace + Numba + UniFace, now MiniFASNet). With read_only:true rootfs
and the cache named volume owned by root:root, DeepFace running as uid
100 silently failed to download MiniFASNet weights on first inference,
collapsing the anti-spoof verdict to a false-positive. Today's hot-fix
manually docker-cp'd the .pth files into the live volume; that fix was
load-bearing on operator memory and would have vanished on the next
`docker volume rm`.

Defense in depth, two layers:

1. Image bake-in. New `model-fetcher` build stage downloads the four
   critical weight files with SHA256 verification:
   - facenet512_weights.h5          3f76b51...
   - centerface.onnx                77e394b...
   - 2.7_80x80_MiniFASNetV2.pth     a5eb02e...
   - 4_0_0_80x80_MiniFASNetV1SE.pth 84ee1d3...
   All four match upstream (serengil/deepface_models, Star-Clouds/CenterFace,
   minivision-ai/Silent-Face-Anti-Spoofing) and the running container's
   live SHAs. COPY'd into the runtime stage at /opt/baked-models/.deepface
   with --chown=100:101.

2. Entrypoint shim (deploy/entrypoint.sh). Runs as root, chowns any
   externally-mounted /tmp/.deepface cache volume to 100:101, seeds
   missing weight files from the baked /opt/baked-models layer (so a
   wiped named volume self-heals on next boot), then drops to uid 100
   via gosu before exec'ing the CMD. Idempotent + best-effort. Pins
   the app user UID/GID to 100/101 explicitly so host-side chown matches
   across rebuilds (the previous --system numbering was implicit and
   drifted).

Companion changes:
- .env.example documents DEEPFACE_FACENET512_SHA256 (required runtime
  pin per PR #102 `DEEPFACE_SHA256_REQUIRED=true`) plus the three other
  SHAs for audit reference.
- docker-compose.prod.yml comments document that the `biometric_models`
  volume is now self-healing and `docker volume rm` is safe (operator
  no longer has to remember the manual docker-cp dance).

Coordinated with parent PR (OPERATOR_ACTIONS_2026-05-12.md item 11)
which gives the post-merge cleanup runbook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 12, 2026 21:00

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@ahmetabdullahgultekin ahmetabdullahgultekin merged commit 7e436af into main May 28, 2026
9 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants