Skip to content

Brazilian benchmark series + Agente Pedometrista (v0.9.60 -> v0.9.65)#16

Open
HugoMachadoRodrigues wants to merge 10 commits into
mainfrom
claude/charming-goldberg-141c95
Open

Brazilian benchmark series + Agente Pedometrista (v0.9.60 -> v0.9.65)#16
HugoMachadoRodrigues wants to merge 10 commits into
mainfrom
claude/charming-goldberg-141c95

Conversation

@HugoMachadoRodrigues
Copy link
Copy Markdown
Owner

@HugoMachadoRodrigues HugoMachadoRodrigues commented May 6, 2026

Summary

This PR bundles two release series on top of v0.9.59:

Brazilian benchmark series (v0.9.60 -> v0.9.63)

  • v0.9.60benchmark_bdsolos_sibcs() + .bdsolos_normalize_ordem() + loader extension to capture BDsolos pre-parsed Classe de Solos Nivel 1/2/3.
  • v0.9.61 — Thickness-weighted dominant-colour-in-B subordem rule (replaces SiBCS first-match-wins for Argissolos / Latossolos / Nitossolos).
  • v0.9.62merge_brazilian_pedons() deduplicates BDsolos × FEBR via site$sisb_id (BDsolos Codigo PA ≡ FEBR observacao$sisb_id).
  • v0.9.63 — README documents the trajectory.

Agente Pedometrista series (v0.9.64 -> v0.9.65)

  • v0.9.64setup_local_vlm() (Ollama + Gemma 4 one-call bootstrap, presets light/balanced/best); pedologist_system_prompt(language) persona PT-BR / EN; default Ollama model lowered to gemma4:e2b (~1.5 GB).
  • v0.9.65run_agent_app() modern bslib Shiny UI with 8 tabs (foto / PDF / ficha / espectro / tabela / classificar / trace / chat) wired to the local Gemma. Vignette v10_agente_pedometrista.Rmd. README + status footer updated.

Test plan

  • R CMD check Status: OK (0 errors / 0 warnings / 0 notes)
  • Test suite: 3 821 passing / 0 failing / 21 expected skips
  • 53 new tests / ~137 new expectations across 5 new test files
  • Smoke benchmark on real RJ data validates the colour override fires on 9 % of profiles
  • Empirical sisb_id overlap diagnostic on RJ confirms 590 / 722 BD pedons share a FEBR cross-reference
  • setup_local_vlm() + run_agent_app() validated by parsing the Shiny app.R + checking all 8 nav_panels are wired

Tags

v0.9.60, v0.9.61, v0.9.62, v0.9.63, v0.9.64, v0.9.65

GitHub Releases: v0.9.63 · v0.9.65

🤖 Generated with Claude Code

Hugo Rodrigues and others added 4 commits May 6, 2026 14:24
…files)

- New R/benchmark-bdsolos.R:
  * benchmark_bdsolos_sibcs(pedons): runs classify_sibcs() and computes
    confusion matrix + per-Ordem recall vs reference_sibcs (the pedologist
    truth from BDsolos Classe de Solos Nivel 1/2/3)
  * .bdsolos_normalize_ordem(): maps modern ("ARGISSOLO" -> "Argissolos")
    and pre-1999 legacy names (PODZOLICO, LATOSOL, GLEI, BRUNIZEM, ALUVIAL,
    AREIA, RENDZINA -> SiBCS Ordens), diacritic-aware

- R/bdsolos.R loader extension:
  * Captures Classe de Solos Nivel 1/2/3 columns into
    site$reference_nivel_1/2/3
  * Bug fix .bdsolos_find_header_line(): switched from which.max() to
    first line with >= 5 fields. Real BDsolos data rows can have MORE
    semicolons than the header because free-text "Descricao Original"
    fields contain embedded ';'

- Tests: 10 new tests / 42 expectations
  * normalize_ordem mapping (modern + legacy + diacritics + edge cases)
  * benchmark schema (predictions/confusion/per_ordem/summary)
  * accuracy bounds, max_n, error handling, n_unmapped reporting
  * Integration: load_bdsolos_csv captures niveis 1/2/3

- Smoke test on 100 RJ pedons: 34% Ordem accuracy
  * Argissolos: 67.6% recall (largest class, healthy baseline)
  * 0% recall on Latossolos, Gleissolos, Espodossolos -> v0.9.61 targets

R CMD check Status: OK (3717 / 0 / 20)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rule

- New R/sibcs-color-tuning.R:
  * .classify_b_color(hue, value, chroma): one of VERMELHO /
    VERMELHO_AMARELO / AMARELO / BRUNO_ACINZENTADO / ACINZENTADO
  * .dominant_b_color(pedon): walks all B horizons, returns
    thickness-weighted dominant color category. Tie-break order:
    BRUNO_ACINZ > ACINZ > AMARELO > VERMELHO > V_AMARELO
  * .dominant_b_color_subordem(pedon, ordem_code): ordem-aware
    mapping to SiBCS subordem code (P -> PV/PA/PVA/PBAC/PAC,
    L -> LV/LA/LVA/LB/LVA, N -> NV/NX/NX/NB/NX). Other ordens
    return NA (no override).
  * .apply_color_dominant_override(): post-processor, swaps
    YAML-assigned subordem when dominant disagrees with first-match.

- R/key-sibcs.R: classify_sibcs() wires the override between
  subordem assignment and the v0.9.45 "cor a determinar" fallback
  detection. Override evidence ends up in
  result$trace$color_dominant_override and a warning fires whenever
  the swap happens.

- R/benchmark-bdsolos.R:
  * .bdsolos_normalize_subordem(): maps any case / language form of
    a subordem name to the canonical 2-3 letter SiBCS code
    (PV / PBAC / LVA / etc.). Diacritic-aware. Handles compound
    names (BRUNO-ACINZENTADO, VERMELHO-AMARELO).
  * benchmark_bdsolos_sibcs() now also reports subordem-level
    metrics: predictions$predicted_subordem_code /
    reference_subordem_code / agree_subordem;
    accuracy_subordem (top-level); summary$n_in_scope_sub /
    n_matched_sub.
  * tests/test-v0960-bdsolos-benchmark.R schema test updated for
    new fields.

- Tests: 14 new tests / 37 expectations
  * .classify_b_color mapping for all 5 categories + NA inputs
  * .dominant_b_color thickness-weighted dominant + NA fallback
  * .dominant_b_color_subordem for P/L Ordens + non-color Ordens
  * .apply_color_dominant_override flip + no-op + non-color +
    missing-Munsell paths
  * classify_sibcs() end-to-end: override exposed in trace +
    Cambissolos untouched

Smoke results (RJ benchmark, 100 pedons): 9 / 100 pedons had
their subordem overridden by the dominant-color rule. Ordem
accuracy unchanged (33%) since the override is a 2nd-level rule.

R CMD check Status: OK (3722 / 0 / 20).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…set)

Joins the BDsolos and FEBR PedonRecord lists by site$sisb_id to
dedupe historic Embrapa pedons that appear in both corpuses.

- New R/merge-brazilian.R:
  * merge_brazilian_pedons(bdsolos, febr, prefer = c("bdsolos",
    "febr"), verbose = TRUE): joins two PedonRecord lists by
    site$sisb_id, drops duplicates from the non-preferred side.
    Each surviving pedon is tagged with site$merge_decision
    ("kept_bdsolos" / "kept_febr" / "unique") and site$merge_source.
  * summarize_brazilian_overlap(bdsolos, febr): diagnostic counts
    (n_bdsolos, n_febr, n_shared, n_bdsolos_only, n_febr_only,
    n_unmatchable). Useful before committing to the merge.
  * .get_sisb_id(pedon): NA-safe centralised lookup. Backwards-
    compatible with PedonRecord objects pre-v0.9.62.

- R/bdsolos.R: load_bdsolos_csv() now also assigns
  site$sisb_id <- Codigo PA (BDsolos historical pedon ID,
  identical numbering to FEBR's observacao$sisb_id).

- R/febr.R: read_febr_pedons() now captures observacao$sisb_id
  into site$sisb_id (character, NA-safe).

Empirical RJ overlap scan (722 BDsolos x 884 FEBR):
  590 shared sisb_ids, 132 BD-only, 239 FEBR-only, 55 unmatchable.
  Naive concat 1606 -> after merge 1016 distinct pedons.

Tests: 12 new tests / 28 expectations.
R CMD check Status: OK (3760 / 0 / 20).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eries

- Version badge 0.9.40 -> 0.9.63
- Tests-passing badge 3137 -> 3760
- New "What's new in v0.9.62" section: load_bdsolos_csv,
  read_febr_pedons, benchmark_bdsolos_sibcs, dominant-color-in-B
  override, merge_brazilian_pedons (with the 590/722 RJ overlap
  empirical result)
- Status footer merges Brazilian highlights with USDA/WRB summary
- NEWS.md entry for v0.9.63

No code changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 6, 2026 19:18
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends soilKey’s Brazilian SiBCS pipeline by adding: (1) a BDsolos surveyor-reference benchmark, (2) a dominant-color-in-B post-processor to improve color-driven SiBCS subordem assignment, and (3) a BDsolos×FEBR dedup merge keyed on site$sisb_id, alongside documentation and test-suite updates.

Changes:

  • Add BDsolos benchmark tooling (benchmark_bdsolos_sibcs() + BDsolos SiBCS normalization helpers).
  • Add SiBCS dominant-color-in-B override logic and wire it into classify_sibcs() trace/warnings.
  • Add BDsolos×FEBR merge + overlap diagnostics via site$sisb_id, plus README/NEWS/version bumps and new tests/man pages.

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tests/testthat/test-v0962-merge-brazilian.R Adds tests for sisb_id extraction, merge behavior, overlap summary, and BDsolos loader integration.
tests/testthat/test-v0961-sibcs-color-tuning.R Adds unit and end-to-end tests for dominant-color-in-B override behavior.
tests/testthat/test-v0960-bdsolos-benchmark.R Adds tests for BDsolos Ordem normalization and benchmark schema/behavior.
README.md Adds Brazilian benchmark series narrative + updates badges/status text.
R/sibcs-color-tuning.R Implements dominant B-horizon color categorization, dominance calculation, and override application.
R/merge-brazilian.R Implements merge_brazilian_pedons() and summarize_brazilian_overlap() plus internal helpers.
R/key-sibcs.R Wires dominant-color override into classify_sibcs() and exposes it in trace/warnings.
R/febr.R Captures FEBR observacao$sisb_id into site$sisb_id.
R/benchmark-bdsolos.R Adds BDsolos benchmark and normalization helpers for Ordem/Subordem.
R/bdsolos.R Captures BDsolos Classe de Solos Nivel 1/2/3 and sets site$sisb_id from Codigo PA; improves header detection.
NEWS.md Documents v0.9.60–v0.9.63 Brazilian series and associated tests/metrics.
NAMESPACE Exports benchmark_bdsolos_sibcs(), merge_brazilian_pedons(), summarize_brazilian_overlap().
man/summarize_brazilian_overlap.Rd Roxygen output for summarize_brazilian_overlap().
man/merge_brazilian_pedons.Rd Roxygen output for merge_brazilian_pedons().
man/dot-tag_merge_decision.Rd Roxygen output for internal .tag_merge_decision().
man/dot-get_sisb_id.Rd Roxygen output for internal .get_sisb_id().
man/dot-dominant_b_color.Rd Roxygen output for internal .dominant_b_color().
man/dot-dominant_b_color_subordem.Rd Roxygen output for internal .dominant_b_color_subordem().
man/dot-classify_b_color.Rd Roxygen output for internal .classify_b_color().
man/dot-BDSOLOS_SITE_PATTERNS.Rd Updates documented size of .BDSOLOS_SITE_PATTERNS.
man/dot-bdsolos_normalize_subordem.Rd Roxygen output for internal .bdsolos_normalize_subordem().
man/dot-bdsolos_normalize_ordem.Rd Roxygen output for internal .bdsolos_normalize_ordem().
man/dot-apply_color_dominant_override.Rd Roxygen output for internal .apply_color_dominant_override().
man/benchmark_bdsolos_sibcs.Rd Roxygen output for benchmark_bdsolos_sibcs().
DESCRIPTION Bumps package version to 0.9.63.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread README.md Outdated

[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg?style=flat-square)](https://lifecycle.r-lib.org/articles/stages.html)
![v0.9.40](https://img.shields.io/badge/version-0.9.40-FF6B35?style=flat-square)
![v0.9.62](https://img.shields.io/badge/version-0.9.62-FF6B35?style=flat-square)
Comment thread R/benchmark-bdsolos.R
Comment on lines +39 to +40
#' \item \code{PODZOLICO}, \code{PODZOLCIO}, \code{LATOSOL}
#' -> \code{Argissolos} (the 1999 SiBCS rename)
Comment thread R/benchmark-bdsolos.R
Comment on lines +140 to +144
toks <- strsplit(ascii, "[ ,;]+")[[1L]]
if (length(toks) < 1L) return(NA_character_)
ord_word <- toks[1L]
sub_word <- if (length(toks) >= 2L) toks[2L] else ""
# Some BDsolos rows use compound names (BRUNO-ACINZENTADO,
Comment thread R/benchmark-bdsolos.R
Comment on lines +295 to +299
predicted_ordem <- character(length(pedons))
predicted_subordem <- character(length(pedons))
predicted_gg <- character(length(pedons))
predicted_sg <- character(length(pedons))
reference_raw <- character(length(pedons))
Comment thread R/merge-brazilian.R
Comment on lines +63 to +65
pedon$site$reference_source <- paste0(prev_src, " | merged:", decision)
} else {
pedon$site$reference_source <- paste0(prev_src, " | ", decision)
Comment thread R/merge-brazilian.R
#' tagged via \code{site$merge_decision} (\code{"kept_bdsolos"},
#' \code{"kept_febr"}, or \code{"unique"}) and \code{site$merge_source}.
#' Pedons appear in the order: chosen-from-overlap first, then
#' unique-to-bdsolos, then unique-to-febr.
Comment thread R/sibcs-color-tuning.R
# * VERMELHO -- hue <= 2.5YR (10R, 7.5R, 5R, 2.5R, 2.5YR)
# * VERMELHO_AMARELO -- hue == 5YR (intermediate)
# * AMARELO -- hue >= 7.5YR with chroma >= 4
# * BRUNO_ACINZENTADO -- value <= 4 AND chroma <= 4 (dark, regardless hue)
Comment thread R/sibcs-color-tuning.R
Comment on lines +51 to +56
# 1. BRUNO_ACINZENTADO: dark (value <= 4, chroma <= 4) and at least
# moderately yellow (hue >= 5YR) -- catches the dark-brown / dark-grey
# end of the B color spectrum.
if (value <= 4 && chroma <= 4 &&
grepl("^(5YR|7\\.5YR|10YR|2\\.5Y|5Y|10Y)\\b", hu)) {
return("BRUNO_ACINZENTADO")
Comment thread R/sibcs-color-tuning.R
Comment on lines +251 to +253
"dominante-de-cor em B (categoria %s, espessura ",
"%.0f cm de %d horizonte(s) classificado(s)/",
"%d horizonte(s) B)."),
Hugo Rodrigues and others added 2 commits May 6, 2026 15:49
…ist persona

One-call bootstrap of the local VLM stack so v0.9.65's agent_app
can offer a single "Configurar Gemma local" button.

- New R/setup-local-vlm.R:
  * setup_local_vlm(model = "balanced"): idempotent. Detects Ollama,
    starts the daemon if needed, pulls the chosen model. Catalog:
    light = gemma4:e2b (~1.5 GB), balanced = gemma4:e4b (~3 GB),
    best = gemma4:31b (~19 GB). Accepts arbitrary model identifiers
    as well. Returns status list (ready, model, ollama_url,
    installed, running, pulled, hint) for direct rendering in a
    Shiny status card.
  * ollama_is_installed(): detects ollama on PATH.
  * ollama_ensure_running(timeout_s = 30): starts ollama serve in
    background and polls until /api/tags answers.
  * ollama_pull_model(model): wraps `ollama pull`; no-op when the
    model is already on disk; rejects empty / NA input.
  * ollama_list_local_models(): queries /api/tags; never throws.
  * .print_ollama_install_hint(): OS-specific install instructions
    (Homebrew / curl-pipe-sh / winget) when Ollama is missing.

- R/vlm-prompts.R:
  * pedologist_system_prompt(language = c("pt-BR", "en")): canonical
    persona installed in every chat session (and exposed for any
    user-built vlm_provider(..., system_prompt = ...)). Trained
    pedometrist; SiBCS 5a + WRB 2022 + KST 13ed; explicit "NEVER
    classify, only extract" + per-attribute confidence + source_quote.

- R/vlm-providers.R:
  * Default Ollama model lowered from gemma4:e4b to gemma4:e2b
    (laptop-friendly default; users opt into bigger via
    setup_local_vlm presets).
  * Documentation updated to point at setup_local_vlm() as the
    one-shot bootstrap path.

- Tests: 13 new tests / ~30 expectations
  * Catalog resolution (light/balanced/best -> model names)
  * Status schema verified (7 documented fields)
  * Error paths (no Ollama on PATH, invalid model, empty / NA input)
  * Daemon lifecycle short-circuits
  * Persona content & language switching (PT-BR / EN)

CRAN-friendly: ships the downloader, NOT the weights. The user runs
setup_local_vlm() once after install; Ollama caches the model in
~/.ollama/models/. No network calls happen at package install time.

R CMD check Status: OK (3821 / 0 / 21).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ma pedometrist

End-to-end soil profile classification driven by the v0.9.64 local
Gemma 4 stack: photo + PDF + field-sheet image + Vis-NIR spectrum
-> deterministic taxonomic key (WRB 2022 + SiBCS 5a + USDA Tax 13).

- New inst/shiny/agent_app/app.R (bslib page_navbar, 8 nav_panels):
  * Foto Munsell  -> extract_munsell_from_photo()
  * PDF / Texto   -> extract_horizons_from_pdf()
  * Ficha Campo   -> extract_site_from_fieldsheet()
  * Espectros     -> fill_from_spectra() (OSSL local-band library)
  * Tabela        -> editable DT for manual correction
  * Classificar   -> classify_all() -> 3 bslib::value_box() cards
  * Trace         -> per-system trace + provenance browser
  * Pedometrista  -> free-form chat (ellmer chat session preserved)

  Persistent 320 px sidebar with provider/model selector, real-time
  Ollama status badges, "Configurar Gemma local" button (modal
  progress), language toggle (PT-BR / EN), session reset.

- New R/run-agent-app.R: run_agent_app(port, launch.browser, ...)
  launcher; soft-fails on missing Suggests with actionable
  install.packages() hint.

- New vignettes/v10_agente_pedometrista.Rmd: walkthrough of setup,
  persona, all 8 tabs, classify_from_documents() programmatic
  equivalent, privacy / data sovereignty rationale, known limits.

- README.md:
  * Version badge 0.9.62 -> 0.9.65
  * Tests-passing badge 3 760 -> 3 821
  * New "What's new in v0.9.65 -- Agente Pedometrista" section
  * Status footer rewritten

- DESCRIPTION: adds bslib + bsicons to Suggests.

- Tests (test-v0965-agent-app.R): 4 verifying launcher exports,
  app.R parseability, all 8 nav_panels wired, persona referenced.

Principle: the LLM never classifies. It only extracts schema-validated
JSON. The taxonomic key is 100 % R, deterministic, YAML-versioned.

R CMD check Status: OK (3821 / 0 / 21).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@HugoMachadoRodrigues HugoMachadoRodrigues changed the title Brazilian benchmark series: v0.9.60 BDsolos benchmark + v0.9.61 dominant-color tuning + v0.9.62 sisb_id dedup (+ v0.9.63 README) Brazilian benchmark series + Agente Pedometrista (v0.9.60 -> v0.9.65) May 6, 2026
Hugo Rodrigues and others added 4 commits May 6, 2026 16:52
…l-VLM hint

Adds the harness that measures the local Gemma 4 baseline on real
soilKey extraction tasks, so Phase 2 (few-shot) and Phase 3 (LoRA)
decisions are informed by data.

- New R/zzz.R: .onAttach() interactive hint about local Gemma stack:
  * Silent when Ollama not installed.
  * Hint to start daemon when installed but stopped.
  * Hint to run setup_local_vlm("light") when daemon up but
    gemma4:e2b missing.
  * Auto-pull only with options(soilKey.auto_setup_vlm = TRUE) or
    Sys.setenv(SOILKEY_AUTO_SETUP_VLM = "1") (CRAN policy 1.1
    forbids auto-modification of system without explicit consent).
  * Suppress all hints with options(soilKey.suggest_local_vlm = FALSE).
  * .suggest_local_vlm_message() factored out for testability.

- New R/benchmark-vlm-extraction.R:
  * benchmark_vlm_extraction(providers, tasks, fixtures_dir,
    max_per_task): provider-agnostic. 3 tasks (horizons / site /
    munsell) x per-task metrics. Returns long predictions df +
    summary df. Accepts MockVLMProvider for unit tests.
  * list_vlm_fixtures(task): paired (input, golden.json) discovery.
  * make_synthetic_horizons_fixture(pedon, fixture_id): renders any
    PedonRecord into a Markdown profile description and emits the
    structured horizons as the golden answer. Scales fixture set
    from BDsolos / FEBR / KSSL.
  * .metric_horizons_overlap: precision + recall (depth-overlap >=
    80 %) + per-attribute match rate (10 % numeric tolerance).
  * .metric_site_iou: field-level IoU + value-accuracy + recall.
  * .munsell_delta_e: prefers Nickerson Color Difference Index;
    falls back to Lab Euclidean (DeltaE 1976).

- New inst/prompts/extract_site_from_text.md: text-mode companion
  to extract_site_metadata.md. Required because the original prompt
  says "Supplied as an image content block", causing local Gemma
  to return all-null when fed text. Inlines {document_text} and
  explicitly forbids null for visible fields.

- inst/fixtures/vlm_extraction/ -- 4 bundled paired fixtures:
  * horizons/perfil_RJ_argissolo (4-horizon Argissolo Vermelho-Amarelo)
  * horizons/perfil_MG_latossolo (4-horizon Latossolo Vermelho)
  * site/ficha_RJ_001, ficha_MG_002 (field-sheet text + golden site)
  * munsell/README.md -- format spec for user-supplied photo fixtures
    (CRAN size + licence policy).

- vignettes/v11_vlm_extraction_benchmark.Rmd: walkthrough.

- 47 tests / ~70 expectations in test-v0966-benchmark-vlm-extraction.R
  covering fixture discovery, metric correctness on synthetic ground
  truths, end-to-end with MockVLMProvider, and
  .suggest_local_vlm_message shape on all Ollama states.

- README.md: version 0.9.65 -> 0.9.66, tests 3821 -> 3868, new
  "What's new" section with the baseline table.

## Baseline measured (gemma4 8B local, MacBook M1)

| task     | fixture        | precision/iou | recall/value-acc | attr-match |
|----------|----------------|---------------|------------------|-----------|
| horizons | Latossolo MG   | 1.00          | 1.00             | 1.00      |
| horizons | Argissolo RJ   | 1.00          | 1.00             | 1.00      |
| site     | Ficha MG       | 0.79          | 1.00             | 0.79      |
| site     | Ficha RJ       | 0.87          | 0.92             | 0.87      |

Horizons extraction is solved on clean PT-BR profiles. Site is
~83 % IoU with ~96 % value-accuracy on matched fields -- gaps are
inferred fields (e.g. country: BR from a Brazilian state) the 8B
model misses. This baseline is the input for Phase 2 / Phase 3.

R CMD check Status: OK (3868 / 0 / 21).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The on-disk size figures shipped in v0.9.64 - v0.9.66 for the local
Gemma 4 catalog were wrong: I had documented gemma4:e2b at "~1.5 GB"
assuming bare 2B parameters at 4-bit quantisation, but the multimodal
Gemma 4 builds bundle a vision encoder + tokenizers that add ~5 GB
on top. Confirmed locally after the v0.9.66 pull completed:

  $ ollama list
  gemma4:e2b   6.67 GB
  gemma4       8.95 GB   (alias of latest, 8B)

Corrigendum scope (no code logic changes):

- R/setup-local-vlm.R .SOILKEY_OLLAMA_CATALOG -- corrected size_gb
  fields to 6.7 / 8.0 / 19.0; new docstring explaining the ~5 GB
  vision-encoder overhead.
- R/zzz.R .suggest_local_vlm_message() -- "(~1.5 GB)" replaced with
  "(~6.7 GB on disk)".
- R/vlm-providers.R -- both vlm_provider() docstrings updated with
  corrected sizes + multimodal-overhead note.
- vignettes/v10_agente_pedometrista.Rmd -- corrected sizes plus a
  corrigendum callout pointing back to v0.9.67.
- vignettes/v11_vlm_extraction_benchmark.Rmd -- corrected sizes AND
  added a fresh head-to-head benchmark comparing gemma4:e2b vs the
  8B 'gemma4' alias on the four bundled text fixtures.
- README.md -- corrected sizes everywhere; status footer updated.

New baseline finding (e2b vs 8B head-to-head):

| task     | gemma4:e2b | gemma4 (8B) |
|----------|-----------|-------------|
| horizons | 1.00 / 1.00 / 1.00 (both fixtures) | 1.00 / 1.00 / 1.00 (both) |
| site     | 50% ok rate; value-acc 1.00 on matched | 0% ok rate (JSON validation errors) |

- Horizons (text) is solved at both sizes -- locks in gemma4:e2b as
  the soilKey default.
- Site (text) is unstable on both sizes; failure mode is JSON
  validation, not wrong content. When extraction succeeds, value-
  accuracy on matched fields is 100%. This is exactly what Phase 2
  (few-shot demos in the prompt) targets.

R CMD check Status: OK. No tests changed; no API changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…risation

Adds schema-correct worked-example prompts for the 3 extraction
tasks (horizons / site-from-text / Munsell-from-photo), opt-in
use_fewshot parameter on every extractor, n_repeats parameter on the
benchmark, and a harder bundled fixture (multi-horizon Chernossolo
BA with PT-BR comma decimals + mixed Munsell umida/seca + CaCO3).

- New inst/prompts/ (3 few-shot variants):
  * extract_horizons_fewshot.md -- 2 worked examples in the SCHEMA-
    CORRECT mixed shape: top_cm / bottom_cm / designation / boundary_*
    are RAW values; munsell_moist / munsell_dry are SINGLE wrapped
    objects (hue + value + chroma + confidence + source_quote);
    everything else (clay_pct, ph_h2o, etc.) is wrapped {value,
    confidence, source_quote}. Earlier draft (separate
    munsell_hue_moist wrappers) caused 0% schema validation -- this
    is the corrected v0.9.68 shape.
  * extract_site_from_text_fewshot.md -- 2 PT-BR + EN examples; id /
    crs raw, everything else wrapped; country inferred from state.
  * extract_munsell_from_photo_fewshot.md -- 2 examples (with /
    without Munsell card; confidence calibration baked in).

- R/vlm-extract.R:
  * extract_horizons_from_pdf(use_fewshot = TRUE) -- new arg, default
    TRUE. Switches prompt to *_fewshot variant.
  * extract_munsell_from_photo(use_fewshot = TRUE) -- same.
  * extract_site_from_fieldsheet(use_fewshot = TRUE) -- accepted but
    image-mode prompt unchanged (text-mode goes through
    .run_one_extraction).

- R/benchmark-vlm-extraction.R:
  * benchmark_vlm_extraction(use_fewshot = TRUE, n_repeats = 1L) --
    two new args. n_repeats runs each fixture N times so the summary
    can report metric_*_mean AND metric_*_sd. Required to distinguish
    real lift from stochastic LLM noise on a small fixture set.
  * .run_one_extraction(use_fewshot) -- forwarded.

- inst/fixtures/vlm_extraction/horizons/perfil_BA_chernossolo_messy:
  4-horizon Chernossolo Argiluvico Carbonatico from a Bahia survey,
  with PT-BR comma decimals, UTM coords noted then converted, mixed
  Munsell umida/seca, CaCO3 equivalents. Smoke at v0.9.68:
  precision = 1.00, recall = 1.00, attr_match = 0.79 with gemma4:e2b
  + few-shot.

- vignettes/v11 + README + NEWS: new "Phase 2" section + status
  footer.

## Honest measurement

Few-shot did NOT move metrics on the 4 simple fixtures because
vanilla gemma4:e2b already nails them. The 50% ok-rate observed
in v0.9.66 was stochastic variance, not a real failure mode -- which
is exactly what the new n_repeats parameter exposes. Few-shot
DOES NOT regress quality, and the harder Chernossolo BA fixture
confirms the system handles non-toy PT-BR profiles cleanly. Real
lift will surface from harder fixtures or smaller models -- not
from the existing toy suite.

R CMD check Status: OK (3868 / 0 / 21 unchanged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…+ polish

Three coherent improvements that close out the Phase 2 roadmap.

(A) 8 BDsolos hard fixtures
    inst/fixtures/vlm_extraction/horizons/bdsolos_RJ_*.{txt,golden.json}
    -- one per SiBCS Ordem (Argissolo, Cambissolo, Chernossolo,
    Espodossolo, Gleissolo, Latossolo, Neossolo, Planossolo).
    Generated via make_synthetic_horizons_fixture() from real RJ
    pedons. Stress-test target: gemma4:e2b + few-shot vs baseline,
    n_repeats = 3, ~12 min on a laptop CPU.

(B) ellmer chat_structured() bridge
    - New R/vlm-types.R:
      * vlm_type_from_soilkey_schema(name): wraps
        ellmer::type_from_schema() reading inst/schemas/<name>.json.
        Returns the ellmer type tree the provider needs for
        chat_structured(type = ...).
      * .provider_supports_structured(provider): TRUE when provider
        exposes chat_structured as a method.
    - validate_or_retry(use_structured = FALSE): new param. When
      TRUE AND provider supports it, replaces the chat-and-parse-
      and-retry loop with a single chat_structured() call that
      returns a schema-validated R list directly. Removes JSON
      validation errors at the protocol level (Anthropic tool
      calls / OpenAI response_format=json_schema / Ollama 0.5+
      format=json_schema / Gemini structured output).
    - extract_horizons_from_pdf(), extract_munsell_from_photo(),
      extract_site_from_fieldsheet(), benchmark_vlm_extraction():
      all accept use_structured (default FALSE for back-compat).

(C) Production polish
    - extract_horizons_from_pdf(): cli::cli_progress_bar() for
      multi-chunk PDFs (no-op for single-chunk).
    - inst/shiny/agent_app/app.R: new sidebar section "Estrategia
      de extracao" with checkboxes for use_fewshot (default TRUE)
      and use_structured (default FALSE). Both flags propagate to
      every extract_*() call inside the app.
    - Model preset labels corrected to v0.9.67 measured sizes
      (light = ~6.7 GB, balanced = ~8 GB, best = ~19 GB).

Tests: 20 new tests / ~45 expectations (test-v0970-structured-
outputs.R) covering type builder, capability probe, structured
fast path, fallback path, and parameter propagation through the
extractor family. Total: 3 888 / 0 / 21.

R CMD check Status: OK.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants