Skip to content

feat(harvester): add Memovs audiovisual archives API#4024

Open
rerowep wants to merge 1 commit into
rero:stagingfrom
rerowep:wep-harvesting-audiovisual-archives
Open

feat(harvester): add Memovs audiovisual archives API#4024
rerowep wants to merge 1 commit into
rero:stagingfrom
rerowep:wep-harvesting-audiovisual-archives

Conversation

@rerowep
Copy link
Copy Markdown
Contributor

@rerowep rerowep commented Feb 3, 2026

Add new API harvester for Médiathèque Valais audiovisual archives (archives.memovs.ch). This harvester enables automatic ingestion of Film, Photo, and Audio metadata from the Memovs digital archives.

Implementation includes:

  • ApiMemovs class with pagination and date filtering support
  • JSON transformation layer mapping Memovs metadata to RERO-ILS
  • Document type detection (Film, Audio, Photo, Other)
  • Holdings creation with electronic locations
  • Support for creators, contributors, subjects, and descriptions
  • Integration with existing CLI and task infrastructure

Configuration added as VS-MEMO in apisources.yml with endpoint https://archives.memovs.ch/docs/api/ and code 'memovs'.

@rerowep rerowep self-assigned this Feb 3, 2026
@rerowep rerowep force-pushed the wep-harvesting-audiovisual-archives branch 2 times, most recently from a072a85 to 6377ef8 Compare February 4, 2026 13:09
@rerowep rerowep force-pushed the wep-harvesting-audiovisual-archives branch 5 times, most recently from e41e4c2 to 6170f6c Compare February 5, 2026 10:49
@rerowep rerowep force-pushed the wep-harvesting-audiovisual-archives branch from 6170f6c to f2ff57a Compare February 9, 2026 15:06
@rerowep rerowep force-pushed the wep-harvesting-audiovisual-archives branch 3 times, most recently from 02670f4 to 07a5ba6 Compare February 19, 2026 15:37
@rerowep rerowep force-pushed the wep-harvesting-audiovisual-archives branch 2 times, most recently from 9de0c23 to dd9acfe Compare February 21, 2026 22:23
@coveralls
Copy link
Copy Markdown

coveralls commented Feb 22, 2026

Coverage Status

coverage: 91.218% (-0.09%) from 91.312% — rerowep:wep-harvesting-audiovisual-archives into rero:staging

@rerowep rerowep force-pushed the wep-harvesting-audiovisual-archives branch from dd9acfe to 0861a59 Compare March 18, 2026 13:47
@rerowep rerowep force-pushed the wep-harvesting-audiovisual-archives branch 6 times, most recently from da8348f to 817eb99 Compare May 13, 2026 06:49
@rerowep rerowep marked this pull request as ready for review May 13, 2026 06:52
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 13, 2026

Review Change Stack

Walkthrough

This pull request introduces a complete Memovs audiovisual archive API harvester for RERO-ILS. The implementation spans database schema extension, a sophisticated JSON-to-RERO-ILS data transformation pipeline, HTTP-based record harvesting with in-memory caching and item/local-field synchronization, comprehensive test coverage across transformation and harvesting logic, and operational deployment configuration. The feature extends existing API harvester infrastructure through a new settings column, a specialized ApiMemovs class with pagination and cache optimization, and integration into the CLI, scheduler, and data configuration.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding a new Memovs audiovisual archives API harvester, which matches the comprehensive implementation across all changed files.
Description check ✅ Passed The description is directly related to the changeset, detailing the implementation of the Memovs API harvester including its key components (ApiMemovs class, JSON transformation, document type detection, holdings creation, and configuration).
Docstring Coverage ✅ Passed Docstring coverage is 98.84% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@rerowep rerowep force-pushed the wep-harvesting-audiovisual-archives branch from 817eb99 to 00968e3 Compare May 13, 2026 11:40
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
tests/ui/entities/remote_entities/test_remote_entities_api.py (1)

58-60: ⚡ Quick win

Weak logging assertion: consider removing or making it specific.

The assertion only verifies that any elasticsearch log record exists, but doesn't check:

  • Log level (is it actually a warning?)
  • Message content (is it about the 404?)
  • Relationship to the delete_from_index() call

Since this test is primarily about entity creation and deletion, consider either:

  1. Remove the assertion entirely if logging behavior is not central to this test's purpose.
  2. Make it specific if you do want to verify the expected 404 warning:
    es_logs = [r for r in caplog.records if r.name == "elasticsearch"]
    assert len(es_logs) == 1
    assert es_logs[0].levelname == "WARNING"
    assert "404" in es_logs[0].message

The current assertion could pass even if the expected warning doesn't occur, reducing test reliability.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/ui/entities/remote_entities/test_remote_entities_api.py` around lines
58 - 60, The current weak assertion "assert any(r.name == 'elasticsearch' for r
in caplog.records)" should be tightened or removed: either delete the assertion
if logging isn't central, or replace it with a specific check that filters
caplog.records for r.name == "elasticsearch" and asserts the expected count,
r.levelname == "WARNING", and that the log message contains "404" (to tie it to
the delete_from_index() call) so the test verifies the exact warning produced by
the deletion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rero_ils/modules/api_harvester/cli.py`:
- Around line 120-121: The info command currently prints the full api.settings
payload (click.echo(f"\tsettings  : {api.settings}")), which may leak sensitive
data; change this to print a redacted indicator or only safe keys: implement a
small helper (e.g., redact_settings or get_safe_settings_summary) that takes
api.settings and returns either a list of allowed keys/values or a redacted
placeholder like "<redacted>" or a filtered dict with only safe keys, and
replace the direct usage of api.settings in the click.echo call with the
helper's output so sensitive fields are never printed.

In `@rero_ils/modules/api_harvester/memovs/dojson/json/model.py`:
- Around line 557-559: The code handling bf:summary in the model currently
appends its raw value into summaries as label.value, which can be a dict or
list; update the logic in the block around the bf:summary handling (the if
description := self.data.get("bf:summary") / summaries.append call) to normalize
description to a string before storing: if it's a list, join elements (or map to
strings and join); if it's a dict, serialize to a stable string (e.g., extract a
textual field or use JSON serialization); otherwise coerce to str; then append
{"label": [{"value": normalized_description}]}. Ensure the normalization covers
None and empty cases.
- Around line 179-190: The current block that builds identified_by only handles
bf:identifier when it's a list; update the logic around identifiers (the
variable assigned from self.data.get("bf:identifier", [])) to also accept a
single dict by normalizing it into a list (e.g., if isinstance(identifiers,
dict): identifiers = [identifiers]) before the existing comprehension, then keep
the existing filters and fields ("bf:noteType", "rdf:value", note_type logic and
the exclusion of "COTE") so dict-form identifiers are processed the same as list
entries.

In `@rero_ils/modules/api_harvester/utils.py`:
- Around line 78-80: Avoid appending the raw settings value to the update
messages to prevent leaking secrets: keep assigning settings to source.settings
but replace msgs.append(f"settings:{settings}") with a generic message such as
msgs.append("settings:changed") (or similar non-sensitive text). Locate the code
that sets source.settings and updates msgs (the assignment to source.settings
and the msgs.append call) and change only the message text so no secret content
is echoed to logs/CLI.

---

Nitpick comments:
In `@tests/ui/entities/remote_entities/test_remote_entities_api.py`:
- Around line 58-60: The current weak assertion "assert any(r.name ==
'elasticsearch' for r in caplog.records)" should be tightened or removed: either
delete the assertion if logging isn't central, or replace it with a specific
check that filters caplog.records for r.name == "elasticsearch" and asserts the
expected count, r.levelname == "WARNING", and that the log message contains
"404" (to tie it to the delete_from_index() call) so the test verifies the exact
warning produced by the deletion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bcf40606-2037-45f1-894a-9f4a7b571ecd

📥 Commits

Reviewing files that changed from the base of the PR and between 994996e and 00968e3.

📒 Files selected for processing (20)
  • data/apisources.yml
  • data/organisations.json
  • rero_ils/alembic/4b2a1c3d8e9f_apiharvester_location_itemtype_pids.py
  • rero_ils/config.py
  • rero_ils/modules/api_harvester/cli.py
  • rero_ils/modules/api_harvester/memovs/__init__.py
  • rero_ils/modules/api_harvester/memovs/api.py
  • rero_ils/modules/api_harvester/memovs/dojson/__init__.py
  • rero_ils/modules/api_harvester/memovs/dojson/json/__init__.py
  • rero_ils/modules/api_harvester/memovs/dojson/json/model.py
  • rero_ils/modules/api_harvester/models.py
  • rero_ils/modules/api_harvester/utils.py
  • rero_ils/modules/ext.py
  • scripts/setup
  • tests/api_harvester/memovs/__init__.py
  • tests/api_harvester/memovs/test_memovs_api.py
  • tests/api_harvester/memovs/test_memovs_dojson.py
  • tests/api_harvester/test_cli_api_harvester.py
  • tests/data/apisources.yml
  • tests/ui/entities/remote_entities/test_remote_entities_api.py

Comment thread rero_ils/modules/api_harvester/cli.py
Comment thread rero_ils/modules/api_harvester/memovs/dojson/json/model.py Outdated
Comment thread rero_ils/modules/api_harvester/memovs/dojson/json/model.py Outdated
Comment thread rero_ils/modules/api_harvester/utils.py
@rerowep rerowep force-pushed the wep-harvesting-audiovisual-archives branch from 00968e3 to 51cc5c5 Compare May 13, 2026 12:51
@rerowep rerowep force-pushed the wep-harvesting-audiovisual-archives branch 9 times, most recently from bf05461 to 0473432 Compare May 18, 2026 09:52
Add new API harvester for Médiathèque Valais audiovisual archives
(archives.memovs.ch) enabling automatic ingestion of Film, Photo,
and Audio metadata from the Memovs digital archives into RERO-ILS.

Harvester (api_harvester/memovs/):
* ApiMemovs class with pagination, date filtering, and orphan deletion
* Celery tasks for scheduled harvest and cleanup of removed records
* Configuration as VS-MEMO in apisources.yml with location/item_type PIDs
* Alembic migration for default location and item_type PID settings

BIBFRAME-to-RERO-ILS transformation (dojson/json/model.py):
* Document type detection from bf:content/bf:media/bf:carrier (Film, Audio, Photo, Other)
* Provision activity from bf:provisionActivity with date normalisation
* Electronic locators for landing page and thumbnail
* MEF entity resolution for contributions, subjects, genre/form and places
* Fallback to identifiedBy when MEF ref cannot be resolved
* Physical description from bf:extent split into extent, otherPhysicalDetails and colorContent
* Warning logged when bf:contribution agent carries a URL as rdfs:label

Document and item creation (api.py):
* Holdings created per document with electronic locations
* Items created per bf:hasPart call number, stale items deleted on update
* Local fields synced from bf:note annotations (vsavmat, vsavgeo, vsavfonds...)

Other changes:
* feat(documents): add rpt (Reporter) to contribution roles schema
* refactor(documents): load _CONTRIBUTION_ROLE dynamically from JSON schema

Co-authored-by: Peter Weber <peter.weber@rero.ch>
@rerowep rerowep force-pushed the wep-harvesting-audiovisual-archives branch from 0473432 to 617a1bc Compare May 18, 2026 10:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants