feat(harvester): add Memovs audiovisual archives API#4024
Conversation
a072a85 to
6377ef8
Compare
e41e4c2 to
6170f6c
Compare
6170f6c to
f2ff57a
Compare
02670f4 to
07a5ba6
Compare
9de0c23 to
dd9acfe
Compare
dd9acfe to
0861a59
Compare
da8348f to
817eb99
Compare
WalkthroughThis pull request introduces a complete Memovs audiovisual archive API harvester for RERO-ILS. The implementation spans database schema extension, a sophisticated JSON-to-RERO-ILS data transformation pipeline, HTTP-based record harvesting with in-memory caching and item/local-field synchronization, comprehensive test coverage across transformation and harvesting logic, and operational deployment configuration. The feature extends existing API harvester infrastructure through a new Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
817eb99 to
00968e3
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
tests/ui/entities/remote_entities/test_remote_entities_api.py (1)
58-60: ⚡ Quick winWeak logging assertion: consider removing or making it specific.
The assertion only verifies that any elasticsearch log record exists, but doesn't check:
- Log level (is it actually a warning?)
- Message content (is it about the 404?)
- Relationship to the
delete_from_index()callSince this test is primarily about entity creation and deletion, consider either:
- Remove the assertion entirely if logging behavior is not central to this test's purpose.
- Make it specific if you do want to verify the expected 404 warning:
es_logs = [r for r in caplog.records if r.name == "elasticsearch"] assert len(es_logs) == 1 assert es_logs[0].levelname == "WARNING" assert "404" in es_logs[0].messageThe current assertion could pass even if the expected warning doesn't occur, reducing test reliability.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/ui/entities/remote_entities/test_remote_entities_api.py` around lines 58 - 60, The current weak assertion "assert any(r.name == 'elasticsearch' for r in caplog.records)" should be tightened or removed: either delete the assertion if logging isn't central, or replace it with a specific check that filters caplog.records for r.name == "elasticsearch" and asserts the expected count, r.levelname == "WARNING", and that the log message contains "404" (to tie it to the delete_from_index() call) so the test verifies the exact warning produced by the deletion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@rero_ils/modules/api_harvester/cli.py`:
- Around line 120-121: The info command currently prints the full api.settings
payload (click.echo(f"\tsettings : {api.settings}")), which may leak sensitive
data; change this to print a redacted indicator or only safe keys: implement a
small helper (e.g., redact_settings or get_safe_settings_summary) that takes
api.settings and returns either a list of allowed keys/values or a redacted
placeholder like "<redacted>" or a filtered dict with only safe keys, and
replace the direct usage of api.settings in the click.echo call with the
helper's output so sensitive fields are never printed.
In `@rero_ils/modules/api_harvester/memovs/dojson/json/model.py`:
- Around line 557-559: The code handling bf:summary in the model currently
appends its raw value into summaries as label.value, which can be a dict or
list; update the logic in the block around the bf:summary handling (the if
description := self.data.get("bf:summary") / summaries.append call) to normalize
description to a string before storing: if it's a list, join elements (or map to
strings and join); if it's a dict, serialize to a stable string (e.g., extract a
textual field or use JSON serialization); otherwise coerce to str; then append
{"label": [{"value": normalized_description}]}. Ensure the normalization covers
None and empty cases.
- Around line 179-190: The current block that builds identified_by only handles
bf:identifier when it's a list; update the logic around identifiers (the
variable assigned from self.data.get("bf:identifier", [])) to also accept a
single dict by normalizing it into a list (e.g., if isinstance(identifiers,
dict): identifiers = [identifiers]) before the existing comprehension, then keep
the existing filters and fields ("bf:noteType", "rdf:value", note_type logic and
the exclusion of "COTE") so dict-form identifiers are processed the same as list
entries.
In `@rero_ils/modules/api_harvester/utils.py`:
- Around line 78-80: Avoid appending the raw settings value to the update
messages to prevent leaking secrets: keep assigning settings to source.settings
but replace msgs.append(f"settings:{settings}") with a generic message such as
msgs.append("settings:changed") (or similar non-sensitive text). Locate the code
that sets source.settings and updates msgs (the assignment to source.settings
and the msgs.append call) and change only the message text so no secret content
is echoed to logs/CLI.
---
Nitpick comments:
In `@tests/ui/entities/remote_entities/test_remote_entities_api.py`:
- Around line 58-60: The current weak assertion "assert any(r.name ==
'elasticsearch' for r in caplog.records)" should be tightened or removed: either
delete the assertion if logging isn't central, or replace it with a specific
check that filters caplog.records for r.name == "elasticsearch" and asserts the
expected count, r.levelname == "WARNING", and that the log message contains
"404" (to tie it to the delete_from_index() call) so the test verifies the exact
warning produced by the deletion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: bcf40606-2037-45f1-894a-9f4a7b571ecd
📒 Files selected for processing (20)
data/apisources.ymldata/organisations.jsonrero_ils/alembic/4b2a1c3d8e9f_apiharvester_location_itemtype_pids.pyrero_ils/config.pyrero_ils/modules/api_harvester/cli.pyrero_ils/modules/api_harvester/memovs/__init__.pyrero_ils/modules/api_harvester/memovs/api.pyrero_ils/modules/api_harvester/memovs/dojson/__init__.pyrero_ils/modules/api_harvester/memovs/dojson/json/__init__.pyrero_ils/modules/api_harvester/memovs/dojson/json/model.pyrero_ils/modules/api_harvester/models.pyrero_ils/modules/api_harvester/utils.pyrero_ils/modules/ext.pyscripts/setuptests/api_harvester/memovs/__init__.pytests/api_harvester/memovs/test_memovs_api.pytests/api_harvester/memovs/test_memovs_dojson.pytests/api_harvester/test_cli_api_harvester.pytests/data/apisources.ymltests/ui/entities/remote_entities/test_remote_entities_api.py
00968e3 to
51cc5c5
Compare
bf05461 to
0473432
Compare
Add new API harvester for Médiathèque Valais audiovisual archives (archives.memovs.ch) enabling automatic ingestion of Film, Photo, and Audio metadata from the Memovs digital archives into RERO-ILS. Harvester (api_harvester/memovs/): * ApiMemovs class with pagination, date filtering, and orphan deletion * Celery tasks for scheduled harvest and cleanup of removed records * Configuration as VS-MEMO in apisources.yml with location/item_type PIDs * Alembic migration for default location and item_type PID settings BIBFRAME-to-RERO-ILS transformation (dojson/json/model.py): * Document type detection from bf:content/bf:media/bf:carrier (Film, Audio, Photo, Other) * Provision activity from bf:provisionActivity with date normalisation * Electronic locators for landing page and thumbnail * MEF entity resolution for contributions, subjects, genre/form and places * Fallback to identifiedBy when MEF ref cannot be resolved * Physical description from bf:extent split into extent, otherPhysicalDetails and colorContent * Warning logged when bf:contribution agent carries a URL as rdfs:label Document and item creation (api.py): * Holdings created per document with electronic locations * Items created per bf:hasPart call number, stale items deleted on update * Local fields synced from bf:note annotations (vsavmat, vsavgeo, vsavfonds...) Other changes: * feat(documents): add rpt (Reporter) to contribution roles schema * refactor(documents): load _CONTRIBUTION_ROLE dynamically from JSON schema Co-authored-by: Peter Weber <peter.weber@rero.ch>
0473432 to
617a1bc
Compare
Add new API harvester for Médiathèque Valais audiovisual archives (archives.memovs.ch). This harvester enables automatic ingestion of Film, Photo, and Audio metadata from the Memovs digital archives.
Implementation includes:
Configuration added as VS-MEMO in apisources.yml with endpoint https://archives.memovs.ch/docs/api/ and code 'memovs'.