Skip to content

CMR-11105: Update resharding to copy index directly from ES, fix how split cluster identifies resharded collection index#2391

Merged
jmaeng72 merged 16 commits intomasterfrom
CMR-11105
Mar 5, 2026
Merged

CMR-11105: Update resharding to copy index directly from ES, fix how split cluster identifies resharded collection index#2391
jmaeng72 merged 16 commits intomasterfrom
CMR-11105

Conversation

@jmaeng72
Copy link
Contributor

@jmaeng72 jmaeng72 commented Mar 3, 2026

Overview

What is the objective?

Fix resharding bugs found during testing

What are the changes?

  • Update resharding to copy the index directly from ES instead of from the index-set
  • Update reshard start api to return a task id to use in reshard status to keep track of the reindex task in elastic better
  • Remove unnecessary channel use in reshard start
  • Update how split cluster identifies a changed/resharded collection index to go to the non-gran cluster
  • Add sys int tests to make sure both granule and collection reshards work properly

What areas of the application does this impact?

  • Ingest
  • Bootstrap

Required Checklist

  • New and existing unit and int tests pass locally and remotely
  • clj-kondo has been run locally and all errors in changed files are corrected
  • I have commented my code, particularly in hard-to-understand areas
  • I have made changes to the documentation (if necessary)
  • My changes generate no new warnings

Additional Checklist

  • I have removed unnecessary/dead code and imports in files I have changed
  • I have cleaned up integration tests by doing one or more of the following:
    • migrated any are2 tests to are3 in files I have changed
    • de-duped, consolidated, removed dead int tests
    • transformed applicable int tests into unit tests
    • reduced number of system state resets by updating fixtures. Ex) (use-fixtures :each (ingest/reset-fixture {})) to be :once instead of :each

Summary by CodeRabbit

  • New Features

    • Resharding now returns a task ID for tracking; status checks and polling require that task ID. Start accepts elastic_name and num_shards; finalize and rollback require elastic_name.
  • Documentation

    • Docs/examples updated with request/response fields, task_id usage, expected statuses (IN_PROGRESS, COMPLETE, FAILED), and note: finalize only when COMPLETE.
  • Improvements

    • Added safer index copy for resharding and explicit mapping/setting retrieval; mapping updates now MERGE properties.

@jmaeng72 jmaeng72 self-assigned this Mar 3, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 3, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaces dispatcher/channel-based migrate-index with a direct bulk reindex flow that returns an Elasticsearch reindex task-id; APIs, services, tests, and docs now start reindexing, return task-id, and query reindex status by task-id (IN_PROGRESS / COMPLETE / FAILED). Adds index-copy helper and status constants.

Changes

Cohort / File(s) Summary
Bootstrap API & Docs
bootstrap-app/src/cmr/bootstrap/api/resharding.clj, bootstrap-app/README.md
Start now invokes direct reindex and returns :task-id; status/finalize endpoints accept and validate task_id; docs updated with request/response examples and status enum.
Bulk Index & Bootstrap Services
bootstrap-app/src/cmr/bootstrap/data/bulk_index.clj, bootstrap-app/src/cmr/bootstrap/services/bootstrap_service.clj, bootstrap-app/src/cmr/bootstrap/api/messages_bulk_index.clj
bulk/migrate-index now returns ES reindex task-id and updates status using centralized constants; removed dispatcher-based migrate-index and background migrate channel; start-reshard-index signature and call sites updated to surface task-id.
Dispatch Implementations
bootstrap-app/src/cmr/bootstrap/services/dispatch/core.clj, .../impl/async.clj, .../impl/sync.clj, .../impl/message_queue.clj
Removed migrate-index from Dispatch protocol and all dispatcher implementations, their channels, and dispatch maps.
Elasticsearch Utils
elastic-utils-lib/src/cmr/elastic_utils/es_helper.clj, .../es_index_helper.clj, elastic-utils-lib/test/.../es_helper.clj
Added get-reindex-task-status to query _tasks/{id} and validate task/index relation; removed prior wildcard-based helpers and associated unit test block; added get-mapping/get-settings and clarified mapping docstring.
Indexer App
indexer-app/src/cmr/indexer/data/index_set_elasticsearch.clj, .../indexer_util.clj, .../services/index_set_service.clj, indexer-app/README.md
Added create-copy-of-index to create shard-adjusted copies preserving mappings/settings; added reshard-status-states constants; integrated task-id validation and index-copy into reshard flows; docs updated for task_id usage and finalize precondition.
Transmit / Client Lib
transmit-lib/src/cmr/transmit/indexer.clj
get-reshard-status signature extended to accept reindex-task-id and include task_id in query params.
Tests & System Integration
system-int-test/.../bootstrap_util.clj, .../reshard_index_test.clj, indexer-app/test/.../index_set_service_test.clj, indexer-app/int-test/.../utility.clj
Tests updated to capture/propagate task-id; polling helpers accept task-id; mocks and expectations use get-reindex-task-status and centralized status constants; expanded integration scenarios and assertions.
Index Mapping
indexer-app/src/cmr/indexer/data/index_set.clj
Added new :keyword2 collection field mapping (m/text-field-keyword-mapping) alongside legacy :keyword.
Minor Docs/Helpers
system-int-test/src/cmr/system_int_test/utils/elastic_util.clj, .../bootstrap_util.clj
Docstring tweak and signature update: wait-for-reshard-complete now requires task-id and propagates it to status queries.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant API as BootstrapAPI
    participant Service as BootstrapService
    participant Bulk as BulkLayer
    participant ES as Elasticsearch

    Client->>API: POST /start-reshard (elastic_name, num_shards)
    API->>Service: start-reshard-index(index, num_shards, elastic_name)
    Service->>Bulk: bulk/migrate-index(context, index, num_shards, elastic_name)
    Bulk->>ES: POST _reindex (async)
    ES-->>Bulk: 202 {task-id}
    Bulk-->>Service: {:task-id ...}
    Service-->>API: 202 {:task-id ...}
    Client->>API: GET /reshard-status?elastic_name=...&task_id=...
    API->>Service: reshard-status(index, elastic_name, task_id)
    Service->>ES: GET /_tasks/{task-id}
    ES-->>Service: {completed: bool, failures: [...]}
    Service-->>API: {:status IN_PROGRESS/COMPLETE/FAILED, message...}
    API-->>Client: 200 {:status ...}
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested labels

hacktoberfest-accepted

Suggested reviewers

  • jceaser
  • DuJuan
  • daniel-zamora
  • eereiter
  • zimzoom

Poem

🐇 I hopped through code to chase a task,
Task-IDs now guide each shuffled ask.
Channels folded, copies made neat,
Reindex tracked, status clear and sweet.
Hooray — resharding leaps to its feet!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main changes: updating resharding to copy index from ES and fixing split cluster identification for resharded collection indices.
Description check ✅ Passed The description covers objectives, changes, and impacted areas comprehensively. However, the Required Checklist is incomplete with unit/int tests unchecked and the Additional Checklist items for code cleanup remain unchecked.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch CMR-11105

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
system-int-test/src/cmr/system_int_test/utils/bootstrap_util.clj (1)

560-576: ⚠️ Potential issue | 🟡 Minor

Fail fast when reshard status is FAILED in polling helper.

Line 565-569 only exits on COMPLETE or timeout. A FAILED status currently burns all retries and hides the real terminal state.

💡 Proposed fix
 (defn wait-for-reshard-complete
   [coll-index-name elastic-name task-id {:keys [max-attempts sleep-ms] :or {max-attempts 50 sleep-ms 1000}}]
   (loop [attempt 1]
     (let [res (get-reshard-status coll-index-name {:elastic-name elastic-name :task-id task-id})
           status (get-in res [:reshard-status])]
       (cond
         (= status "COMPLETE")
         (info (format "Success: Resharding index %s is COMPLETE." coll-index-name))
+
+        (= status "FAILED")
+        (throw (Exception. (format "Resharding index %s FAILED." coll-index-name)))
 
         (>= attempt max-attempts)
         (throw (Exception. (str "Timeout: Resharding failed to complete after " attempt " attempts.")))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@system-int-test/src/cmr/system_int_test/utils/bootstrap_util.clj` around
lines 560 - 576, The polling loop in the reshard helper (where
get-reshard-status is called and status is bound from [:reshard-status]) only
handles "COMPLETE" and timeouts; add an explicit branch for (= status "FAILED")
that throws an Exception immediately (including status and the full res or
relevant error details) so failures don't consume all retries; update the cond
in the loop that references coll-index-name, task-id, max-attempts, sleep-ms to
check for "FAILED" before the timeout branch and include helpful context in the
thrown message.
🧹 Nitpick comments (3)
indexer-app/src/cmr/indexer/data/index_set.clj (1)

551-554: Helpful legacy note; consider adding a cleanup tracker reference.

This context is useful. To avoid indefinite :keyword/:keyword2 drift, consider linking a migration ticket (or TODO with ticket id) near Line 551 so cleanup is explicitly tracked.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@indexer-app/src/cmr/indexer/data/index_set.clj` around lines 551 - 554, Add
an explicit cleanup/migration reference near the legacy comment for :keyword and
:keyword2: insert a TODO or NOTE that references the migration/ticket ID (e.g.,
MIGRATE-1234) so future maintainers know there is an actionable task to remove
:keyword or consolidate into :keyword2; update the block containing the comment
and the mapping symbol :keyword2 (in index_set.clj) to include that ticket ID
and a short target (e.g., "remove legacy :keyword by YYYY-MM-DD /
MIGRATE-1234").
bootstrap-app/src/cmr/bootstrap/services/dispatch/impl/sync.clj (1)

125-125: Remove the commented-out dispatch entry to keep the behavior map unambiguous.

Leaving ;:migrate-index migrate-index in the live map adds noise after the functional removal.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bootstrap-app/src/cmr/bootstrap/services/dispatch/impl/sync.clj` at line 125,
Remove the commented-out dispatch entry ";:migrate-index migrate-index" from the
behavior/dispatch map in sync.clj so the live map contains only active entries;
locate the map where dispatch entries are declared (the same block containing
:migrate-index) and delete that commented line to avoid noise and ambiguity.
bootstrap-app/README.md (1)

129-141: Add a language to fenced code blocks in this updated section.

Line 129 starts a fenced block without a language tag, which triggers markdownlint (MD040).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bootstrap-app/README.md` around lines 129 - 141, The fenced code block
showing the curl example is missing a language tag (triggering MD040); update
the opening backticks for that RESPONSE example to include a language (e.g.,
change ``` to ```bash or ```sh) so the curl snippet is properly marked and
markdownlint passes; locate the curl example block in the README section that
starts with the REQUEST/RESPONSE sample and add the language tag to the opening
fence.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bootstrap-app/README.md`:
- Around line 120-133: The README shows inconsistent naming between the reshard
API response field `task-id` and the query param `task_id`; update the docs to
use a single canonical name (prefer `task_id`) across examples and prose or
explicitly state the mapping (e.g., "use the `task-id` value from the
/reshard/<index>/start response as the `task_id` query parameter when calling
/reshard/<index>/status"); ensure all occurrences of `task-id` and `task_id` in
the README are updated to the chosen canonical name (or include the one-line
conversion note) so operators aren’t confused when copying the value from the
start response into the status/finalize requests.

In `@bootstrap-app/src/cmr/bootstrap/data/bulk_index.clj`:
- Around line 80-86: After calling es-helper/migrate-index, check that (:task
result) (bound to reindex-task-id) is non-nil before marking the index as
IN_PROGRESS; if reindex-task-id is nil treat it as a migration failure by
throwing an ex-info (similar to the existing :error path) so clients can poll,
and avoid calling index-set-service/update-resharding-status with a missing task
id. Update the logic around reindex-task-id and the call to
index-set-service/update-resharding-status (using indexer-context,
index-set/index-set-id, source-index, indexer-util/reshard-status-states, and
elastic-name) to only proceed when a valid task id exists.

In `@indexer-app/README.md`:
- Around line 215-217: The markdown example showing the curl command violates
MD040/MD046; fix it by making the code block lint-compliant—either convert the
fenced block to an indented code block (prefix the curl line with four spaces)
or add a fenced code language tag (e.g., ```bash) before and after the curl line
so the example becomes a proper fenced code block. Update the curl example in
the README.md accordingly to match the repo's markdown style.

In `@indexer-app/src/cmr/indexer/data/index_set_elasticsearch.clj`:
- Around line 199-202: The current branch that checks (esi-helper/exists? conn
new-index) only logs via error and returns, allowing upstream processing to
continue; change it to fail fast by throwing an exception after logging so the
copy operation aborts immediately. Locate the exists? check and replace the
silent-return path with a failing path such as calling throw (eg. throw (ex-info
...)) or otherwise returning a clearly failing result; include orig-index and
new-index in the exception message so callers see why it failed, and keep the
esi-helper/exists? conn new-index and error(...) usages for context.

In `@indexer-app/src/cmr/indexer/services/index_set_service.clj`:
- Around line 688-691: The reshard start is not atomic: es/create-copy-of-index
creates the target index (target-index) before update-index-set persists
new-index-set, leaving an orphaned ES index if update fails. Fix by making the
start operation atomic: after calling es/create-copy-of-index(context,
elastic-name, index, target-index, num-shards) immediately persist a durable
"reshard-start" marker in the same transactional scope as update-index-set
(e.g., update-index-set should first record resharding-in-progress with
target-index/new-index-set and then return), and if update-index-set fails
rollback/delete the created ES index (call es/delete-index on target-index) so
no orphan remains; alternatively invert to persist the reshard marker first then
create the ES index and finalize by clearing the marker in update-index-set —
ensure operations reference es/create-copy-of-index, update-index-set,
target-index, new-index-set and are idempotent for retries.

In `@system-int-test/test/cmr/system_int_test/bootstrap/reshard_index_test.clj`:
- Around line 242-243: The assertions that verify old-index removal are reading
the wrong path and comparing mismatched name formats; update all uses of get-in
on updated-index-set (e.g., where tests reference updated-index-set and
coll-index-set-key) to read [:collection :index-set :indexes] instead of
[:collection :indexes], and change checks that compare prefixed ES index names
(like "1_*" / coll-index-set-key) against canonical index names to compare the
canonical names from the :indexes entries (or strip the prefix before
comparison) so the is/not-any? and equality checks use the same name form; apply
these fixes to the failing assertions that reference updated-index-set and
coll-index-set-key (the occurrences around the old-index removal checks).

---

Outside diff comments:
In `@system-int-test/src/cmr/system_int_test/utils/bootstrap_util.clj`:
- Around line 560-576: The polling loop in the reshard helper (where
get-reshard-status is called and status is bound from [:reshard-status]) only
handles "COMPLETE" and timeouts; add an explicit branch for (= status "FAILED")
that throws an Exception immediately (including status and the full res or
relevant error details) so failures don't consume all retries; update the cond
in the loop that references coll-index-name, task-id, max-attempts, sleep-ms to
check for "FAILED" before the timeout branch and include helpful context in the
thrown message.

---

Nitpick comments:
In `@bootstrap-app/README.md`:
- Around line 129-141: The fenced code block showing the curl example is missing
a language tag (triggering MD040); update the opening backticks for that
RESPONSE example to include a language (e.g., change ``` to ```bash or ```sh) so
the curl snippet is properly marked and markdownlint passes; locate the curl
example block in the README section that starts with the REQUEST/RESPONSE sample
and add the language tag to the opening fence.

In `@bootstrap-app/src/cmr/bootstrap/services/dispatch/impl/sync.clj`:
- Line 125: Remove the commented-out dispatch entry ";:migrate-index
migrate-index" from the behavior/dispatch map in sync.clj so the live map
contains only active entries; locate the map where dispatch entries are declared
(the same block containing :migrate-index) and delete that commented line to
avoid noise and ambiguity.

In `@indexer-app/src/cmr/indexer/data/index_set.clj`:
- Around line 551-554: Add an explicit cleanup/migration reference near the
legacy comment for :keyword and :keyword2: insert a TODO or NOTE that references
the migration/ticket ID (e.g., MIGRATE-1234) so future maintainers know there is
an actionable task to remove :keyword or consolidate into :keyword2; update the
block containing the comment and the mapping symbol :keyword2 (in index_set.clj)
to include that ticket ID and a short target (e.g., "remove legacy :keyword by
YYYY-MM-DD / MIGRATE-1234").

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fb1d495 and 0ad93ea.

📒 Files selected for processing (23)
  • bootstrap-app/README.md
  • bootstrap-app/src/cmr/bootstrap/api/messages_bulk_index.clj
  • bootstrap-app/src/cmr/bootstrap/api/resharding.clj
  • bootstrap-app/src/cmr/bootstrap/data/bulk_index.clj
  • bootstrap-app/src/cmr/bootstrap/services/bootstrap_service.clj
  • bootstrap-app/src/cmr/bootstrap/services/dispatch/core.clj
  • bootstrap-app/src/cmr/bootstrap/services/dispatch/impl/async.clj
  • bootstrap-app/src/cmr/bootstrap/services/dispatch/impl/message_queue.clj
  • bootstrap-app/src/cmr/bootstrap/services/dispatch/impl/sync.clj
  • elastic-utils-lib/src/cmr/elastic_utils/es_helper.clj
  • elastic-utils-lib/src/cmr/elastic_utils/es_index_helper.clj
  • elastic-utils-lib/test/cmr/elastic_utils/test/es_helper.clj
  • indexer-app/README.md
  • indexer-app/int-test/cmr/indexer/test/utility.clj
  • indexer-app/src/cmr/indexer/data/index_set.clj
  • indexer-app/src/cmr/indexer/data/index_set_elasticsearch.clj
  • indexer-app/src/cmr/indexer/indexer_util.clj
  • indexer-app/src/cmr/indexer/services/index_set_service.clj
  • indexer-app/test/cmr/indexer/test/services/index_set_service_test.clj
  • system-int-test/src/cmr/system_int_test/utils/bootstrap_util.clj
  • system-int-test/src/cmr/system_int_test/utils/elastic_util.clj
  • system-int-test/test/cmr/system_int_test/bootstrap/reshard_index_test.clj
  • transmit-lib/src/cmr/transmit/indexer.clj
💤 Files with no reviewable changes (5)
  • bootstrap-app/src/cmr/bootstrap/services/dispatch/impl/async.clj
  • bootstrap-app/src/cmr/bootstrap/services/dispatch/core.clj
  • bootstrap-app/src/cmr/bootstrap/api/messages_bulk_index.clj
  • bootstrap-app/src/cmr/bootstrap/services/dispatch/impl/message_queue.clj
  • elastic-utils-lib/test/cmr/elastic_utils/test/es_helper.clj

@jmaeng72 jmaeng72 changed the title Cmr 11105 CMR-11105: Update resharding to copy index directly from ES, fix how split cluster identifies resharded collection index Mar 3, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
system-int-test/test/cmr/system_int_test/bootstrap/reshard_index_test.clj (1)

240-241: ⚠️ Potential issue | 🟠 Major

Assertion path is missing :index-set prefix.

Line 240 reads (get-in updated-index-set [:collection :indexes]) but based on the index-set structure used elsewhere (e.g., line 237 uses [:index-set :concepts :collection ...]), this should include the :index-set prefix. The current path will always return nil, making the not-any? assertion trivially pass.

🐛 Proposed fix for assertion path
          ;; check orig index is deleted from index-set and elastic
-          _ (is (not-any? #(= (:name %) coll-index-set-key) (get-in updated-index-set [:collection :indexes])))
+          _ (is (not-any? #(= (:name %) coll-index-set-key) (get-in updated-index-set [:index-set :collection :indexes])))
           _ (is (not (es-util/index-exists? coll-index-name elastic-name)))

Also applies to lines 314, 394, and 466 where similar incorrect paths are used.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@system-int-test/test/cmr/system_int_test/bootstrap/reshard_index_test.clj`
around lines 240 - 241, The assertion paths miss the top-level :index-set key;
update the get-in calls that read from updated-index-set (e.g., the expression
using (get-in updated-index-set [:collection :indexes])) to include the
:index-set prefix (so they read from [:index-set :collection :indexes]) so the
not-any? assertions actually inspect the index-set structure; apply the same fix
to the other similar checks that reference updated-index-set (the ones around
the sections flagged at lines ~314, ~394, ~466) so all assertions use (get-in
updated-index-set [:index-set :collection :indexes]).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@elastic-utils-lib/src/cmr/elastic_utils/es_helper.clj`:
- Around line 148-149: The code calls string/lower-case on description retrieved
via (get-in resp [:task :description]) which can be nil; change the check in the
index-found-in-description computation to guard against nil by normalizing
description to a safe empty string or only performing string/lower-case when
description is non-nil (e.g., use a when-let or (or description "") pattern)
before computing (string/includes? (string/lower-case ...) (string/lower-case
index)); update the index-found-in-description binding accordingly to avoid a
NullPointerException.

---

Duplicate comments:
In `@system-int-test/test/cmr/system_int_test/bootstrap/reshard_index_test.clj`:
- Around line 240-241: The assertion paths miss the top-level :index-set key;
update the get-in calls that read from updated-index-set (e.g., the expression
using (get-in updated-index-set [:collection :indexes])) to include the
:index-set prefix (so they read from [:index-set :collection :indexes]) so the
not-any? assertions actually inspect the index-set structure; apply the same fix
to the other similar checks that reference updated-index-set (the ones around
the sections flagged at lines ~314, ~394, ~466) so all assertions use (get-in
updated-index-set [:index-set :collection :indexes]).

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5bf5551 and 00669db.

📒 Files selected for processing (5)
  • bootstrap-app/src/cmr/bootstrap/api/resharding.clj
  • elastic-utils-lib/src/cmr/elastic_utils/es_helper.clj
  • elastic-utils-lib/test/cmr/elastic_utils/test/es_helper.clj
  • indexer-app/test/cmr/indexer/test/services/index_set_service_test.clj
  • system-int-test/test/cmr/system_int_test/bootstrap/reshard_index_test.clj
💤 Files with no reviewable changes (1)
  • elastic-utils-lib/test/cmr/elastic_utils/test/es_helper.clj
🚧 Files skipped from review as they are similar to previous changes (1)
  • bootstrap-app/src/cmr/bootstrap/api/resharding.clj

@codecov-commenter
Copy link

codecov-commenter commented Mar 3, 2026

Codecov Report

❌ Patch coverage is 24.34783% with 87 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.87%. Comparing base (44b9843) to head (b18da01).

Files with missing lines Patch % Lines
...p/src/cmr/indexer/data/index_set_elasticsearch.clj 3.03% 32 Missing ⚠️
...stic-utils-lib/src/cmr/elastic_utils/es_helper.clj 6.25% 15 Missing ⚠️
...ootstrap-app/src/cmr/bootstrap/data/bulk_index.clj 0.00% 12 Missing ⚠️
...app/src/cmr/indexer/services/index_set_service.clj 62.96% 9 Missing and 1 partial ⚠️
bootstrap-app/src/cmr/bootstrap/api/resharding.clj 10.00% 9 Missing ⚠️
...tils-lib/src/cmr/elastic_utils/es_index_helper.clj 33.33% 4 Missing ⚠️
...p/src/cmr/bootstrap/services/bootstrap_service.clj 0.00% 2 Missing ⚠️
transmit-lib/src/cmr/transmit/indexer.clj 0.00% 2 Missing ⚠️
indexer-app/src/cmr/indexer/indexer_util.clj 66.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2391      +/-   ##
==========================================
- Coverage   57.92%   57.87%   -0.06%     
==========================================
  Files        1067     1067              
  Lines       73299    73337      +38     
  Branches     2119     2122       +3     
==========================================
- Hits        42462    42447      -15     
- Misses      28860    28910      +50     
- Partials     1977     1980       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@eereiter eereiter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Rabbit has found some issues that need to be addressed.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
elastic-utils-lib/src/cmr/elastic_utils/es_helper.clj (1)

148-150: ⚠️ Potential issue | 🟡 Minor

Guard description before string/lower-case to avoid nil failure path.

Line 149 still lowercases description unconditionally. If :task :description is missing, status checks fall into generic exception handling instead of the specific task/index mismatch path.

💡 Proposed fix
-          description (get-in resp [:task :description])
-          index-found-in-description (string/includes? (string/lower-case description) (string/lower-case index))
+          description (get-in resp [:task :description])
+          index-found-in-description (and description
+                                          (string/includes? (string/lower-case description)
+                                                            (string/lower-case index)))
#!/bin/bash
# Verify current guarding around description in get-reindex-task-status
rg -n -C3 'description|get-reindex-task-status|string/lower-case description' elastic-utils-lib/src/cmr/elastic_utils/es_helper.clj
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@elastic-utils-lib/src/cmr/elastic_utils/es_helper.clj` around lines 148 -
150, In get-reindex-task-status, guard the possibly nil description before
calling string/lower-case: locate where description is bound from (get-in resp
[:task :description]) and ensure index-found-in-description is computed only
after checking description (e.g., use when-let or default description to an
empty string) so string/lower-case never receives nil; update the logic that
builds full-status/ index-found-in-description to handle a missing description
path and preserve the existing task/index mismatch flow.
🧹 Nitpick comments (1)
system-int-test/test/cmr/system_int_test/bootstrap/reshard_index_test.clj (1)

167-317: Consider extracting repeated dual-write verification into helpers.

The two new tests are comprehensive but very dense. Pulling repeated “ingest → wait → retrieve revision → verify old/new index docs” steps into small helpers would reduce brittleness and make failures easier to diagnose.

Also applies to: 319-470

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@system-int-test/test/cmr/system_int_test/bootstrap/reshard_index_test.clj`
around lines 167 - 317, The test reshard-index-with-collection-updates-test
contains repeated sequences (d/ingest -> index/wait-until-indexed ->
search/retrieve-concept -> es-util/get-doc -> get-in revision checks) (seen
around coll1 and coll2 flows) — extract that pattern into small helper functions
(e.g., assert-dual-write-or-update or verify-collection-dual-write) that accept
parameters: concept payload or concept-id, expected revision, original index
name (coll-index-name or resharded-coll-index-name), target index name
(resharded-coll-index-name or new-resharded-coll-index) and elastic-name;
implement helpers to perform the ingest (or pass an already-ingested concept),
wait, retrieve the concept via search/retrieve-concept, fetch docs from
es-util/get-doc for both indexes and assert revision fields, then replace the
duplicated blocks in reshard-index-with-collection-updates-test (and the similar
section referenced at 319-470) with calls to the new helpers to reduce
duplication and clarify failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bootstrap-app/README.md`:
- Line 134: The README line describing task_id contains a typo: replace "hypen"
with "hyphen" in the sentence "task_id = string (elastic task id associated with
the reindexing task of moving original index content to the new index. This is
given in the output of the /reshard/<index>/start api as 'task-id'. Notice the
difference in hypen vs underscore.)" so the note reads "hyphen vs underscore"
and ensure the quoted 'task-id' remains unchanged; edit the same string for
clarity in the task_id description.

In `@indexer-app/src/cmr/indexer/data/index_set_elasticsearch.clj`:
- Around line 206-220: The client/get calls that fetch index mapping and
settings (mapping-resp via client/get mapping-url and settings-resp via
client/get settings-url) do not include connection HTTP options, so auth and
connection config from conn are lost; fix by either replacing those calls with
rest/get conn mapping-url and rest/get conn settings-url (the conn-aware
wrapper) or by merging (.http-opts conn) into the options map passed to
client/get (e.g. client/get mapping-url (merge (.http-opts conn)
{:throw-exceptions true}) and similarly for settings-resp) so the
connection-level headers and options are preserved.

In `@indexer-app/src/cmr/indexer/services/index_set_service.clj`:
- Around line 687-696: The catch block unconditionally deletes target-index
which can remove a pre-existing index; change the flow so you only delete an
index you created: have es/create-copy-of-index return or set a local flag
(e.g., created-target? or return a map {:created true}) and assign it before
calling update-index-set, or catch the specific "index already exists" exception
and avoid deletion in that case; then in the catch use that flag
(created-target?) before calling esi-helper/exists? / es/delete-index
(referencing es/create-copy-of-index, update-index-set, esi-helper/exists?,
es/delete-index, indexer-util/context->conn and indexer-util/context->es-store)
to ensure rollback only removes indexes created by this operation.

---

Duplicate comments:
In `@elastic-utils-lib/src/cmr/elastic_utils/es_helper.clj`:
- Around line 148-150: In get-reindex-task-status, guard the possibly nil
description before calling string/lower-case: locate where description is bound
from (get-in resp [:task :description]) and ensure index-found-in-description is
computed only after checking description (e.g., use when-let or default
description to an empty string) so string/lower-case never receives nil; update
the logic that builds full-status/ index-found-in-description to handle a
missing description path and preserve the existing task/index mismatch flow.

---

Nitpick comments:
In `@system-int-test/test/cmr/system_int_test/bootstrap/reshard_index_test.clj`:
- Around line 167-317: The test reshard-index-with-collection-updates-test
contains repeated sequences (d/ingest -> index/wait-until-indexed ->
search/retrieve-concept -> es-util/get-doc -> get-in revision checks) (seen
around coll1 and coll2 flows) — extract that pattern into small helper functions
(e.g., assert-dual-write-or-update or verify-collection-dual-write) that accept
parameters: concept payload or concept-id, expected revision, original index
name (coll-index-name or resharded-coll-index-name), target index name
(resharded-coll-index-name or new-resharded-coll-index) and elastic-name;
implement helpers to perform the ingest (or pass an already-ingested concept),
wait, retrieve the concept via search/retrieve-concept, fetch docs from
es-util/get-doc for both indexes and assert revision fields, then replace the
duplicated blocks in reshard-index-with-collection-updates-test (and the similar
section referenced at 319-470) with calls to the new helpers to reduce
duplication and clarify failures.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 42480c7e-543b-4bf8-8cb9-db1df6ff7495

📥 Commits

Reviewing files that changed from the base of the PR and between 00669db and b52ce8f.

📒 Files selected for processing (7)
  • bootstrap-app/README.md
  • bootstrap-app/src/cmr/bootstrap/data/bulk_index.clj
  • elastic-utils-lib/src/cmr/elastic_utils/es_helper.clj
  • indexer-app/README.md
  • indexer-app/src/cmr/indexer/data/index_set_elasticsearch.clj
  • indexer-app/src/cmr/indexer/services/index_set_service.clj
  • system-int-test/test/cmr/system_int_test/bootstrap/reshard_index_test.clj

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
indexer-app/src/cmr/indexer/services/index_set_service.clj (1)

688-692: ⚠️ Potential issue | 🟠 Major

Reshard start is still non-atomic across ES creation and index-set persistence.

If Line 689 succeeds but Line 692 fails, state diverges and retries can conflict with the already-created target index.

🛠️ Safer flow (create + guarded cleanup only after successful create)
-    ;; Create index directly in ES first to avoid potential mapping mismatch issues between old and new indexes
-    (es/create-copy-of-index context elastic-name index target-index num-shards)
-
-    ;; Update the index-set and all the indexes with changes
-    (update-index-set context elastic-name new-index-set)))
+    (let [conn (indexer-util/context->conn context elastic-name)]
+      (when (esi-helper/exists? conn target-index)
+        (errors/throw-service-error
+         :conflict
+         (format "Cannot start resharding because target index [%s] already exists." target-index)))
+      ;; Create index directly in ES first to avoid potential mapping mismatch issues between old and new indexes
+      (es/create-copy-of-index context elastic-name index target-index num-shards)
+      (try
+        ;; Update the index-set and all the indexes with changes
+        (update-index-set context elastic-name new-index-set)
+        (catch Exception e
+          ;; rollback only for index created in this flow
+          (es/delete-index (indexer-util/context->es-store context elastic-name) target-index)
+          (throw e))))))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@indexer-app/src/cmr/indexer/services/index_set_service.clj` around lines 688
- 692, Make the reshard flow atomic by ensuring a created target index is
removed if persisting the new index-set fails: call es/create-copy-of-index as
before, then wrap the subsequent update-index-set call in a try/catch; on
exception call es/delete-index (or equivalent cleanup) for target-index to
remove the partially created ES index, then rethrow the error so the caller sees
the failure. Update the code paths around es/create-copy-of-index and
update-index-set to perform this guarded cleanup (or alternatively persist a
"creating" marker in the index-set before creation and clear it on success) so
state cannot diverge if update-index-set fails.
🧹 Nitpick comments (5)
indexer-app/src/cmr/indexer/data/index_set_elasticsearch.clj (2)

220-223: Filter out nil settings to avoid sending nulls to Elasticsearch.

If the original index lacks any of these settings (number_of_replicas, max_result_window, refresh_interval), they'll be nil. When serialized to JSON and sent to ES, explicit null values may cause validation errors or unexpected behavior compared to omitting the keys entirely.

♻️ Suggested refactor using cond->
-            updated-settings {:number_of_shards (str shard-count)
-                              :number_of_replicas (get-in settings [:index :number_of_replicas])
-                              :max_result_window (get-in settings [:index :max_result_window])
-                              :refresh_interval (get-in settings [:index :refresh_interval])}]
+            updated-settings (cond-> {:number_of_shards (str shard-count)}
+                               (get-in settings [:index :number_of_replicas])
+                               (assoc :number_of_replicas (get-in settings [:index :number_of_replicas]))
+                               (get-in settings [:index :max_result_window])
+                               (assoc :max_result_window (get-in settings [:index :max_result_window]))
+                               (get-in settings [:index :refresh_interval])
+                               (assoc :refresh_interval (get-in settings [:index :refresh_interval])))]

Alternatively, filter nils after construction:

updated-settings (into {} (remove (comp nil? val) 
                            {:number_of_shards (str shard-count)
                             :number_of_replicas (get-in settings [:index :number_of_replicas])
                             ...}))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@indexer-app/src/cmr/indexer/data/index_set_elasticsearch.clj` around lines
220 - 223, The updated-settings map currently includes keys whose values can be
nil (from get-in on settings) which will send explicit nulls to Elasticsearch;
modify how updated-settings is constructed so nil-valued entries are omitted
before sending—e.g., build the map using updated-settings and then filter out
entries with nil values (operate on the map produced from :number_of_shards,
:number_of_replicas, :max_result_window, :refresh_interval derived from
shard-count and settings) so only present settings are included; ensure the
symbol updated-settings is assigned the filtered result used downstream where
index settings are sent to ES.

206-210: Prefer seq or not-empty over (not (empty? ...)).

The idiomatic Clojure approach is to use seq (returns nil for empty/nil) or not-empty (returns the collection or nil) rather than (not (empty? ...)).

♻️ Suggested refactor
-            mappings (if (not (empty? mapping-resp-body))
+            mappings (if (seq mapping-resp-body)
                        (get-in mapping-resp-body [(keyword orig-index) :mappings])
                        (errors/throw-service-error
                          :internal-error
                          (format "error trying to get mapping of index %s" orig-index)))

             ;; get orig index settings
             settings-resp-body (esi-helper/get-settings conn orig-index)
-            settings (if (not (empty? settings-resp-body))
+            settings (if (seq settings-resp-body)
                        (get-in settings-resp-body [(keyword orig-index) :settings])
                        (errors/throw-service-error
                          :internal-error
                          (format "error trying to get settings of index %s" orig-index)))

Also applies to: 214-218

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@indexer-app/src/cmr/indexer/data/index_set_elasticsearch.clj` around lines
206 - 210, Replace uses of (not (empty? ...)) with the idiomatic Clojure
predicate seq (or not-empty) in the index mapping checks; specifically update
the conditional that checks mapping-resp-body in the index mapping assignment
(the binding that defines mappings) and the other similar check around lines
214-218 to use (seq mapping-resp-body) (or (not-empty mapping-resp-body)) so the
condition returns nil for empty/nil collections and reads more idiomatically.
bootstrap-app/README.md (3)

147-148: Provide guidance for FAILED status.

The documentation lists FAILED as a possible status but doesn't explain what operators should do when resharding fails. Consider adding a note that directs operators to the rollback endpoint when status is FAILED, since rollback is only allowed before finalization (as noted in line 181).

📋 Suggested addition

Add guidance after line 148:

Expected reshard status options:
- IN_PROGRESS, COMPLETE, FAILED

**Note**: If the status is `FAILED`, use the rollback endpoint (see "Rollback a Resharding" section below) to revert the index to its original state before attempting to reshard again.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bootstrap-app/README.md` around lines 147 - 148, Update the README's reshard
status list to include operational guidance for FAILED: after the existing
"Expected reshard status options" list (IN_PROGRESS, COMPLETE, FAILED) add a
short note instructing operators to call the rollback endpoint when a resharding
job reports FAILED (referencing the "Rollback a Resharding" section), since
rollback is only allowed before finalization.

114-122: Add language specifier to fenced code block.

The code block should specify bash or shell as the language identifier for proper syntax highlighting and linting compliance.

📝 Proposed fix
-```
+```bash
 curl -i \
   -X POST \
   "http://localhost:3006/reshard/1_small_collections/start?num_shards=50&elastic_name=gran-elastic"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bootstrap-app/README.md` around lines 114 - 122, Update the fenced code block
in the README so it includes a shell language specifier (e.g., ```bash or
```shell) for the curl example; locate the curl snippet shown in the README.md
and change the opening fence from ``` to ```bash so the POST request example and
HTTP response are highlighted and linted correctly.

138-146: Add language specifier to fenced code block.

The code block should specify bash or shell as the language identifier for proper syntax highlighting and linting compliance.

📝 Proposed fix
-```
+```bash
 curl -i \
 	-H "Accept: application/json" \
 	http://localhost:3006/reshard/1_c1234_prov1/status?elastic_name=gran-elastic&task_id=abc:1234
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bootstrap-app/README.md` around lines 138 - 146, Update the fenced code block
that shows the curl example (the triple-backtick block containing the curl -i
... HTTP/1.1 200 OK response) to include a language specifier (e.g., bash or
shell) after the opening ``` so it becomes ```bash; this will enable proper
syntax highlighting and satisfy the linter for the README example.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@elastic-utils-lib/src/cmr/elastic_utils/es_helper.clj`:
- Around line 146-152: The task status builder currently only extracts failures
from (:response :failures) which is nil when Elasticsearch returns a top-level
task error; update the failures binding so it also captures a top-level :error
from resp (e.g. set failures to (or (get-in resp [:response :failures])
(when-let [e (:error resp)] [e]))) and include that in the constructed
full-status map (variables: completed, failures, description,
index-found-in-description, full-status) so downstream logic correctly treats
task-level errors as failures.

In `@indexer-app/src/cmr/indexer/services/index_set_service.clj`:
- Around line 744-756: Update get-reindex-task-status in
elastic-utils-lib/src/cmr/elastic_utils/es_helper.clj to also extract the
top-level :error from the Elasticsearch _tasks response and include it in the
returned map (e.g., return {:completed completed :failures failures :error
error}). Then modify the completion/failure check in
indexer-app/src/cmr/indexer/services/index_set_service.clj (the code that calls
es-helper/get-reindex-task-status and then decides between (:COMPLETE ...) and
(:FAILED ...) before calling update-resharding-status) to treat a task as failed
if either :error is present or :failures is non-empty (i.e., check both keys
rather than only :failures).

---

Duplicate comments:
In `@indexer-app/src/cmr/indexer/services/index_set_service.clj`:
- Around line 688-692: Make the reshard flow atomic by ensuring a created target
index is removed if persisting the new index-set fails: call
es/create-copy-of-index as before, then wrap the subsequent update-index-set
call in a try/catch; on exception call es/delete-index (or equivalent cleanup)
for target-index to remove the partially created ES index, then rethrow the
error so the caller sees the failure. Update the code paths around
es/create-copy-of-index and update-index-set to perform this guarded cleanup (or
alternatively persist a "creating" marker in the index-set before creation and
clear it on success) so state cannot diverge if update-index-set fails.

---

Nitpick comments:
In `@bootstrap-app/README.md`:
- Around line 147-148: Update the README's reshard status list to include
operational guidance for FAILED: after the existing "Expected reshard status
options" list (IN_PROGRESS, COMPLETE, FAILED) add a short note instructing
operators to call the rollback endpoint when a resharding job reports FAILED
(referencing the "Rollback a Resharding" section), since rollback is only
allowed before finalization.
- Around line 114-122: Update the fenced code block in the README so it includes
a shell language specifier (e.g., ```bash or ```shell) for the curl example;
locate the curl snippet shown in the README.md and change the opening fence from
``` to ```bash so the POST request example and HTTP response are highlighted and
linted correctly.
- Around line 138-146: Update the fenced code block that shows the curl example
(the triple-backtick block containing the curl -i ... HTTP/1.1 200 OK response)
to include a language specifier (e.g., bash or shell) after the opening ``` so
it becomes ```bash; this will enable proper syntax highlighting and satisfy the
linter for the README example.

In `@indexer-app/src/cmr/indexer/data/index_set_elasticsearch.clj`:
- Around line 220-223: The updated-settings map currently includes keys whose
values can be nil (from get-in on settings) which will send explicit nulls to
Elasticsearch; modify how updated-settings is constructed so nil-valued entries
are omitted before sending—e.g., build the map using updated-settings and then
filter out entries with nil values (operate on the map produced from
:number_of_shards, :number_of_replicas, :max_result_window, :refresh_interval
derived from shard-count and settings) so only present settings are included;
ensure the symbol updated-settings is assigned the filtered result used
downstream where index settings are sent to ES.
- Around line 206-210: Replace uses of (not (empty? ...)) with the idiomatic
Clojure predicate seq (or not-empty) in the index mapping checks; specifically
update the conditional that checks mapping-resp-body in the index mapping
assignment (the binding that defines mappings) and the other similar check
around lines 214-218 to use (seq mapping-resp-body) (or (not-empty
mapping-resp-body)) so the condition returns nil for empty/nil collections and
reads more idiomatically.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f9f55dba-7eac-4ff7-992c-d6702a34880a

📥 Commits

Reviewing files that changed from the base of the PR and between b52ce8f and d52e940.

📒 Files selected for processing (5)
  • bootstrap-app/README.md
  • elastic-utils-lib/src/cmr/elastic_utils/es_helper.clj
  • elastic-utils-lib/src/cmr/elastic_utils/es_index_helper.clj
  • indexer-app/src/cmr/indexer/data/index_set_elasticsearch.clj
  • indexer-app/src/cmr/indexer/services/index_set_service.clj
🚧 Files skipped from review as they are similar to previous changes (1)
  • elastic-utils-lib/src/cmr/elastic_utils/es_index_helper.clj

@jmaeng72 jmaeng72 merged commit 8f809ea into master Mar 5, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants