Skip to content

feat(indexers): add framework for default hooks#140

Merged
lukeroantreeONS merged 9 commits intomainfrom
139-default-hooks
Mar 20, 2026
Merged

feat(indexers): add framework for default hooks#140
lukeroantreeONS merged 9 commits intomainfrom
139-default-hooks

Conversation

@lukeroantreeONS
Copy link
Collaborator

📌 Add framework for flexible, common hooks to be provided as part of the package

✨ Summary

Introduces a hooks submodule within indexers to provide robust, flexible hooks to the user to cover common pre- and post-processing tasks, and provide a base class should an advanced user want to implement custom ones.

Note: this does not remove the ability of a user to define a hook function themselves, without relying on the base class, as is currently the recommended approach.

📜 Changes Introduced

  • feat(indexers): hook_factory defines base classes and shared utilities
  • feat(indexers): default_hooks contains particular flexible pre- or post-processing hooks we offer as part of the package

✅ Checklist

Please confirm you've completed these checks before requesting a review.

  • Code passes linting with Ruff
  • Security checks pass using Bandit
  • API and Unit tests are written and pass using pytest
  • Terraform files (if applicable) follow best practices and have been validated (terraform fmt & terraform validate)
  • DocStrings follow Google-style and are added as per Pylint recommendations
  • Documentation has been updated if needed

🔍 How to Test

  1. re-build the package with uv build
  2. re-install classifai from dist/classifai<version>.whl with any optional dependencies you want to test with
  3. create a test.py file as described below:
from classifai.indexers import VectorStore
from classifai.indexers.dataclasses import VectorStoreSearchInput
from classifai.indexers.hooks import CapitalisationStandardisingHook
from classifai.vectorisers import HuggingFaceVectoriser

demo_vectoriser = HuggingFaceVectoriser(model_name="sentence-transformers/all-MiniLM-L6-v2")

default_hook_capitalise = CapitalisationStandardisingHook(method="upper")

demo_vectorstore = VectorStore(
    file_name="./data/fake_soc_dataset.csv",
    data_type="csv",
    vectoriser=demo_vectoriser,
    output_dir="./demo_vdb",
    overwrite=True,
    hooks={"search_preprocess": default_hook_capitalise},
)

query_df = VectorStoreSearchInput({"id": [1, 2], "query": ["apple merchant", "pub landlord"]})

results = demo_vectorstore.search(query_df, n_results=5)

print(results)
  1. run uv run test.py; confirm it runs successfully, and the query_text column in the output is all capitalised.
  2. repeat, changing the method parameter within the CapitalisationStandardisingHook object initialisation

@lukeroantreeONS lukeroantreeONS requested a review from a team as a code owner March 5, 2026 15:54
@lukeroantreeONS lukeroantreeONS linked an issue Mar 5, 2026 that may be closed by this pull request
@github-actions github-actions bot added the enhancement New feature or request label Mar 5, 2026
Copy link
Contributor

@frayle-ons frayle-ons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good! I've made a few comments about a few minor changes that might be good but it would also likely be fine as is.

Only one major issue I found was that the default pre-processing capitalisation hook is accepted as a post-processing hook, provided the colname=doc_text (or any legal colname of the SearchOutput df).

Should this be updated somehow so that the vectorstore rejects default hooks of type A being applied at type B hook locations?

The following shows how I produced this:

default_hook_capitalise = CapitalisationStandardisingHook(method="upper", colname="doc_text")

demo_vectorstore = VectorStore(
    file_name="./DEMO/data/fake_soc_dataset.csv",
    data_type="csv",
    vectoriser=demo_vectoriser,
    output_dir="./demo_vdb",
    overwrite=True,
    hooks={"search_postprocess": default_hook_capitalise},
)

Finally, the demo hooks notebook works perfectly well still. Should it be updated to document these changes?

@lukeroantreeONS
Copy link
Collaborator Author

Only one major issue I found was that the default pre-processing capitalisation hook is accepted as a post-processing hook, provided the colname=doc_text (or any legal colname of the SearchOutput df).

Thanks - Yeah, I didn't want to put anything in place to block this as there's no harm in allowing that. Most custom hooks (whether direct functions, or extensions of the Base class) will be able to be applied to either inputs or outputs without restriction, and I couldn't think of a reason to prevent this one being used outside of input sanitisation even if that's what we intend it to be used for.

Have I missed something downstream that allowing this would affect?

@lukeroantreeONS
Copy link
Collaborator Author

lukeroantreeONS commented Mar 13, 2026

Finally, the demo hooks notebook works perfectly well still. Should it be updated to document these changes?

Apologies, forgot to reply to this comment.
Yes - but I'll do that as a last step once we're happy with everything here, to make sure it reflects the final version of this default hooks utility.

frayle-ons
frayle-ons previously approved these changes Mar 16, 2026
Copy link
Contributor

@frayle-ons frayle-ons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good :), I've applied both pre-made hooks to the test script and they work well.

Happy to approve, but as a final request can we check can I suggest we add some inline comments to the code logic in each of the pre-made hooks? It would be good practice that they are well documented so that users can understand how they operate under the hood.

And do we have later changes to any notebooks such as the hooks notebook, to highlight the changes made in this PR?

…, update documentation and examples for hooks
@lukeroantreeONS
Copy link
Collaborator Author

Looks good :), I've applied both pre-made hooks to the test script and they work well.

Happy to approve, but as a final request can we check can I suggest we add some inline comments to the code logic in each of the pre-made hooks? It would be good practice that they are well documented so that users can understand how they operate under the hood.

And do we have later changes to any notebooks such as the hooks notebook, to highlight the changes made in this PR?

Thanks @frayle-ons .
I've added the in-code descriptions to document the more complex behaviour in the DeduplicationHook, and added in the functuonality we had discussed around allowing lists of hooks be passed for a given task in the hooks dictionary.

Now that we're happy with the implementation, I've updated the DEMO notebook for hooks to show:

  • creating and using custom hooks as functions
  • using default_hooks
  • extending the HookBase abstract class

Copy link
Contributor

@frayle-ons frayle-ons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks really good now - just some final cosmetic considerations/requests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same dark mode issue as the spell check svg

"\n",
"## Extending the HookBase class\n",
"\n",
"As well as the flexible hooks we provide for common tasks, we recognise that creating custom flexible callable hooks is not straightforward, due to the strict requirement that they can be called with a single input parameter which must be a `VectorStore` dataclass.\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

constrasting earlier comment on this file, I think this section does explain well the positioning of the HookBase class and when it should be used over a function

…aph in demo, remove commented out softmax code
Copy link
Contributor

@frayle-ons frayle-ons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for making the final changes!

@lukeroantreeONS lukeroantreeONS merged commit 9706568 into main Mar 20, 2026
5 checks passed
@lukeroantreeONS lukeroantreeONS deleted the 139-default-hooks branch March 20, 2026 10:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Default Hooks

3 participants