feat(indexers): add framework for default hooks#140
Conversation
…ed as part of the package
frayle-ons
left a comment
There was a problem hiding this comment.
Looks really good! I've made a few comments about a few minor changes that might be good but it would also likely be fine as is.
Only one major issue I found was that the default pre-processing capitalisation hook is accepted as a post-processing hook, provided the colname=doc_text (or any legal colname of the SearchOutput df).
Should this be updated somehow so that the vectorstore rejects default hooks of type A being applied at type B hook locations?
The following shows how I produced this:
default_hook_capitalise = CapitalisationStandardisingHook(method="upper", colname="doc_text")
demo_vectorstore = VectorStore(
file_name="./DEMO/data/fake_soc_dataset.csv",
data_type="csv",
vectoriser=demo_vectoriser,
output_dir="./demo_vdb",
overwrite=True,
hooks={"search_postprocess": default_hook_capitalise},
)
Finally, the demo hooks notebook works perfectly well still. Should it be updated to document these changes?
…ss to remove unneeded internal method
Thanks - Yeah, I didn't want to put anything in place to block this as there's no harm in allowing that. Most custom hooks (whether direct functions, or extensions of the Base class) will be able to be applied to either inputs or outputs without restriction, and I couldn't think of a reason to prevent this one being used outside of input sanitisation even if that's what we intend it to be used for. Have I missed something downstream that allowing this would affect? |
Apologies, forgot to reply to this comment. |
…o keep robust across aggregation methods
frayle-ons
left a comment
There was a problem hiding this comment.
Looks good :), I've applied both pre-made hooks to the test script and they work well.
Happy to approve, but as a final request can we check can I suggest we add some inline comments to the code logic in each of the pre-made hooks? It would be good practice that they are well documented so that users can understand how they operate under the hood.
And do we have later changes to any notebooks such as the hooks notebook, to highlight the changes made in this PR?
…, update documentation and examples for hooks
Thanks @frayle-ons . Now that we're happy with the implementation, I've updated the DEMO notebook for hooks to show:
|
frayle-ons
left a comment
There was a problem hiding this comment.
I think this looks really good now - just some final cosmetic considerations/requests
There was a problem hiding this comment.
same dark mode issue as the spell check svg
| "\n", | ||
| "## Extending the HookBase class\n", | ||
| "\n", | ||
| "As well as the flexible hooks we provide for common tasks, we recognise that creating custom flexible callable hooks is not straightforward, due to the strict requirement that they can be called with a single input parameter which must be a `VectorStore` dataclass.\n", |
There was a problem hiding this comment.
constrasting earlier comment on this file, I think this section does explain well the positioning of the HookBase class and when it should be used over a function
…aph in demo, remove commented out softmax code
frayle-ons
left a comment
There was a problem hiding this comment.
LGTM, thanks for making the final changes!
📌 Add framework for flexible, common hooks to be provided as part of the package
✨ Summary
Introduces a
hookssubmodule withinindexersto provide robust, flexible hooks to the user to cover common pre- and post-processing tasks, and provide a base class should an advanced user want to implement custom ones.Note: this does not remove the ability of a user to define a hook function themselves, without relying on the base class, as is currently the recommended approach.
📜 Changes Introduced
hook_factorydefines base classes and shared utilitiesdefault_hookscontains particular flexible pre- or post-processing hooks we offer as part of the package✅ Checklist
terraform fmt&terraform validate)🔍 How to Test
uv builddist/classifai<version>.whlwith any optional dependencies you want to test withtest.pyfile as described below:uv run test.py; confirm it runs successfully, and thequery_textcolumn in the output is all capitalised.methodparameter within theCapitalisationStandardisingHookobject initialisation