diff --git a/DEMO/custom_preprocessing_and_postprocessing_hooks.ipynb b/DEMO/custom_preprocessing_and_postprocessing_hooks.ipynb index 5e8ae2e..3465ea1 100644 --- a/DEMO/custom_preprocessing_and_postprocessing_hooks.ipynb +++ b/DEMO/custom_preprocessing_and_postprocessing_hooks.ipynb @@ -17,7 +17,7 @@ "\n", "- Remove punctuation from input queries before the VectorStore search process begins,\n", "- Capitalising all text in an input query to the Vectorstore search process,\n", - "- Deduplicate results based on the doc_id column so that duplicate knowledgebase entries are not returned,\n", + "- Deduplicate results based on the doc_label column so that duplicate knowledgebase entries are not returned,\n", "- Prevent users of the package from retrieving certain documents in your vectorstore,\n", "- Removing hate speech from any input text.\n", "\n", @@ -51,7 +51,7 @@ " - Takes in a body of text and searches the vector store for semantically similar knowledgebase samples.\n", "\n", "2. **`reverse_search()`** \n", - " - Takes in document IDs and searches the vector store for entries with those IDs.\n", + " - Takes in document labels and searches the vector store for entries with those labels.\n", "\n", "3. **`embed()`** \n", " - Takes in a body of text and uses the vectoriser model to convert the text into embeddings.\n", @@ -66,7 +66,7 @@ "\n", "This shows that the `VectorStore.search()` method expects:\n", "- An **input dataclass object** with columns `[id, query]`. \n", - "- To output an **output dataclass object** with columns `[query_id, query_text, doc_id, doc_text, rank, score]`.\n", + "- To output an **output dataclass object** with columns `[query_id, query_text, doc_label, doc_text, rank, score]`.\n", "\n", "The use of these dataclasses both helps the user of the package to understand what data needs to be provided to the Vectorstore and how a user should interact with the objects being returned by these VectorStore functions. Additionally, this ensures robustness of the package by checking that the correct columns are present in the data before operating on it. \n", "\n", @@ -217,7 +217,7 @@ "source": [ "The below code uses our dataclasses to set up some data to pass to the VectorStore search method, notice that:\n", " * an exclaimation mark in the query (that in some cases we may want to sanitise) is shown in the results. \n", - " * Also the results for the below query should also show several rows with the same ```'doc_id'``` value (because our example data file had multiple entries with the same id label)" + " * Also the results for the below query should also show several rows with the same ```'doc_label'``` value (because our example data file had multiple entries with the same label value)" ] }, { @@ -277,8 +277,8 @@ "\n", "def drop_duplicates(input_data: VectorStoreSearchOutput) -> VectorStoreSearchOutput:\n", " # we want to depuplicate the ranking attribute of the pydantic model which is a pandas dataframe\n", - " # specifically we want to drop all but the first occurrence of each unique 'doc_id' value for each subset of query results\n", - " input_data = input_data.drop_duplicates(subset=[\"query_id\", \"doc_id\"], keep=\"first\")\n", + " # specifically we want to drop all but the first occurrence of each unique 'doc_label' value for each subset of query results\n", + " input_data = input_data.drop_duplicates(subset=[\"query_id\", \"doc_label\"], keep=\"first\")\n", "\n", " # BE CAREFUL: drop_duplicates returns an object of type DataFrame, not VectorStoreSearchOutput so we need to convert back to that type after this operation\n", " input_data = VectorStoreSearchOutput(input_data)\n", @@ -380,8 +380,8 @@ "outputs": [], "source": [ "def drop_duplicates_and_reset_rank(input_object: VectorStoreSearchOutput) -> VectorStoreSearchOutput:\n", - " # Remove duplicates based on 'query_id' and 'doc_id'\n", - " input_object = input_object.drop_duplicates(subset=[\"query_id\", \"doc_id\"], keep=\"first\")\n", + " # Remove duplicates based on 'query_id' and 'doc_label'\n", + " input_object = input_object.drop_duplicates(subset=[\"query_id\", \"doc_label\"], keep=\"first\")\n", "\n", " # Reset the rank column per query_id using .loc to avoid SettingWithCopyWarning\n", " input_object.loc[:, \"rank\"] = input_object.groupby(\"query_id\").cumcount()\n", @@ -507,7 +507,7 @@ "source": [ "### Injecting Data into our classification results with a hook\n", "\n", - "What if we had some additional context information that we wanted to add in our pipeline. It could be some official taxonomy definitions about our doc_id labels, such as SIC or SOC code definitions.\n", + "What if we had some additional context information that we wanted to add in our pipeline. It could be some official taxonomy definitions about our doc_labels, such as SIC or SOC code definitions.\n", "\n", "We may want to inject this extra information that's not directly stored as metadata in the knowledgebase, so that a downstream component (such as a RAG agent) can use the additional information" ] @@ -551,8 +551,8 @@ "outputs": [], "source": [ "def add_id_definitions(input_data: VectorStoreSearchOutput) -> VectorStoreSearchOutput:\n", - " # Map the 'doc_id' column to the corresponding definitions from the dictionary\n", - " input_data.loc[:, \"id_definition\"] = input_data[\"doc_id\"].map(official_id_definitions)\n", + " # Map the 'doc_label' column to the corresponding definitions from the dictionary\n", + " input_data.loc[:, \"id_definition\"] = input_data[\"doc_label\"].map(official_id_definitions)\n", "\n", " return input_data" ] @@ -661,11 +661,6 @@ " - try adding a new column of data to the reverse search results \n", " - make it so that if the user tries to reverse search for a specific ID that is 'secret' then that row is removed from the input data." ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [] } ], "metadata": { diff --git a/DEMO/custom_vectoriser.ipynb b/DEMO/custom_vectoriser.ipynb index 41846d9..80cae78 100644 --- a/DEMO/custom_vectoriser.ipynb +++ b/DEMO/custom_vectoriser.ipynb @@ -85,7 +85,7 @@ "outputs": [], "source": [ "# we're going to use scikit learns countvectoriser to create our one hot embeddings - install in the terminal or uncomment the below code\n", - "# !pip install scikit-learn" + "# !pip install scikit-learn OR uv pip install scikit-learn" ] }, { @@ -319,13 +319,6 @@ "we can create our own custom vectoriser such as the one-hot encoding model shown here. \n", "Check out the other DEMO notebooks to see how use the Vectorstore and Vectorisers in other ways and how to deploy your search system over a RestAPI service :)" ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] } ], "metadata": { diff --git a/DEMO/data/fake_soc_dataset.csv b/DEMO/data/fake_soc_dataset.csv index fbf215c..5459912 100644 --- a/DEMO/data/fake_soc_dataset.csv +++ b/DEMO/data/fake_soc_dataset.csv @@ -1,4 +1,4 @@ -id,text +label,text 101,"Fruit farmer: Grows and harvests fruits such as apples, oranges, and berries." 101,"Vegetable farmer: Cultivates and harvests vegetables like carrots, potatoes, and lettuce." 102,"Dairy farmer: Manages cows for milk production and processes dairy products." diff --git a/DEMO/data/testdata.csv b/DEMO/data/testdata.csv index 986f723..6ea8954 100644 --- a/DEMO/data/testdata.csv +++ b/DEMO/data/testdata.csv @@ -1,4 +1,4 @@ -id,text,colour,country,language +label,text,colour,country,language 0001,The sun rises in the east.,Orange,India,Hindi 0002,The moon shines at night.,White,USA,English 0003,Rivers flow to the sea.,Blue,Brazil,Portuguese diff --git a/DEMO/general_workflow_demo.ipynb b/DEMO/general_workflow_demo.ipynb index a31eca0..8e80fe0 100644 --- a/DEMO/general_workflow_demo.ipynb +++ b/DEMO/general_workflow_demo.ipynb @@ -218,7 +218,7 @@ "source": [ "from classifai.indexers.dataclasses import VectorStoreReverseSearchInput\n", "\n", - "input_data_2 = VectorStoreReverseSearchInput({\"id\": [\"1\", \"2\"], \"doc_id\": [\"1100\", \"1056\"]})\n", + "input_data_2 = VectorStoreReverseSearchInput({\"id\": [\"1\", \"2\"], \"doc_label\": [\"1100\", \"1056\"]})\n", "\n", "my_vector_store.reverse_search(input_data_2)" ] @@ -227,7 +227,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## With reverse search you can do partial matching!\n", + "### With reverse search you can do partial matching!\n", "use the `partial match` flag to check if the **ids/labels** start with our query id" ] }, @@ -237,7 +237,7 @@ "metadata": {}, "outputs": [], "source": [ - "input_data_3 = VectorStoreReverseSearchInput({\"id\": [\"1\", \"2\"], \"doc_id\": [\"1100\", \"105\"]})\n", + "input_data_3 = VectorStoreReverseSearchInput({\"id\": [\"1\", \"2\"], \"doc_label\": [\"1100\", \"105\"]})\n", "\n", "my_vector_store.reverse_search(input_data_3, partial_match=True)" ] @@ -250,7 +250,7 @@ "source": [ "## use n_results to limit the amount of results per-item\n", "\n", - "my_vector_store.reverse_search(input_data_3, n_results=2, partial_match=True)" + "my_vector_store.reverse_search(input_data_3, max_n_results=2, partial_match=True)" ] }, { @@ -445,7 +445,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3 (ipykernel)", + "display_name": "classifai", "language": "python", "name": "python3" }, @@ -459,7 +459,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.10" + "version": "3.13.7" } }, "nbformat": 4, diff --git a/src/classifai/indexers/dataclasses.py b/src/classifai/indexers/dataclasses.py index b57cdff..f26560f 100644 --- a/src/classifai/indexers/dataclasses.py +++ b/src/classifai/indexers/dataclasses.py @@ -67,7 +67,7 @@ class VectorStoreSearchOutput(pd.DataFrame): Attributes: query_id (pd.Series): Identifier for the source query. query_text (pd.Series): The original query text. - doc_id (pd.Series): Identifier for the retrieved document. + doc_label (pd.Series): Identifier for the retrieved document. doc_text (pd.Series): The text content of the retrieved document. rank (pd.Series): The ranking position of the result (0-indexed, non-negative). score (pd.Series): The similarity score or relevance metric. @@ -77,7 +77,7 @@ class VectorStoreSearchOutput(pd.DataFrame): { "query_id": pa.Column(str), "query_text": pa.Column(str), - "doc_id": pa.Column(str), + "doc_label": pa.Column(str), "doc_text": pa.Column(str), "rank": pa.Column(int, pa.Check.ge(0)), "score": pa.Column(float), @@ -117,8 +117,8 @@ def query_text(self) -> pd.Series: return self["query_text"] @property - def doc_id(self) -> pd.Series: - return self["doc_id"] + def doc_label(self) -> pd.Series: + return self["doc_label"] @property def doc_text(self) -> pd.Series: @@ -141,13 +141,13 @@ class VectorStoreReverseSearchInput(pd.DataFrame): Attributes: id (pd.Series): Unique identifier for the reverse search query. - doc_id (pd.Series): The document ID to find similar documents for. + doc_label (pd.Series): The document ID to find similar documents for. """ _schema = pa.DataFrameSchema( { "id": pa.Column(str), - "doc_id": pa.Column(str), + "doc_label": pa.Column(str), }, coerce=True, ) @@ -179,8 +179,8 @@ def id(self) -> pd.Series: return self["id"] @property - def text(self) -> pd.Series: - return self["doc_id"] + def doc_label(self) -> pd.Series: + return self["doc_label"] class VectorStoreReverseSearchOutput(pd.DataFrame): @@ -190,16 +190,18 @@ class VectorStoreReverseSearchOutput(pd.DataFrame): containing knowledgebase examples with the same label as in the query. Attributes: - query_id (pd.Series): Identifier for the input label for lookup in the knowledgebase. - doc_id (pd.Series): Identifier for the knowledgebase example retrieved. - doc_text (pd.Series): The text content of the retrieved example. + id (pd.Series): Identifier for the input label for lookup in the knowledgebase. + doc_label (pd.Series): Identifier for the knowledgebase example retrieved. + retrieved_doc_label (pd.Series): Identifier for the retrieved document with the same label. + retrieved_doc_text (pd.Series): The text content of the retrieved example. """ _schema = pa.DataFrameSchema( { "id": pa.Column(str), - "doc_id": pa.Column(str), - "doc_text": pa.Column(str), + "doc_label": pa.Column(str), + "retrieved_doc_label": pa.Column(str), + "retrieved_doc_text": pa.Column(str), } ) @@ -226,16 +228,20 @@ def validate(cls, df: pd.DataFrame) -> "VectorStoreReverseSearchOutput": return cls(validated_df) @property - def query_id(self) -> pd.Series: - return self["input_doc_id"] + def id(self) -> pd.Series: + return self["id"] @property - def doc_id(self) -> pd.Series: - return self["retrieved_doc_id"] + def doc_label(self) -> pd.Series: + return self["doc_label"] @property - def doc_text(self) -> pd.Series: - return self["doc_text"] + def retrieved_doc_label(self) -> pd.Series: + return self["retrieved_doc_label"] + + @property + def retrieved_doc_text(self) -> pd.Series: + return self["retrived_doc_text"] class VectorStoreEmbedInput(pd.DataFrame): diff --git a/src/classifai/indexers/main.py b/src/classifai/indexers/main.py index 14d64ee..ff02992 100644 --- a/src/classifai/indexers/main.py +++ b/src/classifai/indexers/main.py @@ -278,8 +278,8 @@ def _create_vector_store_index(self): # noqa: C901 if self.data_type == "csv": self.vectors = pl.read_csv( self.file_name, - columns=["id", "text", *self.meta_data.keys()], - dtypes=self.meta_data | {"id": str, "text": str}, + columns=["label", "text", *self.meta_data.keys()], + dtypes=self.meta_data | {"label": str, "text": str}, ) self.vectors = self.vectors.with_columns( pl.Series("uuid", [str(uuid.uuid4()) for _ in range(self.vectors.height)]) @@ -439,16 +439,16 @@ def reverse_search( # noqa: C901 If using partial matching, matches if document label starts with query label. Args: - query (VectorStoreReverseSearchInput): A `VectorStoreReverseSearchInput` object containing the text query or - list of queries to search for with ids. + query (VectorStoreReverseSearchInput): A `VectorStoreReverseSearchInput` object containing the `doc_labels` to + look up in the `VectorStore` and their corresponding ids. max_n_results (int): [optional] Number of top results to return for each query, set to -1 to return all results. Defaults to 100. - partial_match (bool): [optional] If `True`, the search behaviour is set to return results where the `document_id` - is prefixed by the query. Defaults to `False`. + partial_match (bool): [optional] If `True`, the search behaviour is set to return results where the `doc_label` + is a prefixed of any vectorstore entries labels. Defaults to `False`. Returns: (VectorStoreReverseSearchOutput): A `VectorStoreReverseSearchOutput` object containing reverse search - results with columns for `query_id`, `query_text`, `document_id`, `document_text` and any associated metadata columns. + results with columns for `id`, `doc_label`, `retrieved_doc_label`, `retrieved_doc_text` and any associated metadata columns. Raises: `DataValidationError`: Raised if invalid arguments are passed. @@ -488,17 +488,25 @@ def reverse_search( # noqa: C901 try: # polars conversion paired_query = pl.DataFrame( - {"id": query.id.astype(str).to_list(), "doc_id": query.doc_id.astype(str).to_list()} + {"id": query.id.astype(str).to_list(), "doc_label": query.doc_label.astype(str).to_list()} + ) + + # rename vectors dataframe for reverse search return column names and joining + docs = self.vectors.rename({"label": "retrieved_doc_label", "text": "retrieved_doc_text"}).with_columns( + pl.col("retrieved_doc_label").alias("retrieved_doc_label_copy") ) - paired_query = paired_query.rename({"doc_id": "query_docid"}) - docs = self.vectors.rename({"id": "doc_id"}) if partial_match: - out = docs.join_where(paired_query, pl.col("doc_id").str.starts_with(pl.col("query_docid"))) + out = docs.join_where(paired_query, pl.col("retrieved_doc_label").str.starts_with(pl.col("doc_label"))) else: - out = docs.join(paired_query.rename({"query_docid": "doc_id"}), on="doc_id", how="inner") - - out = out.sort(by=["id", "doc_id"], descending=[False, False]) + out = paired_query.join( + docs, + left_on="doc_label", + right_on="retrieved_doc_label", + how="inner", + ).rename({"retrieved_doc_label_copy": "retrieved_doc_label"}) + + out = out.sort(by=["id", "doc_label"], descending=[False, False]) if max_n_results != -1: out = out.group_by("id").head(max_n_results) @@ -506,8 +514,9 @@ def reverse_search( # noqa: C901 final_table = out.select( [ pl.col("id").cast(str), - pl.col("doc_id").cast(str), - pl.col("text").cast(str).alias("doc_text"), + pl.col("doc_label").cast(str), + pl.col("retrieved_doc_label").cast(str), + pl.col("retrieved_doc_text").cast(str), *[pl.col(key) for key in self.meta_data], ] ) @@ -557,7 +566,7 @@ def search(self, query: VectorStoreSearchInput, n_results=10, batch_size=8) -> V Returns: (VectorStoreSearchOutput): A `VectorStoreSearchOutput` object containing search results with columns for `query_id`, `query_text`, - `document_id`, `document_text`, `rank`, `score`, and any associated metadata columns. + `doc_label`, `doc_text`, `rank`, `score`, and any associated metadata columns. Raises: `DataValidationError`: Raised if invalid arguments are passed. @@ -648,12 +657,14 @@ def search(self, query: VectorStoreSearchInput, n_results=10, batch_size=8) -> V } ) - ranked_docs = self.vectors[idx_sorted.flatten().tolist()].select(["id", "text", *self.meta_data.keys()]) - merged_df = result_df.hstack(ranked_docs).rename({"id": "doc_id", "text": "doc_text"}) + ranked_docs = self.vectors[idx_sorted.flatten().tolist()].select( + ["label", "text", *self.meta_data.keys()] + ) + merged_df = result_df.hstack(ranked_docs).rename({"label": "doc_label", "text": "doc_text"}) merged_df = merged_df.with_columns( [ - pl.col("doc_id").cast(str), + pl.col("doc_label").cast(str), pl.col("doc_text").cast(str), pl.col("rank").cast(int), pl.col("score").cast(float), @@ -670,7 +681,7 @@ def search(self, query: VectorStoreSearchInput, n_results=10, batch_size=8) -> V schema={ "query_id": pl.Utf8, "query_text": pl.Utf8, - "doc_id": pl.Utf8, + "doc_label": pl.Utf8, "doc_text": pl.Utf8, "rank": pl.Int64, "score": pl.Float64, @@ -680,7 +691,7 @@ def search(self, query: VectorStoreSearchInput, n_results=10, batch_size=8) -> V return VectorStoreSearchOutput.from_data(empty.to_dict(as_series=False)) reordered_df = pl.concat(all_results).select( - ["query_id", "query_text", "doc_id", "doc_text", "rank", "score", *self.meta_data.keys()] + ["query_id", "query_text", "doc_label", "doc_text", "rank", "score", *self.meta_data.keys()] ) result_df = VectorStoreSearchOutput.from_data(reordered_df.to_dict(as_series=False)) @@ -820,7 +831,7 @@ def from_filespace(cls, folder_path, vectoriser, hooks: dict | None = None): # context={"folder_path": folder_path, "vectors_path": vectors_path}, ) - required_columns = ["id", "text", "embeddings", "uuid", *deserialized_column_meta_data.keys()] + required_columns = ["label", "text", "embeddings", "uuid", *deserialized_column_meta_data.keys()] try: df = pl.read_parquet(vectors_path, columns=required_columns) diff --git a/src/classifai/servers/main.py b/src/classifai/servers/main.py index d31f660..139d2ef 100644 --- a/src/classifai/servers/main.py +++ b/src/classifai/servers/main.py @@ -25,14 +25,15 @@ ) from ..indexers.main import VectorStore from .pydantic_models import ( - ClassifaiData, - EmbeddingsList, - EmbeddingsResponseBody, - ResultsResponseBody, - RevClassifaiData, - RevResultsResponseBody, - convert_dataframe_to_pydantic_response, - convert_dataframe_to_reverse_search_pydantic_response, + EmbedRequestSet, + EmbedResponseBody, + ReverseSearchRequestSet, + ReverseSearchResponseBody, + SearchRequestSet, + SearchResponseBody, + convert_embedding_dataframe_to_pydantic_response, + convert_reverse_search_dataframe_to_pydantic_response, + convert_search_dataframe_to_pydantic_response, ) @@ -173,24 +174,18 @@ def _create_embedding_endpoint(router: APIRouter | FastAPI, endpoint_name: str, """ @router.post(f"/{endpoint_name}/embed", description=f"{endpoint_name} embedding endpoint") - async def embedding_endpoint(data: ClassifaiData) -> EmbeddingsResponseBody: + async def embedding_endpoint(data: EmbedRequestSet) -> EmbedResponseBody: input_ids = [x.id for x in data.entries] - documents = [x.description for x in data.entries] - - input_data = VectorStoreEmbedInput({"id": input_ids, "text": documents}) + input_texts = [x.text for x in data.entries] + # Creat the input dataclass object and pass it to the vectorstore to get results. + input_data = VectorStoreEmbedInput({"id": input_ids, "text": input_texts}) output_data = vector_store.embed(input_data) - returnable = [] - for _, row in output_data.iterrows(): - returnable.append( - EmbeddingsList( - idx=row["id"], - description=row["text"], - embedding=row["embedding"].tolist(), # Convert numpy array to list - ) - ) - return EmbeddingsResponseBody(data=returnable) + # post processing of the Vectorstore output åobject + formatted_result = convert_embedding_dataframe_to_pydantic_response(output_data) + + return formatted_result def _create_search_endpoint(router: APIRouter | FastAPI, endpoint_name: str, vector_store: VectorStore): @@ -208,7 +203,7 @@ def _create_search_endpoint(router: APIRouter | FastAPI, endpoint_name: str, vec @router.post(f"/{endpoint_name}/search", description=f"{endpoint_name} search endpoint") async def search_endpoint( - data: ClassifaiData, + data: SearchRequestSet, n_results: Annotated[ int, Query( @@ -216,15 +211,17 @@ async def search_endpoint( ge=1, # Ensure at least one result is returned ), ] = 10, - ) -> ResultsResponseBody: + ) -> SearchResponseBody: + # Creat the input dataclass object and pass it to the vectorstore to get results. input_ids = [x.id for x in data.entries] - queries = [x.description for x in data.entries] + queries = [x.query for x in data.entries] + # Creat the input dataclass object and pass it to the vectorstore to get results. input_data = VectorStoreSearchInput({"id": input_ids, "query": queries}) output_data = vector_store.search(query=input_data, n_results=n_results) - ##post processing of the Vectorstore outputobject - formatted_result = convert_dataframe_to_pydantic_response( + # post processing of the Vectorstore output åobject + formatted_result = convert_search_dataframe_to_pydantic_response( df=output_data, meta_data=vector_store.meta_data, ) @@ -247,7 +244,7 @@ def _create_reverse_search_endpoint(router: APIRouter | FastAPI, endpoint_name: @router.post(f"/{endpoint_name}/reverse_search", description=f"{endpoint_name} reverse query endpoint") def reverse_search_endpoint( - data: RevClassifaiData, + data: ReverseSearchRequestSet, max_n_results: Annotated[ int | Literal[-1], Query(description="The max number of results to return, set to -1 to return all results."), @@ -255,18 +252,21 @@ def reverse_search_endpoint( partial_match: Annotated[ bool, Query(description="Flag to use partial `starts_with` matching for queries") ] = False, - ) -> RevResultsResponseBody: + ) -> ReverseSearchResponseBody: # Enforce the ≥1 rule manually, only when not -1 if max_n_results != -1 and max_n_results < 1: raise HTTPException(422, "max_n_results must be -1 or >= 1") + # Creat the input dataclass object and pass it to the vectorstore to get results. input_ids = [x.id for x in data.entries] - queries = [x.code for x in data.entries] + queries = [x.doc_label for x in data.entries] - input_data = VectorStoreReverseSearchInput({"id": input_ids, "doc_id": queries}) + # Creat the input dataclass object and pass it to the vectorstore to get results. + input_data = VectorStoreReverseSearchInput({"id": input_ids, "doc_label": queries}) output_data = vector_store.reverse_search(input_data, max_n_results=max_n_results, partial_match=partial_match) - formatted_result = convert_dataframe_to_reverse_search_pydantic_response( + # post processing of the Vectorstore output object + formatted_result = convert_reverse_search_dataframe_to_pydantic_response( df=output_data, meta_data=vector_store.meta_data, ) diff --git a/src/classifai/servers/pydantic_models.py b/src/classifai/servers/pydantic_models.py index 3bb7b75..33266bc 100644 --- a/src/classifai/servers/pydantic_models.py +++ b/src/classifai/servers/pydantic_models.py @@ -5,126 +5,154 @@ from pydantic import BaseModel, Extra, Field -class ClassifaiEntry(BaseModel): - """Atomic model for a single row of input data (i.e. a single query input) , includes 'id' and - 'description' which are expected as str type. +class SearchRequestEntry(BaseModel): + """Atomic model for a single row of VectorStore search method input data (i.e. a single query input) , includes 'id' and + 'query'. """ id: str = Field(examples=["1"]) - description: str = Field( - description="User string describing inforation need/query", - examples=["How to ice skate?"], + query: str = Field( + description="User string describing information need/query.", + examples=["Vegetable farmer"], ) -class ClassifaiData(BaseModel): - """Model for a list of many ClassifaiEntry pydantic models, i.e. several queries to be searched +class SearchRequestSet(BaseModel): + """Model for a list of many SearchRequestEntry pydantic models, i.e. several queries to be searched in the VectorStore. """ - entries: list[ClassifaiEntry] = Field(description="array of search queries to be searched in the VectorStore") + entries: list[SearchRequestEntry] = Field(description="array of search queries to be searched in the VectorStore.") -class ResultEntry(BaseModel): - """Atomic model for a single row of vector store result data (i.e. a single vectorstore entry), - includes 'label', 'description', 'score' and 'rank' which are expected as str, str, float and - int types respectively. - """ - - label: str - description: str - score: float - rank: int +class SearchResponseEntry(BaseModel): + """Atomic model for a single row of vector store search result data (i.e. a single vectorstore entry).""" - class Config: # pylint: disable=R0903 - """Sub-class to permit additional extra metadata (e.g., metadata columns from vectorstore - construction). - """ + doc_label: str = Field(description="The vectorstore row label of the relevant result entry.") + doc_text: str = Field(description="The vectorstore row text of the relevant result entry.") + rank: int = Field(description="The rank of the result entry for the given query, with 1 being the most relevant.") + score: float = Field(description="The similarity score of the result entry for the given query.") - extra = Extra.allow + class Config: + extra = Extra.allow # Allow extra keys (e.g., metadata columns)å -class ResultsList(BaseModel): - """Model for a list of many ResultEntry pydantic models, representing a ranked list of vector - store search results. +class SearchResponseSet(BaseModel): + """Model for a list of many SearchResponseEntry pydantic models, representing a ranked list of vector + store search results for a provided query. """ - input_id: str - response: list[ResultEntry] + query_id: str = Field(description="The id of the query input for which these are the search results.") + query_text: str = Field(description="The text of the query input for which these are the search results.") + entries: list[SearchResponseEntry] = Field( + description="array of search results for the given query, ranked by relevance to the query." + ) -class ResultsResponseBody(BaseModel): - """Model for set of ranked lists, corresponding to multiple input queries and their own ranked - ResultsLists. +class SearchResponseBody(BaseModel): + """Model for the overall search response body, which includes a list of SearchResponseSet objects, + representing the search results for each input query. """ - data: list[ResultsList] + data: list[SearchResponseSet] -class RevClassifaiEntry(BaseModel): - """Atomic model for a single row of reverse search data includes 'id' and 'code' which are expected - as str type. - """ +class ReverseSearchRequestEntry(BaseModel): + """Atomic model for a single row of reverse search data includes 'id' and 'doc_label'.""" id: str = Field(examples=["1"]) - code: str = Field( - examples=["0001"], description="VectorStore row entry 'ID' to be looked up, searched in the 'id'column." + doc_label: str = Field( + examples=["101"], + description="VectorStore row entry label to be looked up, searched in the 'label' column.", ) -class RevClassifaiData(BaseModel): - """Model for a list of many RevClassifaiEntry pydantic models, i.e. several vectorstore row entry - codes to be looked up in the VectorStore. +class ReverseSearchRequestSet(BaseModel): + """Model for a list of many ReverseSearchRequestEntry pydantic models, i.e. several vectorstore row entry + label to be looked up in the VectorStore. """ - entries: list[RevClassifaiEntry] = Field(description="array of VectorStore row entry IDs to be retrieved") + entries: list[ReverseSearchRequestEntry] = Field(description="array of VectorStore row entry labels to look up.") -class RevResultEntry(BaseModel): - """Atomic model for single reverse search result entry, includes 'label' and 'description' which +class ReverseSearchResponseEntry(BaseModel): + """Atomic model for single reverse search result entry, includes 'retrieved_doc_label' and 'retrieved_doc_text' which are expected as str types. """ - label: str - description: str + retrieved_doc_label: str + retrieved_doc_text: str class Config: extra = Extra.allow # Allow extra keys (e.g., metadata columns) -class RevResultsList(BaseModel): - """Model for a list of many RevResultEntry pydnatic models, representing a list of vector store - entries found matching an input RevClassifaiEntry 'id'. +class ReverseSearchResponseSet(BaseModel): + """Model for a list of many ReverseSearchResponseEntry pydnatic models, representing a list of vector store + entries found (partially) matching an input 'doc_label' and corresponding input 'id'. """ - input_id: str - response: list[RevResultEntry] + input_id: str = Field( + description="The id of the vectorstore row entry input for which these are the reverse search results." + ) + doc_label: str = Field( + description="The vectorstore row entry label that was looked up in the reverse search query." + ) + entries: list[ReverseSearchResponseEntry] = Field( + description="array of reverse search results for the given vectorstore row entry, matching (partially) the input doc_label." + ) -class RevResultsResponseBody(BaseModel): - """Model for set of reverse ranked lists, corresponding to multiple input RevClassifaiEntry and - their own RevResultsList. +class ReverseSearchResponseBody(BaseModel): + """Model for the overall reverse search response body, which includes a list of ReverseSearchResponseSet + objects, representing the reverse search results for each input vectorstore row entry 'id'. """ - data: list[RevResultsList] + data: list[ReverseSearchResponseSet] -class EmbeddingsList(BaseModel): - """model for set of embeddings lists, for all row entries submmitted.""" +class EmbedRequestEntry(BaseModel): + """Atomic model for a single text string to be embedded with VectorStore embed method with associated 'id'.""" - idx: str - description: str - embedding: list + id: str = Field(description="The id of the text entry to be embedded.", examples=["1"]) + text: str = Field( + description="The text string to be embedded.", examples=["A string to be converted to vector embedding."] + ) -class EmbeddingsResponseBody(BaseModel): - """model for set of list of embeddings, for all row entries submmitted.""" +class EmbedRequestSet(BaseModel): + """Model for a list of many EmbedRequestEntry pydantic models, representing several text strings to be embedded with the VectorStore embed method.""" - data: list[EmbeddingsList] + entries: list[EmbedRequestEntry] = Field( + description="array of text entries to be embedded, with their corresponding text and id" + ) -def convert_dataframe_to_reverse_search_pydantic_response(df: pd.DataFrame, meta_data: dict) -> RevResultsResponseBody: - """Convert a Pandas DataFrame into a JSON object conforming to the RevResultsResponseBody Pydantic +class EmbedResponseEntry(BaseModel): + """Atomic model for a single embedding result entry, includes 'id', 'text' and 'embedding'.""" + + id: str = Field(description="The id of the text entry that was embedded.") + text: str = Field(description="The text string that was embedded.") + embedding: list = Field( + description="The vector embedding result for the input text string, represented as a list of floats." + ) + + class Config: + extra = Extra.allow # Allow extra keys (e.g., metadata columns) + + +class EmbedResponseBody(BaseModel): + """model for set of list of EmbedResponseEntry pydnatic objects, for all row entries submmitted to embed VectorStore method.""" + + data: list[EmbedResponseEntry] = Field( + description="array of embedding results, with their corresponding text and id" + ) + + +def convert_reverse_search_dataframe_to_pydantic_response( + df: pd.DataFrame, meta_data: dict +) -> ReverseSearchResponseBody: + """Convert a VectorStoreReverseSearchOutput DataFrame into a JSON object conforming to the ReverseSearchResponseBody Pydantic model. Args: @@ -132,7 +160,7 @@ def convert_dataframe_to_reverse_search_pydantic_response(df: pd.DataFrame, meta meta_data (dict): dictionary of metadata column names mapping to their types. Returns: - RevResultsResponseBody: Pydantic model containing the structured response. + ReverseSearchResponseBody: Pydantic model containing the API structured result for reverse search VectorStore method. """ # identify metadata columns from the DataFrame by checking which columns are in the meta_data dictionary hook_columns = ( @@ -141,8 +169,9 @@ def convert_dataframe_to_reverse_search_pydantic_response(df: pd.DataFrame, meta .difference( { "id", - "doc_id", - "doc_text", + "doc_label", + "retrieved_doc_label", + "retrieved_doc_text", } ) ) @@ -155,7 +184,7 @@ def convert_dataframe_to_reverse_search_pydantic_response(df: pd.DataFrame, meta # Convert group_df to a list of dictionaries rows_as_dicts = group_df.to_dict(orient="records") - # Build the list of RevResultEntry objects for the current group + # Build the list of ReverseSearchResponseEntry objects for the current group response_entries = [] for row in rows_as_dicts: # Extract metadata columns dynamically @@ -164,39 +193,41 @@ def convert_dataframe_to_reverse_search_pydantic_response(df: pd.DataFrame, meta # Find other values - added by hooks - any other per-row columns not in reserved/meta other_values = {k: v for k, v in row.items() if k in hook_columns} - # Create a RevResultEntry object + # Create a ReverseSearchResponseEntry object response_entries.append( - RevResultEntry( - label=row["doc_id"], - description=row["doc_text"], + ReverseSearchResponseEntry( + retrieved_doc_label=row["retrieved_doc_label"], + retrieved_doc_text=row["retrieved_doc_text"], **metadata_values, # Add metadata dynamically **other_values, # Add any extra columns dynamically ) ) - # Create a RevResultsList object for the current `id` + # Create a ReverseSearchResponseSet object for the current `id` and 'doc_label' results_list.append( - RevResultsList( + ReverseSearchResponseSet( input_id=input_id, - response=response_entries, + doc_label=group_df["doc_label"].iloc[0], # Assuming `doc_label` is the same for all rows in the group + entries=response_entries, ) ) - # Create the RevResultsResponseBody object - response_body = RevResultsResponseBody(data=results_list) + # Create the ReverseSearchResponseBody object to be returned + response_body = ReverseSearchResponseBody(data=results_list) return response_body -def convert_dataframe_to_pydantic_response(df: pd.DataFrame, meta_data: dict) -> ResultsResponseBody: - """Convert a Pandas DataFrame into a JSON object conforming to the ResultsResponseBody Pydantic model. +def convert_search_dataframe_to_pydantic_response(df: pd.DataFrame, meta_data: dict) -> SearchResponseBody: + """Convert a VectorStoreSearchOutput DataFrame into a JSON object conforming to the SearchResponseBody Pydantic + model. Args: - df (pd.DataFrame): Pandas DataFrame containing query results. + df (pd.DataFrame): Pandas DataFrame containing search results. meta_data (dict): dictionary of metadata column names mapping to their types. Returns: - ResultsResponseBody: Pydantic model containing the structured response. + SearchResponseBody: Pydantic model containing the API structured results for search VectorStore method. """ # identify metadata columns from the DataFrame by checking which columns are in the meta_data dictionary hook_columns = ( @@ -206,7 +237,7 @@ def convert_dataframe_to_pydantic_response(df: pd.DataFrame, meta_data: dict) -> { "query_id", "query_text", - "doc_id", + "doc_label", "doc_text", "score", "rank", @@ -222,7 +253,7 @@ def convert_dataframe_to_pydantic_response(df: pd.DataFrame, meta_data: dict) -> # Convert group_df to a list of dictionaries rows_as_dicts = group_df.to_dict(orient="records") - # Build the list of ResultEntry objects for the current group + # Build the list of SearchResponseEntry objects for the current group response_entries = [] for row in rows_as_dicts: # Extract metadata columns dynamically @@ -231,27 +262,66 @@ def convert_dataframe_to_pydantic_response(df: pd.DataFrame, meta_data: dict) -> # Find other values - added by hooks - any other per-row columns not in reserved/meta other_values = {k: v for k, v in row.items() if k in hook_columns} - # Create a ResultEntry object + # Create a SearchResponseEntry object response_entries.append( - ResultEntry( - label=row["doc_id"], - description=row["doc_text"], - score=row["score"], # Assuming `score` is a column in the DataFrame + SearchResponseEntry( + doc_label=row["doc_label"], + doc_text=row["doc_text"], rank=row["rank"], # Assuming `rank` is a column in the DataFrame + score=row["score"], # Assuming `score` is a column in the DataFrame **metadata_values, # Add metadata dynamically **other_values, # Add any extra columns dynamically ) ) - # Create a ResultsList object for the current query_id + # Create a SearchResponseSet object for the current 'query_id' and 'query_text' results_list.append( - ResultsList( - input_id=query_id, # type: ignore[arg-type] - response=response_entries, + SearchResponseSet( + query_id=query_id, + query_text=group_df["query_text"].iloc[0], + entries=response_entries, ) ) - # Create the ResultsResponseBody object - response_body = ResultsResponseBody(data=results_list) + # Create the SearchResponseBody object to be returned + response_body = SearchResponseBody(data=results_list) + + return response_body + +def convert_embedding_dataframe_to_pydantic_response(df: pd.DataFrame) -> EmbedResponseBody: + """Convert a VectorStoreEmvedOutput DataFrame into a JSON object conforming to the EmbedResponseBody Pydantic + model. Unlike the conversion functions for search and reverse search, this function does not take in a meta_data dictionary as an argument, as meta data comes from the VectorStore which is not accessed during the embedding process, and thus there are no reserved metadata columns to check for. Instead, this function identifies any extra columns in the DataFrame that are not 'id', 'text' or 'embedding' as "hook" columns, which may have been added by a user with a custom hook attached to the embed method. + + Args: + df (pd.DataFrame): Pandas DataFrame containing search results. + + Returns: + EmbedResponseBody: Pydantic model containing the API structured results for embed VectorStore method. + """ + # identify hook columns from the DataFrame by checking which columns are in the required columns + hook_columns = set(df.columns).difference( + { + "id", + "text", + "embedding", + } + ) + + # Build the list of EmbedResponseEntry objects for the current group + response_entries = [] + for _, row in df.iterrows(): + other_values = {k: v for k, v in row.items() if k in hook_columns} + + # Create an EmbedResponseEntry object + response_entries.append( + EmbedResponseEntry( + id=row["id"], + text=row["text"], + embedding=row["embedding"].tolist(), # Convert numpy array to list for JSON serialization + **other_values, # Add any extra columns dynamically + ) + ) + # Create the EmbedResponseBody object to be returned + response_body = EmbedResponseBody(data=response_entries) return response_body