diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml index 17dcc3ee..b29917e6 100644 --- a/.github/workflows/test.yml +++ b/.github/workflows/test.yml @@ -21,8 +21,10 @@ jobs: with: python-version: ${{ matrix.python-version }} - - name: Install package - run: pip install . + - name: Install dev-package + run: | + python -m pip install --upgrade pip + pip install -v -e . - name: Run tests run: python -m unittest diff --git a/CHANGELOG.md b/CHANGELOG.md index d1cb63ff..ed25a1c0 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,6 +7,23 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] +## [0.4.1?] - 2021-06-11 + +### Added + +* Added new keyword argument **`tfidf_matrix_dtype`** (the datatype for the tf-idf values of the matrix components). Allowed values are `numpy.float32` and `numpy.float64` (used by the required external package `sparse_dot_topn` version 0.3.1). Default is `numpy.float32`. (Note: `numpy.float32` often leads to faster processing and a smaller memory footprint albeit less numerical precision than `numpy.float64`.) + +### Changed + +* Changed dependency on `sparse_dot_topn` from version 0.2.9 to 0.3.1 +* Changed the default datatype for cosine similarities from numpy.float64 to numpy.float32 to boost computational performance at the expense of numerical precision. +* Changed the default value of the keyword argument `max_n_matches` from 20 to the number of strings in `duplicates` (or `master`, if `duplicates` is not given). +* Changed warning issued when the condition \[`include_zeroes=True` and `min_similarity` ≤ 0 and `max_n_matches` is not sufficiently high to capture all nonzero-similarity-matches\] is met to an exception. + +### Removed + +* Removed the keyword argument `suppress_warning` + ## [0.4.0] - 2021-04-11 ### Added diff --git a/README.md b/README.md index 3ddc43c7..1b18c3c9 100644 --- a/README.md +++ b/README.md @@ -134,16 +134,16 @@ All functions are built using a class **`StringGrouper`**. This class can be use All keyword arguments not mentioned in the function definitions above are used to update the default settings. The following optional arguments can be used: * **`ngram_size`**: The amount of characters in each n-gram. Default is `3`. + * **`tfidf_matrix_dtype`**: The datatype for the tf-idf values of the matrix components. Allowed values are `numpy.float32` and `numpy.float64`. Default is `numpy.float32`. (Note: `numpy.float32` often leads to faster processing and a smaller memory footprint albeit less numerical precision than `numpy.float64`.) * **`regex`**: The regex string used to clean-up the input string. Default is `"[,-./]|\s"`. - * **`max_n_matches`**: The maximum number of matches allowed per string in `master`. Default is `20`. + * **`max_n_matches`**: The maximum number of matches allowed per string in `master`. Default is the number of strings in `duplicates` (or `master`, if `duplicates` is not given). * **`min_similarity`**: The minimum cosine similarity for two strings to be considered a match. Defaults to `0.8` * **`number_of_processes`**: The number of processes used by the cosine similarity calculation. Defaults to `number of cores on a machine - 1.` * **`ignore_index`**: Determines whether indexes are ignored or not. If `False` (the default), index-columns will appear in the output, otherwise not. (See [tutorials/ignore_index_and_replace_na.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/ignore_index_and_replace_na.md) for a demonstration.) * **`replace_na`**: For function `match_most_similar`, determines whether `NaN` values in index-columns are replaced or not by index-labels from `duplicates`. Defaults to `False`. (See [tutorials/ignore_index_and_replace_na.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/ignore_index_and_replace_na.md) for a demonstration.) - * **`include_zeroes`**: When `min_similarity` ≤ 0, determines whether zero-similarity matches appear in the output. Defaults to `True`. (See [tutorials/zero_similarity.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/zero_similarity.md) for a demonstration.) **Warning:** Make sure the kwarg `max_n_matches` is sufficiently high to capture ***all*** nonzero-similarity-matches, otherwise some zero-similarity-matches returned will be false. - * **`suppress_warning`**: when `min_similarity` ≤ 0 and `include_zeroes` is `True`, determines whether or not to suppress the message warning that `max_n_matches` may be too small. Defaults to `False`. + * **`include_zeroes`**: When `min_similarity` ≤ 0, determines whether zero-similarity matches appear in the output. Defaults to `True`. (See [tutorials/zero_similarity.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/zero_similarity.md).) **Note:** If `include_zeroes` is `True` and the kwarg `max_n_matches` is set then it must be sufficiently high to capture ***all*** nonzero-similarity-matches, otherwise an error is raised and `string_grouper` suggests an alternative value for `max_n_matches`. To allow `string_grouper` to automatically use the appropriate value for `max_n_matches` then do not set this kwarg at all. * **`group_rep`**: For function `group_similar_strings`, determines how group-representatives are chosen. Allowed values are `'centroid'` (the default) and `'first'`. See [tutorials/group_representatives.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/group_representatives.md) for an explanation. ## Examples diff --git a/setup.py b/setup.py index f4b5ecb0..4b7dc00a 100644 --- a/setup.py +++ b/setup.py @@ -25,6 +25,6 @@ , 'scipy' , 'scikit-learn' , 'numpy' - , 'sparse_dot_topn>=0.2.6' + , 'sparse_dot_topn>=0.3.1' ] ) diff --git a/string_grouper/string_grouper.py b/string_grouper/string_grouper.py index 4ff7c7d1..d1612511 100644 --- a/string_grouper/string_grouper.py +++ b/string_grouper/string_grouper.py @@ -4,13 +4,14 @@ import multiprocessing from sklearn.feature_extraction.text import TfidfVectorizer from scipy.sparse.csr import csr_matrix +from scipy.sparse.lil import lil_matrix from scipy.sparse.csgraph import connected_components from typing import Tuple, NamedTuple, List, Optional, Union from sparse_dot_topn import awesome_cossim_topn from functools import wraps -import warnings DEFAULT_NGRAM_SIZE: int = 3 +DEFAULT_TFIDF_MATRIX_DTYPE: type = np.float32 # (only types np.float32 and np.float64 are allowed by sparse_dot_topn) DEFAULT_REGEX: str = r'[,-./]|\s' DEFAULT_MAX_N_MATCHES: int = 20 DEFAULT_MIN_SIMILARITY: float = 0.8 # minimum cosine similarity for an item to be considered a match @@ -21,8 +22,6 @@ # similar string index-columns with corresponding duplicates-index values DEFAULT_INCLUDE_ZEROES: bool = True # when the minimum cosine similarity <=0, determines whether zero-similarity # matches appear in the output -DEFAULT_SUPPRESS_WARNING: bool = False # when the minimum cosine similarity <=0 and zero-similarity matches are -# requested, determines whether or not to suppress the message warning that max_n_matches may be too small GROUP_REP_CENTROID: str = 'centroid' # Option value to select the string in each group with the largest # similarity aggregate as group-representative: GROUP_REP_FIRST: str = 'first' # Option value to select the first string in each group as group-representative: @@ -33,7 +32,8 @@ DEFAULT_ID_NAME: str = 'id' # used to name id-columns in the output of StringGrouper.get_matches LEFT_PREFIX: str = 'left_' # used to prefix columns on the left of the output of StringGrouper.get_matches RIGHT_PREFIX: str = 'right_' # used to prefix columns on the right of the output of StringGrouper.get_matches -MOST_SIMILAR_PREFIX: str = 'most_similar_' # used to prefix columns of the output of StringGrouper._get_nearest_matches +MOST_SIMILAR_PREFIX: str = 'most_similar_' # used to prefix columns of the output of +# StringGrouper._get_nearest_matches DEFAULT_MASTER_NAME: str = 'master' # used to name non-index column of the output of StringGrouper.get_nearest_matches DEFAULT_MASTER_ID_NAME: str = f'{DEFAULT_MASTER_NAME}_{DEFAULT_ID_NAME}' # used to name id-column of the output of # StringGrouper.get_nearest_matches @@ -141,7 +141,11 @@ class StringGrouperConfig(NamedTuple): Class with configuration variables. :param ngram_size: int. The amount of characters in each n-gram. Default is 3. - :param regex: str. The regex string used to cleanup the input string. Default is [,-./]|\s. + :param tfidf_matrix_dtype: type. The datatype for the tf-idf values of the matrix components. + Possible values allowed by sparse_dot_topn are np.float32 and np.float64. Default is np.float32. + (Note: np.float32 often leads to faster processing and a smaller memory footprint albeit less precision + than np.float64.) + :param regex: str. The regex string used to cleanup the input string. Default is '[,-./]|\s'. :param max_n_matches: int. The maximum number of matches allowed per string. Default is 20. :param min_similarity: float. The minimum cosine similarity for two strings to be considered a match. Defaults to 0.8. @@ -151,8 +155,6 @@ class StringGrouperConfig(NamedTuple): :param ignore_index: whether or not to exclude string Series index-columns in output. Defaults to False. :param include_zeroes: when the minimum cosine similarity <=0, determines whether zero-similarity matches appear in the output. Defaults to True. - :param suppress_warning: when min_similarity <=0 and include_zeroes=True, determines whether or not to supress - the message warning that max_n_matches may be too small. Defaults to False. :param replace_na: whether or not to replace NaN values in most similar string index-columns with corresponding duplicates-index values. Defaults to False. :param group_rep: str. The scheme to select the group-representative. Default is 'centroid'. @@ -160,14 +162,14 @@ class StringGrouperConfig(NamedTuple): """ ngram_size: int = DEFAULT_NGRAM_SIZE + tfidf_matrix_dtype: int = DEFAULT_TFIDF_MATRIX_DTYPE regex: str = DEFAULT_REGEX - max_n_matches: int = DEFAULT_MAX_N_MATCHES + max_n_matches: Optional[int] = None min_similarity: float = DEFAULT_MIN_SIMILARITY number_of_processes: int = DEFAULT_N_PROCESSES ignore_case: bool = DEFAULT_IGNORE_CASE ignore_index: bool = DEFAULT_DROP_INDEX include_zeroes: bool = DEFAULT_INCLUDE_ZEROES - suppress_warning: bool = DEFAULT_SUPPRESS_WARNING replace_na: bool = DEFAULT_REPLACE_NA group_rep: str = DEFAULT_GROUP_REP @@ -223,13 +225,23 @@ def __init__(self, master: pd.Series, self._duplicates: pd.Series = duplicates if duplicates is not None else None self._master_id: pd.Series = master_id if master_id is not None else None self._duplicates_id: pd.Series = duplicates_id if duplicates_id is not None else None + self._config: StringGrouperConfig = StringGrouperConfig(**kwargs) + if self._config.max_n_matches is None: + self._max_n_matches = len(self._master) if self._duplicates is None else len(self._duplicates) + else: + self._max_n_matches = self._config.max_n_matches + self._validate_group_rep_specs() + self._validate_tfidf_matrix_dtype() self._validate_replace_na_and_drop() self.is_build = False # indicates if the grouper was fit or not - self._vectorizer = TfidfVectorizer(min_df=1, analyzer=self.n_grams) - # After the StringGrouper is build, _matches_list will contain the indices and similarities of two matches + self._vectorizer = TfidfVectorizer(min_df=1, analyzer=self.n_grams, dtype=self._config.tfidf_matrix_dtype) + # After the StringGrouper is built, _matches_list will contain the indices and similarities of the matches self._matches_list: pd.DataFrame = pd.DataFrame() + # _true_max_n_matches will contain the true maximum number of matches over all strings in master if + # self._config.min_similarity <= 0 + self._true_max_n_matches = None def n_grams(self, string: str) -> List[str]: """ @@ -247,13 +259,22 @@ def n_grams(self, string: str) -> List[str]: def fit(self) -> 'StringGrouper': """Builds the _matches list which contains string matches indices and similarity""" master_matrix, duplicate_matrix = self._get_tf_idf_matrices() + # Calculate the matches using the cosine similarity - matches = self._build_matches(master_matrix, duplicate_matrix) + matches, self._true_max_n_matches = self._build_matches(master_matrix, duplicate_matrix) + if self._duplicates is None: - # the matrix of matches needs to be symmetric!!! (i.e., if A != B and A matches B; then B matches A) - # and each of its diagonal components must be equal to 1 - matches = StringGrouper._symmetrize_matrix_and_fix_diagonal(matches) - # retrieve all matches + # convert to lil format for best efficiency when setting matrix-elements + matches = matches.tolil() + # matrix diagonal elements must be exactly 1 (numerical precision errors introduced by + # floating-point computations in awesome_cossim_topn sometimes lead to unexpected results) + matches = StringGrouper._fix_diagonal(matches) + if self._max_n_matches < self._true_max_n_matches: + # the list of matches must be symmetric! (i.e., if A != B and A matches B; then B matches A) + matches = StringGrouper._symmetrize_matrix(matches) + matches = matches.tocsr() + + # build list from matrix self._matches_list = self._get_matches_list(matches) self.is_build = True return self @@ -270,8 +291,7 @@ def dot(self) -> pd.Series: @validate_is_fit def get_matches(self, ignore_index: Optional[bool] = None, - include_zeroes: Optional[bool] = None, - suppress_warning: Optional[bool] = None) -> pd.DataFrame: + include_zeroes: Optional[bool] = None) -> pd.DataFrame: """ Returns a DataFrame with all the matches and their cosine similarity. If optional IDs are used, returned as extra columns with IDs matched to respective data rows @@ -280,8 +300,6 @@ def get_matches(self, self._config.ignore_index. :param include_zeroes: when the minimum cosine similarity <=0, determines whether zero-similarity matches appear in the output. Defaults to self._config.include_zeroes. - :param suppress_warning: when min_similarity <=0 and include_zeroes=True, determines whether or not to suppress - the message warning that max_n_matches may be too small. Defaults to self._config.suppress_warning. """ def get_both_sides(master: pd.Series, duplicates: pd.Series, @@ -307,15 +325,13 @@ def prefix_column_names(data: Union[pd.Series, pd.DataFrame], prefix: str): ignore_index = self._config.ignore_index if include_zeroes is None: include_zeroes = self._config.include_zeroes - if suppress_warning is None: - suppress_warning = self._config.suppress_warning if self._config.min_similarity > 0 or not include_zeroes: matches_list = self._matches_list elif include_zeroes: # Here's a fix to a bug pointed out by one GitHub user (@nbcvijanovic): # the fix includes zero-similarity matches that are missing by default # in _matches_list due to our use of sparse matrices - non_matches_list = self._get_non_matches_list(suppress_warning) + non_matches_list = self._get_non_matches_list() matches_list = self._matches_list if non_matches_list.empty else \ pd.concat([self._matches_list, non_matches_list], axis=0, ignore_index=True) @@ -442,19 +458,20 @@ def _build_matches(self, master_matrix: csr_matrix, duplicate_matrix: csr_matrix tf_idf_matrix_1 = master_matrix tf_idf_matrix_2 = duplicate_matrix.transpose() - optional_kwargs = dict() - if self._config.number_of_processes > 1: - optional_kwargs = { - 'use_threads': True, - 'n_jobs': self._config.number_of_processes - } - - return awesome_cossim_topn(tf_idf_matrix_1, tf_idf_matrix_2, - self._config.max_n_matches, - self._config.min_similarity, - **optional_kwargs) + optional_kwargs = { + 'return_best_ntop': True, + 'use_threads': self._config.number_of_processes > 1, + 'n_jobs': self._config.number_of_processes + } + + return awesome_cossim_topn( + tf_idf_matrix_1, tf_idf_matrix_2, + self._max_n_matches, + self._config.min_similarity, + **optional_kwargs + ) - def _get_non_matches_list(self, suppress_warning=False) -> pd.DataFrame: + def _get_non_matches_list(self) -> pd.DataFrame: """Returns a list of all the indices of non-matching pairs (with similarity set to 0)""" m_sz, d_sz = len(self._master), len(self._master if self._duplicates is None else self._duplicates) all_pairs = pd.MultiIndex.from_product([range(m_sz), range(d_sz)], names=['master_side', 'dupe_side']) @@ -462,12 +479,12 @@ def _get_non_matches_list(self, suppress_warning=False) -> pd.DataFrame: missing_pairs = all_pairs.difference(matched_pairs) if missing_pairs.empty: return pd.DataFrame() - if (self._config.max_n_matches < d_sz) and not suppress_warning: - warnings.warn(f'WARNING: max_n_matches={self._config.max_n_matches} may be too small!\n' - f'\t\t Some zero-similarity matches returned may be false!\n' - f'\t\t To be absolutely certain all zero-similarity matches are true,\n' - f'\t\t try setting max_n_matches={d_sz} (the length of the Series parameter duplicates).\n' - f'\t\t To suppress this warning, set suppress_warning=True.') + if (self._max_n_matches < self._true_max_n_matches): + raise Exception(f'\nERROR: Cannot return zero-similarity matches since \n' + f'\t\t max_n_matches={self._max_n_matches} is too small!\n' + f'\t\t Try setting max_n_matches={self._true_max_n_matches} (the \n' + f'\t\t true maximum number of matches over all strings in master)\n' + f'\t\t or greater or do not set this kwarg at all.') missing_pairs = missing_pairs.to_frame(index=False) missing_pairs['similarity'] = 0 return missing_pairs @@ -513,8 +530,8 @@ def _get_nearest_matches(self, # For some weird reason, pandas' merge function changes int-datatype columns to float when NaN values # appear within them. So here we change them back to their original datatypes if possible: - if dupes_max_sim[master_id_label].dtype != self._master_id.dtype \ - and self._duplicates_id.dtype == self._master_id.dtype: + if dupes_max_sim[master_id_label].dtype != self._master_id.dtype and \ + self._duplicates_id.dtype == self._master_id.dtype: dupes_max_sim.loc[:, master_id_label] = \ dupes_max_sim.loc[:, master_id_label].astype(self._master_id.dtype) @@ -612,6 +629,13 @@ def _validate_group_rep_specs(self): f"Invalid option value for group_rep. The only permitted values are\n {group_rep_options}" ) + def _validate_tfidf_matrix_dtype(self): + dtype_options = (np.float32, np.float64) + if self._config.tfidf_matrix_dtype not in dtype_options: + raise Exception( + f"Invalid option value for tfidf_matrix_dtype. The only permitted values are\n {dtype_options}" + ) + def _validate_replace_na_and_drop(self): if self._config.ignore_index and self._config.replace_na: raise Exception("replace_na can only be set to True when ignore_index=False.") @@ -622,13 +646,16 @@ def _validate_replace_na_and_drop(self): ) @staticmethod - def _symmetrize_matrix_and_fix_diagonal(x_non_symmetric: csr_matrix) -> csr_matrix: - x_symmetric = x_non_symmetric.tolil() - r, c = x_symmetric.nonzero() - x_symmetric[c, r] = x_symmetric[r, c] - r = np.arange(x_symmetric.shape[0]) - x_symmetric[r, r] = 1 - return x_symmetric.tocsr() + def _fix_diagonal(m: lil_matrix) -> csr_matrix: + r = np.arange(m.shape[0]) + m[r, r] = 1 + return m + + @staticmethod + def _symmetrize_matrix(m_symmetric: lil_matrix) -> csr_matrix: + r, c = m_symmetric.nonzero() + m_symmetric[c, r] = m_symmetric[r, c] + return m_symmetric @staticmethod def _get_matches_list(matches: csr_matrix) -> pd.DataFrame: diff --git a/string_grouper/test/test_string_grouper.py b/string_grouper/test/test_string_grouper.py index 02b96dc4..f5f0aac8 100644 --- a/string_grouper/test/test_string_grouper.py +++ b/string_grouper/test/test_string_grouper.py @@ -3,17 +3,15 @@ import numpy as np from scipy.sparse.csr import csr_matrix from string_grouper.string_grouper import DEFAULT_MIN_SIMILARITY, \ - DEFAULT_MAX_N_MATCHES, DEFAULT_REGEX, \ - DEFAULT_NGRAM_SIZE, DEFAULT_N_PROCESSES, DEFAULT_IGNORE_CASE, \ + DEFAULT_REGEX, DEFAULT_NGRAM_SIZE, DEFAULT_N_PROCESSES, DEFAULT_IGNORE_CASE, \ StringGrouperConfig, StringGrouper, StringGrouperNotFitException, \ - match_most_similar, group_similar_strings, match_strings,\ + match_most_similar, group_similar_strings, match_strings, \ compute_pairwise_similarities from unittest.mock import patch -import warnings -def mock_symmetrize_matrix(a: csr_matrix) -> csr_matrix: - return a +def mock_symmetrize_matrix(x: csr_matrix) -> csr_matrix: + return x class SimpleExample(object): @@ -97,7 +95,7 @@ def test_config_defaults(self): """Empty initialisation should set default values""" config = StringGrouperConfig() self.assertEqual(config.min_similarity, DEFAULT_MIN_SIMILARITY) - self.assertEqual(config.max_n_matches, DEFAULT_MAX_N_MATCHES) + self.assertEqual(config.max_n_matches, None) self.assertEqual(config.regex, DEFAULT_REGEX) self.assertEqual(config.ngram_size, DEFAULT_NGRAM_SIZE) self.assertEqual(config.number_of_processes, DEFAULT_N_PROCESSES) @@ -135,6 +133,7 @@ def test_compute_pairwise_similarities(self): ], name='similarity' ) + expected_result = expected_result.astype(np.float32) pd.testing.assert_series_equal(expected_result, similarities) def test_compute_pairwise_similarities_data_integrity(self): @@ -202,7 +201,7 @@ def test_match_strings(self, mock_StringGouper): self.assertEqual(df, 'whatever') @patch( - 'string_grouper.string_grouper.StringGrouper._symmetrize_matrix_and_fix_diagonal', + 'string_grouper.string_grouper.StringGrouper._symmetrize_matrix', side_effect=mock_symmetrize_matrix ) def test_match_list_symmetry_without_symmetrize_function(self, mock_symmetrize_matrix_param): @@ -244,17 +243,17 @@ def test_match_list_symmetry_with_symmetrize_function(self): self.assertTrue(intersection.empty or len(upper) == len(upper_prime) == len(intersection)) @patch( - 'string_grouper.string_grouper.StringGrouper._symmetrize_matrix_and_fix_diagonal', + 'string_grouper.string_grouper.StringGrouper._fix_diagonal', side_effect=mock_symmetrize_matrix ) - def test_match_list_diagonal_without_the_fix(self, mock_symmetrize_matrix_param): + def test_match_list_diagonal_without_the_fix(self, mock_fix_diagonal): """test fails whenever _matches_list's number of self-joins is not equal to the number of strings""" # This bug is difficult to reproduce -- I mostly encounter it while working with very large datasets; # for small datasets setting max_n_matches=1 reproduces the bug simple_example = SimpleExample() df = simple_example.customers_df['Customer Name'] matches = match_strings(df, max_n_matches=1) - mock_symmetrize_matrix_param.assert_called_once() + mock_fix_diagonal.assert_called_once() num_self_joins = len(matches[matches['left_index'] == matches['right_index']]) num_strings = len(df) self.assertNotEqual(num_self_joins, num_strings) @@ -276,7 +275,7 @@ def test_zero_min_similarity(self): simple_example = SimpleExample() s_master = simple_example.customers_df['Customer Name'] s_dup = simple_example.whatever_series_1 - matches = match_strings(s_master, s_dup, max_n_matches=len(s_master), min_similarity=0) + matches = match_strings(s_master, s_dup, min_similarity=0) pd.testing.assert_frame_equal(simple_example.expected_result_with_zeroes, matches) def test_zero_min_similarity_small_max_n_matches(self): @@ -285,7 +284,6 @@ def test_zero_min_similarity_small_max_n_matches(self): simple_example = SimpleExample() s_master = simple_example.customers_df['Customer Name'] s_dup = simple_example.two_strings - warnings.simplefilter('error', UserWarning) with self.assertRaises(Exception): _ = match_strings(s_master, s_dup, max_n_matches=1, min_similarity=0) @@ -358,7 +356,7 @@ def test_build_matches(self): expected_matches = np.array([[1., 0., 0.], [0., 1., 0.], [0., 0., 0.]]) - np.testing.assert_array_equal(expected_matches, sg._build_matches(master, dupe).toarray()) + np.testing.assert_array_equal(expected_matches, sg._build_matches(master, dupe)[0].toarray()) def test_build_matches_list(self): """Should create the cosine similarity matrix of two series""" @@ -370,6 +368,7 @@ def test_build_matches_list(self): dupe_side = [0, 1] similarity = [1.0, 1.0] expected_df = pd.DataFrame({'master_side': master, 'dupe_side': dupe_side, 'similarity': similarity}) + expected_df.loc[:, 'similarity'] = expected_df.loc[:, 'similarity'].astype(sg._config.tfidf_matrix_dtype) pd.testing.assert_frame_equal(expected_df, sg._matches_list) def test_case_insensitive_build_matches_list(self): @@ -382,6 +381,7 @@ def test_case_insensitive_build_matches_list(self): dupe_side = [0, 1] similarity = [1.0, 1.0] expected_df = pd.DataFrame({'master_side': master, 'dupe_side': dupe_side, 'similarity': similarity}) + expected_df.loc[:, 'similarity'] = expected_df.loc[:, 'similarity'].astype(sg._config.tfidf_matrix_dtype) pd.testing.assert_frame_equal(expected_df, sg._matches_list) def test_get_matches_two_dataframes(self): @@ -396,6 +396,7 @@ def test_get_matches_two_dataframes(self): expected_df = pd.DataFrame({'left_index': left_index, 'left_side': left_side, 'similarity': similarity, 'right_side': right_side, 'right_index': right_index}) + expected_df.loc[:, 'similarity'] = expected_df.loc[:, 'similarity'].astype(sg._config.tfidf_matrix_dtype) pd.testing.assert_frame_equal(expected_df, sg.get_matches()) def test_get_matches_single(self): @@ -410,6 +411,7 @@ def test_get_matches_single(self): expected_df = pd.DataFrame({'left_index': left_index, 'left_side': left_side, 'similarity': similarity, 'right_side': right_side, 'right_index': right_index}) + expected_df.loc[:, 'similarity'] = expected_df.loc[:, 'similarity'].astype(sg._config.tfidf_matrix_dtype) pd.testing.assert_frame_equal(expected_df, sg.get_matches()) def test_get_matches_1_series_1_id_series(self): @@ -427,6 +429,7 @@ def test_get_matches_1_series_1_id_series(self): expected_df = pd.DataFrame({'left_index': left_index, 'left_side': left_side, 'left_id': left_side_id, 'similarity': similarity, 'right_id': right_side_id, 'right_side': right_side, 'right_index': right_index}) + expected_df.loc[:, 'similarity'] = expected_df.loc[:, 'similarity'].astype(sg._config.tfidf_matrix_dtype) pd.testing.assert_frame_equal(expected_df, sg.get_matches()) def test_get_matches_2_series_2_id_series(self): @@ -446,6 +449,7 @@ def test_get_matches_2_series_2_id_series(self): expected_df = pd.DataFrame({'left_index': left_index, 'left_side': left_side, 'left_id': left_side_id, 'similarity': similarity, 'right_id': right_side_id, 'right_side': right_side, 'right_index': right_index}) + expected_df.loc[:, 'similarity'] = expected_df.loc[:, 'similarity'].astype(sg._config.tfidf_matrix_dtype) pd.testing.assert_frame_equal(expected_df, sg.get_matches()) def test_get_matches_raises_exception_if_unexpected_options_given(self):