BM25opt

faster BM25 search algorithms in Python

based on https://github.com/dorianbrown/rank_bm25 by Dorian Brown

Apache License, Version 2.0 http://www.apache.org/licenses/LICENSE-2.0

News:

1.2.0 stopword filter in many languages (en, fr, es, pt, it, de, nl, sv, no, nn, da, ru, fi, hu, ga, id). Thanks, Peter Lindsten!
1.1.0 supports updating the index with new add_documents(), delete_documents() and update_documents() functions, see Example 4

Usage:

Input:

corpus is a list of strings, e.g. [ 'bla bla bla', 'this is document two', ... ]
question is a string, e.g. 'which text contains the word two?'
optional arguments:
- algo : BM25 algorithm, the default is 'okapi'; 'l' and 'plus' available
- tokenizer_function : the default is tokenizer_default which is split-on-whitespace, lowercase, remove common punctiations
- stopwords_filter : the default is None (no filtering), see Example 5 for usage
- idf_algo : default uses the same IDF as rank_bm25; values 'okapi', 'l' and 'plus' can override to fix dorianbrown/rank_bm25#35
- k1, b, epsilon, delta : constants with standard default values, see https://en.wikipedia.org/wiki/Okapi_BM25

Example 1:

This example uses the default tokenizer and the default BM25Okapi algorithm and returns the top 5 highest scoring document ids and scores.

# creating the index
bm25opt_index = BM25opt( corpus )

# search
results = bm25opt_index.topk( question, 5 )
print( 'results[0] id', results[0][0], 'results[0] score', results[0][1], 'results[0] document', corpus[ results[0][0] ] )

Example 2:

This example returns the list of document scores (order is the same as the document order in corpus), shows algoritm selection and custom tokenizer function.

bm25opt_index = BM25opt( corpus, algo='plus', tokenizer_function=some_tokenizer_function )
doc_scores = bm25opt_index.get_scores( question )

Example 3: comparison with rank_bm25

This example shows the score list and the similarity with rank_bm25, but NOTE: BM25opt input is not tokenized beforehand.

corpus = [ ... ]
question = '...'
tokenized_corpus = [ tokenizer_default(document) for document in corpus ]
tokenized_question = tokenizer_default( question )

rank_bm25_index = BM25Okapi( tokenized_corpus )
bm25opt_index = BM25opt( corpus, algo='okapi' )

rank_bm25_scores = rank_bm25_index.get_scores( tokenizedquestion )
bm25opt_scores = bm25opt_index.get_scores( question )

Example 4: updating the index

# creating the index
bm25opt_index = BM25opt( corpus )

# add new documents
bm25opt_index.add_documents( corpus2 ) 

# delete from the index
delete_ids = [ 1, 3, 5 ] # list of document ids (indices in corpus) to delete from the index
bm25opt_index.delete_documents( delete_ids )

# in-place update changed documents in the index
update_ids = [ 1, 3, 5 ] # list of document ids (indices in corpus) to change
updated_documents = [ 'first changed document', 'second changed document', ... ]
bm25opt_index.update_documents( update_ids, updated_documents )

Example 5: stopwords filter

# the following languages are available: en, fr, es, pt, it, de, nl, sv, no, nn, da, ru, fi, hu, ga, id
bm25opt_index = BM25opt( corpus, algo='plus', tokenizer_function=tokenizer_default, stopwords_filter=stop_words_filter('en') )

Notes:

This is an optimized variant of rank_bm25 where the key insight is that we can calculate almost everything at index creation time in __init__() , resulting a words * documents-score dict, e.g.

wsmap = {
  'word1': [ word1_doc1_score, word1_doc2_score, ... ],
  'word2': [ word2_doc1_score, word2_doc2_score, ... ],
  ...
}

then the query function is just adding the score lists for each word in the question, e.g.

question = 'word1 word2'
doc_scores = [ wsmap['word1'][0] + wsmap['word2'][0], wsmap['word1'][1] + wsmap['word2'][1], ... ]

Another important change is the un-tokenized inputs and registration of the tokenizer function, which is important to avoid situations where the corpus would be tokenized with a different function than the queries later. A simple tokenizer_default() function is provided as a default.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
BM25opt.py		BM25opt.py
LICENSE		LICENSE
README.md		README.md
bm25opt_comparative_test.ipynb		bm25opt_comparative_test.ipynb
old_v1.0.0_BM25opt.py		old_v1.0.0_BM25opt.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BM25opt

faster BM25 search algorithms in Python

based on https://github.com/dorianbrown/rank_bm25 by Dorian Brown

Apache License, Version 2.0 http://www.apache.org/licenses/LICENSE-2.0

News:

Usage:

Input:

Example 1:

Example 2:

Example 3: comparison with rank_bm25

Example 4: updating the index

Example 5: stopwords filter

Notes:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BM25opt

faster BM25 search algorithms in Python

based on https://github.com/dorianbrown/rank_bm25 by Dorian Brown

Apache License, Version 2.0 http://www.apache.org/licenses/LICENSE-2.0

News:

Usage:

Input:

Example 1:

Example 2:

Example 3: comparison with rank_bm25

Example 4: updating the index

Example 5: stopwords filter

Notes:

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages