Skip to content

jankovicsandras/bm25opt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BM25opt

faster BM25 search algorithms in Python

Apache License, Version 2.0 http://www.apache.org/licenses/LICENSE-2.0


News:

  • 1.2.0 stopword filter in many languages (en, fr, es, pt, it, de, nl, sv, no, nn, da, ru, fi, hu, ga, id). Thanks, Peter Lindsten!
  • 1.1.0 supports updating the index with new add_documents(), delete_documents() and update_documents() functions, see Example 4

Usage:

Input:

  • corpus is a list of strings, e.g. [ 'bla bla bla', 'this is document two', ... ]
  • question is a string, e.g. 'which text contains the word two?'
  • optional arguments:
    • algo : BM25 algorithm, the default is 'okapi'; 'l' and 'plus' available
    • tokenizer_function : the default is tokenizer_default which is split-on-whitespace, lowercase, remove common punctiations
    • stopwords_filter : the default is None (no filtering), see Example 5 for usage
    • idf_algo : default uses the same IDF as rank_bm25; values 'okapi', 'l' and 'plus' can override to fix dorianbrown/rank_bm25#35
    • k1, b, epsilon, delta : constants with standard default values, see https://en.wikipedia.org/wiki/Okapi_BM25

Example 1:

This example uses the default tokenizer and the default BM25Okapi algorithm and returns the top 5 highest scoring document ids and scores.

# creating the index
bm25opt_index = BM25opt( corpus )

# search
results = bm25opt_index.topk( question, 5 )
print( 'results[0] id', results[0][0], 'results[0] score', results[0][1], 'results[0] document', corpus[ results[0][0] ] )

Example 2:

This example returns the list of document scores (order is the same as the document order in corpus), shows algoritm selection and custom tokenizer function.

bm25opt_index = BM25opt( corpus, algo='plus', tokenizer_function=some_tokenizer_function )
doc_scores = bm25opt_index.get_scores( question )

Example 3: comparison with rank_bm25

This example shows the score list and the similarity with rank_bm25, but NOTE: BM25opt input is not tokenized beforehand.

corpus = [ ... ]
question = '...'
tokenized_corpus = [ tokenizer_default(document) for document in corpus ]
tokenized_question = tokenizer_default( question )

rank_bm25_index = BM25Okapi( tokenized_corpus )
bm25opt_index = BM25opt( corpus, algo='okapi' )

rank_bm25_scores = rank_bm25_index.get_scores( tokenizedquestion )
bm25opt_scores = bm25opt_index.get_scores( question )

Example 4: updating the index

# creating the index
bm25opt_index = BM25opt( corpus )

# add new documents
bm25opt_index.add_documents( corpus2 ) 

# delete from the index
delete_ids = [ 1, 3, 5 ] # list of document ids (indices in corpus) to delete from the index
bm25opt_index.delete_documents( delete_ids )

# in-place update changed documents in the index
update_ids = [ 1, 3, 5 ] # list of document ids (indices in corpus) to change
updated_documents = [ 'first changed document', 'second changed document', ... ]
bm25opt_index.update_documents( update_ids, updated_documents )

Example 5: stopwords filter

# the following languages are available: en, fr, es, pt, it, de, nl, sv, no, nn, da, ru, fi, hu, ga, id
bm25opt_index = BM25opt( corpus, algo='plus', tokenizer_function=tokenizer_default, stopwords_filter=stop_words_filter('en') )

Notes:

This is an optimized variant of rank_bm25 where the key insight is that we can calculate almost everything at index creation time in __init__() , resulting a words * documents-score dict, e.g.

wsmap = {
  'word1': [ word1_doc1_score, word1_doc2_score, ... ],
  'word2': [ word2_doc1_score, word2_doc2_score, ... ],
  ...
}

then the query function is just adding the score lists for each word in the question, e.g.

question = 'word1 word2'
doc_scores = [ wsmap['word1'][0] + wsmap['word2'][0], wsmap['word1'][1] + wsmap['word2'][1], ... ]

Another important change is the un-tokenized inputs and registration of the tokenizer function, which is important to avoid situations where the corpus would be tokenized with a different function than the queries later. A simple tokenizer_default() function is provided as a default.

Releases

No releases published

Packages

 
 
 

Contributors