sentence_transformers.util defines different helpful functions to work with text embediddings.
Downloads a URL to a given path on disc
paraphrase_mining(model, sentences: List[str], show_progress_bar=False, batch_size=32, query_chunk_size: int = 5000, corpus_chunk_size: int = 100000, max_pairs: int = 500000, top_k: int = 100)¶
Given a list of sentences / texts, this function performs paraphrase mining. It compares all sentences against all other sentences and returns a list with the pairs that have the highest cosine similarity score.
model – SentenceTransformer model for embedding computation
sentences – A list of strings (texts or sentences)
show_progress_bar – Plotting of a progress bar
batch_size – Number of texts that are encoded simultaneously by the model
query_chunk_size – Search for most similar pairs for #query_chunk_size at the same time. Decrease, to lower memory footprint (increases run-time).
corpus_chunk_size – Compare a sentence simultaneously against #corpus_chunk_size other sentences. Decrease, to lower memory footprint (increases run-time).
max_pairs – Maximal number of text pairs returned.
top_k – For each sentence, we retrieve up to top_k other sentences
Returns a list of triplets with the format [score, id1, id2]
pytorch_cos_sim(a: torch.Tensor, b: torch.Tensor)¶
Computes the cosine similarity cos_sim(a[i], b[j]) for all i and j. This function can be used as a faster replacement for 1-scipy.spatial.distance.cdist(a,b) :return: Matrix with res[i][j] = cos_sim(a[i], b[j])
semantic_search(query_embeddings: torch.Tensor, corpus_embeddings: torch.Tensor, query_chunk_size: int = 100, corpus_chunk_size: int = 100000, top_k: int = 10)¶
This function performs a cosine similarity search between a list of query embeddings and a list of corpus embeddings. It can be used for Information Retrieval / Semantic Search for corpora up to about 1 Million entries.
query_embeddings – A 2 dimensional tensor with the query embeddings.
corpus_embeddings – A 2 dimensional tensor with the corpus embeddings.
query_chunk_size – Process 100 queries simultaneously. Increasing that value increases the speed, but requires more memory.
corpus_chunk_size – Scans the corpus 100k entries at a time. Increasing that value increases the speed, but requires more memory.
top_k – Retrieve top k matching entries. Note, if your corpus is larger than query_chunk_size, |Chunks|*top_k are returned
Returns a sorted list with decreasing cosine similarity scores. Entries are dictionaries with the keys ‘corpus_id’ and ‘score’