util

sentence_transformers.util defines different helpful functions to work with text embediddings.

sentence_transformers.util.http_get(url, path)

Downloads a URL to a given path on disc

sentence_transformers.util.paraphrase_mining(model, sentences: List[str], show_progress_bar=False, batch_size=32, query_chunk_size: int = 5000, corpus_chunk_size: int = 100000, max_pairs: int = 500000, top_k: int = 100)

Given a list of sentences / texts, this function performs paraphrase mining. It compares all sentences against all other sentences and returns a list with the pairs that have the highest cosine similarity score.

Parameters
  • model – SentenceTransformer model for embedding computation

  • sentences – A list of strings (texts or sentences)

  • show_progress_bar – Plotting of a progress bar

  • batch_size – Number of texts that are encoded simultaneously by the model

  • query_chunk_size – Search for most similar pairs for #query_chunk_size at the same time. Decrease, to lower memory footprint (increases run-time).

  • corpus_chunk_size – Compare a sentence simultaneously against #corpus_chunk_size other sentences. Decrease, to lower memory footprint (increases run-time).

  • max_pairs – Maximal number of text pairs returned.

  • top_k – For each sentence, we retrieve up to top_k other sentences

Returns

Returns a list of triplets with the format [score, id1, id2]

sentence_transformers.util.pytorch_cos_sim(a: torch.Tensor, b: torch.Tensor)

Computes the cosine similarity cos_sim(a[i], b[j]) for all i and j. This function can be used as a faster replacement for 1-scipy.spatial.distance.cdist(a,b) :return: Matrix with res[i][j] = cos_sim(a[i], b[j])

This function performs a cosine similarity search between a list of query embeddings and a list of corpus embeddings. It can be used for Information Retrieval / Semantic Search for corpora up to about 1 Million entries.

Parameters
  • query_embeddings – A 2 dimensional tensor with the query embeddings.

  • corpus_embeddings – A 2 dimensional tensor with the corpus embeddings.

  • query_chunk_size – Process 100 queries simultaneously. Increasing that value increases the speed, but requires more memory.

  • corpus_chunk_size – Scans the corpus 100k entries at a time. Increasing that value increases the speed, but requires more memory.

  • top_k – Retrieve top k matching entries. Note, if your corpus is larger than query_chunk_size, |Chunks|*top_k are returned

Returns

Returns a sorted list with decreasing cosine similarity scores. Entries are dictionaries with the keys ‘corpus_id’ and ‘score’