util¶
sentence_transformers.util
defines different helpful functions to work with text embeddings.
- sentence_transformers.util.community_detection(embeddings, threshold=0.75, min_community_size=10, batch_size=1024)¶
Function for Fast Community Detection Finds in the embeddings all communities, i.e. embeddings that are close (closer than threshold). Returns only communities that are larger than min_community_size. The communities are returned in decreasing order. The first element in each list is the central point in the community.
- sentence_transformers.util.cos_sim(a: torch.Tensor, b: torch.Tensor)¶
Computes the cosine similarity cos_sim(a[i], b[j]) for all i and j. :return: Matrix with res[i][j] = cos_sim(a[i], b[j])
- sentence_transformers.util.dot_score(a: torch.Tensor, b: torch.Tensor)¶
Computes the dot-product dot_prod(a[i], b[j]) for all i and j. :return: Matrix with res[i][j] = dot_prod(a[i], b[j])
- sentence_transformers.util.http_get(url, path)¶
Downloads a URL to a given path on disc
- sentence_transformers.util.paraphrase_mining(model, sentences: List[str], show_progress_bar: bool = False, batch_size: int = 32, *args, **kwargs)¶
Given a list of sentences / texts, this function performs paraphrase mining. It compares all sentences against all other sentences and returns a list with the pairs that have the highest cosine similarity score.
- Parameters
model – SentenceTransformer model for embedding computation
sentences – A list of strings (texts or sentences)
show_progress_bar – Plotting of a progress bar
batch_size – Number of texts that are encoded simultaneously by the model
query_chunk_size – Search for most similar pairs for #query_chunk_size at the same time. Decrease, to lower memory footprint (increases run-time).
corpus_chunk_size – Compare a sentence simultaneously against #corpus_chunk_size other sentences. Decrease, to lower memory footprint (increases run-time).
max_pairs – Maximal number of text pairs returned.
top_k – For each sentence, we retrieve up to top_k other sentences
score_function – Function for computing scores. By default, cosine similarity.
- Returns
Returns a list of triplets with the format [score, id1, id2]
- sentence_transformers.util.semantic_search(query_embeddings: torch.Tensor, corpus_embeddings: torch.Tensor, query_chunk_size: int = 100, corpus_chunk_size: int = 500000, top_k: int = 10, score_function: typing.Callable[[torch.Tensor, torch.Tensor], torch.Tensor] = <function cos_sim>)¶
This function performs a cosine similarity search between a list of query embeddings and a list of corpus embeddings. It can be used for Information Retrieval / Semantic Search for corpora up to about 1 Million entries.
- Parameters
query_embeddings – A 2 dimensional tensor with the query embeddings.
corpus_embeddings – A 2 dimensional tensor with the corpus embeddings.
query_chunk_size – Process 100 queries simultaneously. Increasing that value increases the speed, but requires more memory.
corpus_chunk_size – Scans the corpus 100k entries at a time. Increasing that value increases the speed, but requires more memory.
top_k – Retrieve top k matching entries.
score_function – Function for computing scores. By default, cosine similarity.
- Returns
Returns a list with one entry for each query. Each entry is a list of dictionaries with the keys ‘corpus_id’ and ‘score’, sorted by decreasing cosine similarity scores.