util¶
sentence_transformers.util
defines different helpful functions to work with text embeddings.
Helper Functions¶
-
sentence_transformers.util.
community_detection
(embeddings: Union[torch.Tensor, numpy.ndarray], threshold: float = 0.75, min_community_size: int = 10, batch_size: int = 1024, show_progress_bar: bool = False) → List[List[int]][source]¶ Function for Fast Community Detection.
Finds in the embeddings all communities, i.e. embeddings that are close (closer than threshold). Returns only communities that are larger than min_community_size. The communities are returned in decreasing order. The first element in each list is the central point in the community.
- Parameters
embeddings (torch.Tensor or numpy.ndarray) – The input embeddings.
threshold (float) – The threshold for determining if two embeddings are close. Defaults to 0.75.
min_community_size (int) – The minimum size of a community to be considered. Defaults to 10.
batch_size (int) – The batch size for computing cosine similarity scores. Defaults to 1024.
show_progress_bar (bool) – Whether to show a progress bar during computation. Defaults to False.
- Returns
A list of communities, where each community is represented as a list of indices.
- Return type
List[List[int]]
-
sentence_transformers.util.
http_get
(url: str, path: str) → None[source]¶ Downloads a URL to a given path on disk.
- Parameters
url (str) – The URL to download.
path (str) – The path to save the downloaded file.
- Raises
requests.HTTPError – If the HTTP request returns a non-200 status code.
- Returns
None
-
sentence_transformers.util.
is_training_available
() → bool[source]¶ Returns True if we have the required dependencies for training Sentence Transformer models
-
sentence_transformers.util.
normalize_embeddings
(embeddings: torch.Tensor) → torch.Tensor[source]¶ Normalizes the embeddings matrix, so that each sentence embedding has unit length.
- Parameters
embeddings (Tensor) – The input embeddings matrix.
- Returns
The normalized embeddings matrix.
- Return type
Tensor
-
sentence_transformers.util.
paraphrase_mining
(model, sentences: List[str], show_progress_bar: bool = False, batch_size: int = 32, query_chunk_size: int = 5000, corpus_chunk_size: int = 100000, max_pairs: int = 500000, top_k: int = 100, score_function: Callable[[torch.Tensor, torch.Tensor], torch.Tensor] = <function cos_sim>) → List[List[Union[float, int]]][source]¶ Given a list of sentences / texts, this function performs paraphrase mining. It compares all sentences against all other sentences and returns a list with the pairs that have the highest cosine similarity score.
- Parameters
model (SentenceTransformer) – SentenceTransformer model for embedding computation
sentences (List[str]) – A list of strings (texts or sentences)
show_progress_bar (bool, optional) – Plotting of a progress bar. Defaults to False.
batch_size (int, optional) – Number of texts that are encoded simultaneously by the model. Defaults to 32.
query_chunk_size (int, optional) – Search for most similar pairs for #query_chunk_size at the same time. Decrease, to lower memory footprint (increases run-time). Defaults to 5000.
corpus_chunk_size (int, optional) – Compare a sentence simultaneously against #corpus_chunk_size other sentences. Decrease, to lower memory footprint (increases run-time). Defaults to 100000.
max_pairs (int, optional) – Maximal number of text pairs returned. Defaults to 500000.
top_k (int, optional) – For each sentence, we retrieve up to top_k other sentences. Defaults to 100.
score_function (Callable[[Tensor, Tensor], Tensor], optional) – Function for computing scores. By default, cosine similarity. Defaults to cos_sim.
- Returns
Returns a list of triplets with the format [score, id1, id2]
- Return type
List[List[Union[float, int]]]
-
sentence_transformers.util.
semantic_search
(query_embeddings: torch.Tensor, corpus_embeddings: torch.Tensor, query_chunk_size: int = 100, corpus_chunk_size: int = 500000, top_k: int = 10, score_function: Callable[[torch.Tensor, torch.Tensor], torch.Tensor] = <function cos_sim>) → List[List[Dict[str, Union[int, float]]]][source]¶ This function performs a cosine similarity search between a list of query embeddings and a list of corpus embeddings. It can be used for Information Retrieval / Semantic Search for corpora up to about 1 Million entries.
- Parameters
query_embeddings (Tensor) – A 2 dimensional tensor with the query embeddings.
corpus_embeddings (Tensor) – A 2 dimensional tensor with the corpus embeddings.
query_chunk_size (int, optional) – Process 100 queries simultaneously. Increasing that value increases the speed, but requires more memory. Defaults to 100.
corpus_chunk_size (int, optional) – Scans the corpus 100k entries at a time. Increasing that value increases the speed, but requires more memory. Defaults to 500000.
top_k (int, optional) – Retrieve top k matching entries. Defaults to 10.
score_function (Callable[[Tensor, Tensor], Tensor], optional) – Function for computing scores. By default, cosine similarity.
- Returns
A list with one entry for each query. Each entry is a list of dictionaries with the keys ‘corpus_id’ and ‘score’, sorted by decreasing cosine similarity scores.
- Return type
List[List[Dict[str, Union[int, float]]]]
-
sentence_transformers.util.
truncate_embeddings
(embeddings: numpy.ndarray, truncate_dim: Optional[int]) → numpy.ndarray[source]¶ -
sentence_transformers.util.
truncate_embeddings
(embeddings: torch.Tensor, truncate_dim: Optional[int]) → torch.Tensor Truncates the embeddings matrix.
- Parameters
embeddings (Union[np.ndarray, torch.Tensor]) – Embeddings to truncate.
truncate_dim (Optional[int]) – The dimension to truncate sentence embeddings to. None does no truncation.
Example
>>> from sentence_transformers import SentenceTransformer >>> from sentence_transformers.util import truncate_embeddings >>> model = SentenceTransformer("tomaarsen/mpnet-base-nli-matryoshka") >>> embeddings = model.encode(["It's so nice outside!", "Today is a beautiful day.", "He drove to work earlier"]) >>> embeddings.shape (3, 768) >>> model.similarity(embeddings, embeddings) tensor([[1.0000, 0.8100, 0.1426], [0.8100, 1.0000, 0.2121], [0.1426, 0.2121, 1.0000]]) >>> truncated_embeddings = truncate_embeddings(embeddings, 128) >>> truncated_embeddings.shape >>> model.similarity(truncated_embeddings, truncated_embeddings) tensor([[1.0000, 0.8092, 0.1987], [0.8092, 1.0000, 0.2716], [0.1987, 0.2716, 1.0000]])
- Returns
Truncated embeddings.
- Return type
Union[np.ndarray, torch.Tensor]
Similarity Metrics¶
-
sentence_transformers.util.
cos_sim
(a: Union[list, numpy.ndarray, torch.Tensor], b: Union[list, numpy.ndarray, torch.Tensor]) → torch.Tensor[source]¶ Computes the cosine similarity between two tensors.
- Parameters
a (Union[list, np.ndarray, Tensor]) – The first tensor.
b (Union[list, np.ndarray, Tensor]) – The second tensor.
- Returns
Matrix with res[i][j] = cos_sim(a[i], b[j])
- Return type
Tensor
-
sentence_transformers.util.
dot_score
(a: Union[list, numpy.ndarray, torch.Tensor], b: Union[list, numpy.ndarray, torch.Tensor]) → torch.Tensor[source]¶ Computes the dot-product dot_prod(a[i], b[j]) for all i and j.
- Parameters
a (Union[list, np.ndarray, Tensor]) – The first tensor.
b (Union[list, np.ndarray, Tensor]) – The second tensor.
- Returns
Matrix with res[i][j] = dot_prod(a[i], b[j])
- Return type
Tensor
-
sentence_transformers.util.
euclidean_sim
(a: Union[list, numpy.ndarray, torch.Tensor], b: Union[list, numpy.ndarray, torch.Tensor]) → torch.Tensor[source]¶ Computes the euclidean similarity (i.e., negative distance) between two tensors.
- Parameters
a (Union[list, np.ndarray, Tensor]) – The first tensor.
b (Union[list, np.ndarray, Tensor]) – The second tensor.
- Returns
Matrix with res[i][j] = -euclidean_distance(a[i], b[j])
- Return type
Tensor
-
sentence_transformers.util.
manhattan_sim
(a: Union[list, numpy.ndarray, torch.Tensor], b: Union[list, numpy.ndarray, torch.Tensor]) → torch.Tensor[source]¶ Computes the manhattan similarity (i.e., negative distance) between two tensors.
- Parameters
a (Union[list, np.ndarray, Tensor]) – The first tensor.
b (Union[list, np.ndarray, Tensor]) – The second tensor.
- Returns
Matrix with res[i][j] = -manhattan_distance(a[i], b[j])
- Return type
Tensor
-
sentence_transformers.util.
pairwise_cos_sim
(a: torch.Tensor, b: torch.Tensor) → torch.Tensor[source]¶ Computes the pairwise cosine similarity cos_sim(a[i], b[i]).
- Parameters
a (Union[list, np.ndarray, Tensor]) – The first tensor.
b (Union[list, np.ndarray, Tensor]) – The second tensor.
- Returns
Vector with res[i] = cos_sim(a[i], b[i])
- Return type
Tensor
-
sentence_transformers.util.
pairwise_dot_score
(a: torch.Tensor, b: torch.Tensor) → torch.Tensor[source]¶ Computes the pairwise dot-product dot_prod(a[i], b[i]).
- Parameters
a (Union[list, np.ndarray, Tensor]) – The first tensor.
b (Union[list, np.ndarray, Tensor]) – The second tensor.
- Returns
Vector with res[i] = dot_prod(a[i], b[i])
- Return type
Tensor
-
sentence_transformers.util.
pairwise_euclidean_sim
(a: Union[list, numpy.ndarray, torch.Tensor], b: Union[list, numpy.ndarray, torch.Tensor])[source]¶ Computes the euclidean distance (i.e., negative distance) between pairs of tensors.
- Parameters
a (Union[list, np.ndarray, Tensor]) – The first tensor.
b (Union[list, np.ndarray, Tensor]) – The second tensor.
- Returns
Vector with res[i] = -euclidean_distance(a[i], b[i])
- Return type
Tensor
-
sentence_transformers.util.
pairwise_manhattan_sim
(a: Union[list, numpy.ndarray, torch.Tensor], b: Union[list, numpy.ndarray, torch.Tensor])[source]¶ Computes the manhattan similarity (i.e., negative distance) between pairs of tensors.
- Parameters
a (Union[list, np.ndarray, Tensor]) – The first tensor.
b (Union[list, np.ndarray, Tensor]) – The second tensor.
- Returns
Vector with res[i] = -manhattan_distance(a[i], b[i])
- Return type
Tensor