util

sentence_transformers.util defines different helpful functions to work with text embeddings.

Helper Functions

sentence_transformers.util.community_detection(embeddings: Union[torch.Tensor, numpy.ndarray], threshold: float = 0.75, min_community_size: int = 10, batch_size: int = 1024, show_progress_bar: bool = False)List[List[int]][source]

Function for Fast Community Detection.

Finds in the embeddings all communities, i.e. embeddings that are close (closer than threshold). Returns only communities that are larger than min_community_size. The communities are returned in decreasing order. The first element in each list is the central point in the community.

Parameters
  • embeddings (torch.Tensor or numpy.ndarray) – The input embeddings.

  • threshold (float) – The threshold for determining if two embeddings are close. Defaults to 0.75.

  • min_community_size (int) – The minimum size of a community to be considered. Defaults to 10.

  • batch_size (int) – The batch size for computing cosine similarity scores. Defaults to 1024.

  • show_progress_bar (bool) – Whether to show a progress bar during computation. Defaults to False.

Returns

A list of communities, where each community is represented as a list of indices.

Return type

List[List[int]]

sentence_transformers.util.http_get(url: str, path: str)None[source]

Downloads a URL to a given path on disk.

Parameters
  • url (str) – The URL to download.

  • path (str) – The path to save the downloaded file.

Raises

requests.HTTPError – If the HTTP request returns a non-200 status code.

Returns

None

sentence_transformers.util.is_training_available()bool[source]

Returns True if we have the required dependencies for training Sentence Transformer models

sentence_transformers.util.normalize_embeddings(embeddings: torch.Tensor)torch.Tensor[source]

Normalizes the embeddings matrix, so that each sentence embedding has unit length.

Parameters

embeddings (Tensor) – The input embeddings matrix.

Returns

The normalized embeddings matrix.

Return type

Tensor

sentence_transformers.util.paraphrase_mining(model, sentences: List[str], show_progress_bar: bool = False, batch_size: int = 32, query_chunk_size: int = 5000, corpus_chunk_size: int = 100000, max_pairs: int = 500000, top_k: int = 100, score_function: Callable[[torch.Tensor, torch.Tensor], torch.Tensor] = <function cos_sim>)List[List[Union[float, int]]][source]

Given a list of sentences / texts, this function performs paraphrase mining. It compares all sentences against all other sentences and returns a list with the pairs that have the highest cosine similarity score.

Parameters
  • model (SentenceTransformer) – SentenceTransformer model for embedding computation

  • sentences (List[str]) – A list of strings (texts or sentences)

  • show_progress_bar (bool, optional) – Plotting of a progress bar. Defaults to False.

  • batch_size (int, optional) – Number of texts that are encoded simultaneously by the model. Defaults to 32.

  • query_chunk_size (int, optional) – Search for most similar pairs for #query_chunk_size at the same time. Decrease, to lower memory footprint (increases run-time). Defaults to 5000.

  • corpus_chunk_size (int, optional) – Compare a sentence simultaneously against #corpus_chunk_size other sentences. Decrease, to lower memory footprint (increases run-time). Defaults to 100000.

  • max_pairs (int, optional) – Maximal number of text pairs returned. Defaults to 500000.

  • top_k (int, optional) – For each sentence, we retrieve up to top_k other sentences. Defaults to 100.

  • score_function (Callable[[Tensor, Tensor], Tensor], optional) – Function for computing scores. By default, cosine similarity. Defaults to cos_sim.

Returns

Returns a list of triplets with the format [score, id1, id2]

Return type

List[List[Union[float, int]]]

This function performs a cosine similarity search between a list of query embeddings and a list of corpus embeddings. It can be used for Information Retrieval / Semantic Search for corpora up to about 1 Million entries.

Parameters
  • query_embeddings (Tensor) – A 2 dimensional tensor with the query embeddings.

  • corpus_embeddings (Tensor) – A 2 dimensional tensor with the corpus embeddings.

  • query_chunk_size (int, optional) – Process 100 queries simultaneously. Increasing that value increases the speed, but requires more memory. Defaults to 100.

  • corpus_chunk_size (int, optional) – Scans the corpus 100k entries at a time. Increasing that value increases the speed, but requires more memory. Defaults to 500000.

  • top_k (int, optional) – Retrieve top k matching entries. Defaults to 10.

  • score_function (Callable[[Tensor, Tensor], Tensor], optional) – Function for computing scores. By default, cosine similarity.

Returns

A list with one entry for each query. Each entry is a list of dictionaries with the keys ‘corpus_id’ and ‘score’, sorted by decreasing cosine similarity scores.

Return type

List[List[Dict[str, Union[int, float]]]]

sentence_transformers.util.truncate_embeddings(embeddings: numpy.ndarray, truncate_dim: Optional[int])numpy.ndarray[source]
sentence_transformers.util.truncate_embeddings(embeddings: torch.Tensor, truncate_dim: Optional[int])torch.Tensor

Truncates the embeddings matrix.

Parameters
  • embeddings (Union[np.ndarray, torch.Tensor]) – Embeddings to truncate.

  • truncate_dim (Optional[int]) – The dimension to truncate sentence embeddings to. None does no truncation.

Example

>>> from sentence_transformers import SentenceTransformer
>>> from sentence_transformers.util import truncate_embeddings
>>> model = SentenceTransformer("tomaarsen/mpnet-base-nli-matryoshka")
>>> embeddings = model.encode(["It's so nice outside!", "Today is a beautiful day.", "He drove to work earlier"])
>>> embeddings.shape
(3, 768)
>>> model.similarity(embeddings, embeddings)
tensor([[1.0000, 0.8100, 0.1426],
        [0.8100, 1.0000, 0.2121],
        [0.1426, 0.2121, 1.0000]])
>>> truncated_embeddings = truncate_embeddings(embeddings, 128)
>>> truncated_embeddings.shape
>>> model.similarity(truncated_embeddings, truncated_embeddings)
tensor([[1.0000, 0.8092, 0.1987],
        [0.8092, 1.0000, 0.2716],
        [0.1987, 0.2716, 1.0000]])
Returns

Truncated embeddings.

Return type

Union[np.ndarray, torch.Tensor]

Similarity Metrics

sentence_transformers.util.cos_sim(a: Union[list, numpy.ndarray, torch.Tensor], b: Union[list, numpy.ndarray, torch.Tensor])torch.Tensor[source]

Computes the cosine similarity between two tensors.

Parameters
  • a (Union[list, np.ndarray, Tensor]) – The first tensor.

  • b (Union[list, np.ndarray, Tensor]) – The second tensor.

Returns

Matrix with res[i][j] = cos_sim(a[i], b[j])

Return type

Tensor

sentence_transformers.util.dot_score(a: Union[list, numpy.ndarray, torch.Tensor], b: Union[list, numpy.ndarray, torch.Tensor])torch.Tensor[source]

Computes the dot-product dot_prod(a[i], b[j]) for all i and j.

Parameters
  • a (Union[list, np.ndarray, Tensor]) – The first tensor.

  • b (Union[list, np.ndarray, Tensor]) – The second tensor.

Returns

Matrix with res[i][j] = dot_prod(a[i], b[j])

Return type

Tensor

sentence_transformers.util.euclidean_sim(a: Union[list, numpy.ndarray, torch.Tensor], b: Union[list, numpy.ndarray, torch.Tensor])torch.Tensor[source]

Computes the euclidean similarity (i.e., negative distance) between two tensors.

Parameters
  • a (Union[list, np.ndarray, Tensor]) – The first tensor.

  • b (Union[list, np.ndarray, Tensor]) – The second tensor.

Returns

Matrix with res[i][j] = -euclidean_distance(a[i], b[j])

Return type

Tensor

sentence_transformers.util.manhattan_sim(a: Union[list, numpy.ndarray, torch.Tensor], b: Union[list, numpy.ndarray, torch.Tensor])torch.Tensor[source]

Computes the manhattan similarity (i.e., negative distance) between two tensors.

Parameters
  • a (Union[list, np.ndarray, Tensor]) – The first tensor.

  • b (Union[list, np.ndarray, Tensor]) – The second tensor.

Returns

Matrix with res[i][j] = -manhattan_distance(a[i], b[j])

Return type

Tensor

sentence_transformers.util.pairwise_cos_sim(a: torch.Tensor, b: torch.Tensor)torch.Tensor[source]

Computes the pairwise cosine similarity cos_sim(a[i], b[i]).

Parameters
  • a (Union[list, np.ndarray, Tensor]) – The first tensor.

  • b (Union[list, np.ndarray, Tensor]) – The second tensor.

Returns

Vector with res[i] = cos_sim(a[i], b[i])

Return type

Tensor

sentence_transformers.util.pairwise_dot_score(a: torch.Tensor, b: torch.Tensor)torch.Tensor[source]

Computes the pairwise dot-product dot_prod(a[i], b[i]).

Parameters
  • a (Union[list, np.ndarray, Tensor]) – The first tensor.

  • b (Union[list, np.ndarray, Tensor]) – The second tensor.

Returns

Vector with res[i] = dot_prod(a[i], b[i])

Return type

Tensor

sentence_transformers.util.pairwise_euclidean_sim(a: Union[list, numpy.ndarray, torch.Tensor], b: Union[list, numpy.ndarray, torch.Tensor])[source]

Computes the euclidean distance (i.e., negative distance) between pairs of tensors.

Parameters
  • a (Union[list, np.ndarray, Tensor]) – The first tensor.

  • b (Union[list, np.ndarray, Tensor]) – The second tensor.

Returns

Vector with res[i] = -euclidean_distance(a[i], b[i])

Return type

Tensor

sentence_transformers.util.pairwise_manhattan_sim(a: Union[list, numpy.ndarray, torch.Tensor], b: Union[list, numpy.ndarray, torch.Tensor])[source]

Computes the manhattan similarity (i.e., negative distance) between pairs of tensors.

Parameters
  • a (Union[list, np.ndarray, Tensor]) – The first tensor.

  • b (Union[list, np.ndarray, Tensor]) – The second tensor.

Returns

Vector with res[i] = -manhattan_distance(a[i], b[i])

Return type

Tensor