retrieval

sentence_transformers.util.retrieval.community_detection(embeddings: Tensor | ndarray, threshold: float = 0.75, min_community_size: int = 10, batch_size: int = 1024, show_progress_bar: bool = False) → list[list[int]][source]

Function for Fast Community Detection.

Finds in the embeddings all communities, i.e. embeddings that are close (closer than threshold). Returns only communities that are larger than min_community_size. The communities are returned in decreasing order. The first element in each list is the central point in the community.

Parameters:

embeddings (torch.Tensor or numpy.ndarray) – The input embeddings.
threshold (float) – The threshold for determining if two embeddings are close. Defaults to 0.75.
min_community_size (int) – The minimum size of a community to be considered. Defaults to 10.
batch_size (int) – The batch size for computing cosine similarity scores. Defaults to 1024.
show_progress_bar (bool) – Whether to show a progress bar during computation. Defaults to False.

Returns:

A list of communities, where each community is represented as a list of indices.

Return type:

List[List[int]]

sentence_transformers.util.retrieval.information_retrieval(*args, **kwargs) → list[list[dict[str, int | float]]][source]: This function is deprecated. Use semantic_search instead

sentence_transformers.util.retrieval.paraphrase_mining(model: SentenceTransformer, sentences: list[str], show_progress_bar: bool = False, batch_size: int = 32, query_chunk_size: int = 5000, corpus_chunk_size: int = 100000, max_pairs: int = 500000, top_k: int = 100, score_function: Callable[[Tensor, Tensor], Tensor] = <function cos_sim>, truncate_dim: int | None = None, prompt_name: str | None = None, prompt: str | None = None) → list[list[float | int]][source]

Given a list of sentences / texts, this function performs paraphrase mining. It compares all sentences against all other sentences and returns a list with the pairs that have the highest cosine similarity score.

Parameters:

model (SentenceTransformer) – SentenceTransformer model for embedding computation
sentences (List[str]) – A list of strings (texts or sentences)
show_progress_bar (bool, optional) – Plotting of a progress bar. Defaults to False.
batch_size (int, optional) – Number of texts that are encoded simultaneously by the model. Defaults to 32.
query_chunk_size (int, optional) – Search for most similar pairs for #query_chunk_size at the same time. Decrease, to lower memory footprint (increases run-time). Defaults to 5000.
corpus_chunk_size (int, optional) – Compare a sentence simultaneously against #corpus_chunk_size other sentences. Decrease, to lower memory footprint (increases run-time). Defaults to 100000.
max_pairs (int, optional) – Maximal number of text pairs returned. Defaults to 500000.
top_k (int, optional) – For each sentence, we retrieve up to top_k other sentences. Defaults to 100.
score_function (Callable[[Tensor, Tensor], Tensor], optional) – Function for computing scores. By default, cosine similarity. Defaults to cos_sim.
truncate_dim (int, optional) – The dimension to truncate sentence embeddings to. If None, uses the model’s ones. Defaults to None.
prompt_name (Optional[str], optional) –
The name of a predefined prompt to use when encoding the sentence. It must match a key in the model prompts dictionary, which can be set during model initialization or loaded from the model configuration.

Ignored if prompt is provided. Defaults to None.
prompt (Optional[str], optional) –
A raw prompt string to prepend directly to the input sentence during encoding.

For instance, prompt=”query: “ transforms the sentence “What is the capital of France?” into: “query: What is the capital of France?”. Use this to override the prompt logic entirely and supply your own prefix. This takes precedence over prompt_name. Defaults to None.

Returns:

Returns a list of triplets with the format [score, id1, id2]

Return type:

List[List[Union[float, int]]]

sentence_transformers.util.retrieval.paraphrase_mining_embeddings(embeddings: ~torch.Tensor, query_chunk_size: int = 5000, corpus_chunk_size: int = 100000, max_pairs: int = 500000, top_k: int = 100, score_function: ~collections.abc.Callable[[~torch.Tensor, ~torch.Tensor], ~torch.Tensor] = <function cos_sim>) → list[list[float | int]][source]

Given a list of sentences / texts, this function performs paraphrase mining. It compares all sentences against all other sentences and returns a list with the pairs that have the highest cosine similarity score.

Parameters:

embeddings (Tensor) – A tensor with the embeddings
query_chunk_size (int) – Search for most similar pairs for #query_chunk_size at the same time. Decrease, to lower memory footprint (increases run-time).
corpus_chunk_size (int) – Compare a sentence simultaneously against #corpus_chunk_size other sentences. Decrease, to lower memory footprint (increases run-time).
max_pairs (int) – Maximal number of text pairs returned.
top_k (int) – For each sentence, we retrieve up to top_k other sentences
score_function (Callable[[Tensor, Tensor], Tensor]) – Function for computing scores. By default, cosine similarity.

Returns:

Returns a list of triplets with the format [score, id1, id2]

Return type:

List[List[Union[float, int]]]

sentence_transformers.util.retrieval.semantic_search(query_embeddings: ~torch.Tensor, corpus_embeddings: ~torch.Tensor, query_chunk_size: int = 100, corpus_chunk_size: int = 500000, top_k: int = 10, score_function: ~collections.abc.Callable[[~torch.Tensor, ~torch.Tensor], ~torch.Tensor] = <function cos_sim>) → list[list[dict[str, int | float]]][source]

This function performs by default a cosine similarity search between a list of query embeddings and a list of corpus embeddings. It can be used for Information Retrieval / Semantic Search for corpora up to about 1 Million entries.

Parameters:

query_embeddings (Tensor) – A 2 dimensional tensor with the query embeddings. Can be a sparse tensor.
corpus_embeddings (Tensor) – A 2 dimensional tensor with the corpus embeddings. Can be a sparse tensor.
query_chunk_size (int, optional) – Process 100 queries simultaneously. Increasing that value increases the speed, but requires more memory. Defaults to 100.
corpus_chunk_size (int, optional) – Scans the corpus 100k entries at a time. Increasing that value increases the speed, but requires more memory. Defaults to 500000.
top_k (int, optional) – Retrieve top k matching entries. Defaults to 10.
score_function (Callable[[Tensor, Tensor], Tensor], optional) – Function for computing scores. By default, cosine similarity.

Returns:

A list with one entry for each query. Each entry is a list of dictionaries with the keys ‘corpus_id’ and ‘score’, sorted by decreasing cosine similarity scores.

Return type:

List[List[Dict[str, Union[int, float]]]]