Search Engines

sentence_transformers.sparse_encoder.search_engines defines different helpful functions to integrate with vector databases and search engines the sparse embeddings produced.

sentence_transformers.sparse_encoder.search_engines.semantic_search_elasticsearch(query_embeddings_decoded: list[list[tuple[str, float]]], corpus_embeddings_decoded: list[list[tuple[str, float]]] | None = None, corpus_index: tuple[Elasticsearch, str] | None = None, top_k: int = 10, output_index: bool = False, **kwargs: Any) → tuple[list[list[dict[str, int | float]]], float] | tuple[list[list[dict[str, int | float]]], float, tuple[Elasticsearch, str]][source]

Performs semantic search using sparse embeddings with Elasticsearch.

Parameters:

query_embeddings_decoded –

List of query embeddings in format [[(“token”: value), …], …] Example: To get this format from a SparseEncoder model:

model = SparseEncoder('my-sparse-model')
query_texts = ["your query text"]
query_embeddings = model.encode(query_texts)
query_embeddings_decoded = model.decode(query_embeddings)

corpus_embeddings_decoded – List of corpus embeddings in format [[(“token”: value), …], …] Only used if corpus_index is None Can be obtained using the same decode method as query embeddings
corpus_index – Tuple of (Elasticsearch, collection_name) If provided, uses this existing index for search
top_k – Number of top results to retrieve
output_index – Whether to return the Elasticsearch client and collection name

Returns:

List of search results in format [[{“corpus_id”: int, “score”: float}, …], …]
Time taken for search
(Optional) Tuple of (Elasticsearch, collection_name) if output_index is True

Return type:

A tuple containing

sentence_transformers.sparse_encoder.search_engines.semantic_search_opensearch(query_embeddings_decoded: list[list[tuple[str, float]]], corpus_embeddings_decoded: list[list[tuple[str, float]]] | None = None, corpus_index: tuple[OpenSearch, str] | None = None, top_k: int = 10, output_index: bool = False, **kwargs: Any) → tuple[list[list[dict[str, int | float]]], float] | tuple[list[list[dict[str, int | float]]], float, tuple[OpenSearch, str]][source]

Performs semantic search using sparse embeddings with OpenSearch.

Parameters:

query_embeddings_decoded –

List of query embeddings in format [[(“token”: value), …], …] Example: To get this format from a SparseEncoder model:

model = SparseEncoder('my-sparse-model')
query_texts = ["your query text"]
query_embeddings = model.encode(query_texts)
query_embeddings_decoded = model.decode(query_embeddings)

corpus_embeddings_decoded – List of corpus embeddings in format [[(“token”: value), …], …] Only used if corpus_index is None Can be obtained using the same decode method as query embeddings
corpus_index – Tuple of (OpenSearch, collection_name) If provided, uses this existing index for search
top_k – Number of top results to retrieve
output_index – Whether to return the OpenSearch client and collection name
vocab – The dict to transform tokens into token ids

Returns:

List of search results in format [[{“corpus_id”: int, “score”: float}, …], …]
Time taken for search
(Optional) Tuple of (OpenSearch, collection_name) if output_index is True

Return type:

A tuple containing

sentence_transformers.sparse_encoder.search_engines.semantic_search_qdrant(query_embeddings: torch.Tensor, corpus_embeddings: torch.Tensor | None = None, corpus_index: tuple[QdrantClient, str] | None = None, top_k: int = 10, output_index: bool = False, **kwargs: Any) → tuple[list[list[dict[str, int | float]]], float] | tuple[list[list[dict[str, int | float]]], float, tuple[QdrantClient, str]][source]

Performs semantic search using sparse embeddings with Qdrant.

Parameters:

query_embeddings – PyTorch COO sparse tensor containing query embeddings
corpus_embeddings – PyTorch COO sparse tensor containing corpus embeddings Only used if corpus_index is None
corpus_index – Tuple of (QdrantClient, collection_name) If provided, uses this existing index for search
top_k – Number of top results to retrieve
output_index – Whether to return the Qdrant client and collection name

Returns:

List of search results in format [[{“corpus_id”: int, “score”: float}, …], …]
Time taken for search
(Optional) Tuple of (QdrantClient, collection_name) if output_index is True

Return type:

A tuple containing

sentence_transformers.sparse_encoder.search_engines.semantic_search_seismic(query_embeddings_decoded: list[list[tuple[str, float]]], corpus_embeddings_decoded: list[list[tuple[str, float]]] | None = None, corpus_index: tuple[SeismicIndex, str] | None = None, top_k: int = 10, output_index: bool = False, index_kwargs: dict[str, Any] | None = None, search_kwargs: dict[str, Any] | None = None) → tuple[list[list[dict[str, int | float]]], float] | tuple[list[list[dict[str, int | float]]], float, tuple[SeismicIndex, str]][source]

Performs semantic search using sparse embeddings with Seismic.

Parameters:

query_embeddings_decoded –

List of query embeddings in format [[(“token”: value), …], …] Example: To get this format from a SparseEncoder model:

model = SparseEncoder('my-sparse-model')
query_texts = ["your query text"]
query_embeddings = model.encode(query_texts)
query_embeddings_decoded = model.decode(query_embeddings)

corpus_embeddings_decoded – List of corpus embeddings in format [[(“token”: value), …], …] Only used if corpus_index is None Can be obtained using the same decode method as query embeddings
corpus_index – Tuple of (SeismicIndex, collection_name) If provided, uses this existing index for search
top_k – Number of top results to retrieve
output_index – Whether to return the SeismicIndex client and collection name
index_kwargs – Additional arguments for SeismicIndex passed to build_from_dataset, such as centroid_fraction, min_cluster_size, summary_energy, nknn, knn_path, batched_indexing, or num_threads.
search_kwargs – Additional arguments for SeismicIndex passed to batch_search, such as query_cut, heap_factor, n_knn, sorted, or num_threads. Note: query_cut and heap_factor are set to default values if not provided.

Returns:

List of search results in format [[{“corpus_id”: int, “score”: float}, …], …]
Time taken for search
(Optional) Tuple of (SeismicIndex, collection_name) if output_index is True

Return type:

A tuple containing