Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines, that only finds documents based on lexical matches, semantic search can also find synonyms.
The idea behind semantic search is to embedd all entries in your corpus, which can be sentences, paragraphs, or documents, into a vector space.
At search time, the query is embedded into the same vector space and the closest embedding from your corpus are found. These entries should have a high semantic overlap with the query.
For small corpora (up to about 100k entries) we can compute the cosine-similarity between the query and all entries in the corpus.
In the following example, we define a small corpus with few example sentences and compute the embeddings for the corpus as well as for our query.
We then use the util.pytorch_cos_sim() function to compute the cosine similarity between the query and all corpus entries.
For large corpora, sorting all scores would take too much time. Hence, we use torch.topk to only get the top k entries.
For a simple example, see semantic_search.py:
""" This is a simple application for sentence embeddings: semantic search We have a corpus with various sentences. Then, for a given query sentence, we want to find the most similar sentence in this corpus. This script outputs for various queries the top 5 most similar sentences in the corpus. """ from sentence_transformers import SentenceTransformer, util import torch embedder = SentenceTransformer('paraphrase-distilroberta-base-v1') # Corpus with example sentences corpus = ['A man is eating food.', 'A man is eating a piece of bread.', 'The girl is carrying a baby.', 'A man is riding a horse.', 'A woman is playing violin.', 'Two men pushed carts through the woods.', 'A man is riding a white horse on an enclosed ground.', 'A monkey is playing drums.', 'A cheetah is running behind its prey.' ] corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True) # Query sentences: queries = ['A man is eating pasta.', 'Someone in a gorilla costume is playing a set of drums.', 'A cheetah chases prey on across a field.'] # Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity top_k = 5 for query in queries: query_embedding = embedder.encode(query, convert_to_tensor=True) cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings) cos_scores = cos_scores.cpu() #We use torch.topk to find the highest 5 scores top_results = torch.topk(cos_scores, k=top_k) print("\n\n======================\n\n") print("Query:", query) print("\nTop 5 most similar sentences in corpus:") for score, idx in zip(top_results, top_results): print(corpus[idx], "(Score: %.4f)" % (score))
Instead of implementing semantic search by your self, you can use the util.semantic_search function.
The function accepts the following parameters:
semantic_search(query_embeddings: torch.Tensor, corpus_embeddings: torch.Tensor, query_chunk_size: int = 100, corpus_chunk_size: int = 100000, top_k: int = 10)¶
This function performs a cosine similarity search between a list of query embeddings and a list of corpus embeddings. It can be used for Information Retrieval / Semantic Search for corpora up to about 1 Million entries.
query_embeddings – A 2 dimensional tensor with the query embeddings.
corpus_embeddings – A 2 dimensional tensor with the corpus embeddings.
query_chunk_size – Process 100 queries simultaneously. Increasing that value increases the speed, but requires more memory.
corpus_chunk_size – Scans the corpus 100k entries at a time. Increasing that value increases the speed, but requires more memory.
top_k – Retrieve top k matching entries.
Returns a sorted list with decreasing cosine similarity scores. Entries are dictionaries with the keys ‘corpus_id’ and ‘score’
By default, up to 100 queries are processes in parallel. Further, the corpus is chunked into set of up to 100k entries. You can increase query_chunk_size and corpus_chunk_size, which leads to and increased speed for large corpora, but also increases the memory requirement.
Depending on your real-time requirements, you can use this function for corpora up to 1 Million entries given you have enough memory.
Similar Questions Retrieval¶
semantic_search_quora_pytorch.py shows an example based on the Quora duplicate questions dataset. The user can enter a question, and the code retrieves the most similar questions from the dataset using the util.semantic_search method. As model, we use distilbert-multilingual-nli-stsb-quora-ranking, which was trained to identify similar questions and supports 50+ languages.
Starting with version 7.3, ElasticSearch introduced the possibility to index dense vectors and to use to for document scoring. Hence, we can use ElasticSearch to index embeddings along the documents and we can use the query embeddings to retrieve relevant entries.
An advantage of ElasticSearch is that it is easy to add new documents to an index and that we can store also other data along with our vectors. A disadvantage is the slow performance, as it compares the query embeddings with all stored embeddings. This has a linear run-time and might be too slow for large (>100k) corpora.
For further details, see semantic_search_quora_elasticsearch.py.
Approximate Nearest Neighbor¶
Searching a large corpus with millions of embeddings can be time-consuming if exact nearest neighbor search is used (like it is used by util.semantic_search).
In that case, Approximate Nearest Neighor (ANN) can be helpful. Here, the data is partitioned into smaller fractions of similar embeddings. This index can be search efficiently and the embeddings with the highest similarity (the nearest neighbors) can be retrieved within milliseconds, even if you have Millions of vectors.
However, the results are not necessarily exact: It can happen, that some vectors with high similarity are missed. That’s the reason why it is called approximate nearest neighbor.
For all ANN methods, there is usually one or more parameters to tune that determine the recall - speed trade-off. If you want the highest speed, you have a high chance of missing hits. If you want high recall, the search speed decreases.
Three popular libraries for approximate nearest neighbor are Annoy, FAISS, and hnswlib. Personally I find hnswlib the most suitable library: It is easy to use, offers a great performance and has nice features included that are important for real applications.
For an example how to use SentenceTransformers with HNSWLib, see: semantic_search_quora_hnswlib.py
For an example how to use SentenceTransformers with Annoy, see: semantic_search_quora_annoy.py
For an example how to use SentenceTransformers with FAISS, see: semantic_search_quora_faiss.py