Evaluation with MTEB
The Massive Text Embedding Benchmark (MTEB) is a comprehensive benchmark suite for evaluating embedding models across diverse NLP tasks like retrieval, classification, clustering, reranking, and semantic similarity.
This guide walks you through using MTEB with SentenceTransformer models for post-training evaluation. This is not designed for use during training, as this risks overfitting on public benchmarks. For evaluation during training, please see the Evaluator section in the Training Overview. To fully integrate your model to MTEB, you can follow the Adding a model to the Leaderboard guide from the MTEB Documentation.
Installation
Install MTEB and its dependencies:
pip install mteb
Evaluation
You can evaluate your SentenceTransformer model on individual tasks from the MTEB suite like so:
import mteb
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
# Example 1: Run a specific single task
tasks = mteb.get_tasks(tasks=["STS22.v2"], languages=["eng"])
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, output_folder="results/")
For the full list of available tasks, you can check the MTEB Tasks documentation.
You can also filter available MTEB tasks based on task type, domain, language, and more. For example, the following snippet evaluates on English retrieval tasks in the medical domain:
import mteb
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
# Example 2: Run all English retrieval tasks in the medical domain
tasks = mteb.get_tasks(
task_types=["Retrieval"],
domains=["Medical"],
languages=["eng"]
)
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, output_folder="results/")
Lastly, it’s often valuable to evaluate on predefined benchmarks. For example, to run all retrieval tasks in the MTEB(eng, v2)
benchmark:
import mteb
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
# Example 3: Run the MTEB benchmark for English tasks
benchmark = mteb.get_benchmark("MTEB(eng, v2)")
evaluation = mteb.MTEB(tasks=benchmark)
results = evaluation.run(model, output_folder="results/")
For the full list of supported benchmarks, visit the MTEB Benchmarks documentation.
Additional Arguments
When running evaluations, you can pass arguments down to model.encode()
using the encode_kwargs
parameter on evaluation.run()
. This allows you to customize how embeddings are generated, such as setting batch_size
, truncate_dim
, or normalize_embeddings
. For example:
...
results = evaluation.run(
model,
verbosity=2,
output_folder="results/",
encode_kwargs={"batch_size": 64, "normalize_embeddings": True}
)
Additionally, your SentenceTransformer model may have been configured to use prompts
. MTEB will automatically detect and use these prompts if they are defined in your model’s configuration. For task-specific or document/query-specific prompts, you should read the MTEB Documentation on Running SentenceTransformer models with prompts.
Results Handling
MTEB caches all results to disk, so you can rerun evaluation.run()
without needing to redownload datasets or recomputing scores.
import mteb
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
tasks = mteb.get_tasks(tasks=["STS17", "STS22.v2"], languages=["eng"])
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, output_folder="results/")
for task_results in results:
# Print the aggregated main scores for each task
print(f"{task_results.task_name}: {task_results.get_score():.4f} mean {task_results.task.metadata.main_score}")
"""
STS17: 0.2881 mean cosine_spearman
STS22.v2: 0.4925 mean cosine_spearman
"""
# Or e.g. print the individual scores for each split or subset
print(task_results.only_main_score().to_dict())
Leaderboard Submission
To add your model to the MTEB Leaderboard, you will need to follow the Adding a Model MTEB Documentation.
For the process, you’ll need to follow these steps:
Add your model metadata (name, languages, number of parameters, framework, training datasets, etc.) to the MTEB Repository.
Evaluate your model using MTEB on your desired tasks and save the results.
Submit your results to the MTEB Results Repository.
Once both are merged, after a day you’ll be able to find your model on the official leaderboard.