Evaluation with MTEB

The Massive Text Embedding Benchmark (MTEB) is a comprehensive benchmark suite for evaluating embedding models across diverse NLP tasks like retrieval, classification, clustering, reranking, and semantic similarity.

This guide walks you through using MTEB with SentenceTransformer models for post-training evaluation. This is not designed for use during training, as this risks overfitting on public benchmarks. For evaluation during training, please see the Evaluator section in the Training Overview. To fully integrate your model to MTEB, you can follow the Adding a model to the Leaderboard guide from the MTEB Documentation.

Installation

Install MTEB and its dependencies:

pip install mteb

Evaluation

You can evaluate your SentenceTransformer model on individual tasks from the MTEB suite like so:

import mteb
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

# Example 1: Run a specific single task
tasks = mteb.get_tasks(tasks=["STS22.v2"], languages=["eng"])
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, output_folder="results/")

For the full list of available tasks, you can check the MTEB Tasks documentation.

You can also filter available MTEB tasks based on task type, domain, language, and more. For example, the following snippet evaluates on English retrieval tasks in the medical domain:

import mteb
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

# Example 2: Run all English retrieval tasks in the medical domain
tasks = mteb.get_tasks(
    task_types=["Retrieval"],
    domains=["Medical"],
    languages=["eng"]
)
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, output_folder="results/")

Lastly, it’s often valuable to evaluate on predefined benchmarks. For example, to run all retrieval tasks in the MTEB(eng, v2) benchmark:

import mteb
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

# Example 3: Run the MTEB benchmark for English tasks
benchmark = mteb.get_benchmark("MTEB(eng, v2)")
evaluation = mteb.MTEB(tasks=benchmark)
results = evaluation.run(model, output_folder="results/")

For the full list of supported benchmarks, visit the MTEB Benchmarks documentation.

Additional Arguments

When running evaluations, you can pass arguments down to model.encode() using the encode_kwargs parameter on evaluation.run(). This allows you to customize how embeddings are generated, such as setting batch_size, truncate_dim, or normalize_embeddings. For example:

...

results = evaluation.run(
    model,
    verbosity=2,
    output_folder="results/",
    encode_kwargs={"batch_size": 64, "normalize_embeddings": True}
)

Additionally, your SentenceTransformer model may have been configured to use prompts. MTEB will automatically detect and use these prompts if they are defined in your model’s configuration. For task-specific or document/query-specific prompts, you should read the MTEB Documentation on Running SentenceTransformer models with prompts.

Results Handling

MTEB caches all results to disk, so you can rerun evaluation.run() without needing to redownload datasets or recomputing scores.

import mteb
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

tasks = mteb.get_tasks(tasks=["STS17", "STS22.v2"], languages=["eng"])
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, output_folder="results/")

for task_results in results:
    # Print the aggregated main scores for each task
    print(f"{task_results.task_name}: {task_results.get_score():.4f} mean {task_results.task.metadata.main_score}")
    """
    STS17: 0.2881 mean cosine_spearman
    STS22.v2: 0.4925 mean cosine_spearman
    """

    # Or e.g. print the individual scores for each split or subset
    print(task_results.only_main_score().to_dict())

Leaderboard Submission

To add your model to the MTEB Leaderboard, you will need to follow the Adding a Model MTEB Documentation.

For the process, you’ll need to follow these steps:

  1. Add your model metadata (name, languages, number of parameters, framework, training datasets, etc.) to the MTEB Repository.

  2. Evaluate your model using MTEB on your desired tasks and save the results.

  3. Submit your results to the MTEB Results Repository.

Once both are merged, after a day you’ll be able to find your model on the official leaderboard.