text-embeddings-inference documentation

Text Embeddings Inference

HF中国镜像站's logo
Join the HF中国镜像站 community

and get access to the augmented documentation experience

to get started

Text Embeddings Inference

Text Embeddings Inference (TEI) is a comprehensive toolkit designed for efficient deployment and serving of open source text embeddings models. It enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE, and E5.

TEI offers multiple features tailored to optimize the deployment process and enhance overall performance.

Key Features:

  • Streamlined Deployment: TEI eliminates the need for a model graph compilation step for an easier deployment process.
  • Efficient Resource Utilization: Benefit from small Docker images and rapid boot times, allowing for true serverless capabilities.
  • Dynamic Batching: TEI incorporates token-based dynamic batching thus optimizing resource utilization during inference.
  • Optimized Inference: TEI leverages Flash Attention, Candle, and cuBLASLt by using optimized transformers code for inference.
  • Safetensors weight loading: TEI loads Safetensors weights for faster boot times.
  • Production-Ready: TEI supports distributed tracing through Open Telemetry and exports Prometheus metrics.

Benchmarks

Benchmark for BAAI/bge-base-en-v1.5 on an NVIDIA A10 with a sequence length of 512 tokens:

Latency comparison for batch size of 1 Throughput comparison for batch size of 1

Latency comparison for batch size of 32 Throughput comparison for batch size of 32

Getting Started:

To start using TEI, check the Quick Tour guide.

< > Update on GitHub