Skip to main content
The standard for evaluating text is human labeling. However, high-quality LLM outputs are becoming cheaper and faster to produce, and human evaluation cannot scale. In this context, evaluating the performance of LLM applications is best tackled by using a LLM. The Phoenix Evals library is designed for simple, fast, and accurate LLM-based evaluations.

Features

Phoenix Evals provides lightweight, composable building blocks for writing and running evaluations on LLM applications. It can be installed completely independently of the arize-phoenix package and is available in both Python and TypeScript versions.
  • Works with your preferred model SDKs via SDK adapters (OpenAI, LiteLLM, LangChain, AI SDK) - Phoenix lets you configure which foundation model you’d like to use as a judge. This includes OpenAI, Anthropic, Gemini, and much more. See Configuring the LLM.
  • Powerful input mapping and binding for working with complex data structures - easily map nested data and complex inputs to evaluator requirements.
  • Several pre-built metrics for common evaluation tasks like hallucination detection - Phoenix provides pre-tested eval templates for common tasks such as RAG and function calling. Learn more about pretested templates here. Each eval is pre-tested on a variety of eval models. Find the most up-to-date benchmarks on GitHub.
  • Evaluators are natively instrumented via OpenTelemetry tracing for observability and dataset curation. See Evaluator Traces for an overview.
  • Blazing fast performance - achieve up to 20x speedup with built-in concurrency and batching. Evals run in batches and typically run much faster than calling the APIs directly. See Executors for details on how this works.
  • Tons of convenience features to improve the developer experience!
  • Run evals on your own data - comes with native dataframe and data transformation utilities, making it easy to run evaluations on your own data—whether that’s logs, traces, or datasets downloaded for benchmarking.
  • Built-in Explanations - All Phoenix evaluations include an explanation capability that requires eval models to explain their judgment rationale. This boosts performance and helps you understand and improve your eval.