arize-phoenix
banner
arize-phoenix.bsky.social
arize-phoenix
@arize-phoenix.bsky.social
Open-Source AI Observability and Evaluation
app.phoenix.arize.com
Why this matters for on-prem:
🔒 Private CA support—no public cert required
🏠 Data never leaves your network
👥 Leverage existing AD groups for access control
Requires Phoenix 12.20.0+

arize.com/docs/phoeni...
12.10.2025: LDAP Authentication Support - Phoenix
Available in Phoenix 12.20+
arizeai-433a7140.mintlify.app
December 16, 2025 at 6:39 PM
Phoenix Evals now handles:

• string OR message templates
• {var} and {{var}} syntax
• automatic provider-specific transformations
• consistent, reproducible scoring

More prompt-engineering context here: arize.com/docs/phoeni...
12.10.2025: Evaluator Message Formats - Phoenix
Available in phoenix-evals 0.22+ (Python) and @arizeai/phoenix-evals 2.0+ (TypeScript)
arizeai-433a7140.mintlify.app
December 11, 2025 at 4:37 AM
Message-separated prompts fix this by isolating:

🔹 Evaluator instructions (system/developer)
from
🔹 Content being judged (user)

This reduces false positives from content filters and stabilizes judge behavior.
December 11, 2025 at 4:37 AM
This is critical because providers interpret roles differently:

• OpenAI: system becomes developer for reasoning models
• Anthropic and Gemini: system extracted into a top-level param

If everything is shoved into a message, you get safety flags, refusals, and polluted evals.
December 11, 2025 at 4:37 AM
With Phoenix Evals you can now define evaluators using OpenAI-style message lists:

[
{"role": "system", "content": "You evaluate helpfulness."},
{"role": "user", "content": "Question: {question}\nAnswer: {answer}"}
]

Variables get slotted into the correct role, preserving evaluator intent.
December 11, 2025 at 4:37 AM
Historically, many eval pipelines used single-string prompts. But today’s LLMs are role-aware, and that old pattern can lead to:

• task vs. judge confusion
• content-filter refusals
• inconsistent or drifted scoring
• evaluator instructions being ignored

Message formatting is now a critical piece
December 11, 2025 at 4:37 AM
This matters even more in light of OpenAI’s latest model spec update, which emphasizes stricter role semantics across system / developer / user messages:
model-spec.openai.com/2025-10-27....
OpenAI Model Spec
The Model Spec specifies desired behavior for the models underlying OpenAI's products (including our APIs).
model-spec.openai.com
December 11, 2025 at 4:37 AM
If you have any feedback, please let @arizeai @mikeldking and the team know!
December 10, 2025 at 6:20 AM
12.10.2025: Span Notes API - Phoenix
Available in Phoenix 12.20+
arizeai-433a7140.mintlify.app
December 10, 2025 at 6:20 AM
This process has been core to ML error analysis for decades. @GergelyOrosz and @HamelHusain wrote an excellent deep-dive on how teams are applying it to LLM development: newsletter.pragmaticengineer.com/p/evals
A pragmatic guide to LLM evals for devs
Evals are a new toolset for any and all AI engineers – and software engineers should also know about them. Move from guesswork to a systematic engineering process for improving AI quality.
newsletter.pragmaticengineer.com
December 10, 2025 at 6:20 AM
What you get: a data-driven priority list of your system's real failure modes, not what a generic benchmark thinks might be wrong.
December 10, 2025 at 6:20 AM
The workflow:

Review 100+ diverse traces
Write descriptive notes on anything that feels wrong
Group similar notes into themes (axial coding)
Count the frequency of each theme
December 10, 2025 at 6:20 AM
"Asked for confirmation twice"
"Kept trying to solve instead of handing off"
These descriptive observations are more valuable than generic scores—they tell you what's actually breaking.
December 10, 2025 at 6:20 AM
The idea: instead of starting with predefined categories like "hallucination" or "helpfulness," you review your traces and write open-ended notes about what you observe. Let the failure modes emerge from your data.
"Missed opportunity to re-engage a price-sensitive user"
December 10, 2025 at 6:20 AM
Open coding is a technique from qualitative research that's becoming essential for LLM evaluation.
December 10, 2025 at 6:20 AM