app.phoenix.arize.com
www.deeplearning.ai/short-course...
www.deeplearning.ai/short-course...
🔒 Private CA support—no public cert required
🏠 Data never leaves your network
👥 Leverage existing AD groups for access control
Requires Phoenix 12.20.0+
arize.com/docs/phoeni...
🔒 Private CA support—no public cert required
🏠 Data never leaves your network
👥 Leverage existing AD groups for access control
Requires Phoenix 12.20.0+
arize.com/docs/phoeni...
• string OR message templates
• {var} and {{var}} syntax
• automatic provider-specific transformations
• consistent, reproducible scoring
More prompt-engineering context here: arize.com/docs/phoeni...
• string OR message templates
• {var} and {{var}} syntax
• automatic provider-specific transformations
• consistent, reproducible scoring
More prompt-engineering context here: arize.com/docs/phoeni...
🔹 Evaluator instructions (system/developer)
from
🔹 Content being judged (user)
This reduces false positives from content filters and stabilizes judge behavior.
🔹 Evaluator instructions (system/developer)
from
🔹 Content being judged (user)
This reduces false positives from content filters and stabilizes judge behavior.
• OpenAI: system becomes developer for reasoning models
• Anthropic and Gemini: system extracted into a top-level param
If everything is shoved into a message, you get safety flags, refusals, and polluted evals.
• OpenAI: system becomes developer for reasoning models
• Anthropic and Gemini: system extracted into a top-level param
If everything is shoved into a message, you get safety flags, refusals, and polluted evals.
[
{"role": "system", "content": "You evaluate helpfulness."},
{"role": "user", "content": "Question: {question}\nAnswer: {answer}"}
]
Variables get slotted into the correct role, preserving evaluator intent.
[
{"role": "system", "content": "You evaluate helpfulness."},
{"role": "user", "content": "Question: {question}\nAnswer: {answer}"}
]
Variables get slotted into the correct role, preserving evaluator intent.
• task vs. judge confusion
• content-filter refusals
• inconsistent or drifted scoring
• evaluator instructions being ignored
Message formatting is now a critical piece
• task vs. judge confusion
• content-filter refusals
• inconsistent or drifted scoring
• evaluator instructions being ignored
Message formatting is now a critical piece
model-spec.openai.com/2025-10-27....
model-spec.openai.com/2025-10-27....
Review 100+ diverse traces
Write descriptive notes on anything that feels wrong
Group similar notes into themes (axial coding)
Count the frequency of each theme
Review 100+ diverse traces
Write descriptive notes on anything that feels wrong
Group similar notes into themes (axial coding)
Count the frequency of each theme
"Kept trying to solve instead of handing off"
These descriptive observations are more valuable than generic scores—they tell you what's actually breaking.
"Kept trying to solve instead of handing off"
These descriptive observations are more valuable than generic scores—they tell you what's actually breaking.
"Missed opportunity to re-engage a price-sensitive user"
"Missed opportunity to re-engage a price-sensitive user"