Lightnews — Scholar-powered news

arize-phoenix

@arize-phoenix.bsky.social

Enroll for free to learn more!
www.deeplearning.ai/short-course...

Nvidia's NeMo Agent Toolkit: Making Agents Reliable

Turn proof-of-concept agent demos into production-ready systems using observability, evaluation, and deployment tools from Nvidia's NeMo Agent Toolkit.

www.deeplearning.ai

December 22, 2025 at 7:07 PM

arize-phoenix

@arize-phoenix.bsky.social

Why this matters for on-prem:
🔒 Private CA support—no public cert required
🏠 Data never leaves your network
👥 Leverage existing AD groups for access control
Requires Phoenix 12.20.0+

arize.com/docs/phoeni...

12.10.2025: LDAP Authentication Support - Phoenix

Available in Phoenix 12.20+

arizeai-433a7140.mintlify.app

December 16, 2025 at 6:39 PM

arize-phoenix

@arize-phoenix.bsky.social

Watch the webinar here! www.youtube.com/watch?v=n-O...

LLM as a Judge 102: Meta Evaluation

Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.

www.youtube.com

December 12, 2025 at 7:08 PM

arize-phoenix

@arize-phoenix.bsky.social

Demo Code is here: github.com/Arize-ai/tu...

tutorials/python/cookbooks/phoenix_evals_examples/evals_series_meta_evaluation.ipynb at main · Arize-ai/tutorials

Tutorials. Contribute to Arize-ai/tutorials development by creating an account on GitHub.

github.com

December 12, 2025 at 7:08 PM

arize-phoenix

@arize-phoenix.bsky.social

Phoenix Evals now handles:

• string OR message templates
• {var} and {{var}} syntax
• automatic provider-specific transformations
• consistent, reproducible scoring

More prompt-engineering context here: arize.com/docs/phoeni...

12.10.2025: Evaluator Message Formats - Phoenix

Available in phoenix-evals 0.22+ (Python) and @arizeai/phoenix-evals 2.0+ (TypeScript)

arizeai-433a7140.mintlify.app

December 11, 2025 at 4:37 AM

arize-phoenix

@arize-phoenix.bsky.social

Message-separated prompts fix this by isolating:

🔹 Evaluator instructions (system/developer)
from
🔹 Content being judged (user)

This reduces false positives from content filters and stabilizes judge behavior.

December 11, 2025 at 4:37 AM

arize-phoenix

@arize-phoenix.bsky.social

This is critical because providers interpret roles differently:

• OpenAI: system becomes developer for reasoning models
• Anthropic and Gemini: system extracted into a top-level param

If everything is shoved into a message, you get safety flags, refusals, and polluted evals.

December 11, 2025 at 4:37 AM

arize-phoenix

@arize-phoenix.bsky.social

With Phoenix Evals you can now define evaluators using OpenAI-style message lists:

[
{"role": "system", "content": "You evaluate helpfulness."},
{"role": "user", "content": "Question: {question}\nAnswer: {answer}"}
]

Variables get slotted into the correct role, preserving evaluator intent.

December 11, 2025 at 4:37 AM

arize-phoenix

@arize-phoenix.bsky.social

Historically, many eval pipelines used single-string prompts. But today’s LLMs are role-aware, and that old pattern can lead to:

• task vs. judge confusion
• content-filter refusals
• inconsistent or drifted scoring
• evaluator instructions being ignored

Message formatting is now a critical piece

December 11, 2025 at 4:37 AM

arize-phoenix

@arize-phoenix.bsky.social

This matters even more in light of OpenAI’s latest model spec update, which emphasizes stricter role semantics across system / developer / user messages:
model-spec.openai.com/2025-10-27....

OpenAI Model Spec

The Model Spec specifies desired behavior for the models underlying OpenAI's products (including our APIs).

model-spec.openai.com

December 11, 2025 at 4:37 AM

arize-phoenix

@arize-phoenix.bsky.social

If you have any feedback, please let @arizeai @mikeldking and the team know!

December 10, 2025 at 6:20 AM

arize-phoenix

@arize-phoenix.bsky.social

API docs: arize.com/docs/phoeni...

12.10.2025: Span Notes API - Phoenix

Available in Phoenix 12.20+

arizeai-433a7140.mintlify.app

December 10, 2025 at 6:20 AM

arize-phoenix

@arize-phoenix.bsky.social

This process has been core to ML error analysis for decades. @GergelyOrosz and @HamelHusain wrote an excellent deep-dive on how teams are applying it to LLM development: newsletter.pragmaticengineer.com/p/evals

A pragmatic guide to LLM evals for devs

Evals are a new toolset for any and all AI engineers – and software engineers should also know about them. Move from guesswork to a systematic engineering process for improving AI quality.

newsletter.pragmaticengineer.com

December 10, 2025 at 6:20 AM

arize-phoenix

@arize-phoenix.bsky.social

What you get: a data-driven priority list of your system's real failure modes, not what a generic benchmark thinks might be wrong.

December 10, 2025 at 6:20 AM

arize-phoenix

@arize-phoenix.bsky.social

The workflow:

Review 100+ diverse traces
Write descriptive notes on anything that feels wrong
Group similar notes into themes (axial coding)
Count the frequency of each theme

December 10, 2025 at 6:20 AM

arize-phoenix

@arize-phoenix.bsky.social

"Asked for confirmation twice"
"Kept trying to solve instead of handing off"
These descriptive observations are more valuable than generic scores—they tell you what's actually breaking.

December 10, 2025 at 6:20 AM

arize-phoenix

@arize-phoenix.bsky.social

The idea: instead of starting with predefined categories like "hallucination" or "helpfulness," you review your traces and write open-ended notes about what you observe. Let the failure modes emerge from your data.
"Missed opportunity to re-engage a price-sensitive user"

December 10, 2025 at 6:20 AM

arize-phoenix

@arize-phoenix.bsky.social

Open coding is a technique from qualitative research that's becoming essential for LLM evaluation.

December 10, 2025 at 6:20 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news