[IMPROVE] AI Evals

## Title


## Background


Set of representative questions:

https://github.com/CodeForPhilly/balancer-main/issues/345#issuecomment-3433329904

https://github.com/CodeForPhilly/balancer-main/issues/411#issuecomment-3712677508

https://github.com/CodeForPhilly/balancer-main/tree/develop/evaluation


## Current State





## Acceptance Criteria
- [] 

## Approach


Start with [error analysis](https://hamel.dev/blog/posts/evals-faq/#q-why-is-error-analysis-so-important-in-llm-evals-and-how-is-it-performed), not infrastructure. Spend 30 minutes manually reviewing 20-50 LLM outputs whenever you make significant changes. Use one [domain expert](https://hamel.dev/blog/posts/evals-faq/#q-how-many-people-should-annotate-my-llm-outputs) who understands your users as your quality decision maker (a “[benevolent dictator](https://hamel.dev/blog/posts/evals-faq/#q-how-many-people-should-annotate-my-llm-outputs)”).

OpenAI API Dashboard has total duration and cost metrics 



## References


## Risks and Rollback


## Screenshots / Recordings


## Related PR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[IMPROVE] AI Evals #490

Title

Background

Current State

Acceptance Criteria

Approach

References

Risks and Rollback

Screenshots / Recordings

Related PR

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[IMPROVE] AI Evals #490

Description

Title

Background

Current State

Acceptance Criteria

Approach

References

Risks and Rollback

Screenshots / Recordings

Related PR

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions