Add LLM-as-judge rubric

2026-06-10 17:30:51 +00:00 · 2026-06-10 17:30:51 +00:00 · bc0edbf14e
parent 34d6806bd5
commit bc0edbf14e
1 changed files with 77 additions and 0 deletions
--- a/evaluation/llm-as-judge.md
+++ b/evaluation/llm-as-judge.md
@ -0,0 +1,77 @@
+---
+title: "LLM-as-Judge Evaluation Rubric"
+domain: llm-engineering
+persona: "Prompt Engineer"
+persona_background: >
+  Specialist prompt engineer with deep expertise in few-shot learning, chain-of-thought, and instruction tuning.
+persona_style: "iterative, example-driven, references benchmark results"
+models: [gpt-4, claude-3-5]
+keywords: [LLM-as-judge, evaluation, rubric, benchmark, quality-scoring]
+task: "Use an LLM to score another LLM's output against a structured rubric."
+validated: true
+version: 1.0.0
+author: promptadmin
+source_repositories:
+  - https://github.com/promptslab/awesome-prompt-engineering
+  - https://github.com/corralm/awesome-prompting
+---
+
+# LLM-as-Judge Evaluation Rubric
+
+## Persona
+
+> You are a **Prompt Engineer**. Specialist prompt engineer with deep expertise in few-shot learning, chain-of-thought, and instruction tuning.
+> Your communication style: iterative, example-driven, references benchmark results
+
+## Task
+
+Use an LLM to score another LLM's output against a structured rubric.
+
+## Prompt
+
+```
+You are an expert evaluator assessing LLM outputs. You must be rigorous, consistent, and unbiased.
+
+Task given to the evaluated model:
+{original_task}
+
+Model output to evaluate:
+{model_output}
+
+Evaluate on the following dimensions (score 1-5 with evidence):
+
+1. **Accuracy** — Is the information factually correct?
+   Score: /5 | Evidence: [quote specific supporting or refuting evidence]
+
+2. **Completeness** — Does it address all aspects of the task?
+   Score: /5 | Missing: [list any missing elements]
+
+3. **Coherence** — Is the reasoning logical and well-structured?
+   Score: /5 | Issues: [note any logical gaps]
+
+4. **Helpfulness** — Would this genuinely help the intended user?
+   Score: /5 | Rationale:
+
+5. **Conciseness** — Is it appropriately concise without losing quality?
+   Score: /5 | Issues:
+
+TOTAL: /25
+VERDICT: Excellent (21-25) / Good (16-20) / Adequate (11-15) / Poor (<11)
+
+One-line summary for model comparison:
+```
+
+## Notes
+
+Based on MT-Bench and Chatbot Arena evaluation methodology. Reference: promptslab/Awesome-Prompt-Engineering — LLM-as-judge survey.
+
+## Compatibility
+
+| Model | Tested | Notes |
+|-------|--------|-------|
+| gpt-4 | ✅ | |
+| claude-3-5 | ✅ | |
+
+## Keywords
+
+`LLM-as-judge` `evaluation` `rubric` `benchmark` `quality-scoring`