LLM-As-A-Judge Evaluation using Azure OpenAI
What is LLM-as-a-Judge?
LLM-as-a-Judge is an evaluation metric to assess the quality of text outputs from any LLM-powered product, including chatbots, Q&A systems, or agents. It uses a large language model (LLM) with an evaluation prompt to rate generated text based on criteria you define.
LLM-as-a-Judge is not a single metric but a flexible technique for approximating human judgment. LLM judge evaluations are specific to your application. The success of this method depends on three key things:
- The prompt used for evaluation — how well it describes the quality checks for the LLM.
- The model — using a good LLM gives better judgment ability.
- The complexity of the task — simpler tasks are easier to judge, while more complex queries need more advanced prompts.
How It Works for Continuous Monitoring
TContinuous monitoring is essential for maintaining a knowledge-centric chatbot’s accuracy, relevance, and responsiveness. Utilizing our custom FastAPI-based LLM-as-a-Judge, this process is streamlined and highly effective. Here’s a step-by-step overview of how it works:
1. User Query and Context Retrieval
The process begins when a user asks a question. The system retrieves relevant context using an AI-powered search mechanism. This context, combined with the user’s question, forms the basis for generating an accurate response.
2. Generating the Response
Once the question and context are retrieved, they are passed to the LLM (Large Language Model). The LLM processes this input and generates a response based on the provided context, aiming to deliver the most accurate and informative answer.
3. Calling LLM-as-a-Judge for Evaluation
After the LLM generates a response, it undergoes an evaluation to ensure quality and correctness. This is where the custom FastAPI-based LLM-as-a-Judge API comes in. The evaluation process involves passing the original user question, the retrieved context, and the LLM-generated response to the LLM-as-a-Judge API.
The LLM-as-a-Judge then evaluates the response based on specific criteria and assigns it to one of the following categories:
- Acceptable: The answer is fully aligned with the context and directly drawn from available information.
- Not Acceptable: The answer contains incorrect or irrelevant information that does not align with the given context.
- Ask Again: The answer is unclear or requires additional clarification to be properly understood.
- Acceptable with Minor Issues: The response is generally correct but may contain slight inaccuracies or incomplete information.
The LLM-as-a-Judge also provides reasoning behind each classification, allowing for a transparent understanding of the response quality.
4. Custom Classification Rules and Implementation
The custom classification logic helps ensure that every chatbot response meets high-quality standards. For instance, an example classification rule might be:
- If the context explicitly mentions that “Paris is the capital of France” and the answer reflects that correctly, the response is deemed Acceptable.
- If the answer deviates or provides inaccurate information, it would fall under Not Acceptable or Ask Againcategories.
The FastAPI-based LLM-as-a-Judge API uses these custom rules to evaluate and provide a reason for each classification, allowing for detailed insights into each response’s quality.
5. Final Response and Continuous Improvement
Based on the evaluation result from the LLM-as-a-Judge API, the system determines whether the response is suitable for the user. If the response is classified as Not Acceptable or Ask Again, it may be flagged for further review or revision before being presented to the user.
Custom Classification Rules
Code Example with Given Output:
class LLM_Evaluation:
def __init__(self, config, auth):
self.config = config
self.auth = auth
def evaluate_response(self, question, context, answer):
"""
You are an AI assistant. You will be given a QUESTION, CONTEXT, and an ANSWER.
The CONTEXT is composed of various source pieces, each identified by a source number followed by a number at the beginning.
Your task is to evaluate how well the ANSWER aligns with the QUESTION and CONTEXT based on the following evaluation criteria:
1. **Acceptable**: The ANSWER is fully aligned with and directly drawn from the information given in the CONTEXT.
2. **Not Acceptable**: The ANSWER is not aligned with the CONTEXT or contains information not present in the CONTEXT.
3. **Ask Again**: The ANSWER is unclear, incomplete, or requires further clarification to match the CONTEXT.
4. **Acceptable with minor issues**: The ANSWER is generally acceptable but may have slight inaccuracies or incomplete information.
Read and understand the CONTEXT thoroughly to ensure accurate evaluation.
"""
# Example evaluation task
example_task_input = {
"QUESTION": "What is the capital of France?",
"CONTEXT": "Paris is the capital of France, known for its art, fashion, and culture.",
"ANSWER": answer
}
# Example evaluation logic
if "Paris" in example_task_input["ANSWER"] and "capital of France" in example_task_input["CONTEXT"]:
evaluation = "Acceptable"
reason = "The answer is fully aligned with and directly drawn from the context."
elif "Paris" not in example_task_input["ANSWER"]:
evaluation = "Not Acceptable"
reason = "The answer contradicts the context. Paris is the correct capital of France."
else:
evaluation = "Ask Again"
reason = "The answer is unclear or requires further clarification."
return {"evaluation": evaluation, "reason": reason}
# Example of how the system would work for different tasks:
evaluation_result = LLM_Evaluation(config=None, auth=None).evaluate_response(
question="What is the capital of France?",
context="Paris is the capital of France, known for its art, fashion, and culture.",
answer="The capital of France is Paris."
)
print(evaluation_result)