Comprehensive Evaluation Rubric

Rubric 0 Stars (Critically Flawed) 1 Star (Highly Unsatisfactory) 2 Stars (Unsatisfactory) 3 Stars (Satisfactory) 4 Stars (Excellent) 5 Stars (Exceptional / Perfect)
Overall Satisfaction The interaction is non-functional, provides harmful information, or completely fails to address the user's need. The interaction is a struggle, fails to meet the user's primary goal, and leads to extreme frustration. The user's goal is only partially met, and the process is inefficient or frustrating. The user is largely dissatisfied. The assistant meets the basic requirements of the user's request, but the interaction has noticeable flaws or inefficiencies. The user's goal is met effectively and efficiently with only minor, negligible imperfections. The experience is very positive. The user's needs are fully met in a seamless, highly efficient, and pleasant manner. The user would strongly prefer this assistant over any alternative.
Naturalness The assistant's language is incoherent, nonsensical, or completely unrelated to the conversation. The language is consistently robotic and very difficult to understand due to poor phrasing or grammatical structure. The language is frequently unnatural and stilted. The robotic phrasing makes the conversation awkward to follow. The language is mostly natural but has occasional awkward or robotic phrasing that breaks the conversational flow. The assistant's language is fluid and resembles a human's. Any artificiality is very subtle and does not detract from the conversation. The assistant's language is indistinguishable from that of a thoughtful and articulate human expert. The tone and flow are perfectly natural.
Grounding Sources The provided sources have zero relevance to the user's questions, making them completely useless. Almost none of the user's questions can be answered by the sources. The documents are largely irrelevant to the query. Less than half of the user's questions can be answered by the sources. The documents are mostly inadequate. More than half of the user's questions can be answered by the sources, but there are significant gaps. All major questions can be answered by the sources. A minor sub-question might not be covered. Every single question and sub-question posed by the user can be fully and comprehensively answered using the provided reference documents.
Redundancy The assistant is stuck in a loop or every response is a useless rehash of the previous one. The conversation is bloated with constant repetition, making it frustrating and difficult to extract information. The assistant frequently repeats itself or provides overly specified information, harming conversational efficiency. The assistant occasionally repeats information or uses slightly verbose phrasing, but it's not a major issue. The conversation is almost entirely free of redundancy, with perhaps one minor, isolated instance of repetition. The conversation is perfectly streamlined. Every utterance is purposeful, adds new value, and contains no unnecessary repetition.
Conciseness Provides massive, unusable walls of text that completely disregard the need for brevity. Nearly all responses are far too long and rambling, burying key information and making them difficult to use. Many responses (more than half) are too long and contain unnecessary information, requiring effort to parse. Responses are generally concise, but some (less than half) could have been shorter without losing meaning. All responses are consistently concise and well-judged in length, with almost no wasted words. Every response is perfectly tailored in length, delivering information as compactly as possible without sacrificing clarity or a natural tone.
Efficiency The conversation makes no progress, goes in circles, and completely fails to address the user's goal. The conversation is highly inefficient, taking an extremely high number of turns with very little progress. The conversation takes significantly more turns than necessary or ends prematurely, failing to resolve the query properly. The goal is reached, but the conversation takes a few more turns than ideal due to minor misunderstandings or meandering. The goal is met in a very reasonable number of turns. The interaction feels quick and is very close to the optimal path. The user's goal is achieved in the absolute minimum number of conversational turns possible, with the assistant anticipating needs to prevent back-and-forth.
Functional Correctness Code is non-functional, won't compile/run, or produces catastrophic errors. Code has fundamental logical errors and fails on the most basic test cases. Code works for the primary "happy path" but fails on most other valid inputs or common edge cases. Code is mostly functional but has noticeable bugs or fails on some important edge cases. The code is fully functional, produces the correct output, and passes all common and edge-case tests. The code is flawlessly functional. The logic is not only correct but also elegant, simple, and demonstrably robust.
Efficiency & Optimization Code is extremely inefficient; it hangs, times out, or consumes excessive resources on trivial inputs. The code uses a grossly inefficient (e.g., brute-force) algorithm where a far superior standard alternative exists. The code works, but its performance is sub-optimal. It's noticeably slow or memory-intensive for realistic inputs. The implementation has acceptable performance for typical use cases but isn't highly optimized. The code uses appropriate algorithms and data structures, resulting in strong performance and resource management. The code is optimally efficient, demonstrating a deep understanding of performance with best-in-class speed and resource usage.
Readability & Maintainability Code is obfuscated and impossible for a human to understand (e.g., terrible naming, no structure). Code is extremely difficult to follow due to cryptic variable names, lack of comments, and confusing logic. Code is hard to read and requires significant effort to understand. It lacks sufficient comments or consistent formatting. The code is readable, but a developer needs to study it to understand its logic and flow. The code is clean and well-structured with good variable names and helpful comments. It's easy for another developer to follow. The code is exceptionally clear and self-documenting. Its structure and logic are so intuitive that it's instantly understandable.
Security & Robustness Code contains critical, obvious security vulnerabilities (e.g., SQL injection) and makes no attempt to handle errors. The code is highly vulnerable and brittle. It has clear security flaws and crashes on any invalid or unexpected input. The code is brittle, lacking basic input validation and error handling. It may have subtle security issues. The code handles some common errors but is not robust against a wider range of unexpected inputs and lacks awareness of security best practices. The code is robust and secure. It includes proper error handling, validates inputs, and follows standard security practices to prevent common vulnerabilities. The code is exceptionally robust, gracefully handling a wide range of edge cases and potential failure modes, while adhering to the highest security standards.