| Overall Satisfaction |
The interaction is non-functional, provides harmful information, or completely fails to address the user's need. |
The interaction is a struggle, fails to meet the user's primary goal, and leads to extreme frustration. |
The user's goal is only partially met, and the process is inefficient or frustrating. The user is largely dissatisfied. |
The assistant meets the basic requirements of the user's request, but the interaction has noticeable flaws or inefficiencies. |
The user's goal is met effectively and efficiently with only minor, negligible imperfections. The experience is very positive. |
The user's needs are fully met in a seamless, highly efficient, and pleasant manner. The user would strongly prefer this assistant over any alternative. |
| Naturalness |
The assistant's language is incoherent, nonsensical, or completely unrelated to the conversation. |
The language is consistently robotic and very difficult to understand due to poor phrasing or grammatical structure. |
The language is frequently unnatural and stilted. The robotic phrasing makes the conversation awkward to follow. |
The language is mostly natural but has occasional awkward or robotic phrasing that breaks the conversational flow. |
The assistant's language is fluid and resembles a human's. Any artificiality is very subtle and does not detract from the conversation. |
The assistant's language is indistinguishable from that of a thoughtful and articulate human expert. The tone and flow are perfectly natural. |
| Grounding Sources |
The provided sources have zero relevance to the user's questions, making them completely useless. |
Almost none of the user's questions can be answered by the sources. The documents are largely irrelevant to the query. |
Less than half of the user's questions can be answered by the sources. The documents are mostly inadequate. |
More than half of the user's questions can be answered by the sources, but there are significant gaps. |
All major questions can be answered by the sources. A minor sub-question might not be covered. |
Every single question and sub-question posed by the user can be fully and comprehensively answered using the provided reference documents. |
| Redundancy |
The assistant is stuck in a loop or every response is a useless rehash of the previous one. |
The conversation is bloated with constant repetition, making it frustrating and difficult to extract information. |
The assistant frequently repeats itself or provides overly specified information, harming conversational efficiency. |
The assistant occasionally repeats information or uses slightly verbose phrasing, but it's not a major issue. |
The conversation is almost entirely free of redundancy, with perhaps one minor, isolated instance of repetition. |
The conversation is perfectly streamlined. Every utterance is purposeful, adds new value, and contains no unnecessary repetition. |
| Conciseness |
Provides massive, unusable walls of text that completely disregard the need for brevity. |
Nearly all responses are far too long and rambling, burying key information and making them difficult to use. |
Many responses (more than half) are too long and contain unnecessary information, requiring effort to parse. |
Responses are generally concise, but some (less than half) could have been shorter without losing meaning. |
All responses are consistently concise and well-judged in length, with almost no wasted words. |
Every response is perfectly tailored in length, delivering information as compactly as possible without sacrificing clarity or a natural tone. |
| Efficiency |
The conversation makes no progress, goes in circles, and completely fails to address the user's goal. |
The conversation is highly inefficient, taking an extremely high number of turns with very little progress. |
The conversation takes significantly more turns than necessary or ends prematurely, failing to resolve the query properly. |
The goal is reached, but the conversation takes a few more turns than ideal due to minor misunderstandings or meandering. |
The goal is met in a very reasonable number of turns. The interaction feels quick and is very close to the optimal path. |
The user's goal is achieved in the absolute minimum number of conversational turns possible, with the assistant anticipating needs to prevent back-and-forth. |
| Functional Correctness |
Code is non-functional, won't compile/run, or produces catastrophic errors. |
Code has fundamental logical errors and fails on the most basic test cases. |
Code works for the primary "happy path" but fails on most other valid inputs or common edge cases. |
Code is mostly functional but has noticeable bugs or fails on some important edge cases. |
The code is fully functional, produces the correct output, and passes all common and edge-case tests. |
The code is flawlessly functional. The logic is not only correct but also elegant, simple, and demonstrably robust. |
| Efficiency & Optimization |
Code is extremely inefficient; it hangs, times out, or consumes excessive resources on trivial inputs. |
The code uses a grossly inefficient (e.g., brute-force) algorithm where a far superior standard alternative exists. |
The code works, but its performance is sub-optimal. It's noticeably slow or memory-intensive for realistic inputs. |
The implementation has acceptable performance for typical use cases but isn't highly optimized. |
The code uses appropriate algorithms and data structures, resulting in strong performance and resource management. |
The code is optimally efficient, demonstrating a deep understanding of performance with best-in-class speed and resource usage. |
| Readability & Maintainability |
Code is obfuscated and impossible for a human to understand (e.g., terrible naming, no structure). |
Code is extremely difficult to follow due to cryptic variable names, lack of comments, and confusing logic. |
Code is hard to read and requires significant effort to understand. It lacks sufficient comments or consistent formatting. |
The code is readable, but a developer needs to study it to understand its logic and flow. |
The code is clean and well-structured with good variable names and helpful comments. It's easy for another developer to follow. |
The code is exceptionally clear and self-documenting. Its structure and logic are so intuitive that it's instantly understandable. |
| Security & Robustness |
Code contains critical, obvious security vulnerabilities (e.g., SQL injection) and makes no attempt to handle errors. |
The code is highly vulnerable and brittle. It has clear security flaws and crashes on any invalid or unexpected input. |
The code is brittle, lacking basic input validation and error handling. It may have subtle security issues. |
The code handles some common errors but is not robust against a wider range of unexpected inputs and lacks awareness of security best practices. |
The code is robust and secure. It includes proper error handling, validates inputs, and follows standard security practices to prevent common vulnerabilities. |
The code is exceptionally robust, gracefully handling a wide range of edge cases and potential failure modes, while adhering to the highest security standards. |