0:00
/
0:00

Build an Evaluation Dataset in under 20 Minutes

In my prior blog ‘Don’t Buy AI on ‘Vibes’”, I talked about the importance of building evaluation datasets when buying or building AI tools. In this blog, I will demonstrate how to build an evaluation dataset (often called “eval”) in under 20 minutes.

Task? Expedite legal and compliance review of third-party AI tools.

Why? Many Legal teams are inundated with requests to approve third party AI tools or features. It is a time-consuming multi-step process. It’s tempting to throw the entire task at an AI tool and hope for a perfect answer, but that “leap of faith” often leads to unreliable results.

In this week’s Fairly AI blog, I demonstrate a reiterative approach to build an eval to assess AI tool performance. This simple approach helps you create or buy a more accurate and trustworthy AI agent in the long run.

Here’s a 5-step process you can follow:

  1. Scope the Task: Instead of tackling the entire process, focus on one critical step. For example, my eval answers just 12 critical questions based on a vendor’s online terms.

  2. Draft a Starter Prompt: Define the AI’s role, task, and required input/output. Work together with AI using the “canvas” feature. See “PROMPT” below.

  3. Generate Answers with Multiple AIs: Run your prompt through at least three AI models. I use OpenAI custom GPT, Claude Opus 4.1, and Google AI Studio. This gives me a diverse set of answers to compare. You’ll likely notice many inconsistencies, which is why the next step is crucial.

  4. Create Your “Golden Set” Eval: Pick one AI’s output as your baseline. Use an AI to spot discrepancies among the answers, but always perform a final human review to verify everything. This corrected version becomes your reliable evaluation dataset. See “SAMPLE EVAL” below.

  5. Refine Your Prompt: Use the corrections from your human review to create a more detailed and accurate prompt. This iterative cycle is the key to improving your AI agent’s performance.

This iterative cycle of testing and refining is the fastest way to build an AI agent that you can rely on.

Watch the full video to see this process in action!

****

PROMPT

Role and Task

You are a privacy, product and compliance expert. Your task is to search and review the online Terms of Service and related policies for a specified AI product or feature and answer the most important questions that help evaluate compliance, privacy, security and IP risks of the AI product or feature.

Input

AI product or feature name (e.g., “OpenAI Business Terms”, “Gemini free users”, “Gamma.AI” or “Figma Make”)

Output

Output the result in a table format with columns: Question | Yes/No | Excerpt from Vendor Online Documents | Source URL | Pass/Fail

Use the following 12 questions:

  1. Will user inputs or customer data be used to train AI models? [No is Pass, Yes is Fail]

  2. Will customer data be used to improve products and services (note that the answer should be no if customer data is used to provide products and services only)? [No is Pass, Yes is Fail]

  3. Will the vendor claim ownership or rights over AI-generated outputs? [No is Pass, Yes is Fail]

  4. Does the vendor provide any security commitments? [Yes is Pass, No is Fail]

  5. Does the vendor have ISO 27001 or SOC certification? [Yes is Pass, No is Fail]

  6. Does the vendor permit export or sharing of customer data with third parties? [N/A for Pass/Fail]

  7. Does the vendor provide a data processing addendum or enterprise terms for business or enterprise customers? [Yes is Pass, No is Fail]

  8. Does the vendor agree to a confidentiality clause in this agreement? [Yes is Pass, No is Fail]

  9. Does the vendor allow use of the AI product for commercial purposes? [Yes is Pass, No is Fail]

  10. Does the tool allow automated AI decision-making or emotional recognition? [No is Pass, Yes is Fail]

  11. Does the vendor specify how user data is stored or retained? [Yes is Pass, No is Fail]

  12. Does the vendor require user consent to adverse changes in terms? [Yes is Pass, No is Fail]

Additional Instructions Based on Human Validation

Answer to Question 4 should include any security commitments made in the agreement itself, such as maintaining reasonable security measures.

Answers to Question 6 should include a list of subprocessors.

Answer to Question 12 should be No/Fail if customer’s termination right in the event of adverse term changes does not give customer a remedy or refund.

SAMPLE EVAL

Check out this Google Sheet for the sample eval.

Discussion about this video

User's avatar

Ready for more?