How do I evaluate an AI development company?

Ask for proof of production systems (not prototypes), a clear evaluation methodology for AI accuracy, a named team composition (not just "our team"), explicit IP ownership terms in the contract, and a milestone-based success definition for 30/60/90 days. The most revealing question is how they test AI accuracy before launch — companies that cannot describe a specific methodology are shipping guesses.

What are red flags when hiring an AI development company?

Five red flags — (1) no live reference system, only prototypes or NDAs as excuses, (2) no evaluation methodology ("we test it and make sure it works"), (3) no mention of ongoing costs in the quote, (4) vague IP ownership language, (5) a quote delivered after a 30-minute call with no discovery process.

Who owns the AI models after the project is done?

This should be specified in the contract. You should own all code, models, and training data. If the contract says the vendor licenses you the model (rather than transferring ownership), you cannot switch vendors or modify the system without their involvement. Ask for this clause explicitly before signing.

How long should an AI discovery phase take?

A proper discovery phase takes 2–4 weeks and covers requirements definition, data audit, integration mapping, and success criteria definition. It is typically billed separately from the build at $5,000–$20,000. Vendors who skip discovery and send you a full-project quote after a 30-minute call are making assumptions about your requirements — and you will pay for those assumptions in scope changes.

What does a good 30/60/90-day milestone plan look like for an AI project?

At 30 days — one specific use case working at a defined accuracy level, with a test set and evaluation results. At 60 days — 2–3 integrations live in a staging environment with edge case documentation. At 90 days — production deployment with real user traffic, monitoring active, and a post-launch support plan in place. Milestones should be specific, measurable, and agreed in writing before work starts.

How do I evaluate an AI development company?

Ask for proof of production systems (not prototypes), a clear evaluation methodology for AI accuracy, a named team composition (not just "our team"), explicit IP ownership terms in the contract, and a milestone-based success definition for 30/60/90 days. The most revealing question is how they test AI accuracy before launch — companies that cannot describe a specific methodology are shipping guesses.

What are red flags when hiring an AI development company?

Five red flags — (1) no live reference system, only prototypes or NDAs as excuses, (2) no evaluation methodology ("we test it and make sure it works"), (3) no mention of ongoing costs in the quote, (4) vague IP ownership language, (5) a quote delivered after a 30-minute call with no discovery process.

Who owns the AI models after the project is done?

This should be specified in the contract. You should own all code, models, and training data. If the contract says the vendor licenses you the model (rather than transferring ownership), you cannot switch vendors or modify the system without their involvement. Ask for this clause explicitly before signing.

How long should an AI discovery phase take?

A proper discovery phase takes 2–4 weeks and covers requirements definition, data audit, integration mapping, and success criteria definition. It is typically billed separately from the build at $5,000–$20,000. Vendors who skip discovery and send you a full-project quote after a 30-minute call are making assumptions about your requirements — and you will pay for those assumptions in scope changes.

What does a good 30/60/90-day milestone plan look like for an AI project?

At 30 days — one specific use case working at a defined accuracy level, with a test set and evaluation results. At 60 days — 2–3 integrations live in a staging environment with edge case documentation. At 90 days — production deployment with real user traffic, monitoring active, and a post-launch support plan in place. Milestones should be specific, measurable, and agreed in writing before work starts.

10 Questions to Ask Before Hiring an AI Development Company

Ashit Vora Buyer's PlaybookLast updated on 13 Mar 2026

Summary

Before hiring an AI development company, ask 10 key questions: (1) Can you show a live system you built? (2) How do you evaluate AI accuracy before launch? (3) What happens when the AI is wrong? (4) Who owns the models and data after the project? (5) How do you handle scope changes? (6) What are the ongoing costs after launch? (7) How do you handle my data and compliance? (8) What does your discovery process look like? (9) Who specifically will work on my project? (10) What does success look like at 30, 60, and 90 days? The most important question is #2 — evaluation methodology separates companies that ship reliable systems from those that ship demos.

Key Takeaways

The most important question is how they evaluate AI accuracy before shipping — this separates companies that ship reliable systems from those that ship polished demos.
Ask for a live system, not a prototype. Any vendor can demo a prototype. Production systems with real users are what matter.
Get the ongoing cost estimate in writing before signing. LLM API fees, hosting, and maintenance are real and recurring — many quotes exclude them.
IP ownership must be explicit in the contract. If you can't find a clause that says you own all code, models, and data, ask for one.
Vendors who skip discovery (sending a quote after a 30-minute call) are guessing at your requirements. That guess becomes your risk.

Every AI development company looks good in the pitch. The decks are polished. The demos are slick. The case studies are vague but confident.

Then the project starts. Six weeks in, you realize the "AI system" they showed you was a prototype running on curated test data. The actual system — the one handling your real data — behaves differently. And nobody defined what "works" meant before you signed the contract.

This is not rare. It is the most common failure pattern in AI vendor relationships. The vendor oversells. The buyer doesn't know what to ask. The project drifts. Costs spiral. The relationship ends badly.

This post gives you the questions that separate companies that ship reliable AI systems from those that ship impressive demos. These are the questions to ask before you sign anything — not after.

If you haven't yet decided whether to use a company or a freelancer, read AI development company vs. freelancer first. This post assumes you've made that call and are now evaluating which company to hire.

TL;DR

The 10 questions to ask before hiring an AI development company:

Can you show me a working AI system you built for a similar business problem?
How do you evaluate AI accuracy before shipping to real users?
What happens when the AI is wrong? How do you handle edge cases and errors?
Who owns the AI models and data after the project ends?
What is your process for handling changes to scope mid-project?
What are the ongoing costs after launch?
How do you handle my data? What are the privacy and compliance implications?
What does your discovery process look like before you write a line of code?
Who specifically will work on my project, and how much of their time?
What does success look like at 30, 60, and 90 days?

The single most important question is #2. If they can't describe a specific evaluation methodology, they're shipping guesses.

Why demos are not evidence

An AI demo is a controlled environment. The vendor picks the inputs. They know the edge cases to avoid. The system performs well because it has to — you're watching.

Production is different. Real users submit unexpected inputs. The data distribution shifts. Edge cases compound. The system that looked flawless in the demo starts producing wrong answers, confusing outputs, or silent failures no one notices for weeks.

The questions below are designed to surface this gap before you commit budget.

Question 1: Can you show me a working AI system you built for a similar business problem?

Why it matters

Any team can build a prototype. Prototypes run on cleaned, curated data in a controlled environment. Production systems run on messy real-world data, handle high concurrency, fail gracefully, and keep working at 2am when no one is watching.

Asking for a production reference — not just a demo — tells you immediately whether the company has shipped real work or only delivered impressive presentations.

"Similar business problem" is important. An AI document classifier built for a legal firm is meaningfully different from one built for an insurance carrier. The underlying logic, the compliance requirements, and the edge cases are all different. You want to see work in territory close to yours.

What a good answer looks like

They show you a live URL. They give you a video walkthrough of a real deployment with real users. Or they offer to put you in touch directly with a client from a similar industry. They describe the challenges they hit, how they solved them, and what the system does in production today.

What a red flag looks like

"We built something similar but we can't share it due to NDA." Said with no other evidence — no references, no case study metrics, no architecture overview. NDAs are real, but they don't prevent a company from sharing outcomes, describing the system design, or offering a client reference. If the only evidence they can offer is a claim they can't substantiate, that should stop you.

Question 2: How do you evaluate AI accuracy before shipping to real users?

Why it matters

This is the single most important question on this list. The answer tells you more about a company's engineering maturity than any portfolio piece.

AI systems can hallucinate. They can give confident wrong answers. They can work well on 80% of inputs and fail badly on the 20% that matter most to your business. Without a formal evaluation methodology — a structured process for measuring accuracy before launch — you have no idea what you're getting.

Companies that cannot describe a specific evaluation process are shipping by feel. They test a few prompts, it seems to work, and they call it ready. That is how AI projects end up in production with silent failure modes no one noticed until a customer complained.

What a good answer looks like

They describe a specific methodology: curated test sets representative of real production data, precision and recall measurements for classification tasks, human review panels for generative outputs, A/B testing against a baseline, confidence thresholds that trigger fallback handling when the model is uncertain. They can tell you what "good enough" means for your use case — a specific number, not a feeling.

What a red flag looks like

"We test it with a few prompts and make sure it works." Or "our models are very accurate." Accuracy claims without a measurement methodology are not evidence. They are sales language.

Question 3: What happens when the AI is wrong? How do you handle edge cases and errors?

Why it matters

Every AI system will fail on some inputs. The question is not whether it fails — it will. The question is how failure is handled. Does it fail gracefully, routing to a human or returning a clear "I don't know"? Or does it fail silently, producing a wrong answer with the same confidence as a right one?

The vendor's answer to this question reveals how they think about production-grade systems. Companies with real production experience design for failure from the start. Companies that have only built demos treat failure as someone else's problem.

What a good answer looks like

They describe human-in-the-loop handoffs for low-confidence outputs. They describe fallback responses that trigger when the model is uncertain. They have logging and alerting so failure cases are visible. They have a process for reviewing failure cases and retraining or adjusting the system. They treat failure as a first-class engineering concern, not an afterthought.

What a red flag looks like

"It's very accurate, so that won't be a problem." This answer tells you two things: they haven't thought about failure modes, and they're going to be surprised by them in production.

Question 4: Who owns the AI models and data after the project ends?

Why it matters

IP ownership determines your options after the project is done. If the vendor retains ownership of the models — or the training data, or the fine-tuned weights — you cannot switch vendors without losing your investment. You cannot modify the system without their involvement. You're locked in.

This is not a hypothetical risk. It is a common structure in AI vendor contracts. "Licensing" language is often used to retain vendor control over core assets. If you don't read for it explicitly, you may not notice until you try to leave.

What a good answer looks like

You own everything. All code, all models, all training data, all fine-tuned weights. The contract states this explicitly. Transfer of IP is clean and unconditional. The vendor may retain the right to use learnings for other work (common in consulting), but you own the specific system built for you.

What a red flag looks like

Vague language about "licensing" the model. Answers that say "you have full rights to use the system" without specifying ownership. No clear answer at all. If the contract doesn't contain an explicit IP transfer clause, ask for one before signing.

Question 5: What is your process for handling changes to scope mid-project?

Why it matters

AI projects always surface new requirements once the system starts running. A data pipeline you didn't know you needed. An edge case that requires a model adjustment. A stakeholder who wants a new feature added in week six. How the vendor handles this determines whether your costs stay predictable or spiral without warning.

"We're flexible" is not a process. It is a setup for unexpected invoices.

What a good answer looks like

They describe a formal change request process. Any new requirement is documented and scoped before work begins. You receive a written estimate of cost and timeline impact. You approve or decline. Work proceeds only after approval. Nothing gets done without your explicit sign-off on the scope change.

What a red flag looks like

"We're flexible — we'll handle it." This sounds reassuring. It isn't. Informal flexibility means scope grows informally, costs grow without warning, and you get an invoice at the end of the project that doesn't match your original agreement. "We'll handle it" translates to "we'll bill you for it at a premium without discussing it first."

Question 6: What are the ongoing costs after launch?

Why it matters

The build cost is what you sign for. The operational cost is what you pay for years afterward. LLM API calls, cloud hosting, monitoring infrastructure, maintenance engineering, model retraining — these are real, recurring costs that many proposals leave out entirely.

A system processing 50,000 queries per month through a commercial LLM API can cost $3,000–$8,000 per month in API fees alone, before you add hosting and maintenance. If the vendor hasn't estimated this for you, you don't have a complete picture of what you're committing to.

What a good answer looks like

They give you an itemized estimate of post-launch operational costs. LLM API fees estimated based on your expected usage volume. Cloud infrastructure costs for hosting, storage, and compute. A maintenance contract or estimate for ongoing engineering support. They're transparent that these costs exist and help you model them before you sign.

What a red flag looks like

The quote only covers the build. No mention of what happens after launch. You have to ask specifically, and the answer is vague. This either means they haven't thought about it or they're keeping it out of the conversation because it would make the engagement look more expensive.

Question 7: How do you handle my data? What are the privacy and compliance implications?

Why it matters

Your customer records, financial data, or health information may pass through the AI system. That data will be sent somewhere — to an LLM API, to a cloud provider, to a model training pipeline. Where it goes, how long it's retained, and who can access it are not details. They are compliance requirements.

If your business is subject to HIPAA, GDPR, SOC 2, or any other regulatory framework, your vendor's data handling practices directly affect your compliance posture. "We use OpenAI's API" is not an answer to a compliance question.

What a good answer looks like

They explain the full data flow: where your data is processed, where it's stored, how long it's retained, and who has access. They identify which LLM providers they use and confirm those providers do not retain or train on your data (many enterprise API tiers offer this guarantee). For regulated industries, they describe how they handle HIPAA business associate agreements, GDPR data processing agreements, or SOC 2 requirements. They have done this before and can show evidence.

What a red flag looks like

"We use OpenAI's API" with no further discussion of data handling. Or "we follow best practices" with no specifics. If the vendor can't describe exactly where your data goes and how it's protected, they haven't thought it through — and your compliance team will notice.

Question 8: What does your discovery process look like before you write a line of code?

Why it matters

The biggest predictor of AI project success is whether the vendor understands your problem before building. This sounds obvious. It isn't standard.

Discovery costs time and money. Vendors who skip it save you a small amount upfront and charge you a large amount later through scope changes built on wrong assumptions. The first three weeks of a poorly-scoped project determine whether the next nine weeks are productive or corrective.

Vendors who send you a detailed project quote after a 30-minute introductory call have not done discovery. They have guessed at your requirements and wrapped a price around the guess. That guess becomes your risk.

What a good answer looks like

They describe a formal discovery phase of 2–4 weeks. The phase covers requirements definition with key stakeholders, a data audit (what data exists, what format it's in, what quality it is), integration mapping (what systems the AI connects to), and success criteria definition (what does "working" mean, measured how). This phase is billed separately from the build, typically $5,000–$20,000. The output is a scoped plan both parties agree on before build work begins.

What a red flag looks like

They send you a full project quote after a 30-minute call. Or they describe a brief "scoping session" that happens in the first week of the build. Discovery that happens after the contract is signed is not discovery — it's finding out what you actually need after you've already committed to a budget that doesn't account for it.

Question 9: Who specifically will work on my project, and how much of their time?

Why it matters

Bait-and-switch is real in the vendor world. The senior engineer who built the case studies you admired is the same person who sells the engagement. They hand it off to a junior team. You find out in week three when you ask a technical question and the person on your call can't answer it.

You need to know exactly who will work on your project — their names, their experience, their allocation percentage — before you sign. "Our team" is not a team. It is a placeholder that could mean anything.

What a good answer looks like

They name the lead engineer and the technical architect. They describe the team composition — who is responsible for what. They specify what "dedicated" means in their model (full-time, part-time, percentage of time). They're willing to let you interview or at least speak with the people who will actually do the work. If they can't commit to specific names, they at least describe the seniority level and experience requirements for the team.

What a red flag looks like

"Our team" with no names. "We staff projects based on availability at kickoff." An unwillingness to let you talk to the actual engineers before signing. These are signals that the team who delivers your project may bear no resemblance to the people who sold it.

Question 10: What does success look like at 30, 60, and 90 days?

Why it matters

Vague success criteria guarantee disappointment. If neither party has defined what "done" means at specific points in the project, you will spend weeks in disagreement about whether things are on track. Vendors without milestone definitions benefit from ambiguity — it's harder to hold them accountable.

Concrete milestones also expose whether the vendor understands your project. A company that can define what the system will do at 30, 60, and 90 days has thought through the build sequence. A company that can't has not.

What a good answer looks like

They define specific, measurable milestones. At 30 days: the agent handles one specific use case at a defined accuracy level, with a test set and evaluation results to prove it. At 60 days: two or three integrations live in a staging environment, edge case documentation complete, client stakeholder review conducted. At 90 days: production deployment with real user traffic, monitoring active, a post-launch support plan in place. Milestones are agreed in writing before work starts.

What a red flag looks like

"We'll have something to show you in a few months." Or milestones defined only as deliverables without success criteria ("we'll deliver the model" — but not defining what it needs to do or how well it needs to do it). Vague milestones are designed to be unchallenged. They protect the vendor, not you.

The single most important question

If you can only ask one question, ask question 2: How do you evaluate AI accuracy before shipping to real users?

This question is the sharpest signal of engineering maturity. It separates companies that build reliable systems from those that ship polished demos. It reveals whether the team thinks in terms of measurable outcomes or feels.

A company that can describe a specific evaluation methodology — test sets, precision and recall, human review panels, confidence thresholds, A/B testing — has built production AI before. A company that answers with "we make sure it works" has not.

Everything else on this list matters. But this question is the fastest way to tell whether you're talking to a company that ships AI systems or a company that ships AI presentations.

How to run a 60-minute vendor evaluation call

You have three vendors shortlisted. You have 60 minutes with each. Here's how to use the time.

Minutes 0–10: Context setting. Give them your use case. Be specific about the data, the users, the business outcome you're trying to drive. A company that's done this before will start asking sharp clarifying questions. A company that hasn't will nod and take notes.

Minutes 10–30: Ask questions 1, 2, and 3. These are the highest-signal questions. Show me a live system. How do you measure accuracy? What happens when it's wrong? Listen carefully. Take notes. This section tells you more than any proposal document.

Minutes 30–45: Ask questions 4, 6, 7, and 9. IP ownership, ongoing costs, data handling, team composition. These are the commercial and operational questions that trip people up if they haven't thought them through. Vague answers here are data.

Minutes 45–55: Ask questions 8 and 10. Discovery process and milestone definition. Ask them to describe what week three of the project looks like. What would success at 30 days mean specifically? Can they answer this in concrete terms?

Minutes 55–60: Ask them what they need from you. A good vendor will have specific asks — access to data, key stakeholder availability, existing system documentation. A vendor who has nothing to ask hasn't thought about what the project requires.

After all three calls: compare your notes question by question, not overall impressions. Impressions are influenced by how well the salesperson presents. Answers to specific questions are harder to fake.

Vendor evaluation scorecard

Use this table after each vendor call. Score each answer 1–5. Total the scores. Compare across vendors.

Question	Vendor A	Vendor B	Vendor C
1. Can you show a live production system?	/5	/5	/5
2. How do you evaluate AI accuracy before launch?	/5	/5	/5
3. How do you handle edge cases and errors?	/5	/5	/5
4. Who owns models and data after the project?	/5	/5	/5
5. How do you handle scope changes mid-project?	/5	/5	/5
6. What are the ongoing costs after launch?	/5	/5	/5
7. How do you handle data privacy and compliance?	/5	/5	/5
8. What does your discovery process look like?	/5	/5	/5
9. Who specifically will work on my project?	/5	/5	/5
10. What does success look like at 30/60/90 days?	/5	/5	/5
Total	/50	/50	/50

Scoring guide:

5 — Specific, evidence-based answer with examples
4 — Clear answer with some specifics
3 — Reasonable answer but lacking detail
2 — Vague or generic answer
1 — Red flag answer or no answer

A vendor scoring below 35 is not ready for a production AI engagement. A vendor scoring 40+ on these questions has the maturity to deliver.

How RaftLabs answers these questions

We get asked these questions. Here's how we answer each one honestly.

Question 1 — Live production systems. We have shipped 100+ products across healthcare, fintech, commerce, and professional services. We put clients in contact with reference customers directly. We share live deployments where clients permit.

Question 2 — Evaluation methodology. We define evaluation criteria during discovery before building begins. For classification tasks, we build labeled test sets and measure precision, recall, and F1. For generative tasks, we define rubrics and use human review panels. We set confidence thresholds that route low-certainty outputs to human review. We don't ship until the system meets the agreed bar.

Question 3 — Error handling. We design for failure from the start. Every AI system we build includes human-in-the-loop routing for low-confidence outputs, fallback responses, and monitoring that surfaces failure cases in real time. We include a failure review cycle in the first 30 days post-launch.

Question 4 — IP ownership. You own everything. All code, models, training data, and fine-tuned weights transfer to you at project completion. Our contracts say so explicitly.

Question 5 — Scope changes. Any scope change is documented, estimated, and approved in writing before work begins. We use a formal change request process. Nothing gets done without your sign-off.

Question 6 — Ongoing costs. Every proposal we deliver includes an itemized estimate of post-launch operational costs — LLM API fees, infrastructure, and maintenance. We model this based on your expected usage volume before you sign.

Question 7 — Data handling. We map data flows in discovery. We use enterprise API tiers that do not retain or train on your data. For HIPAA clients, we sign BAAs. For GDPR clients, we define data processing agreements. We document every data touchpoint.

Question 8 — Discovery process. Every project starts with a 2–4 week discovery phase. We audit your data, map your integrations, define success criteria, and produce a scoped build plan. Discovery is billed separately. We don't start building until both parties have agreed on what "done" means.

Question 9 — Who works on your project. We name the lead engineer and architect before you sign. We define team composition and allocation. We make the team available to you for a conversation before kickoff. The people who sell the project are not the people who disappear after kickoff.

Question 10 — 30/60/90-day success. We define milestone criteria in writing during discovery. At 30 days, the system handles a specific use case at a specific accuracy level. At 60 days, integrations are live in staging. At 90 days, the system is in production with monitoring active.

If you're evaluating development companies and want to compare us directly, start a conversation here. Bring your use case. We'll give you specific answers, not a sales presentation.

Hiring the wrong AI development company is expensive. Not just in money — in time, trust, and the opportunity cost of a project that delivers six months late and half of what was promised. These 10 questions don't guarantee the right choice. But they make the wrong choice much harder to hide.

Ask them all. Take notes. Score the answers. Then make your decision based on evidence, not demos.

Frequently Asked Questions

: Ask for proof of production systems (not prototypes), a clear evaluation methodology for AI accuracy, a named team composition (not just "our team"), explicit IP ownership terms in the contract, and a milestone-based success definition for 30/60/90 days. The most revealing question is how they test AI accuracy before launch — companies that cannot describe a specific methodology are shipping guesses.
: Five red flags — (1) no live reference system, only prototypes or NDAs as excuses, (2) no evaluation methodology ("we test it and make sure it works"), (3) no mention of ongoing costs in the quote, (4) vague IP ownership language, (5) a quote delivered after a 30-minute call with no discovery process.
: This should be specified in the contract. You should own all code, models, and training data. If the contract says the vendor licenses you the model (rather than transferring ownership), you cannot switch vendors or modify the system without their involvement. Ask for this clause explicitly before signing.
: A proper discovery phase takes 2–4 weeks and covers requirements definition, data audit, integration mapping, and success criteria definition. It is typically billed separately from the build at $5,000–$20,000. Vendors who skip discovery and send you a full-project quote after a 30-minute call are making assumptions about your requirements — and you will pay for those assumptions in scope changes.
: At 30 days — one specific use case working at a defined accuracy level, with a test set and evaluation results. At 60 days — 2–3 integrations live in a staging environment with edge case documentation. At 90 days — production deployment with real user traffic, monitoring active, and a post-launch support plan in place. Milestones should be specific, measurable, and agreed in writing before work starts.

Ashit Vora— Co-founder

Co-founder at RaftLabs.

10 Questions to Ask Before Hiring an AI Development Company

Why demos are not evidence

Question 1: Can you show me a working AI system you built for a similar business problem?

Why it matters

What a good answer looks like

What a red flag looks like

Question 2: How do you evaluate AI accuracy before shipping to real users?

Why it matters

What a good answer looks like

What a red flag looks like

Question 3: What happens when the AI is wrong? How do you handle edge cases and errors?

Why it matters

What a good answer looks like

What a red flag looks like

Question 4: Who owns the AI models and data after the project ends?

Why it matters

What a good answer looks like

What a red flag looks like

Question 5: What is your process for handling changes to scope mid-project?

Why it matters

What a good answer looks like

What a red flag looks like

Question 6: What are the ongoing costs after launch?

Why it matters

What a good answer looks like

What a red flag looks like

Question 7: How do you handle my data? What are the privacy and compliance implications?

Why it matters

What a good answer looks like

What a red flag looks like

Question 8: What does your discovery process look like before you write a line of code?

Why it matters

What a good answer looks like

What a red flag looks like

Question 9: Who specifically will work on my project, and how much of their time?

Why it matters

What a good answer looks like

What a red flag looks like

Question 10: What does success look like at 30, 60, and 90 days?

Why it matters

What a good answer looks like

What a red flag looks like

The single most important question

How to run a 60-minute vendor evaluation call

Vendor evaluation scorecard

How RaftLabs answers these questions

Best Toptal alternatives for custom software in 2026

Software development agency vs. freelancer: how to choose

Outsourcing vs. in-house software development: the real trade-offs