Cleanlab makes AI agents reliable. Detect issues, fix root causes, and apply guardrails for safe, accurate performance.cleanlab.ai San FranciscoJoined October 2021
We're thrilled to join forces with @joinHandshake, where we'll be able to scale our team's pioneering work to inflect change with the world's leading AI labs. Hear more from our CEO and Co-founder, @cgnorthcutt, to learn about our next chapter.
News: @joinHandshake acquires @CleanlabAI!
This "ten-year old job marketplace" has quietly become a top human data lab for AI--building an AI research org, acquiring top AI talent, and advancing Cleanlab tech and research to lead data foundations for frontier AI.
1 of 4
Achieving 20%+ improvement in structured extraction tasks using @DSPyOSS and GEPA
Building on a blog post from @CleanlabAI I wanted to see how quickly I could optimize a structured extraction task with DSPy + GEPA
In about 3 hours (mostly me getting in the way of claude code):
- +22 percentage points over vanilla structured outputs
- Ran 4 experiments in total
- ~$3 total cost
I tested 5 approaches incrementally:
• OpenAI Baseline: 32.1% exact match
• DSPy Baseline: 39.8%
• DSPy + BAML: 42.7%
• DSPy + GEPA: 53.8%
• DSPy + BAML + GEPA: 54.4%
For anyone who cares about structured output benchmarks as much as I do, here's an early Christmas present 🎁 ! Pretty well thought out from the folks @CleanlabAI.
Seems like I'll def be using it to compare LLMs using BAML and DSPy!
github.com/cleanlab/struc…
Where Did $37B in Enterprise AI Spending Go?
$19B → Applications (51%)
$18B → Infrastructure (49%)
Our report includes a snapshot of the Enterprise AI ecosystem, mapped across departmental, vertical AI, and infrastructure.
Although coding captures more than half of departmental AI spend at $4 billion, the technology is gaining traction across many enterprise departments: IT operations tools ($700M), marketing platforms ($660M), customer success tools ($630 M).
AI-native startups are rapidly emerging across every job function, capturing a meaningful share of the $7.3B spent on departmental AI in 2025. mnlo.vc/enterprise-ai-…
Which LLM is better for Structured Outputs / Data Extraction: Gemini-3-Pro or GPT-5?
We ran popular benchmarks, but found their "ground truth" is full of errors.
To enable reliable benchmarking, we've open-sourced 4 new Structured Outputs benchmarks with *verified* ground-truth
@karanjagtiani04 One example could be:
if there is an ambiguous context shift and the agent's original LLM message wrongly assumes something about the context, this can be auto-detected via a low trust score and the auto-revised message can be a follow-up question to clarify instead of assuming
We discovered how to cut the failure rate of any AI agent on Tau²-Bench, the #1 benchmark for customer service AI.
Agents often fail in multi-turn, tool-use tasks due to a single bad LLM output (reasoning slip, hallucinated fact, misunderstanding, wrong tool call, etc). We introduce an automated LLM trust scoring + message revision pipeline that mitigates this brittleness and keeps agents on the rails.
Benchmarks show that our approach remains effective across all Tau²-Bench domains (Telecom, Retail, Airline) and different LLMs -- cutting agent failure rates up to 50%.
🚀 New from Cleanlab: Expert Guidance
AI agents running multi-step workflows can fail in tiny, trust-breaking ways.
Expert Guidance lets teams fix these behaviors with simple human feedback, instantly.
✈️In one airline workflow: 76% → 90% after only 13 guidance entries.
The “Year of the Agent” just got pushed back.
Out of 1,837 enterprise leaders, most are struggling with stack churn + reliability.
⚙️ 70% rebuild every 90 days
😬 Less than 35 % are happy with their infrastructure
🤖 Most “agents” still aren’t really acting yet
🚧 Even the best AI models still hallucinate.
OpenAI’s recent paper on Why Language Models Hallucinate shows why this problem persists, especially in domain-specific settings.
For teams implementing guardrails, we put together a short walkthrough: youtu.be/i_6fjKgboFg?si…
AI pilots prove intelligence, but AI in production demands reliability.
The best teams separate their stack early: 🧠 Core = how AI thinks 🛡️ Reliability = how it stays safe
That’s how prototypes become products.
👉cleanlab.ai/blog/emerging-…
AI agents won’t replace humans. Their real power comes when humans guide it.
We just added Expert Answers to our platform:
👩🏫 SMEs fix AI mistakes right away
🔁 Fixes are reused across future queries
📈 Accuracy improves, “IDK” drops 10x
Full blog: cleanlab.ai/blog/expert-an…
Launching an AI agent without human oversight is basically launching a rocket without mission control 🚀
Cool for a few minutes… until something breaks.
🕹️ It’s not the rocket that makes the mission succeed. It’s the control center.
cleanlab.ai/blog/managing-…
📍 Live at @AIconference 2025 in San Francisco!
Tomorrow, @cgnorthcutt is sharing practical strategies for building trustworthy customer-facing AI systems, and our team is around all day to connect.
👋 Stop by and geek out with us!
Most AI pilots in financial services never make it to production.
The reason is simple: they can’t be trusted.
Today, Cleanlab + @CorridorAI are fixing that by combining governance with real-time remediation so AI is finally safe to deploy at scale.
🔗 businesswire.com/news/home/2025…
729 Followers 3K FollowingAI Infra @awscloud · Silicon, Scaling, and Economics of Compute · Notes from depth of 100k node GPU clusters, RL optimization & inference scaling · Views my own
12 Followers 15 FollowingCorridor Platforms is a Decision and Analytics workflow automation platform helping FIs upgrade to advanced analytics and real-time decisioning with governance.
477 Followers 6K FollowingCurrently acting as a General Partner @Crescendo Venture Partners , and former Co-Founder & Managing Director @Giza Venture Capital
4K Followers 2K FollowingAssociate Professor @ Rutgers Business School, Reformed Portfolio Manager, Options Guru, Best Selling Author, Public Speaker, U of Michigan Grad BSE(ME) & MBA
567 Followers 6K FollowingEveryone is right in their own perception! But truth always prevails. it can't be destroyed, can only be hidden; only for a while.
11K Followers 398 FollowingThe MLOps community is an open and transparent community where all are welcome to participate. It is a place where MLOps practitioners can collaborate and share
12 Followers 15 FollowingCorridor Platforms is a Decision and Analytics workflow automation platform helping FIs upgrade to advanced analytics and real-time decisioning with governance.
378 Followers 189 FollowingEnterprise AI Agents that actually work • Orchestrate complex processes in Finance, Support, Ops & HR • Gartner Challenger • Book a demo: https://t.co/5OM2iB6lvd
7K Followers 1K FollowingWe are #QuantumBlack, #AIbyMcKinsey. We help organizations harness
the power of #HybridIntelligence to create unimagined opportunities in a changing world.
6K Followers 2K FollowingCreators of CoCounsel, a quantum leap in AI for the law. For the first time lawyers can delegate substantive work to AI and trust the results.
150K Followers 2K FollowingAt Thomson Reuters, we’re not riding the AI wave — we’re reshaping the future of professional work across law, tax, compliance, and journalism.
58K Followers 10K FollowingLexisNexis is a leading global provider of legal, regulatory & business info & analytics. Get 24/7 help at https://t.co/sZKH9Dpwqc or 800-543-6862
108K Followers 4K FollowingDedicated to shaping a world in which AI enhances human potential & transforms how businesses operate, via #agentic automation.
106K Followers 426 Followingprofessor of computer science @Stanford @stanfordnlp, co-founder of @togethercompute, creator of https://t.co/7R5THVogW2, co-founder of @simile_ai, pianist
8K Followers 30 FollowingWRITER is where the world’s leading enterprises orchestrate AI-powered work | Dream Big, Build Fast | Fueled by our Palmyra LLMs
34K Followers 355 FollowingOpenEvidence is the most widely used AI-powered medical search, helping doctors access the world's knowledge at the moment it matters.
12K Followers 17 FollowingTransform how work gets done with custom AI agents, connected to your company knowledge and tools, powered by the best AI models.
Just use Dust.
19K Followers 21 FollowingAn AI research and product company 🫠. We are a team of scientists and engineers building state-of-the-art multimodal models 😻
552 Followers 94 FollowingUnleash your unstructured image data to drive business innovation. Coactive's machine learning platform is lightning fast and easy to use. Why not try a demo?
4K Followers 187 FollowingAbridge transforms patient-clinician conversations into structured clinical notes in real-time, powered by the most advanced generative AI in healthcare.
69K Followers 49 FollowingWorld Labs is a spatial intelligence company, building frontier models that can perceive, generate, and interact with the 3D world.
76K Followers 698 FollowingA community for developers and users of open source scientific tools with 200K+ people 🧑🔬 🧑💻, by @NumFOCUS. Join our Discord: https://t.co/rmBFaQvdMM