A closed lab can't audit itself.

We build open-source benchmarks, datasets, and infrastructure for evaluating what autonomous AI systems can actually do.

Evaluating what AI agents can actually do — whether they can deceive, collude, self-improve, or game their benchmarks — is too important to happen behind closed doors. This work needs contributors, not gatekeepers.

We build in the open so that anyone can use what we ship, find what we missed, and push the work further than we could alone.

Benchmarks

Can AI agents do real ML research? Can they deceive, collude, or game evaluations? We build the benchmarks that answer these questions with evidence, not speculation.

Datasets

1.1M enriched papers. 129K research repositories. 778K code functions. The raw material for studying how AI systems interact with real scientific work.

Infrastructure

Runtimes for structured agent workloads. Orchestration topologies for recursive improvement loops. The scaffolding to run experiments at scale.

Epsilon: Infrastructure for Structured Agent Workloads

An open-source runtime for structured agent workloads with seven orchestration topologies, ZeroMQ-backed task brokering, deterministic...

agents / orchestration / infrastructure

S2ORC CS Enriched: 1.1 Million Computer Science Papers with Structured Metadata

A filtered and LLM-enriched version of Allen AI's S2ORC corpus containing 1.1 million computer science papers with structured extraction...

datasets / scientific-papers / machine learning

Study Failure: AI-driven GPU Kernel Optimization

A retrospective on 131,520 GPU kernel optimization attempts that were invalidated when agents were found to be substituting high-level...

gpu / optimization / machine learning

Learning to Rank Architectures: A Small Model That Guides Neural Architecture Search

A tiny recursive reasoning model trained to rank architectures by predicted performance achieves 8-10x sample efficiency over random...

nas / architecture search / machine learning

ARIA Benchmark: How Much Machine Learning Do AI Models Actually Know?

A suite of five closed-book benchmarks probing the ML knowledge that frontier language models have internalized during training.

agent-evaluation / benchmarks / python

ArXiv Research Code Dataset: 129K Research Repositories

A collection of 4.7 million code files from 129K research repositories linked to arXiv computer science papers.

agent-evaluation / benchmarks / python

ArXivDLInstruct: 778K Research Code Functions for Instruction Tuning

A dataset of 778,152 functions extracted from arXiv-linked research code, each paired with instruction prompts, for training...

agent-evaluation / benchmarks / python

DeltaMLBench: Can AI Agents Improve on Published ML Research?

A benchmark of 50 tasks drawn from real Papers With Code repositories where agents must achieve measurable improvement over published baselines.

agent-evaluation / benchmarks / python

Teaching Models to Bluff: Measuring Deception, Belief, and Coordination in LLM Secret Hitler

We implemented five LLM agents playing the social-deduction game Secret Hitler with structured logging to quantify deception, belief...

ai-research / agi / recursive-improvement

ML Research Benchmark: Can AI Agents Do Real ML Research?

A benchmark suite of 7 competition-level ML challenges for evaluating whether AI agents can perform genuine research iteration beyond...

agent-evaluation / benchmarks / python

Algorithmic Research Group builds open-source tools and infrastructure for AI security research. Benchmarks for evaluating autonomous agents. Datasets for studying how models fail. Runtimes for running agent workloads at scale.

We publish everything we build. The field moves faster when researchers can build on each other's work instead of rebuilding the same tooling behind closed doors.