ARIA Benchmarks

Matthew Kenney
Sep 3, 2024
5 min read

Updated: Sep 3, 2024

We're excited to introduce the ARIA Benchmarks (AI Research Intelligence Assessment), a set of natural language closed-book benchmarks designed to evaluate the internal machine learning knowledge of state-of-the-art AI models.

Introduction to ARIA Benchmarks

The ARIA Benchmarks focuses specifically on the models' ability to recall and apply machine learning concepts without access to external resources.

ARIA Benchmark

Our benchmark suite consists of five distinct tasks, each designed to probe a different aspect of machine learning knowledge. The Dataset Modality QA task assesses the model's familiarity with common datasets used in machine learning research and their characteristics by asking models to predict the modality of a given dataset. In the Model Modality QA task, the AI is presented with a model name and must determine its primary modality or application area, evaluating the model's understanding of various machine learning architectures and their typical use cases.

The Odd Model Out task challenges the AI to identify which machine learning concept doesn't belong in a given list, assessing the model's ability to recognize nuanced differences between ML approaches. For the PWC Metrics task, given a specific paper and model name, the AI must predict which metrics were used to evaluate the model's performance, testing the model's knowledge of appropriate evaluation metrics for different types of machine learning tasks and domains. Finally, the PWC Metrics:Result task builds on the previous one, requiring the model to not only identify the relevant metrics but also recall the specific performance figures reported in the paper, making it perhaps the most challenging task that requires detailed knowledge of state-of-the-art results in various ML subfields.

https://github.com/AlgorithmicResearchGroup/ARIA

Methodology: Building the ARIA Benchmarks

Data Source: PapersWithCode

To create these benchmarks, we leveraged datasets provided by Papers With Code, a platform that connects machine learning research papers with their associated code implementations. This rich source of information includes research papers from various ML domains, model architectures and implementations, dataset details, and performance metrics and results.

Data Preprocessing and Benchmark Design

Transforming the Papers With Code data into a standardized benchmark required several key steps. We developed algorithms to automatically generate natural language questions from the structured data. For multiple-choice tasks, we carefully curated sets of answer choices, such as Audio, Computer Vision, Graphs, Natural Language Processing, Reinforcement Learning, and Sequential for Model Modality, and Audio, Graphs, Images, Tabular, Texts, and Videos for Dataset Modality. Each generated question and answer set was validated to ensure accuracy and relevance, and we ensured a balanced representation of different ML subfields and difficulty levels across the benchmark.

Prompt Design

For the Model Modality QA task, we used prompts like "Given the following machine learning model, predict the modality of the model: {{model}}". The Dataset Modality QA task used similar language: "Given the following machine learning dataset, predict the modality of the dataset: {{dataset}}". For the Odd Model Out task, we asked models to "Predict which model least fits with the others: {{list_of_models}}". The PWC Metrics task used prompts such as "What metrics were used to measure the {{model}} in the {{paper_title}} on the {{dataset}} dataset?", while the PWC Metrics:Results task extended this to "What were the metrics and results used to measure the {{model}} in the {{paper_title}} on the {{dataset}} dataset?"

The Models We Evaluated

We tested a wide range of state-of-the-art large language models, including both proprietary and open-source options. Among the proprietary models, we evaluated the GPT-4 suite (including the 'o' version), GPT-3.5-turbo-0125, the Claude 3 models (Opus, Sonnet, and Haiku), and Gemini Pro. On the open-source side, we tested Mistral-7B (v0.1 and v0.3), Intel/neural-chat-7b-v3-1, openchat_3.5, zephyr-7b-beta, Meta-Llama-3-8B-Instruct, and Phi-3-medium-4k-instruct. This diverse selection allows us to compare performance across different model sizes, architectures, and training approaches.

Key Findings and Results

Our study revealed insights into the machine learning knowledge embedded within these frontier AI models. Among the top performers, GPT-4o demonstrated superior performance in three out of the five benchmarks, showcasing its broad and deep understanding of ML concepts. Claude Opus followed closely behind, particularly excelling in the Dataset Modality QA task.

Looking at task-specific highlights, we saw varying levels of performance across different models and tasks. In the Dataset Modality QA task, Claude-opus achieved the highest accuracy at 71.9%, while Mistral-7B-v0.1 had the lowest at 40.05%. The Model Modality QA task saw generally strong performance, with GPT-4o leading at 85.3% accuracy and even some open-source models achieving over 70% accuracy. The Odd Model Out task proved challenging for most models, with GPT-4o leading at 56.2% accuracy while many models struggled to surpass 30%. The PWC Metrics and PWC Metrics:Result tasks highlighted the difficulty of recalling specific details from research papers, with GPT-4o and Claude-opus consistently outperforming other models, but accuracy rates were generally lower compared to other tasks.

In terms of performance across model types, large proprietary models like GPT-4 and Claude 3 consistently outperformed smaller open-source models. However, some open-source models showed impressive results on certain tasks, demonstrating the rapid progress in publicly available AI.

	GPT-4o	GPT-4	GPT-3.5-Turbo	Claude-Opus	Claude-Sonnet	Claude-Haiku	Gemini-Pro
Dataset Modality QA	0.685	0.62	0.477	0.719	0.699	0.716	0.458
Model Modality QA	0.853	0.82	0.731	0.798	0.748	0.788	0.756
Odd Model Out Benchmark	0.562	0.456	0.354	0.451	0.369	0.307	0.371
PWC Metrics 1000	0.53	0.466	0.392	0.497	0.422	0.273	0.373
PWC Metrics:Result 1000	0.025	0.03	0.085	0.065	0.02	0.025	0.055

Implications and Future Directions

The ARIA Benchmarks provide insights into the current state of AI's machine learning knowledge. The results demonstrate that large language models have indeed internalized a significant amount of machine learning knowledge during their training process. While some models excelled across all tasks, others showed strengths in specific areas, suggesting potential benefits in developing both generalist and specialist AI models for different applications. These benchmarks could be used to identify gaps in AI models' understanding of machine learning concepts, potentially informing curriculum development for both AI systems and human learners.

Reproducibility and Open Science

To foster transparency and encourage further research, we've made our benchmark creation scripts and evaluation framework publicly available. Researchers can access these resources to reproduce our results, evaluate new models on the ARIA Benchmarks, and extend the benchmarks with additional tasks or datasets. We've provided detailed instructions for using the inspect framework (https://inspect.ai-safety-institute.org.uk/) to run the benchmarks, ensuring consistency in evaluation across different research groups.

Limitations and Future Work

While the ARIA Benchmarks represent a step forward in AI evaluation, we acknowledge several limitations. The closed-book nature of the benchmarks, while testing embedded knowledge, may not fully reflect a model's ability to reason or apply knowledge in real-world scenarios. Our focus on prominent research might miss emerging or niche areas of ML research. The multiple-choice format, while allowing for standardized evaluation, may miss nuances in model decision-making or reasoning processes. Additionally, the underlying PapersWithCode data may contain biases in terms of research focus or publication patterns, which could influence benchmark results.

Future work on the ARIA Benchmarks could include expanding the range of tasks to cover more diverse aspects of machine learning, developing open-ended question formats to better assess reasoning capabilities, creating multilingual versions of the benchmarks to evaluate models across different languages, and incorporating time-based elements to assess models' awareness of recent ML developments.

Conclusion

The ARIA Benchmarks represent a step forward in our ability to assess and compare the machine learning knowledge embedded within large language models. By providing a standardized, reproducible framework for evaluation, we hope to contribute to the ongoing conversation about AI capabilities, limitations, and future directions.

We invite the AI research community to engage with these benchmarks, reproduce our results, and contribute to their ongoing evolution.

ARIA Benchmarks

Introduction to ARIA Benchmarks

ARIA Benchmark