top of page

The ArXiv Research Code Dataset

Writer's picture: Matthew KenneyMatthew Kenney

Updated: Sep 4, 2024




We introduce the ArXiv Research Code Dataset, a collection code extracted from repositories linked to computer science papers published on ArXiv. This dataset specifically focuses on ArXiv computer science papers. By concentrating on research code used in computer science publications, this dataset offers insights into the programming languages, libraries, and coding patterns employed in computational research. The dataset comprises 4,716,175 code files from 129,232 unique repositories, representing a portion of publicly available, high-quality computer science research code. The dataset can aid in the development and training large language models (LLMs), retrieval augmented generation, code completion, and agentic research applications.



Dataset Creation Methodology

The ArXiv Research Code Dataset was created using a multi-stage process. Initially, we extracted metadata and full text content from computer science ArXiv papers, identifying those with associated GitHub repository URLs. These URLs were then used to download the corresponding repositories. We developed a library (to be released shortly) to process the downloaded repositories, extracting individual code files while focusing on common research-oriented programming languages. We process each code file to derive metrics including file length, average line length, and maximum line length. The resulting dataset comprises these code files along with their associated metadata, providing a snapshot of code used in research across CS ArXiv. 


Dataset Composition and Metadata

The dataset includes the code files along with metadata for each file. This metadata encompasses file length, average line length, and maximum line length, providing context about code structure and complexity. The inclusion of full code content enables detailed analysis of coding practices, patterns, and styles prevalent in academic research.

The dataset is structured with the following columns:

  • repo: The repository name

  • file: The file path within the repository

  • code: The actual content of the file

  • file_length: The number of lines in the file

  • avg_line_length: The average length of lines in the file

  • max_line_length: The maximum line length in the file

  • extension_type: The file extension


Dataset Statistics

The ArXiv Research Code Dataset comprises a total of 4,716,175 individual code files, sourced from 129,232 unique repositories. The dataset's size, totaling 21.6 GB, reflects the substantial volume of code associated with scientific publications on ArXiv.

An analysis of the dataset's language composition reveals the diversity of programming languages used in academic research. Python emerges as the most prevalent language, accounting for 827,135 files or 17.54% of the dataset, followed closely by C/C++ with 743,207 files (15.76%) and Java with 615,191 files (13.04%). HTML and C round out the top five, representing 7.62% and 6.41% of the files, respectively. 


Python

827135

17.54%

Other

749865

15.90%

C/C++

743207

15.76%

Java

615191

13.04%

HTML

359375

7.62%

C

302533

6.41%

Markdown

201196

4.27%

Objective-C

170582

3.62%

C++

162715

3.45%

YAML

142877

3.03%

Go

125270

2.66%

Shell

88581

1.88%

TypeScript

50907

1.08%

Ruby

34739

0.74%

R

25311

0.54%

Rust

24026

0.51%

Scala

23478

0.50%

CSS

16999

0.36%

PHP

15979

0.34%

Perl

12308

0.26%

SQL

8956

0.19%

Lua

6776

0.14%

C#

4022

0.09%

Swift

3986

0.08%

JavaScript

161

0.00%









The dataset also captures important metadata about the repositories. Notably, 32,547 repositories contain Python files, indicating the widespread use of Python in research contexts. The average file depth of 5.61 suggests a moderate level of complexity in project structures. 



Analysis of Python Code

Given the prevalence of Python in scientific computing, we conducted a focused analysis on the Python subset of the ArXiv Research Code dataset. This subset represents approximately ~17% of the full dataset. The average file depth of 4.07 suggests a moderate level of project structure complexity in Python-based research code.


Examining the distribution of files across repositories we find the repository with the highest file count, "catboost," contains 22,994 files, followed by "Serverless_Variance" with 20,898 files and "Sym-NCO" with 18,194 files. This wide range in repository sizes, from thousands of files to single-digit counts, reflects the diversity of Python-based research projects, from large-scale collaborative efforts to smaller, focused studies.


Analysis of library usage provides insights into the tools and frameworks prevalent in academic research. NumPy stands out as the most widely used library, appearing in 30.38% of all Python files, underscoring its fundamental role in scientific computing. PyTorch follows at 19.80%, indicating a strong preference for this deep learning framework in research contexts. The presence of pandas (4.33%), TensorFlow (3.85%), and matplotlib (1.52%) in the top five most used libraries further highlights the importance of data manipulation, machine learning, and data visualization in contemporary research.


To understand the structural characteristics of Python research code, we computed several metrics across the dataset. On average, each file contains 7.64 import statements, 8.29 function definitions, and 1.34 class definitions. The mean number of loops per file is 2.92, with list comprehensions appearing on average 1.04 times per file. These figures suggest a tendency towards modular, function-oriented code in research settings. The average of 0.46 try-except blocks per file indicates a moderate level of error handling, while the presence of 0.37 lambda functions per file on average suggests some use of functional programming paradigms.


The complexity and scale of Python research code are evident in the lines of code metric, averaging 220.16 lines per file. However, the maximum value of 682,678 lines demonstrates the existence of extremely large, complex scripts or modules in some research projects.

Focusing on potential indicators of machine learning and deep learning research, we found that 23.95% of Python files use at least one ML/DL library. 






Library

Number of Files

Percentage

NumPy

417793

30.38%

PyTorch

272330

19.80%

Pandas

59505

4.33%

Tensorflow

52918

3.85%

Matplotlib

20844

1.52%

SciPy

16143

1.17%

Scikit Learn

6005

0.44%

Keras

3773

0.27%

NLTK

2970

0.22%

Spacy

1362

0.10%

XGBoost

890

0.06%

Gensim

638

0.05%

LightGBM

324

0.02%



Metric

Average

Import Statement Count

7.64

Function Count

8.29

Class Count

1.34

For-loop Count

2.92

List Comprehension Count

1.04

Try Fxcept Count

0.46

Lambda Count

0.37

Global Variable Count

0.07

Lines of Code

220.16


  • Total number of repositories with Python files: 23874

  • Overall average cyclomatic complexity: 23.88

  • Overall maximum cyclomatic complexity: 22401

  • Overall minimum cyclomatic complexity: 1


Python File Validity Results:

  • Total Python files: 1415924

  • Valid Python files: 1375548

  • Invalid Python files: 40376


Percentage of valid Python files: 97.15%


Top 10 repositories by average cyclomatic complexity:

Repo: psitip

  • Average Complexity: 6457.16

  • Max Complexity: 10627

  • Min Complexity: 53

  • Python File Count: 19


Repo: mona

  • Average Complexity: 2574.00

  • Max Complexity: 2574

  • Min Complexity: 2574

  • Python File Count: 1


Repo: AutoMunge

  • Average Complexity: 2253.00

  • Max Complexity: 6757

  • Min Complexity: 1

  • Python File Count: 3


Repo: angrist

  • Average Complexity: 1707.00

  • Max Complexity: 1707

  • Min Complexity: 1707

  • Python File Count: 1


Repo: sma-mir

  • Average Complexity: 1313.10

  • Max Complexity: 22401

  • Min Complexity: 19

  • Python File Count: 21


Repo: HIBPool

  • Average Complexity: 1240.00

  • Max Complexity: 1240

  • Min Complexity: 1240

  • Python File Count: 1


Repo: DeepNC

  • Average Complexity: 1087.86

  • Max Complexity: 14753

  • Min Complexity: 1

  • Python File Count: 28


Repo: GUIcandid

  • Average Complexity: 841.00

  • Max Complexity: 841

  • Min Complexity: 841

  • Python File Count: 1


Repo: p4-codel

  • Average Complexity: 837.78

  • Max Complexity: 7161

  • Min Complexity: 1

  • Python File Count: 9


Repo: Lenia

  • Average Complexity: 811.67

  • Max Complexity: 1334

  • Min Complexity: 26

  • Python File Count: 6


Limitations and Future Work


While the ArXiv Research Code Dataset offers valuable insights into computational practices in academic research, it is important to acknowledge several limitations of our current approach.


A constraint of our dataset stems from its reliance on ArXiv as the source of research papers. This focus inevitably introduces a bias towards fields that commonly use ArXiv for preprint distribution. Consequently we miss research code that may publish elsewhere. Future work could address this limitation by expanding the scope to include code repositories linked to papers from a broader range of preprint servers and journals.


Our GitHub-centric approach to code collection presents another potential bias. By focusing exclusively on GitHub repositories, we may overlook code hosted on alternative platforms or in private repositories. This limitation could skew our understanding of coding practices, particularly in fields or regions where other code hosting solutions are prevalent. Expanding our data collection to encompass other popular code hosting platforms would mitigate this bias and offer a more holistic view of research code practices.

The current dataset provides a static snapshot of research code at a specific point in time, which limits our ability to capture the dynamic nature of code development in academic research.


Conclusion

The ArXiv Research Code Dataset provides a resource for studying computational practices in academic research. By offering access to a collection of research-oriented code linked to scientific publications, we have created a dataset for examining the role of software in computer science research.


Our analysis provides a first-pass insight into research code, including the distribution of programming languages, structural patterns, and the use of libraries and frameworks across computer science disciplines.


This dataset addresses a gap in existing code collections by focusing specifically on code associated with academic research. We believe that this dataset can be used across a number of applications including LLM pretraining, retrieval augmented generation, code completion, and agentic research applications. Please feel free to provide feedback at info@algorithmicresearchgroup.com

45 views

Recent Posts

See All

Advancing AI Together

We value the power of collaboration and are actively seeking partnerships with academic institutions, AI research labs, and individual researchers to drive innovation together.

Thanks for submitting!

©2024 Algorithmic Research Group. All Rights Reserved.

bottom of page