The ArXiv Research Code Dataset

Matthew Kenney
Sep 3, 2024
6 min read

Updated: Sep 4, 2024

We introduce the ArXiv Research Code Dataset, a collection code extracted from repositories linked to computer science papers published on ArXiv. This dataset specifically focuses on ArXiv computer science papers. By concentrating on research code used in computer science publications, this dataset offers insights into the programming languages, libraries, and coding patterns employed in computational research. The dataset comprises 4,716,175 code files from 129,232 unique repositories, representing a portion of publicly available, high-quality computer science research code. The dataset can aid in the development and training large language models (LLMs), retrieval augmented generation, code completion, and agentic research applications.

Dataset Creation Methodology

The ArXiv Research Code Dataset was created using a multi-stage process. Initially, we extracted metadata and full text content from computer science ArXiv papers, identifying those with associated GitHub repository URLs. These URLs were then used to download the corresponding repositories. We developed a library (to be released shortly) to process the downloaded repositories, extracting individual code files while focusing on common research-oriented programming languages. We process each code file to derive metrics including file length, average line length, and maximum line length. The resulting dataset comprises these code files along with their associated metadata, providing a snapshot of code used in research across CS ArXiv.

Dataset Composition and Metadata

The dataset includes the code files along with metadata for each file. This metadata encompasses file length, average line length, and maximum line length, providing context about code structure and complexity. The inclusion of full code content enables detailed analysis of coding practices, patterns, and styles prevalent in academic research.

The dataset is structured with the following columns:

repo: The repository name
file: The file path within the repository
code: The actual content of the file
file_length: The number of lines in the file
avg_line_length: The average length of lines in the file
max_line_length: The maximum line length in the file
extension_type: The file extension

Dataset Statistics

The ArXiv Research Code Dataset comprises a total of 4,716,175 individual code files, sourced from 129,232 unique repositories. The dataset's size, totaling 21.6 GB, reflects the substantial volume of code associated with scientific publications on ArXiv.

An analysis of the dataset's language composition reveals the diversity of programming languages used in academic research. Python emerges as the most prevalent language, accounting for 827,135 files or 17.54% of the dataset, followed closely by C/C++ with 743,207 files (15.76%) and Java with 615,191 files (13.04%). HTML and C round out the top five, representing 7.62% and 6.41% of the files, respectively.

Python	827135	17.54%
Other	749865	15.90%
C/C++	743207	15.76%
Java	615191	13.04%
HTML	359375	7.62%
C	302533	6.41%
Markdown	201196	4.27%
Objective-C	170582	3.62%
C++	162715	3.45%
YAML	142877	3.03%
Go	125270	2.66%
Shell	88581	1.88%
TypeScript	50907	1.08%
Ruby	34739	0.74%
R	25311	0.54%
Rust	24026	0.51%
Scala	23478	0.50%
CSS	16999	0.36%
PHP	15979	0.34%
Perl	12308	0.26%
SQL	8956	0.19%
Lua	6776	0.14%
C#	4022	0.09%
Swift	3986	0.08%
JavaScript	161	0.00%

The dataset also captures important metadata about the repositories. Notably, 32,547 repositories contain Python files, indicating the widespread use of Python in research contexts. The average file depth of 5.61 suggests a moderate level of complexity in project structures.

Analysis of Python Code

Given the prevalence of Python in scientific computing, we conducted a focused analysis on the Python subset of the ArXiv Research Code dataset. This subset represents approximately ~17% of the full dataset. The average file depth of 4.07 suggests a moderate level of project structure complexity in Python-based research code.

Examining the distribution of files across repositories we find the repository with the highest file count, "catboost," contains 22,994 files, followed by "Serverless_Variance" with 20,898 files and "Sym-NCO" with 18,194 files. This wide range in repository sizes, from thousands of files to single-digit counts, reflects the diversity of Python-based research projects, from large-scale collaborative efforts to smaller, focused studies.

Analysis of library usage provides insights into the tools and frameworks prevalent in academic research. NumPy stands out as the most widely used library, appearing in 30.38% of all Python files, underscoring its fundamental role in scientific computing. PyTorch follows at 19.80%, indicating a strong preference for this deep learning framework in research contexts. The presence of pandas (4.33%), TensorFlow (3.85%), and matplotlib (1.52%) in the top five most used libraries further highlights the importance of data manipulation, machine learning, and data visualization in contemporary research.

To understand the structural characteristics of Python research code, we computed several metrics across the dataset. On average, each file contains 7.64 import statements, 8.29 function definitions, and 1.34 class definitions. The mean number of loops per file is 2.92, with list comprehensions appearing on average 1.04 times per file. These figures suggest a tendency towards modular, function-oriented code in research settings. The average of 0.46 try-except blocks per file indicates a moderate level of error handling, while the presence of 0.37 lambda functions per file on average suggests some use of functional programming paradigms.

The complexity and scale of Python research code are evident in the lines of code metric, averaging 220.16 lines per file. However, the maximum value of 682,678 lines demonstrates the existence of extremely large, complex scripts or modules in some research projects.

Focusing on potential indicators of machine learning and deep learning research, we found that 23.95% of Python files use at least one ML/DL library.

Library	Number of Files	Percentage
NumPy	417793	30.38%
PyTorch	272330	19.80%
Pandas	59505	4.33%
Tensorflow	52918	3.85%
Matplotlib	20844	1.52%
SciPy	16143	1.17%
Scikit Learn	6005	0.44%
Keras	3773	0.27%
NLTK	2970	0.22%
Spacy	1362	0.10%
XGBoost	890	0.06%
Gensim	638	0.05%
LightGBM	324	0.02%

Metric	Average
Import Statement Count	7.64
Function Count	8.29
Class Count	1.34
For-loop Count	2.92
List Comprehension Count	1.04
Try Fxcept Count	0.46
Lambda Count	0.37
Global Variable Count	0.07
Lines of Code	220.16

Total number of repositories with Python files: 23874
Overall average cyclomatic complexity: 23.88
Overall maximum cyclomatic complexity: 22401
Overall minimum cyclomatic complexity: 1

Python File Validity Results:

Total Python files: 1415924
Valid Python files: 1375548
Invalid Python files: 40376

Percentage of valid Python files: 97.15%

Top 10 repositories by average cyclomatic complexity:

Repo: psitip

Average Complexity: 6457.16
Max Complexity: 10627
Min Complexity: 53
Python File Count: 19

Repo: mona

Average Complexity: 2574.00
Max Complexity: 2574
Min Complexity: 2574
Python File Count: 1

Repo: AutoMunge

Average Complexity: 2253.00
Max Complexity: 6757
Min Complexity: 1
Python File Count: 3

Repo: angrist

Average Complexity: 1707.00
Max Complexity: 1707
Min Complexity: 1707
Python File Count: 1

Repo: sma-mir

Average Complexity: 1313.10
Max Complexity: 22401
Min Complexity: 19
Python File Count: 21

Repo: HIBPool

Average Complexity: 1240.00
Max Complexity: 1240
Min Complexity: 1240
Python File Count: 1

Repo: DeepNC

Average Complexity: 1087.86
Max Complexity: 14753
Min Complexity: 1
Python File Count: 28

Repo: GUIcandid

Average Complexity: 841.00
Max Complexity: 841
Min Complexity: 841
Python File Count: 1

Repo: p4-codel

Average Complexity: 837.78
Max Complexity: 7161
Min Complexity: 1
Python File Count: 9

Repo: Lenia

Average Complexity: 811.67
Max Complexity: 1334
Min Complexity: 26
Python File Count: 6

Limitations and Future Work

While the ArXiv Research Code Dataset offers valuable insights into computational practices in academic research, it is important to acknowledge several limitations of our current approach.

A constraint of our dataset stems from its reliance on ArXiv as the source of research papers. This focus inevitably introduces a bias towards fields that commonly use ArXiv for preprint distribution. Consequently we miss research code that may publish elsewhere. Future work could address this limitation by expanding the scope to include code repositories linked to papers from a broader range of preprint servers and journals.

Our GitHub-centric approach to code collection presents another potential bias. By focusing exclusively on GitHub repositories, we may overlook code hosted on alternative platforms or in private repositories. This limitation could skew our understanding of coding practices, particularly in fields or regions where other code hosting solutions are prevalent. Expanding our data collection to encompass other popular code hosting platforms would mitigate this bias and offer a more holistic view of research code practices.

The current dataset provides a static snapshot of research code at a specific point in time, which limits our ability to capture the dynamic nature of code development in academic research.

Conclusion

The ArXiv Research Code Dataset provides a resource for studying computational practices in academic research. By offering access to a collection of research-oriented code linked to scientific publications, we have created a dataset for examining the role of software in computer science research.

Our analysis provides a first-pass insight into research code, including the distribution of programming languages, structural patterns, and the use of libraries and frameworks across computer science disciplines.

This dataset addresses a gap in existing code collections by focusing specifically on code associated with academic research. We believe that this dataset can be used across a number of applications including LLM pretraining, retrieval augmented generation, code completion, and agentic research applications. Please feel free to provide feedback at info@algorithmicresearchgroup.com

The ArXiv Research Code Dataset

Dataset Creation Methodology

Dataset Composition and Metadata

Dataset Statistics

Analysis of Python Code

Limitations and Future Work

Conclusion

Recent Posts

Advancing AI Together

We value the power of collaboration and are actively seeking partnerships with academic institutions, AI research labs, and individual researchers to drive innovation together.

Algorithmic Research Group
5540 Centerview Dr Ste 204 PMB 296182 Raleigh, NC, 27606 US

©2024 Algorithmic Research Group. All Rights Reserved.

Dataset Creation Methodology

Dataset Composition and Metadata

Dataset Statistics

Analysis of Python Code

Limitations and Future Work

Conclusion

Advancing AI Together

We value the power of collaboration and are actively seeking partnerships with academic institutions, AI research labs, and individual researchers to drive innovation together. Algorithmic Research Group 5540 Centerview Dr Ste 204 PMB 296182 Raleigh, NC, 27606 US

©2024 Algorithmic Research Group. All Rights Reserved.

We value the power of collaboration and are actively seeking partnerships with academic institutions, AI research labs, and individual researchers to drive innovation together.

Algorithmic Research Group
5540 Centerview Dr Ste 204 PMB 296182 Raleigh, NC, 27606 US