Introducing: ArXivDLInstruct

Matthew Kenney
Sep 4, 2024
1 min read

Updated: Sep 6, 2024

We're excited to announce the release of ArXivDLInstruct, a new open-source dataset designed for instruction tuning on Python research code. This dataset addresses the scarcity of high-quality, deep learning-specific datasets for pretraining and fine-tuning language models in code generation tasks.

ArXivDLInstruct comprises 778,152 functions extracted from research code published on arXiv. Each entry in the dataset includes a detailed prompt for generating the function, a short description of the function, and the actual function code. With a total size of 2.26 GB of prompt-response pairs, ArXivDLInstruct offers a rich resource for various applications, including instruction tuning of language models, Retrieval-Augmented Generation (RAG), code completion tasks, and development of R&D coding agents.

The ArXivDLInstruct dataset was created through a multi-step pipeline. We began by parsing GitHub links from all arXiv papers, then downloaded and parsed all GitHub repositories into functions and classes. Next, we filtered for machine learning and deep learning library usage. Finally, we generated prompts based on the ground truth code using GPT-4o-mini

The ArXivDLInstruct dataset is now available on Hugging Face at https://huggingface.co/datasets/AlgorithmicResearchGroup/ArXivDLInstruct. For researchers interested in the intermediate datasets from our pipeline, please visit https://huggingface.co/AlgorithmicResearchGroup.

We believe ArXivDLInstruct will be a valuable resource for the AI and machine learning community, particularly for those working on code generation and understanding in the context of scientific research. We encourage researchers and developers to explore the dataset and leverage it in their work.

We look forward to seeing the innovative applications and advancements that this dataset will enable in the field of AI-assisted coding and research.

Introducing: ArXivDLInstruct

Recent Posts

Advancing AI Together

We value the power of collaboration and are actively seeking partnerships with academic institutions, AI research labs, and individual researchers to drive innovation together.

Algorithmic Research Group
5540 Centerview Dr Ste 204 PMB 296182 Raleigh, NC, 27606 US

©2024 Algorithmic Research Group. All Rights Reserved.

Advancing AI Together

We value the power of collaboration and are actively seeking partnerships with academic institutions, AI research labs, and individual researchers to drive innovation together. Algorithmic Research Group 5540 Centerview Dr Ste 204 PMB 296182 Raleigh, NC, 27606 US

©2024 Algorithmic Research Group. All Rights Reserved.

We value the power of collaboration and are actively seeking partnerships with academic institutions, AI research labs, and individual researchers to drive innovation together.

Algorithmic Research Group
5540 Centerview Dr Ste 204 PMB 296182 Raleigh, NC, 27606 US