We're excited to announce the release of ArXivDLInstruct, a new open-source dataset designed for instruction tuning on Python research code. This dataset addresses the scarcity of high-quality, deep learning-specific datasets for pretraining and fine-tuning language models in code generation tasks.
ArXivDLInstruct comprises 778,152 functions extracted from research code published on arXiv. Each entry in the dataset includes a detailed prompt for generating the function, a short description of the function, and the actual function code. With a total size of 2.26 GB of prompt-response pairs, ArXivDLInstruct offers a rich resource for various applications, including instruction tuning of language models, Retrieval-Augmented Generation (RAG), code completion tasks, and development of R&D coding agents.

The ArXivDLInstruct dataset was created through a multi-step pipeline. We began by parsing GitHub links from all arXiv papers, then downloaded and parsed all GitHub repositories into functions and classes. Next, we filtered for machine learning and deep learning library usage. Finally, we generated prompts based on the ground truth code using GPT-4o-mini

The ArXivDLInstruct dataset is now available on Hugging Face at https://huggingface.co/datasets/AlgorithmicResearchGroup/ArXivDLInstruct. For researchers interested in the intermediate datasets from our pipeline, please visit https://huggingface.co/AlgorithmicResearchGroup.
We believe ArXivDLInstruct will be a valuable resource for the AI and machine learning community, particularly for those working on code generation and understanding in the context of scientific research. We encourage researchers and developers to explore the dataset and leverage it in their work.
We look forward to seeing the innovative applications and advancements that this dataset will enable in the field of AI-assisted coding and research.