Papers with Code partners with arXiv for open-access machine learning datasets

More transparency for faster scientific progress

May 17, 2021

2 Min Read

Two groups focused on sharing free and open scientific research are partnering on artificial intelligence.

Machine learning articles published on paper repository arXiv will now feature a Code & Data tab from Papers with Code which will link to datasets that are used or introduced in the paper.

The announcement follows a similar collaboration last October, where arXiv added Papers with Code community code support.

More data on datasets

Pronounced archive, arXiv began in the early days of the Internet, as a central repository for Los Alamos National Laboratory research in 1991, developed by Paul Ginsparg.

It continued to grow, launching on the World Wide Web in 1993, and leaving the confines of the Department of Energy in 2001.

The site has helped lead a revolution in open access to scientific research, which has historically been guarded behind paywalls - with the money going to a few large corporations, and not the researchers themselves.

For his contribution to science, Ginsparg was awarded a MacArthur Fellowship, also known as a genius grant.

The site does not peer-review papers, although there are moderators who maintain a level of order. Many of the papers are also submitted to scientific journals, where they are peer-reviewed.

Fellow open science group Papers with Code is focused specifically on machine learning, although it has begun to allow the broader scientific community to share code on its site as AI becomes a part of every field.

The organization enables researchers to share the code or data used in a paper - hence the name. In 2020, the group claimed that just 15 percent of AI research papers published their code.

Now, both groups have combined their respective strengths to allow arXiv papers to include relevant links to code and datasets.

"This makes it much easier to track dataset usage across the community and quickly find other papers using the same dataset," Papers with Code co-founder and Facebook AI employee Robert Stojnic said.

"From Papers with Code you can discover other papers using the same dataset, track usage over time, compare models and find similar datasets."

Datasets are a core facet of machine learning, with model functionality defined by the data they are trained with. Models are also a reflection of the quality and size of a dataset.

"An indexed map of datasets accelerates progress by bringing transparency to results and usage," Stojnic said.

"These insights shape future dataset development: when more challenging datasets are required to evaluate models, or when existing datasets become saturated in usage."

“Members of our community want to contribute tools that enhance the arXiv experience, and we value that kind of community engagement,” arXiv executive director Eleonora Presani said.

Along with Stojnic, the core Papers with Code team is part of Facebook AI, but the group claim that they are independent and no data is shared with the social media and advertising titan.

About the Author(s)

Sebastian Moss

See more from Sebastian Moss

Related Topics

Recent in ML

Related Topics

Recent in NLP

Related Topics

Recent in Data

Related Topics

Recent in Automation

Related Topics

Recent in Verticals

Related Topics

Recent in Responsible AI

Related Topics

Recent in Companies

Related Topics

More data on datasets

About the Author(s)

Latest News

Trending articles