We are excited to share that our paper, "Git is for Data" was accepted to CIDR 2023! CIDR is the premier conference for practical data systems research, and Yucheng presented this work in Amsterdam this past week.
In this post I want to share the abstract section, conclusions section, and some of the key figures and results from the paper. I encourage you to read the full paper for context, background, and details.
Dataset management is one of the greatest challenges to the application of machine learning (ML) in industry. Although scaling and performance have often been highlighted as the significant ML challenges, development teams are bogged down by the contradictory requirements of supporting fast and flexible data iteration while maintaining stability, provenance, and reproducibility. For example, blobstores are used to store datasets for maximum flexibility, but their unmanaged access patterns limit reproducibility. Many ML pipeline solutions to ensure reproducibility have been devised, but all introduce a degree of friction and reduce flexibility.
In this paper, we propose that the solution to the dataset management challenges is simple and apparent: Git. As a source control system, as well as an ecosystem of collaboration and developer tooling, Git has enabled the field of DevOps to provide both speed of iteration and reproducibility to source code. Git is not only already familiar to developers, but is also integrated in existing pipelines, facilitating adoption. However, as we (and others) demonstrate, Git, as designed today, does not scale to the needs of ML dataset management. In this paper, we propose XetHub; a system that retains the Git user experience and ecosystem, but can scale to support large datasets. In particular, we demonstrate that XetHub can support Git repositories at the TB scale and beyond. By extending Git to support large-scale data, and building upon a DevOps ecosystem that already exists for source code, we create a new user experience that is both familiar to existing practitioners and truly addresses their needs.
At first glance, we integrate with Git in a comparable method as Git LFS. However, the core differentiation is the holistic set of tooling XetHub provides to fully support the needs of ML datasets by fully embracing the use of software engineering practices for data.
We believe that with the right architecture design, pre-existing systems for source control can be extended to fully support the dataset use case, addressing a significant fraction of dataset management needs while minimizing cognitive friction.
The significance is the observation that the needs around dataset management are not unique, and have been addressed by source code management tools. What is unique is only the scale at which it happens. By extending Git to support large-scale data, and building upon a DevOps ecosystem that already exists for source control, we create a new user experience that is both familiar to existing practitioners and truly addresses their needs.
Again, read the full paper here and get started with XetHub today!