February ’20 DVC❤️Heartbeat

Every month we share news, findings, interesting reads, community takeaways, and everything else along the way. Look here for updates about DVC, our journey as a startup, projects by our users and big ideas about best practices in ML and data science.

Elle O'Brien
February 10, 2020 • 3 min read

Just in time for Valentine's day, here's a seasonally-relevant DVC pipeline.

Welcome to the February Heartbeat! This month's featured image is a DVC pipeline created by one of our users, which we think resembles a valentine. Here are some more highlights from our team and our community:

News

Our team is growing! In early January, DVC gained two new folks: engineer Saugat Pachhai and data scientist Elle O'Brien. Saugat, based in Nepal, will be contributing to core DVC. Elle (that's me!), currently in San Francisco, will be leading data science projects and outreach with DVC.

We're gearing up for a spring full of talks about DVC projects, including new up-and-coming features for data cataloging and continuous integration. Here are just a few events that have been added to our schedule:

-Elle O'Brien was recently accepted to give a keynote at Women in Data Science San Diego on May 9. The talk is called "Packaging data and machine learning models for sharing."

-Elle will also be speaking at Div Ops, a new online conference about (you guessed it) DevOps, on March 27.

Look out for more conference announcements soon- in our brand new community page! We've just launched a new hub for sharing events, goings-ons, and ways to contribute to DVC.

From the community

Our users continue to put awesome things on the internet. Like this AI blogger who isn't afraid to wear his heart on his sleeve.

My favorite data science tool is DVC - Data Version Control

by Musa Atlıhan

medium.com

My favorite data science tool is DVC - Data Version Control

Musa Atlihan writes:

From my experience, whether it is a real-world data science project or it is a data science competition, there are two major key components for success. Those components are API simplicity and reproducible pipelines. Since data science means experimenting a lot in a limited time frame, first, we need machine learning tools with simplicity and second, we need reliable/reproducible machine learning pipelines. Thanks to tools like Keras, LightGBM, and fastai we already have simple yet powerful tools for rapid model development. And thanks to DVC, we are building large projects with reproducible pipelines very easily.

It's cool how Musa puts DVC in context with libraries for model building. In a way, the libraries that have made it easier than ever to iterate through different model architectures have increased the need for reproducibility in proportion.

Meanwhile in Germany, superusers Marcel Mikl and Bert Besser wrote another seriously comprehensive article about DVC for Codecentric. Marcel and Bert walk readers through the steps to build a custom machine learning training pipeline with remote computing resources like GCP and AWS. It's an excellent guide to configuring model training with attention to automation and collaboration. We give them 🦉🦉🦉🦉🦉 out of 5.

Remote training with GitLab-CI and DVC

by Marcel Mikl and Bert Besser

blog.codecentric.de

Here are a few more stories on our radar:

AI Singapore shares their method for AI development and deployment. This .. blog about how Agile informs their processes for continuous integration and delivery includes data versioning.
Toucan AI dispenses advice for ML engineers. This .. blog for practitioners discusses questions like, "When to work on ML vs. the processes that surround ML". It covers how DVC is used for model versioning in the exploration stage of ML.
DVC at the University. A recent .. pre-print from natural language processing researchers at Université Laval explains how DVC facilitated dataset access for collaborators.

"In our case, the original dataset takes up to 6 Gigabytes. The previous way of retrieving the dataset over the network with a standard 20 Mbits/sec internet connexion took up to an hour to complete (including uncompressing the data). Using DVC reduced the retrieval time of the dataset to 3 minutes over the network with the same internet connexion."

Thanks for sharing- this is a lovely result. Oh, and last…
DVC is a job requirement! We celebrated a small milestone when we stumbled .. across a listing for a data engineer to support R&D at Elvie, a maker of tech for women's health (pretty neat mission). The decorations on the job posting are ours 😎