March '20 DVC❤️Heartbeat

Every month we share news, findings, interesting reads, community takeaways, and everything else along the way.

Look here for updates about DVC, our journey as a startup, projects by our users and big ideas about best practices in ML and data science.

Welcome to the March Heartbeat! Here are some highlights from our team and community this past month:

News

DVC is STILL growing! In February, Senior Software Engineer Guro Bokum joined DVC. He’s previously contributed to the core DVC code base and brings several years of full-stack engineering expertise to the team. Welcome, Guro!

hi guro Welcome, Guro!

New feature alert. We’ve received many requests for monorepo support in DVC. As of DVC release 0.87.0, users can version data science projects within a monorepo! The new dvc init --subdir functionality is designed to allow multiple DVC repositories within a single Git repository. Don’t forget to upgrade and check out the latest docs.

From the community

First, there’s an intriguing discussion evolving in the DVC repo about how machine learning hyperparameters (such as learning rate, number of layers in a deep neural network, etc.) can be tracked. Right now, hyperparameters are tracked as source code (i.e., with Git). Could we use some kind of abstraction to separate hyperparameters from source code in a DVC-managed project? Read on and feel free to jump into this discussion, largely helmed by software developer and DVC contributor Helge Munk Jacobsen.

Another discussion we appreciated happened on Twitter:

Thanks, @cyberomin!

Elsewhere on the internet, DVC made the cut in a much-shared blog, Five Interesting Data Engineering Projects by Dmitry Ryaboy (VP of Engineering at biotech startup Zymergen, and formerly Twitter). Dmitry wrote:

To be honest, I’m a bit of a skeptic on “git for data” and various automated data / workflow versioning schemes: various approaches I’ve seen in the past were either too partial to be useful, or required too drastic a change in how data scientists worked to get a realistic chance at adoption. So I ignored, or even explicitly avoided, checking DVC out as the buzz grew. I’ve finally checked it out and… it looks like maybe this has legs? Metrics tied to branches / versions are a great feature. Tying the idea of git-like braches to training multiple models makes the value prop clear. The implementation, using Git for code and datafile index storage, while leveraging scalable data stores for data, and trying to reduce overall storage cost by being clever about reuse, looks sane. A lot of what they have to say in https://dvc.org/doc/understanding-dvc rings true.

Check out the full blog here:

Five Interesting Data Engineering Projects

There’s been a lot of activity in the data engineering world lately, and a ton of really interesting projects and ideas have come on the scene in the past few years. This post is an introduction to (just) five that I think a data engineer who wants to stay current needs to know about.
Five Interesting Data Engineering Projects

One of the areas that DVC is growing into is continuous integration and continuous deployment (CI/CD), a part of the nascent field of MLOps. Naturally, we were thrilled to discover that CI/CD with DVC is taught in a new Packt book, “Learn Python by Building Data Science Applications” by David Katz and Philipp Kats.

In the authors words, the goal of this book is to teach data scientists and engineers “not only how to implement Python in data science projects, but also how to maintain and design them to meet high programming standards.” Needless to say, we are considering starting a book club. Grab a copy here:

Learn Python by Building Data Science Applications

Understand the constructs of the Python programming language and use them to build data science projects
Learn Python by Building Data Science Applications

Last year in Mexico, DVC contributor Ramón Valles gave a talk about reproducible machine learning workflows at Data Day Monterrey—and a video of his presentation is now online! In this Spanish-language talk, Ramón gives a thorough look at DVC, particularly building pipelines for reproducible ML.

Experimentación ágil de machine learning con DVC

Data Day Monterrey '19
Experimentación ágil de machine learning con DVC

Finally, DVC data scientist Elle (that’s me!) released a new public dataset of posts from the Reddit forum r/AmItheAsshole, and reported some preliminary analyses. We’re inviting anyone and everyone to play with the data, make some hypotheses and share their findings. Check it out here:

AITA for making this? A public dataset of Reddit posts about moral dilemmas

Delve into an open natural language dataset of posts about moral dilemmas from r/AmItheAsshole. Use this dataset for whatever you want- here's how to get it and start playing.
AITA for making this? A public dataset of Reddit posts about moral dilemmas

That’s all for now—thanks for reading, and be in touch on our GitHub, Twitter, and Discord channel.

Subscribe for updates. We won't spam you.