March ’19 DVC❤️Heartbeat

Every month we are sharing here our news, findings, interesting reads, community takeaways, and everything along the way. Some of those are related to our brainchild DVC and its journey. The others are a collection of exciting stories and ideas centered around ML best practices and workflow.

Svetlana Grinchenko
March 05, 2019 • 3 min read

This is the very first issue of the DVC❤️Heartbeat. Every month we will be sharing our news, findings, interesting reads, community takeaways, and everything along the way.

Some of those are related to our brainchild DVC and its journey. The others are a collection of exciting stories and ideas centered around ML best practices and workflow.

News and links

We read a ton of articles and posts every day and here are a few that caught our eye. Well-written, offering a different perspective and definitely worth checking.

Data science is different now by Vicki Boykis

Data science is different now

Woman holding a balance, Vermeer 1664 What do you think of when you read the phrase 'data science'? It's probably some…

veekaybee.github.io

What is becoming clear is that, in the late stage of the hype cycle, data science is asymptotically moving closer to engineering, and the skills that data scientists need moving forward are less visualization and statistics-based, and more in line with traditional computer science curricula.

Data Versioning by Emily F. Gorcenski

Data Versioning

Productionizing machine learning/AI/data science is a challenge. Not only are the outputs of machine-learning…

emilygorcenski.com

I want to explore how the degrees of freedom in versioning machine learning systems poses a unique challenge. I’ll identify four key axes on which machine learning systems have a notion of version, along with some brief recommendations for how to simplify this a bit.

Reproducibility in Machine Learning by Pascal Fecht

Reproducibility in Machine Learning | Computer Science Blog

The rise of Machine Learning has led to changes across all areas of computer science. From a very abstract point of…

blog.mi.hdm-stuttgart.de

Reproducibility in Machine Learning | Computer Science Blog

…the objective of this post is not to philosophize about the dangers and dark sides of AI. In fact, this post aims to work out common challenges in reproducibility for machine learning and shows programming differences to other areas of Computer Science. Secondly, we will see practices and workflows to create a higher grade of reproducibility in machine learning algorithms.

Discord gems

There are lots of hidden gems in our Discord community discussions. Sometimes they are scattered all over the channels and hard to track down.

We will be sifting through the issues and discussions and share the most interesting takeaways.

Q: Edit and define DVC files manually, in a Makefile style

There is no separate guide for that, but it is very straight forward. See DVC file format description for how DVC file looks inside in general. All dvc add or dvc run does is just computing md5 fields in it, that is all. You could write your DVC-file and then run dvc repro that will run a command(if any) and compute all needed checksums,read more.

Q: Best practices to define the code dependencies

There’s a ton of code in that project, and it’s very non-trivial to define the code dependencies for my training stage — there are a lot of imports going on, the training code is distributed across many modules, read more

Q: Azure data lake support

DVC officially only supports regular Azure blob storage. Gen1 Data Lake should be accessible by the same interface, so configuring a regular azure remote for DVC should work. Seems like Gen2 Data Lake has disable blob API. If you know more details about the difference between Gen1 and Gen2, feel free to join our community and share this knowledge.

Q: What licence DVC is released under

Apache 2.0. One of the most common and permissible OSS licences.

Q: Setting up S3 compatible remote

(Localstack, wasabi)

$ dvc remote add upstream s3://my-bucket
$ dvc remote modify upstream region REGION_NAME
$ dvc remote modify upstream endpointurl <url>

Find and click the S3 API compatible storage on this page

Q: Why DVC creates and updates `.gitignore` file?

It adds your data files there, that are tracked by DVC, so that you don’t accidentally add them to git as well you can open it with file editor of your liking and see your data files listed there.

Q: Managing data and pipelines with DVC on HDFS

With DVC, you could connect your data sources from HDFS with your pipeline in your local project, by simply specifying it as an external dependency. For example let’s say your script process.cmd works on an input file on HDFS and then downloads a result to your local workspace, then with DVC it could look something like:

$ dvc run -d hdfs://example.com/home/shared/input \
          -d process.cmd \
          -o output process.cmd