This is the very first issue of the DVC❤️Heartbeat. Every month we will be sharing our news, findings, interesting reads, community takeaways, and everything along the way.
Some of those are related to our brainchild DVC and its journey. The others are a collection of exciting stories and ideas centered around ML best practices and workflow.
We read a ton of articles and posts every day and here are a few that caught our eye. Well-written, offering a different perspective and definitely worth checking.
What is becoming clear is that, in the late stage of the hype cycle, data science is asymptotically moving closer to engineering, and the skills that data scientists need moving forward are less visualization and statistics-based, and more in line with traditional computer science curricula.
I want to explore how the degrees of freedom in versioning machine learning systems poses a unique challenge. I’ll identify four key axes on which machine learning systems have a notion of version, along with some brief recommendations for how to simplify this a bit.
…the objective of this post is not to philosophize about the dangers and dark sides of AI. In fact, this post aims to work out common challenges in reproducibility for machine learning and shows programming differences to other areas of Computer Science. Secondly, we will see practices and workflows to create a higher grade of reproducibility in machine learning algorithms.
There are lots of hidden gems in our Discord community discussions. Sometimes they are scattered all over the channels and hard to track down.
We will be sifting through the issues and discussions and share the most interesting takeaways.
There is no separate guide for that, but it is very straight forward. See DVC file format description for how DVC file looks inside in general. All
dvc rundoes is just computing
md5fields in it, that is all. You could write your DVC-file and then run
dvc reprothat will run a command(if any) and compute all needed checksums,read more.
There’s a ton of code in that project, and it’s very non-trivial to define the code dependencies for my training stage — there are a lot of imports going on, the training code is distributed across many modules, read more
DVC officially only supports regular Azure blob storage. Gen1 Data Lake should be accessible by the same interface, so configuring a regular azure remote for DVC should work. Seems like Gen2 Data Lake has disable blob API. If you know more details about the difference between Gen1 and Gen2, feel free to join our community and share this knowledge.
Apache 2.0. One of the most common and permissible OSS licences.
$ dvc remote add upstream s3://my-bucket $ dvc remote modify upstream region REGION_NAME $ dvc remote modify upstream endpointurl <url>
Find and click the
S3 API compatible storageon this page
It adds your data files there, that are tracked by DVC, so that you don’t accidentally add them to git as well you can open it with file editor of your liking and see your data files listed there.
With DVC, you could connect your data sources from HDFS with your pipeline in your local project, by simply specifying it as an external dependency. For example let’s say your script
process.cmdworks on an input file on HDFS and then downloads a result to your local workspace, then with DVC it could look something like:
$ dvc run -d hdfs://example.com/home/shared/input \ -d process.cmd \ -o output process.cmd
If you have any questions, concerns or ideas, let us know here and our stellar team will get back to you in no time.