Autumn is a great season for new beginnings and there is so much we love about it this year. Here are some of the highlights:
Co-hosting our first ever meetup! Our Dmitry Petrov partnering with Dan Fischetti from Standard Cognition to discuss Open-source tools to version control Machine Learning models and experiments. The recording is available now here.
Seeing Dmitry Petrov being really happy one day:
.@martinfowler's books and his website were always the source of programming wisdom 💎 His Refactoring book is the first book I recommend to developers.— Dmitry Petrov (@FullStackML) September 5, 2019
Now they write about ML lifecycle and automation. I’m especially excited because they use @DVCorg that we’ve created. https://t.co/HwswZqjOsb
We at DVC.org are so happy every time we discover an article featuring DVC or addressing one of the burning ML issues we are trying to solve. Here are some of the links that caught our eye past month:
As Machine Learning techniques continue to evolve and perform more complex tasks, so is evolving our knowledge of how to manage and deliver such applications to production. By bringing and extending the principles and practices from Continuous Delivery, we can better manage the risks of releasing changes to Machine Learning applications in a safe and reliable way.
So, the first question is clear: how to choose the optimal hardware for neural networks? Secondly, assuming that we have the appropriate infrastructure, how to build the machine learning ecosystem to train our models efficiently and not die trying? At Signaturit, we have the solution ;)
My talk will focus on Version Control Systems (VCS) for big-data projects. With the advent of Machine Learning (ML) , the development teams find it increasingly difficult to manage and collaborate on projects that deal with huge amounts of data and ML models apart from just source code.
Seeing a need for reproducibility in deep learning experiments, Lukas founded Weights & Biases. In this episode we discuss his experiment tracking tool, how it works, the components that make it unique in the ML marketplace and the open, collaborative culture that Lukas promotes. Listen to Lukas delve into how he got his start in deep learning experiments, what his experiment tracking used to look like, the current Weights & Biases business success strategy, and what his team is working on today.
There are lots of hidden gems in our Discord community discussions. Sometimes they are scattered all over the channels and hard to track down.
We are sifting through the issues and discussions and share with you the most interesting takeaways.
dvc runstep, and realised I forgot to declare an output file. Is there a way to add an output file without rerunning the (computationally expensive) step/stage?
If you’ve already ran it, you could just open created DVC-file with an editor and add an entry to the outs field. After that, just run
dvc commit my.dvcand it will save the checksums and data without re-running your command.
dvc run --no-execwould also work with commit instead of modifying the DVC-file by hand.
Any file that is under DVC control (e.g. added with
dvc addor an output in
dvc run -o) can be made a metric file with dvc metrics add file. Alternatively a command
dvc run -Mfile makes file a metric without caching it. It means dvc metrics show can be used while file is still versioned by Git.
There are two options — use
AZURE_STORAGE_CONNECTION_STRINGenvironment variable or use
--localflag that will put it into the
.dvc/config.localthat is added to the
.gitignore, so you don’t track it with it and so won’t expose secrets.
Yes, you are probably looking for external dependencies and outputs. This is the link to the documentation to start.
Using NAS (e.g. NFS) is a very common scenario for DVC. In short you use
dvc cache dirto setup a cache externally. Set cache type to use symlinks and enable protected mode. We are preparing a document how to setup the NFS as a shared cache, but I think it can be applied to any NAS.
Yes, it will! Here is some clarification. So when you set those settings like that,
dvc adddata will move data to your cache and then will create a hardlink from your cache to your workspace.
Unless your cache directory and your workspace are on different file systems, move should be instant. Please, find more information here.
DVC uses a lock file to prevent running two commands at the same time. The lock file is under the
.dvcdirectory. If no DVC commands running and you are still getting this error it’s safe to remove this file manually to resolve the issue.
When using DVC, in most cases we assume that your data will be somewhere under project root. There is an option to use so called external dependencies, which is data that is usually too big to be stored under your project root, but if you operate on data that is of some reasonable size, I would recommend starting with putting data somewhere under project root. Remotes are usually places where you store your data, but it is DVC task to move your data around. But if you want to keep your current setup where you will have data in different place than your project, you will need to refer to data with full paths. So, for example:
/home/gabriel/myprojectand you have initialized dvc and git repository
featurize.pyin your project dir, and want to use data to produce some features and than
train.pyto train a model.
$ dvc run -d /research_data/myproject/videos \ -o /research_data/myproject/features \ python featurize.py
to tell DVC, that you use
/research_data/myproject/videosto featurize, and produce output to your features dir. Note that your code should be aware of those paths, they can be hardcoded inside
featurize.py, but point of
dvc runis just to tell DVC what artifacts belong to currently defined step of ML pipeline.
ducommand to check how much space DVC project consumes I see that it duplicates/copies data. It’s very space and time consuming to copy large data files, is there a way to avoid that? It takes too long to add large files to DVC.
Yes! You don’t have to copy files with DVC. First of all, there are two reasons when du can show that it takes double the space to store data under DVC control. du can be inaccurate when the underlying file system supports reflinks (XFS on Linux, APFS on Mac, etc). This is actually the best scenario since no copying is happening and no changes are required to any DVC settings. Second, case means that copy semantics is used by default. It can be turned off by providing cache type
hardlinks. Please, read more on this here.
Just removing the corresponding DVC-file and running
dvc gcafter that should be enough. It’ll stop tracking the data file and clean the local cache that might still contain it. Note! Don’t forget to run
dvc unprotectif you use advanced DVC setup with symlinks and hardlinks (
cache.typeconfig option is not default). If
dvc gcbehavior is not granular enough you can manually find the by its cache from the DVC-file in
.dvc/cacheand remote storage. Learn here how they are organized.
Yes, in this sense DVC is not very different from using bare S3, SSH or any other storage where you can go and just delete data. DVC can give a bit of overhead to locate a specific file to delete, but otherwise it’s all the same you will be able to delete any file you want. See more details on how you retrospectively can edit directories under DVC control here.