The past few months have been so busy and full of great events! We love how involved our community is and can’t wait to share more with you:
Devsprints participants on our Discord channel
𝚋𝚛𝚎𝚠 𝚒𝚗𝚜𝚝𝚊𝚕𝚕 𝚍𝚟𝚌!
Here are some of the great pieces of content around DVC and ML ops that we discovered in October and November:
…building your ML system has a great advantage — it is tailored to your needs. It has all features that are needed in your ML system and can be as complex as you wish. This tutorial is for readers who are familiar with ML and would like to learn how to build ML web services.
In this article, I want to show 3 powerful tools to simplify and scale up machine learning development within an organization by making it easy to track, reproduce, manage, and deploy models.
We do believe that Data Science is a field that can become even more mature by using best practices in project development and that Conda, Git, DVC, and JupyterLab are key components of this new approach
DVC is a powerful tool and we covered only the fundamentals of it.
There are lots of hidden gems in our Discord community discussions. Sometimes they are scattered all over the channels and hard to track down.
We are sifting through the issues and discussions and share with you the most interesting takeaways.
dvc importyou get the state of the data in the original repo at that moment in time from that repo, right? The overall state of that repo (e.g. Git
commit id(hash)) is not preserved upon import, right?
On the contrary, DVC relies on Git
commit id (hash) to determine the state of
the data as well as code. Git
commit id (hash) is saved in DVC file upon
import, data itself is copied/downloaded into DVC repo cache but would not be
pushed to the remote — DVC does not create duplicates. There is a command to
advance/update it when it’s needed —
dvc update. Git commit hash saved to
provide reproducibility. Even if the source repo
HEAD has changed your import
stays the same until you run
dvc update or redo
Yes, in this sense DVC is not very different from using bare S3, SSH or any other storage where you can go and just delete data. DVC can give a bit of overhead to locate a specific file to delete, but otherwise it’s all the same you will be able to delete any file you want. Read more details in this discussion.
foo.png.dvc— is there a command that will show the remote url, something like
dvc get-remote-url foo.png.dvcwhich will return e.g. the Azure url to download.
There is no special command for that, but if you are using Python, you could use our API specifically designed for that:
from dvc.api import get_url url = get_url(path, repo="https://github.com/user/proj", rev="mybranch")
so, you could as well use this from CLI as a wrapper command.
Short answer: no (as of the date of publishing this Heartbeat issue) Good news — it should be very easy to add, so we would welcome a contribution :) Azure has a connection argument for AD — quick googling shows this library, which is what probably needed.
When installing using
plain.pkg it is a bit tricky to uninstall, so we usually
recommend using things like brew cask instead if you really need the binary
package. Try to run these commands:
$ sudo rm -rf /usr/local/bin/dvc $ sudo rm -rf /usr/local/lib/dvc $ sudo pkgutil --forget com.iterative.dvc
to uninstall the package.
Yes, you should use
--global config options to set user per
project or per use machine without sharing (committing) them to Git:
$ dvc remote modify myremote —local user myuser
$ dvc remote modify myremote —global user myuser
A simple environment variable like this:
$ export AWS_CA_BUNDLE=/path/to/cert/cert.crt dvc push
should do the trick for now, we plan to fix the ca_bundle option soon.
dvc reproand I’m happy with the result. However, I realized that I didn’t specify a dependency which I needed (and obviously is used in the computation). Can I somehow fix it?
Add the dependency to the stage file without rerunning/reproducing the stage. This is not needed as this additional dependency hasn’t changed.
You would need to edit the DVC-file. In the deps section add:
dvc commit file.dvc to save changes w/o running the pipeline again.
See an example
dvc push -r upstreamas opposed to
dvc push(mind no additional arguments).
You can mark a “default” remote:
$ dvc remote add -d remote /path/to/my/main/remote
dvc push (and other commands like
dvc pull) will know to push to the
No, at least at the time of publishing this. You could use a phony output though. E.g. make the stage A output some dummy file and make B depend on it. Please, consider creating or upvoting a relevant issue on our Github if you’d this to be implemented.