March '20 Community Gems

Look here every month for great discussions and technical Q&A's from our users and core development team.

Discord gems

Here are some Q&A’s from our Discord channel that we think are worth sharing.

Q: I have several simulations organized with Git tags. I know I can compare the metrics with dvc metrics diff [a_rev] [b_rev], substituting hashes, branches, or tags for [a_rev] and [b_rev]. But what if I wanted to see the metrics for a list of tags?

DVC has a built in function for this! You can use dvc metrics show with the -T option:

$ dvc metrics show -T

to list the metrics for all tagged experiments.

Also, we have a couple of relevant discussions going on in our GitHub repo about handling experiments and hyperparameter tuning. Feel free to join the discussion and let us know what kind of support would help you most.

Q: Is there a recommended way to save metadata about the data in a .dvc file? In particular, I’d like to save summary statistics (e.g., mean, minimum, and maximum) about my data.

One simple way to keep metadata in a .dvc file is by using the meta field. Each meta entry is a key:value pair (for example, name: Jean-Luc). The meta field can be manually added or written programmatically, but note that if the .dvc file is overwritten (perhaps by dvc run, dvc add, or dvc import) these values will not be preserved. You can read more about this in our docs.

Another approach would be to track the statistics of your dataset in a metric file, just as you might track performance metrics of a model. For a tutorial on using DVC metrics please see our docs.

Q: My team has been using DVC in production. When we upgraded from DVC version 0.71.0, we started getting an error message: ERROR: unexpected error - /my-folder is not a git repository. What’s going on?

This is a consequence of new support we’ve added for monorepos with the dvc init --subdir functionality (see more here), which lets there be multiple DVC projects within a single Git repository. Now, if a DVC repository doesn’t contain a .git directory, DVC expects the no_scm flag to be present in .dvc/config and raises an error if not. For example, one of our users reported this when using DVC to pull files into a Docker container that didn’t have Git initialized (for more about using DVC without Git, see our docs).

You can fix this by running dvc config core.no_scm true (you could include this command in the script that creates Docker images). Alternately, you could include .git in your Docker container, but this is not advisable for all situations.

We are currently working to add graceful error-handling for this particular issue so stay tuned.

Q: Is there a way to force the pipeline to rerun, even if its dependencies haven’t changed?

Yes, dvc repro has a flag that should help here. You can use the -f or --force flag to reproduce the pipeline even when no changes in the dependencies (for example, a training datset tracked by DVC) have been found. So if you had a hypoethetical DVC pipeline whose final process was deploy.dvc, you could run dvc repro -f deploy.dvc to rerun the whole pipeline.

Q: What’s the best way to oragnize DVC repositories if I have several training datasets shared by several projects? Some projects use only one dataset while other use several. Can one project have .dvc files corresponding to different remotes?

Yes, one project directory can contain datasets from several different DVC remotes. Specifically, DVC has functions dvc import and dvc get that emulate the experience of using a package manager for grabbing datasets from external sources. You can use dvc import or dvc get to access any number of datasets that are dependencies in a given project. For more on this, see our tutorial on data registries.

Q: What are the risks of using DVC on confidential data?

DVC doesn’t collect any information about your data (or code, or models, for that matter). You may have noticed that DVC collects Anonymized Usage Analytics, which users may opt out of. The data we collect is extremely limited and anonymized, as it is collected mainly for the purpose of prioritizing bugs and feature development based on DVC usage. For example, we collect info about your operating system, DVC version, and installation method (the complete list of collected features is here).

Many of our users work with sensitive or private data, and we’ve developed DVC with such scenarios in mind from day one.

Q: Can you suggest a reference architecture for using DVC as part of MLOps?

Increasingly, DVC is being used not to just to version and manage machine learning projects, but as part of MLOps, practices for combining data science and software engineering. As MLOps is a fairly new discipline, standards and references aren’t yet solidified. So while there isn’t (yet) a standard recipe for using DVC in MLOps projects, we can point you to a few architectures we like, and which have been reported in sufficient detail to recreate.

First, DVC can be used to detect events (such as dataset changes) in a CI/CD system that traditional version control systems might not be able to. An excellent and thorough blog by Danilo Sato et al. explores using DVC in this way, as part of a CI/CD system that retrains a model automatically when changes in the dataset are detected.

Second, DVC can be used to support model training on cloud GPUs, particularly as a tool for pushing and pulling files (such as datasets and trained models) between cloud computing instances, DVC repositories, and other environments. This architecture was the subject of a recent blog by Marcel Mikl and Bert Besser. Their report describes the cloud computing setup and continuous integration pipeline quite well.

If you develop your own architecture for using DVC in MLOps, please keep us posted. We’ll be eager to learn from your experience. Also, keep an eye on our blog in the next few months. We’re rolling out some new tools with a focus on MLOps!

Subscribe for updates. We won't spam you.