Recently, we wrote about the smart ways in which you can use Machine Learning to improve your company. We also introduced the concept of MLOps, which we are adopting at 10Clouds. In a nutshell, its the collaborative approach used by data scientists and Ops teams, with the ultimate goal of streamlining processes and providing deeper, more consistent, and more useful insights from ML.
Are you looking to introduce MLOps processes at your company? If so, you might find the below tools helpful.
Data Version Control (DVC)
DVC or Data Version Control is an open source project for experiment management in Machine Learning projects. It’s easy for developers to get to grips with, because its logic and commands are based on Git. It is storage agnostic, which means that the datasets you use in your projects may reside on Amazon S3, MS Azure Blob Storage or other vendor’s services (even Google Drive). You’re also able to download any recent versions of the dataset in one command.
The data versions are stored inside your Git repository, but the data itself is stored in any location of your choice. You’re also able to configure pipelines for data processing which can be used at any time with a single command. Configuration files make it easy to check what actually happens in pipelines and amend it as necessary. The DVC also tracks your model metrics.
Although most data science code is written in Python, DVC, like Git, is language and framework agnostic and you can use it with whatever tech stack you choose. DVC can also be integrated with CI/CD processes and tools such as Hive or Apache Spark among others.
MLflow
MLflow is another open source tool which comes from DataBricks. MLflow is useful for the whole ML project lifecycle, starting with managing experiments through deployment, reproducibility of work and central model registry. It consists of four components:
MLflow Tracking - This tracks various information about experiments, e.g. code version, start and end times, source information on files run, entrypoints, parameters as key-value store metrics and output files in any format e.g. PNGs or pickled models.
MLflow Projects - This allows you to define the parameters and characteristics of the project, so that every data scientist or machine learning engineer is able to run it in exactly the same way. All of this with the use of handy configuration files (such as YAML).
MLflow Models - This is a standard of packaging the model so that it can be easily run with a variety of dowstram tools - from serving through REST API to batch inference on Apache Spark.
Model Registry - This is a crucial tool for model management within the project. It allows you to easily store models along with APIs, history and UI through the whole model lifecycle.
MLflow is easy to use thanks to the clear UI for its tools which can be opened in a web browser. It also operates as a plugin, so you can integrate it with your system in various ways.
Start considering MLOps now, and reap the benefits later
If you've been hesitant about implementing MLOps at your company, start taking the first steps today and exploring some of the above tools. While setting up the infastructure might take some time, you'll soon see the benefits with smoother delivery and deeper, more useful insights from your data scientists.