Databricks (Open-Source Tools) or Azure ML Workspace (Azure) or How to Select the Best MLOps Platform.
Authors: Luis Galo and Dr. Lídia Almazán Torres, Data Scientists at Lantek
In a previous article we explained what Machine Learning Operations (MLOps) is.This article addresses the difference between two widely used environments for Machine Learning (ML): Databricks and the Azure ML workspace. The first is open-source and is not MLOps-oriented, and the second is proprietary to Azure and is designed from its origin as an MLOps platform.
In any project, the question arises as to how to manage the models and which tools to use to control the entire software lifecycle and facilitate both its management and the work of the developers. Although there is a tendency to use open-source tools, there are also proprietary but free tools that greatly facilitate the development of ML applications, an example of which is Azure ML SDK.
Below we compare two of the most widely used environments for developing machine learning systems, and we conclude the best solution for our use cases.
Azure Databricks and Azure Machine Learning
What is Azure Databricks? Databricks is a unified analytics platform that enables teams to collaboratively perform everything from data analytics to data science using Apache Spark.
What is Azure Machine Learning? Azure Machine Learning Service (AMLS) is Microsoft’s proprietary solution to support the machine learning lifecycle in Azure. AMLS is continuously getting new features by updating the MLOps process. Currently, you can use the Python SDK to interact with the service, or you can use Designer to make simple models visually.
The main differences between the two environments lie in:
Model Lifecycle Monitoring
AMLS includes the functionality to keep track of datasets, experiments, pipelines, models, and endpoints. In addition, you can easily provision virtual machines, training clusters, and inference directly from your web environment (and from the SDK) allowing users to use virtually any Python package via Conda (open-source package management system and environment management system).
Databricks can be used with MLflow as a tracking system, but it is a separate package and requires maintenance time. Although it is an open-source package that comes pre-installed and enabled in the latest versions of ML in Databricks, it is a limited MLOps approach. Furthermore, it is restricted to model lifecycle control, it does not monitor model state, nor does it monitor data.
Azure ML is a better choice when training with data of limited size (small and medium) on a single machine. Although Azure ML provides processing clusters, the distribution of data among nodes must be hard-coded by developers, which makes it a poor choice for big data.
Azure Databricks is designed to handle data distributed across multiple nodes and in bulk, as long as the Databricks and Spark libraries are used. This is advantageous when the size of the data is huge.
Databricks can support many file formats natively. Querying and cleaning huge data sets is easy. On the contrary, in Azure ML this has to be custom handled with Python libraries, where cleaning and writing to storages have to be done by code.
Both platforms can distribute training. Databricks provides adapted ML algorithms that can act on a piece of data at the master node and coordinate with other nodes. Although this can be done in AMLS, the data flow management and algorithms must be previously programmed.
Production deployment of models
AMLS offers super simple deployments since with just a couple of clicks (or lines of code from the SDK) you can deploy a trained model in an Azure container instance or an Azure Kubernetes service-based container, with a callable URI. This can even be configured with API keys or tokens and in secure environments. In addition, it can be further tailored by configuring a custom Docker image for specific use cases.
As for Databricks, at least in Azure, we have 3 options: Pull the model out of Spark and operationalize it using AMLS, use Microsoft ML Spark (MMLSpark) to create a distributed web service within your Spark cluster, or use the MLflow API to deploy the model.
MMLSpark is a Microsoft package that allows Spark to handle a dataflow workload from an API endpoint. This process is ideal for complex data that needs to be processed before being run through a trained model.
AMLS is easy to integrate into CI/CD pipelines as it allows testing in simple systems with unit and integration tests without the need for special environments.
However, Databricks requires a connection to the cluster to be able to run the integration tests, which slightly complicates the management of test environments, as well as the debugging of the software in the local IDE of the programmer, making it a bit more cumbersome for developers.
Databricks has a higher learning curve than Azure ML, therefore, it is preferred if data and algorithms are big enough to need to be distributed across the cluster.
Azure ML is easier for small and medium-sized models and especially for experienced Python developers since it is programmed in Python locally and the environment can be launched in a pipeline without any problem.
Databricks, on the other hand, requires the use of its own libraries (specially designed to be used in Spark) and knowledge of how the cluster works in order to take full advantage of its capacity and distribute the computation across the cluster.
Generally speaking, if the amount of dataset is small to medium, AMLS is the best choice. However, if the size of the data is huge (does not fit in training or processing data in the PC RAM), then Databricks is the option to choose. We can assure you that when the size of the data can fit on a single scaled machine using a panda data frame, then the use of Azure Databricks would be excessive.
In the case of a big size of data, the right choice could be a combination of both. The big data processing can be done in Databricks and the training in AML, so the lifecycle management is done by Azure ML and the bulk data processing by Databricks. In addition, AMLS frequently updates the entire MLOps process, so at the pace of Azure ML updates, the entire MLOps process can be updated in parallel.
Lastly, it is possible to use AMLS through open-source resources such as MLflow, since its use is much simpler. Still, it is actually a limited tool because it only allows access to certain open-source libraries. Azure ML, on the other hand, has more freedom to use other open-source libraries despite being a proprietary tool, which can turn out to be more resourceful.