Is There a Future for Hadoop and HDFS in the Azure Cloud?
It's an interesting question that we hear more frequently as we take clients down this data modernization path and follow a "cloud first" approach. For context, this client is in the process of a data modernization initiative, combined with a cloud migration to Microsoft Azure.
As we thought through their query, it led us to a couple of questions of our own:
1. Are you worried about cloud/vendor lock-in?
2. What drives this concern (why should you care)?
This blog post will address the first question. We will answer by showcasing how MS Azure builds flexibility into its cloud offerings to dispel these concerns. We will cover the second question in a future blog post.
As is typical of technology, there is a race to commoditization that is germane to this discussion. Further, with regards to the Azure cloud, there are tradeoffs such as total control, ease of use, and increased security (more on that in a future post). In the end-game, you must decide between one of the following two choices:
If vendor lock-in is not a concern, that is, having a single cloud vendor solution, then take advantage of the commoditization play and embrace Platform-as-a-Service (PaaS) offerings and the benefits that they bring to the table.
In contrast, if vendor lock-in is a concern, then take advantage of Infrastructure-as-a-Service (IaaS) and install your preferred flavor of Hadoop.
Deciding which of these is right for your business requires an understanding of Microsoft's Hadoop offerings:
Figure 1- Ease of use increases as PaaS usage increases
The Azure Marketplace offering is IaaS based. As its name suggests, a consumer can load their preferred version of Hadoop to use on virtual machines (VM) that they control. Each VM will have a dedicated storage, with a conventional Hadoop landscape built around it. The IaaS nature of Marketplace provides complete compatibility with on-premise installations but does not take advantage of the features that MS Azure offers. Additionally, an IaaS focused solution requires significantly more administration from both a Hadoop perspective and an IT Delivery/Operations/Security perspective.
The Azure HDInsight offering, as its name explicitly states, is based on HDInsight. It is a hybrid PaaS offering in Azure that provides a Hortonworks distribution as a service. This covers almost the entire Hortonworks stack, including Ambari, Spark, Storm, Kafka and Ranger to Active Directory integration. HDInsight allows you to scale elastically and takes advantage of separation between compute and storage. The compute nodes can be destroyed (saving significant cost) and then spun back and connected to the storage as needed. The capability can even be implemented in SSIS and Azure Data Factory as part of a larger data pipeline. In this model, there is no concern around HDFS, since that capability is covered by Azure storage and leveraged by the HDInsight cluster. This PaaS solution requires minimal Hadoop administration and even less IT Delivery/Operations/Security support.
AZURE DATA LAKE ANALYTICS
The Azure Data Lake Analytics offering is a PaaS and the most flexible MS Azure offering. It is Microsoft’s Big Data-as-a-Service offering. There are no clusters or storage to configure. It is truly a “plug and play” solution. Jobs are created in Azure Data Lake Analytics to transform data previously landed into Azure Blob or Azure Data Lake Store. In this scenario, there is no Hadoop or IT Delivery administration required at all.
In a nutshell, the future of Hadoop and HDFS in the cloud is already here. Through our exposition of the various MS Azure flavors, we hopefully have dispelled any concerns about cloud/vendor lock-in. We have also walked through considerations (e.g., control vs. ease of use) for companies going through the process of selecting Hadoop offerings. We hope this information has been helpful in charting a trajectory for where Hadoop is shifting. It is encouraging to know the technologies themselves have become commodities and are offered without the need for deep technology expertise.
Our next blog post in this series will focus on the second question that was raised; "What drives this concern (why do you care)?"