Analytics in HDInsight

Azure HDInsight is an open-source analytics and cloud base service. Azure HDInsight is easy, fast, and cost-effective for processing the massive amounts of data. There are many different use case scenarios for HDInsight such as extract, transform, and load (ETL), data warehousing, machine learning, IoT and so forth.

The main benefit of using HDInsight for machine learning is to about access to a memory-based processing framework. HDInsight helps developers to process and analyze big data, and develop solutions using some great and open source framework such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, and Microsoft Machine Learning Server [1].

Set up clusters in HDInsight

The first step is to set up an HDInsight in Azure. Login to your Azure account and create an HDInsight component in Azure . As you can see in the below figure, there are different modules for HDInsight such as HDInsight Spark monitoring and HDInsight Interactive Query monitoring. Among those select the HDInsight analytics option.

When you create an HDInsight, you need to follow some steps for setting up the cluster and identifying the size. In the first step, you need to set a name for the cluster, set the subscription, and the cluster type. There are different cluster types such as Spark, Hadoop, Kafka, ML Services and so forth.

In the next step, you should identify the size and check the summary.

Creation of the HDInsight makes takes a couple of minutes. After creating an HDInsight component, on the main page, in the Overview section, select the Cluster Dashboard

Next, select the Jupyter Notebook, on the new page, choose the New option.

As you can see in the below figure there are different options such as PySpark, PySpark3, and Spark notebook. You can write the python code in all of them. After creating the new page for Spark, there is a need to log in with the username and password that you provided for creating the HDInsight component. The Jupyter environment is like a notebook and so like Azure Databricks environment. Also, there is a possibility to write the code there and run the whole cell to see the result.

There is a possibility to fetch the data from other Azure components such as Azure Data Lake Store Gen1. To do that you need to run the below codes (In Databricks we run the same code).

spark.conf.set(“dfs.adls.oauth2.access.token.provider.type”, “ClientCredential”)

spark.conf.set(“dfs.adls.oauth2.client.id”, “a1824181-e20c-4952-894f-6f53670672dd”)

spark.conf.set(“dfs.adls.oauth2.credential”, “iRzOkcyahiomc5AKobyVxFdDVF/mEbS3mqN1moehG0w=”)

spark.conf.set(“dfs.adls.oauth2.refresh.url”, “https://login.microsoftonline.com/0b414bdb-2159-4b16-ad13-b2d54a1781da/oauth2/token”)

val df=spark.read.option(“header”, “true”).csv(“adl://adlsbook.azuredatalakestore.net/titanic.csv”)

val specificColumnsDf = df.select(“Survived”, “Pclass”, “Sex”, “Age”)

val renamedColumnsDF = specificColumnsDf.withColumnRenamed(“Sex”, “Gender”)

renamedColumnsDF.createOrReplaceTempView(“some_name”)

renamedColumnsDF.show()

There is a possibility to do machine learning in Jyputer Notebook Spark as well. To see an example, follow the tutorial in reference number 4 [4].

[1] https://docs.microsoft.com/en-us/azure/hdinsight/

[2] https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-jupyter-spark-sql

[3]https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-ipython-notebook-machine-learning

[4] https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-ipython-notebook-machine-learning

Leila Etaati

Trainer, Consultant, Mentor

Leila is the first Microsoft AI MVP in New Zealand and Australia, She has Ph.D. in Information System from the University Of Auckland. She is the Co-director and data scientist in RADACAD Company with more than 100 clients in around the world. She is the co-organizer of Microsoft Business Intelligence and Power BI Use group (meetup) in Auckland with more than 1200 members, She is the co-organizer of three main conferences in Auckland: SQL Saturday Auckland (2015 till now) with more than 400 registrations, Difinity (2017 till now) with more than 200 registrations and Global AI Bootcamp 2018. She is a Data Scientist, BI Consultant, Trainer, and Speaker. She is a well-known International Speakers to many conferences such as Microsoft ignite, SQL pass, Data Platform Summit, SQL Saturday, Power BI world Tour and so forth in Europe, USA, Asia, Australia, and New Zealand. She has over ten years’ experience working with databases and software systems. She was involved in many large-scale projects for big-sized companies. She also AI and Data Platform Microsoft MVP. Leila is an active Technical Microsoft AI blogger for RADACAD.

Leave a ReplyCancel reply