Azure data Bricks - Part1

Databricks is an analytics service based on the Apache Spark open source project. Databricks has been used for ingesting a significant amount of data. In February 2018, there is integration between Azure and Databricks. This integration provides data science and data engineer team with a fast, easy and collaborative spark-based platform in Azure [1]. Azure Data bricks is a new platform for big data analytics and machine learning. The notebook in Azure Databricks enables data engineers, data scientist, and business analysts. In this post and next one, an overview of what is Azure Databricks will be provided, the environment will be shown, and how to use it for data science will be explained concisely.

Databricks Environment

Create Azure Databricks in Azure environment.Log into one of your accounts in Azure environment, create an Azure Databricks module.

To access to the Azure Databricks click on the Launch Workspace.

As you can see in the below picture, the Azure Databricks environment has different components. The main components are Workspace and Cluster. The first step is to create a Cluster. Clusters in Databricks provide a unified platform for ETL (Extract, transform, and load), stream analytics, and machine learning. The cluster has two types: Interactive and Job. Notebook clusters are used to analyze data collaboratively. However, Job clusters are used to run fast and robustly automated workload using API.

Cluster page may contain both cluster types. Each cluster able to have different nodes. To start you need to create a cluster. Click on the Create Cluster option. In the create cluster page, enter the information such as Cluster Name, Version (Default), Python Version, min and maximum workers and so forth.

To use the Cluster, you should wait till the Status change to Running (Below picture). By creating an interactive cluster, we able to create a notebook to write the codes there and get the result quickly.

To create a notebook, click on the workspace option and create a new notebook.

By creating a new notebook, you able to specify the notebook belong to which cluster, and what is the main language for the Notebook (Python, Scala, R, and SQL). In this example, R language has been selected as default. However, you able still to write other languages in the Notebook by writing: %scala. %python, %sql or %r before the scripts.

In the notebook by default, there is a place to write the codes. As you can see in below figure, there an editor with name cmd1 as a node to write the codes and run them all. In this example, there is only one node there and the primary language to write the code is R. In this example, we use the existing dataset in the gpplot2 package named mpg by writing below codes.

library(ggplot2)

display(mpg)

Display command, show the dataset in Databricks. To run the code, click on the arrow in the right side of the node and choose the Run Cell. After running the code, the result will appear at the end of the cell with table style.

To show the chart, you need to click on the chart icon at the bottom of the cell.

Also, you able to change the item that you want to show in the chart by clicking on the Chart Options.

This a very simple use case of Datbricks to run R scripts. In the next post I will show how to get data from Azure Data Lake Store, clean it with Scala language and then apply machine learning on it and finally show it in Power BI.

Leila Etaati

Trainer, Consultant, Mentor

Leila is the first Microsoft AI MVP in New Zealand and Australia, She has Ph.D. in Information System from the University Of Auckland. She is the Co-director and data scientist in RADACAD Company with more than 100 clients in around the world. She is the co-organizer of Microsoft Business Intelligence and Power BI Use group (meetup) in Auckland with more than 1200 members, She is the co-organizer of three main conferences in Auckland: SQL Saturday Auckland (2015 till now) with more than 400 registrations, Difinity (2017 till now) with more than 200 registrations and Global AI Bootcamp 2018. She is a Data Scientist, BI Consultant, Trainer, and Speaker. She is a well-known International Speakers to many conferences such as Microsoft ignite, SQL pass, Data Platform Summit, SQL Saturday, Power BI world Tour and so forth in Europe, USA, Asia, Australia, and New Zealand. She has over ten years’ experience working with databases and software systems. She was involved in many large-scale projects for big-sized companies. She also AI and Data Platform Microsoft MVP. Leila is an active Technical Microsoft AI blogger for RADACAD.

1 thought on “Azure data Bricks – Part1”

Happy to see and learn more about BI related topics.

Databricks Environment

1 thought on “Azure data Bricks – Part1”

Leave a ReplyCancel reply