File Managment in Azure Data Lake Store(ADLS) using R Studio

datalake

In this post, I am going to share my experiment in how to do file management in ADLS using with R studio environment,

So how it works? we able to manage ADLS from Rstudio environment using R scripts, so without accessing the ADLS we able to manage the portal, bring Data from ADLS to R studio to do Machine Learning  practice, However, after we sure that our codes are good enough, now we can use U-SQL inside ADLS to embed R codes in ADLS environment. In this post, first I am going to show how we can access files from R studio environment to ADLS file for file management and practising machine learning there.  In the next post on ADLS, I will show after you test your code in ADLS environment, now you want to embed the code inside ADLS using R scripts in U-SQL (the language we have in ADLS).

 

 What do we need to start?

to do this you need to have below items

1. An Azure subscription

2. Create an Azure Data Lake Store Account

3. Create an Azure Active Directory Application (for the aim of service-to-service authentication).

4. An Authorization Token from Azure Active Directory Application

You should have below information:

Client_id (application id), Tenant_Id, Client_secret, and OAUTH 2.0 Token endpoint.

to start in R studio, you need to install below packages

install.packages("httr")
install.packages("jsonlite")
install.packages("curl")
library(httr)
library(jsonlite)
library(curl)

Now I need to request information from ADLS using some R codes as below

h <- new_handle()


handle_setform(h,
               
               "grant_type"="client_credentials",
               
               "resource"="https://management.core.windows.net/",
               
               "client_id"="Application_ID",
               
               "client_secret"="Client_Secret"
               
)
req <- curl_fetch_memory("https://login.windows.net/3977e63c-42bc-4e42-9204-905502b6be1e/oauth2/token", handle = h)

res <- fromJSON(rawToChar(req$content))

after setting up the connection, first I am going to explore the folders that I have in ADLS,

so these are the folders that I have in ADLS

21

 

Now, I am going to use R scripts inside R studio to explore the folders and files in my ADLS, as below:

r <- httr::GET("https://<datalake name>.azuredatalakestore.net/webhdfs/v1/?op=LISTSTATUS",
               
               add_headers(Authorization = paste(res$token_type,res$access_token)))

jsonlite::toJSON(jsonlite::fromJSON(content(r,"text")), pretty = TRUE)

after running the above codes, I got the below message in the output window:

List Folders

{
  "FileStatuses": {
    "FileStatus": [
      {
        "length": 0,
        "pathSuffix": "",
        "type": "DIRECTORY",
        "blockSize": 0,
        "accessTime": 1506460300176,
        "modificationTime": 1506638643156,
        "replication": 0,
        "permission": "777",
        "owner": " ",
        "group": ""
      },
      {
        "length": 0,
        "pathSuffix": "Output",
        "type": "DIRECTORY",
        "blockSize": 0,
        "accessTime": 1506984327086,
        "modificationTime": 1506985146391,
        "replication": 0,
        "permission": "770",
        "owner": "",
        "group": ""
      },
      {
        "length": 0,
        "pathSuffix": "Samples",
        "type": "DIRECTORY",
        "blockSize": 0,
        "accessTime": 1506982289547,
        "modificationTime": 1506982289547,
        "replication": 0,
        "permission": "770",
        "owner": "",
        "group": ""
      },
      {
        "length": 0,
        "pathSuffix": "catalog",
        "type": "DIRECTORY",
        "blockSize": 0,
        "accessTime": 1499921514558,
        "modificationTime": 1499921514583,
        "replication": 0,
        "permission": "771",
        "owner": "",
        "group": ""
      },
      {
        "length": 34,
        "pathSuffix": "data.csv",
        "type": "FILE",
        "blockSize": 268435456,
        "accessTime": 1499923394187,
        "modificationTime": 1499923394293,
        "replication": 1,
        "permission": "770",
        "owner": "",
        "group": ""
      },
      {
        "length": 0,
        "pathSuffix": "mytempdir",
        "type": "DIRECTORY",
        "blockSize": 0,
        "accessTime": 1507562677951,
        "modificationTime": 1507564478898,
        "replication": 0,
        "permission": "770",
        "owner": "",
        "group": ""
      },
      {
        "length": 0,
        "pathSuffix": "system",
        "type": "DIRECTORY",
        "blockSize": 0,
        "accessTime": 1499921515548,
        "modificationTime": 1506982329842,
        "replication": 0,
        "permission": "770",
        "owner": "",
        "group": ""
      },
      {
        "length": 0,
        "pathSuffix": "usqlext",
        "type": "DIRECTORY",
        "blockSize": 0,
        "accessTime": 1506982293606,
        "modificationTime": 1507591046474,
        "replication": 0,
        "permission": "770",
        "owner": "",
        "group": ""
      }
    ]
  }
}

 

so it shows me all the available folders in the root of my ADLS.

Create Folders

imagine I am going to create a folder inside my ADLS storage to do that I write below codes

r <- httr::PUT("https://name of ADLS.azuredatalakestore.net/webhdfs/v1/mytempdir/?op=MKDIRS",
               
               add_headers(Authorization =  paste(res$token_type,res$access_token)))

content(r, "text")

so if I check the ADLS I will found this folder

26

 

 Read Data

Now imagine that we are going to read a file and load it from ADLS into R Studio, or even copy into local PC, so the first step I am going to access the directory using “Get” function, through the mytempdir folder and access the iris.csv file in this folder.

library(httr)

r <- httr::GET("https://ADLS name.azuredatalakestore.net/webhdfs/v1/mytempdir/iris.csv?op=OPEN&read=true",
               
               add_headers(Authorization = paste(res$token_type,res$access_token)))


So, if I need to load it just for working in R studio without download it I can use the below codes

Dataforiris<-content(r)

so all data will be load into variable “Dataforiris”,

however, you may be interested in a local memory then you can use below codes:

writeBin(content(r), "C:/PBIEm/iris.csv")

irisDownloaded <- read.csv("C:/PBIEm/iris.csv")

head(irisDownloaded)

I just load the data into my local C folder and it can be accessible

27

this practice was for a small dataset that it took less than 1 second to load it from ADLS, however, I tried it for a dataset for 64 million records, the loading process took about 70 seconds from ADLS to R Studio. however if you interested to just work with R inside ADLS then you have to use R inside U-SQL, which I am going to talk about it in next post!

https://blogs.msdn.microsoft.com/microsoftrservertigerteam/2017/03/14/using-r-to-perform-filesystem-operations-on-azure-data-lake-store/

 

Leila Etaati on LinkedinLeila Etaati on TwitterLeila Etaati on Youtube
Leila Etaati
Trainer, Consultant, Mentor
Leila is the first Microsoft AI MVP in New Zealand and Australia, She has Ph.D. in Information System from the University Of Auckland. She is the Co-director and data scientist in RADACAD Company with more than 100 clients in around the world. She is the co-organizer of Microsoft Business Intelligence and Power BI Use group (meetup) in Auckland with more than 1200 members, She is the co-organizer of three main conferences in Auckland: SQL Saturday Auckland (2015 till now) with more than 400 registrations, Difinity (2017 till now) with more than 200 registrations and Global AI Bootcamp 2018. She is a Data Scientist, BI Consultant, Trainer, and Speaker. She is a well-known International Speakers to many conferences such as Microsoft ignite, SQL pass, Data Platform Summit, SQL Saturday, Power BI world Tour and so forth in Europe, USA, Asia, Australia, and New Zealand. She has over ten years’ experience working with databases and software systems. She was involved in many large-scale projects for big-sized companies. She also AI and Data Platform Microsoft MVP. Leila is an active Technical Microsoft AI blogger for RADACAD.

4 thoughts on “File Managment in Azure Data Lake Store(ADLS) using R Studio

    • Hi Giorgio, I wrote codes inside R studio and from R studio using R scripts I access the files in ADLS environment, so all the codes should be run inside R studio

  • I find the current “R inside U-SQL” workflow cumbersome. Is there a possibility to do all the data processing from within R, similar to what we’d do with the Azure DW, i.e., emit dplyr queries that get translated in the background to whatever U-SQL + R is needed.

Leave a Reply