In this post, I am going to share my experiment in how to do file management in ADLS using with R studio environment,
So how it works? we able to manage ADLS from Rstudio environment using R scripts, so without accessing the ADLS we able to manage the portal, bring Data from ADLS to R studio to do Machine Learning practice, However, after we sure that our codes are good enough, now we can use U-SQL inside ADLS to embed R codes in ADLS environment. In this post, first I am going to show how we can access files from R studio environment to ADLS file for file management and practising machine learning there. In the next post on ADLS, I will show after you test your code in ADLS environment, now you want to embed the code inside ADLS using R scripts in U-SQL (the language we have in ADLS).
What do we need to start?
to do this you need to have below items
1. An Azure subscription
2. Create an Azure Data Lake Store Account
3. Create an Azure Active Directory Application (for the aim of service-to-service authentication).
4. An Authorization Token from Azure Active Directory Application
You should have below information:
Client_id (application id), Tenant_Id, Client_secret, and OAUTH 2.0 Token endpoint.
to start in R studio, you need to install below packages
install.packages("httr") install.packages("jsonlite") install.packages("curl") library(httr) library(jsonlite) library(curl)
Now I need to request information from ADLS using some R codes as below
h <- new_handle() handle_setform(h, "grant_type"="client_credentials", "resource"="https://management.core.windows.net/", "client_id"="Application_ID", "client_secret"="Client_Secret" ) req <- curl_fetch_memory("https://login.windows.net/3977e63c-42bc-4e42-9204-905502b6be1e/oauth2/token", handle = h) res <- fromJSON(rawToChar(req$content))
after setting up the connection, first I am going to explore the folders that I have in ADLS,
so these are the folders that I have in ADLS
Now, I am going to use R scripts inside R studio to explore the folders and files in my ADLS, as below:
r <- httr::GET("https://<datalake name>.azuredatalakestore.net/webhdfs/v1/?op=LISTSTATUS", add_headers(Authorization = paste(res$token_type,res$access_token))) jsonlite::toJSON(jsonlite::fromJSON(content(r,"text")), pretty = TRUE)
after running the above codes, I got the below message in the output window:
List Folders
{ "FileStatuses": { "FileStatus": [ { "length": 0, "pathSuffix": "", "type": "DIRECTORY", "blockSize": 0, "accessTime": 1506460300176, "modificationTime": 1506638643156, "replication": 0, "permission": "777", "owner": " ", "group": "" }, { "length": 0, "pathSuffix": "Output", "type": "DIRECTORY", "blockSize": 0, "accessTime": 1506984327086, "modificationTime": 1506985146391, "replication": 0, "permission": "770", "owner": "", "group": "" }, { "length": 0, "pathSuffix": "Samples", "type": "DIRECTORY", "blockSize": 0, "accessTime": 1506982289547, "modificationTime": 1506982289547, "replication": 0, "permission": "770", "owner": "", "group": "" }, { "length": 0, "pathSuffix": "catalog", "type": "DIRECTORY", "blockSize": 0, "accessTime": 1499921514558, "modificationTime": 1499921514583, "replication": 0, "permission": "771", "owner": "", "group": "" }, { "length": 34, "pathSuffix": "data.csv", "type": "FILE", "blockSize": 268435456, "accessTime": 1499923394187, "modificationTime": 1499923394293, "replication": 1, "permission": "770", "owner": "", "group": "" }, { "length": 0, "pathSuffix": "mytempdir", "type": "DIRECTORY", "blockSize": 0, "accessTime": 1507562677951, "modificationTime": 1507564478898, "replication": 0, "permission": "770", "owner": "", "group": "" }, { "length": 0, "pathSuffix": "system", "type": "DIRECTORY", "blockSize": 0, "accessTime": 1499921515548, "modificationTime": 1506982329842, "replication": 0, "permission": "770", "owner": "", "group": "" }, { "length": 0, "pathSuffix": "usqlext", "type": "DIRECTORY", "blockSize": 0, "accessTime": 1506982293606, "modificationTime": 1507591046474, "replication": 0, "permission": "770", "owner": "", "group": "" } ] } }
so it shows me all the available folders in the root of my ADLS.
Create Folders
imagine I am going to create a folder inside my ADLS storage to do that I write below codes
r <- httr::PUT("https://name of ADLS.azuredatalakestore.net/webhdfs/v1/mytempdir/?op=MKDIRS", add_headers(Authorization = paste(res$token_type,res$access_token))) content(r, "text")
so if I check the ADLS I will found this folder
Read Data
Now imagine that we are going to read a file and load it from ADLS into R Studio, or even copy into local PC, so the first step I am going to access the directory using “Get” function, through the mytempdir folder and access the iris.csv file in this folder.
library(httr) r <- httr::GET("https://ADLS name.azuredatalakestore.net/webhdfs/v1/mytempdir/iris.csv?op=OPEN&read=true", add_headers(Authorization = paste(res$token_type,res$access_token)))
So, if I need to load it just for working in R studio without download it I can use the below codes
Dataforiris<-content(r)
so all data will be load into variable “Dataforiris”,
however, you may be interested in a local memory then you can use below codes:
writeBin(content(r), "C:/PBIEm/iris.csv") irisDownloaded <- read.csv("C:/PBIEm/iris.csv") head(irisDownloaded)
I just load the data into my local C folder and it can be accessible
this practice was for a small dataset that it took less than 1 second to load it from ADLS, however, I tried it for a dataset for 64 million records, the loading process took about 70 seconds from ADLS to R Studio. however if you interested to just work with R inside ADLS then you have to use R inside U-SQL, which I am going to talk about it in next post!
https://blogs.msdn.microsoft.com/microsoftrservertigerteam/2017/03/14/using-r-to-perform-filesystem-operations-on-azure-data-lake-store/
How do you access the R Scripting Window you see examples of in your post?
Hi Giorgio, I wrote codes inside R studio and from R studio using R scripts I access the files in ADLS environment, so all the codes should be run inside R studio
I find the current “R inside U-SQL” workflow cumbersome. Is there a possibility to do all the data processing from within R, similar to what we’d do with the Azure DW, i.e., emit dplyr queries that get translated in the background to whatever U-SQL + R is needed.
Good suggestion I will check it, if I found a solution will write a post, thanks for suggestion