Previously in another post I explained what is Azure Data Factory alongside tools and requirements for this service. In this post I want to go through a simple demo of Data Factory, so you get an idea of how Data Factory project builds, develops and schedules to run. You may see some components of Azure Data Factory in this post that you don’t fully understand, but don’t worry, I’ll go through them later on in future posts.
An overview from previous section; Azure Data Factory is a Microsoft Azure service to ingest data from data sources and apply compute operations on the data and load it into the destination. The main purpose of Data Factory is data ingestion, and that is the big difference of this service with ETL tools such as SSIS (I’ll go through difference of Data Factory and SSIS in separate blog post). With Azure Data Factory you can;
- Access to data sources such as SQL Server On premises, SQL Azure, and Azure Blob storage
- Apply Data transformation through Hive, Pig, and C#.
- Monitor the pipeline of data, validation and execution of scheduled jobs
- Load it into desired Destinations such as SQL Server On premises, SQL Azure, and Azure Blob storage
- And on last but not least; This is Cloud based service.
In this post (and some posts later on) I’ll explain a simple demo (Hello world!) for Data Factory. In this simple demo we want to extract data from some CSV files (on Azure Blob Storage), and load it intact (without applying any transformation) into a SQL Server Database destination.
Step 0: Azure Subscription
For working with Azure Data Factory you need an Azure Subscription first. Azure subscription is available for trial for 1 month with 250$ worth of credit nowadays. you need to login to Azure Portal with your username and password. Please note that you should use the new Azure Portal (Preview mode nowadays), as the old Azure management portal doesn’t support Data Factory. When you login to new Azure Portal (https://portal.azure.com/) you will see a screen like this;
If this is the first time you log in to azure portal get a minute to familiarize yourself with the first screen. Fortunately it designed user friendly. There is a place for Billing information, a link to old Azure portal, Portal settings, and some dashboards for service health, link to help and browse services. the left pane is also self-explanatory and doesn’t require further description.
Step 1: New Azure Data Factory
Click on New, and then under Data Analytics, Click on Data Factory. The [Preview] in this image means this service is still in preview and not a final release yet.
When you create a new Data Factory you have to assign some variables;
– Name of the Factory; name should only have characters such as letters, digits, or hyphens. No space allowed here.
Name the data factory of this example as: RADACAD-Simple-Copy
– Resource Group Name; Resource Group is a container for a collection of Azure resources. As an example you might use multiple azure services for one application, so creating a resource group for that application helps to manage all azure services of that application in one place.
I would like to create a resource group for all examples of Azure Data Factory in this blog, so I create a new resource group and name it: RADACAD-ADF
– Subscription Name; if you have multiple subscription you can choose one of them, otherwise your only subscription will be available there by default.
– Region Name
Note: Pin to Startboard will bring this data factory into the first page of your Microsoft Azure Portal. As we will work with this example in some posts so leave this box checked.
Click on Create. Now you will be redirected to start page of azure portal with an icon showing the data factory is creating.
When creation completes successfully (or even if it fails for some reasons) you will see a message in Notifications section.
Step 2: Opening Data Factory
After creating the data factory you will be redirected to RADACAD-Simple-Copy data factory home page. in this page you will see name of the factory with a summary of components of it. a data factory usually contains Linked Services, Datasets, and Pipelines. As this data factory is just created it shows zero component for all sections.
- Linked Services are connections to data sources or destinations, they might be on Azure or on-premises.
- Datasets are the views of data that we work with them through data ingestion process, datasets usually are connected to Linked Services.
- Pipeline; Pipeline defines the activity that applies on the source dataset and transfer it to the destination dataset. a pipeline may contains multiple datasets and multiple activities.
There is an Azure Data Factory Editor to create and edit components (Linked Services, Datasets, and pipelines). you can see that with clicking on Author and Deploy under Summary. We will go through this in the next post. The Data Factory that we created in this example was only a container so far, in the next post we will add linked services to it.