Transform data using a Mapping Data Flow in Azure Data Factory

In the previous articles, Copy data between Azure data stores using Azure Data Factory and Copy data from On-premises
data store to an Azure data store using Azure Data Factory
, we saw how we can use the Azure Data Factory to copy data
between different data stores located in an on-premises machine or in the cloud.

In this article, we will show how we can use the Azure Data Factory to transform a specific data set.

Data Transformation Overview

Azure Data Factory supports various data transformation activities. These activities include:

  • Mapping data flow activity: Visually designed data transformation that allows you to design a graphical data
    transformation logic without the need to be an expert developer. The mapping data flow will be executed as an
    activity within the Azure Data Factory pipeline on an ADF fully managed scaled-out Spark cluster
  • Wrangling data flow activity: A code-free data preparation activity that integrates with Power Query Online in order
    to make the Power Query M functions available for data wrangling using spark execution
  • HDInsight Hive activity: Allows you to run Hive queries on your own or on-demand HDInsight cluster
  • HDInsight Pig activity: Allows you to run Pig queries on your own or on-demand HDInsight cluster
  • HDInsight MapReduce activity: Allows you to run MapReduce programs on your own or on-demand HDInsight cluster
  • HDInsight Streaming activity: Allows you to run Hadoop Streaming programs on your own or on-demand HDInsight
    cluster
  • HDInsight Spark activity: Allows you to run Spark programs on your own HDInsight cluster
  • Machine Learning activities: Allows you to use a published Azure Machine Learning web service for predictive
    analytics
  • Stored procedure activity: Allows you to execute a stored procedure in the Azure relational data platforms such as
    Azure SQL DB and Azure SQL Data Warehouse
  • Data Lake Analytics U-SQL activity: Allows you to run a U-SQL script on an Azure Data Lake Analytics cluster
  • Databricks Notebook activity: Allows you to run a Databricks notebook in your Azure Databricks workspace
  • Databricks Jar activity: Allows you to run a Spark Jar in your Azure Databricks cluster
  • Databricks Python activity: Allows you to run a Python file in your Azure Databricks cluster
  • Custom activity: Allows you to define your own data transformation logic in Azure Data Factory

Compute environments

Azure Data Factory supports two compute environments to execute the transform activities. The On-demand compute environment, on which the computing environment is fully managed by the Data
factory, where the cluster will be created to execute the transform activity and removed automatically when the activity is completed. The second approach is Bring Your Own environment, where the compute
environment is managed by both, you and the Data factory.

Transform Data with Mapping Data Flows

Mapping Data Flows activity can be created individually or within an Azure Data Factory pipeline. In this demo, and in
order to test the Data Flow activity execution, we will create a new pipeline and create a Data Flow activity to be
executed inside that pipeline.

First, you need to open the Azure Data Factory using the Azure portal, then click on Author & Monitor option. From the opened Data Factory, click on the Author button then click on the
plus sign to add a New pipeline, as shown below:

New Pipeline

From the Pipeline design window, provide a unique name for the pipeline and drag then drop the Data Flow activity to
the design space and click on the Data flow activity, as below:

Add Data Flow

In the New Mapping dataflow window, choose to add a Data flow type, then click OK, as shown below:

Mapping Data Flow type

The displayed Mapping Data Flow authoring canvas consists of two main parts: the graph that displays the
transformation stream, and the configuration panel that shows the settings specific to the currently selected
transformation.

The purpose of this Data Flow activity is to read data from an Azure SQL Database table and calculate the average
value of the users’ age then save the result to another Azure SQL Database table.

So, the first step is to specify a name for the source stream and the dataset that points to the source data. In this
demo, we will use the dataset that points to the Azure SQL Database table, as shown below:

Data Flow source

Opening the selected dataset, you can reconfigure the table name and validate the source data store connection and
data, as shown below:

Preview the Source data

Under the Source options, you can specify whether to read data from the source table or provide a query to filter the
source data, in addition to the ability to override the default read batch size and the isolation level of the read
transaction, as shown below:

Source Options

Also, you can choose the proper data partitioning option that can help to optimize the data reading and transformation
process, as below:

Optimize data read

Under the Inspect tab, you can check the metadata of the data stream that you’re transforming. This
includes the column counts, the columns changed, the columns added, data types, the column order, and column
references, as shown below:

Data Inspect

Now the data source is ready. We need to add a new stream to calculate the average value of the source data. To
achieve that, click on the plus sign in the graph area and choose to add Aggregate stream, as shown
below:

Add Aggregate

In the Aggregate stream settings, provide a meaningful name for the stream, change the stream from Group By to
Aggregates function, and provide the name of the column that will be used in the aggregate function
then click to open the expression builder, as shown below:

Aggregate settings

From the opened visual expression builder, specify the expression of the Average aggregate function that will be used
to calculate the average value of the users’ age, as shown below:

Build Aggregate function

With the source data and the aggregate function streams configured successfully, we will add a sink to store the
transform function result. This can be achieved by clicking on the plus button and choose to add a Sink stream, as
shown below:

Add sink

From the Sink stream settings, providing an indicative name for that stream and configure the dataset to point to the
same Azure SQL Database, but to store the aggregate function result in another table, as shown below:

Configure the sink dataset

Data Flow activity is configured successfully now. Before publishing the pipeline changes, validate the pipeline then
run it using the Debug mode, where the Data Flow Activity will be executed manually under that pipeline, as shown
below:

Debug pipeline

When the pipeline execution is completed, review the output tab and verify that it is executed successfully with no
issue. As you are still in the debug mode, you can fix any issue with the data sources, sinks and the transform
function within the Data Flow activity, as shown below:

Pipeline executed successfully

After verifying the pipeline components and execution result, you can easily click on the Publish All button to
publish the pipeline to the production environment. You can see from the snapshot below that the created pipeline, the
datasets and the Data Flow will be published to the current Azure Data factory, where you can easily create a trigger
and schedule it to be orchestrated based on the business requirements, as shown below:

Publish the pipeline

Take into consideration that, after executing the pipeline, you can click on the Sink stream and click on the Data
Preview option to validate the transform function result, which is the average value for the users’ ages, as shown
below:

Sink Data Preview

The same result can be verified by connecting to the Azure SQL Database using the SQL Server Management Studio (SSMS)
tool and browse for the new table then run the SELECT statement below, and you can see that this value stored in the
table matches the value verified from the sink data preview:

Azure SQL DB result

Conclusion

In this article, we went through a simple demo of how to use the Mapping Data Flow activity to transform your data and
store the result to be used later.

In the next article, we will show how to run an SSIS package in Azure Data factory. Stay tuned!

Table of contents

Ahmad Yaseen
Latest posts by Ahmad Yaseen (see all)

Author: admin

Leave a Reply

Your email address will not be published.