Ingest Data with Dataflows Gen2 in Microsoft Fabric

Microsoft Fabric presents a comprehensive platform for data engineering, integration, and analytics. A fundamental stage in any end-to-end analytics journey is data ingestion, and that’s where Dataflows Gen2 shine. These powerful tools facilitate the ingestion and transformation of data from diverse sources, ultimately landing the cleansed data into a designated destination.

Dataflows Gen2 seamlessly integrate into data pipelines for further data movement and can also serve as a valuable data source for Power BI.

Real-World Scenario

Let’s consider a retail company with a global presence. As a data engineer, your task is to prepare and transform data originating from various stores into a format suitable for analysis and reporting. The business requires a semantic model that consolidates these disparate data sources.

Dataflows Gen2 come to the rescue, enabling you to ensure data consistency through preparation and then stage the data in the desired destination. Moreover, they promote reusability and simplify data updates. Without dataflows, you’d face the tedious and error-prone task of manually extracting and transforming data from every source.

Key Takeaways

Gain a comprehensive understanding of Dataflows Gen2 and their role in Microsoft Fabric.
Learn how to ingest and transform data from diverse sources using Dataflows Gen2.
Explore how to land cleansed data into a target destination for further analysis or reporting.
Discover the benefits of Dataflows Gen2 in terms of reusability and data update management.

Understand Dataflows Gen2 in Microsoft Fabric

In our retail scenario, you need to develop a standardized semantic model that empowers your business users to access and analyze data effectively. Dataflows Gen2 are your key to achieving this. By utilizing Dataflows Gen2, you can connect to diverse data sources, meticulously prep and transform the data, and then land the refined data directly into your Lakehouse or employ a data pipeline for other destinations, ensuring seamless access for your business.

What is a Dataflow?

Dataflows are essentially cloud-based ETL (Extract, Transform, Load) tools designed to construct and execute scalable data transformation processes.

Dataflows Gen2, specifically, enable you to:

Extract data from a wide array of sources.
Transform data using a rich collection of transformation operations, facilitated by a visual interface through Power Query Online.
Load transformed data into a new table, incorporate it into a Data Pipeline, or offer it as a curated data source for data analysts.

At its core, a dataflow encapsulates all the necessary transformations, streamlining data preparation and enabling efficient loading into various destinations.

How to Use Dataflows Gen2

Traditionally, data engineers invest significant time in extracting, transforming, and loading data into a format suitable for downstream analytics. Dataflows Gen2 aim to alleviate this burden by offering a user-friendly and reusable approach to ETL tasks using Power Query Online.

Let’s consider the alternatives:

Data Pipeline Only: With this approach, you copy data and then employ your preferred coding language for the extraction, transformation, and loading steps.
Dataflow Gen2 + Data Pipeline: Here, you first create a Dataflow Gen2 to handle the extraction and transformation, and then optionally load the data into a Lakehouse or other destinations. This creates a curated semantic model readily available for business consumption.

Adding a data destination to your dataflow is optional, as the dataflow preserves all transformation steps. To perform further tasks or load data to a different destination post-transformation, create a Data Pipeline and incorporate the Dataflow Gen2 activity into your orchestration.

Another approach is to use a Data Pipeline and Dataflow Gen2 for an ELT (Extract, Load, Transform) process. In this sequence, you’d utilize a Pipeline to extract and load the data into your chosen destination, such as the Lakehouse. Subsequently, you’d create a Dataflow Gen2 to connect to the Lakehouse data, enabling cleansing and transformation. This Dataflow can then be offered as a curated semantic model for data analysts to build reports upon.

Dataflows can also be horizontally partitioned. By creating a global dataflow, data analysts gain the ability to build specialized semantic models tailored to specific needs.

Dataflows promote reusable ETL logic, eliminating the necessity to create multiple connections to your data source. They provide a wide variety of transformations and can be executed manually, on a refresh schedule, or as part of a Data Pipeline orchestration.

Benefits and Limitations

While there are multiple paths to ETL or ELT data in Microsoft Fabric, let’s weigh the advantages and limitations of using Dataflows Gen2.

Benefits:

Extend data: Enrich data with consistent elements like a standard date dimension table.
Self-service enablement: Grant self-service users access to a specific subset of the data warehouse.
Performance optimization: Extract data once for reuse, reducing refresh times for slower sources.
Simplified data source complexity: Expose only dataflows to larger analyst groups.
Data consistency and quality: Allow users to clean and transform data before loading.
Simplified data integration: Provide a low-code interface for ingesting data from diverse sources.

Limitations:

Not a data warehouse replacement: Dataflows complement, but don’t replace, a data warehouse.
Row-level security not supported: Fine-grained row-level security is currently unavailable.
Fabric capacity workspace required: A Fabric capacity workspace is necessary to utilize Dataflows Gen2.

By carefully considering these factors, you can make informed decisions about incorporating Dataflows Gen2 into your data integration strategy within Microsoft Fabric.

Explore Dataflows Gen2 in Microsoft Fabric

Within Microsoft Fabric, you have the flexibility to create a Dataflow Gen2 in various locations, including the Data Factory workload, Power BI workspace, or directly within the lakehouse. Given our focus on data ingestion, let’s delve into the Data Factory workload experience. Dataflows Gen2 harness the power of Power Query Online to visualize transformations. Let’s take a closer look at the interface:

Dataflows Gen2 interface in Microsoft Fabric

Power Query Ribbon: Dataflows Gen2 support a wide range of data source connectors, encompassing common sources like cloud and on-premises relational databases, Excel or flat files, SharePoint, Salesforce, Spark, and naturally, Fabric Lakehouses. You’ll find an extensive array of data transformations at your disposal, including:
- Filter and Sort rows
- Pivot and Unpivot
- Merge and Append queries
- Split and Conditional split
- Replace values and Remove duplicates
- Add, Rename, Reorder, or Delete columns
- Rank and Percentage calculator
- Top N and Bottom N
Queries Pane: This pane displays your various data sources, now referred to as queries. Options like renaming, duplicating, referencing, and enabling staging are available for managing your queries.
Diagram View: The Diagram View provides a visual representation of how your data sources are connected and the transformations applied, aiding in understanding the dataflow’s structure.
Data Preview Pane: This pane showcases a subset of your data, allowing you to visualize the impact of your transformations. You can interact with the preview by dragging and dropping columns to rearrange them or right-clicking on columns to apply filters or make modifications.
Query Settings Pane: The Query Settings pane primarily features Applied Steps. Each transformation you perform is linked to a step, some of which are automatically applied upon connecting to the data source. The complexity of your transformations will determine the number of applied steps for each query.

While this visual interface is invaluable, you can also access the underlying M code through the Advanced Editor for more granular control.

Within the Query Settings pane, you’ll find the Data Destination field, where you can designate the Lakehouse as the destination for your transformed data.

By familiarizing yourself with the Dataflows Gen2 interface, you’ll be well-equipped to navigate and utilize its powerful features for data ingestion and transformation within Microsoft Fabric.

This blog post is based on information and concepts derived from the Microsoft Learn module titled “Ingest Data with Dataflows Gen2 in Microsoft Fabric.” The original content can be found here:
https://learn.microsoft.com/en-us/training/modules/use-dataflow-gen-2-fabric/