Orchestrating Data Flow with Pipelines

Data pipelines serve as the backbone of efficient data movement and transformation. They define a sequence of activities that work together to orchestrate an end-to-end process, typically involving data extraction from various sources, transformation into a suitable format, and loading into a target destination. Pipelines are indispensable for automating Extract, Transform, and Load (ETL) processes, efficiently channeling transactional data from operational data stores into analytical data stores like lakehouses or data warehouses.

Familiarity for Azure Data Factory Users

If you’ve worked with Azure Data Factory, you’ll find data pipelines in Microsoft Fabric to be a natural extension of your existing knowledge. They share the same architectural foundation, utilizing connected activities to define a process that encompasses a wide range of data processing tasks and control flow logic. You have the flexibility to run pipelines interactively within the Microsoft Fabric user interface or schedule them for automated execution.

Key Takeaways

Data pipelines streamline data movement and transformation processes.
They are pivotal in automating ETL workflows, essential for populating analytical data stores.
Microsoft Fabric’s data pipelines mirror the architecture and capabilities of Azure Data Factory, providing a familiar experience for existing users.
You can execute pipelines interactively or schedule them for unattended runs.

Pipelines: The Core of Data Orchestration

Pipelines in Microsoft Fabric encapsulate a series of activities that perform data movement and processing tasks. You can leverage a pipeline to define data transfer and transformation activities, and then orchestrate these activities through control flow activities that manage branching, looping, and other essential processing logic. The intuitive graphical pipeline canvas in the Fabric user interface empowers you to build intricate pipelines with minimal or even no coding required.

Core Pipeline Concepts

Before diving into building pipelines in Microsoft Fabric, let’s grasp some fundamental concepts.

Activities: The Building Blocks

Activities represent the executable tasks within a pipeline. You can establish a flow of activities by connecting them sequentially. The outcome of an activity—success, failure, or completion—can be used to direct the flow to the subsequent activity in the sequence.

There are two primary categories of activities in a pipeline:

Data Transformation Activities: These activities encapsulate data transfer operations. This includes straightforward Copy Data activities that extract data from a source and load it into a destination, as well as more sophisticated Data Flow activities that encapsulate dataflows (Gen2) to apply transformations during data transfer. Other notable data transformation activities include Notebook activities for running Spark notebooks, Stored Procedure activities for executing SQL code, Delete Data activities for removing existing data, and more.
Control Flow Activities: These activities enable you to implement loops, conditional branching, and manage variable and parameter values. The extensive range of control flow activities empowers you to craft complex pipeline logic to orchestrate data ingestion and transformation workflows.

Tip: For comprehensive details about the complete set of pipeline activities offered in Microsoft Fabric, consult the “Activity overview” section in the Microsoft Fabric documentation.
https://learn.microsoft.com/en-us/fabric/data-factory/activity-overview

Parameters: Enhancing Reusability

Pipelines can be parameterized, allowing you to provide specific values that are used each time a pipeline is executed. For instance, you might want a pipeline to save ingested data in a folder, but you need the flexibility to specify a different folder name with each pipeline run.

Utilizing parameters boosts the reusability of your pipelines, enabling you to create adaptable data ingestion and transformation processes.

Pipeline Runs: Tracking Execution

Every time a pipeline is executed, a data pipeline run is initiated. You can trigger runs on-demand within the Fabric user interface or schedule them to start at a specified frequency. The unique run ID associated with each execution allows you to review run details, confirm successful completion, and investigate the specific settings used.

By understanding these core concepts, you’re well on your way to harnessing the power of pipelines in Microsoft Fabric to efficiently manage your data workflows.

Copying Data: A Pipeline Essential

The Copy Data activity is a cornerstone of data pipelines, often forming the core of data ingestion processes. Many pipelines consist solely of a single Copy Data activity that efficiently transfers data from an external source into a lakehouse file or table.

Moreover, you can combine the Copy Data activity with other activities to create a repeatable data ingestion process. For instance, you might use a Delete Data activity to remove existing data, followed by a Copy Data activity to replace it with a file containing data from an external source. Subsequently, a Notebook activity could run Spark code to transform the data in the file and load it into a table, showcasing the power of chaining activities for a streamlined workflow.

The Copy Data Tool: Your Configuration Guide

Adding a Copy Data activity to a pipeline initiates a graphical tool that walks you through configuring the data source and destination for the copy operation. This tool supports a wide range of source connections, enabling you to ingest data from most commonly used sources.

Copy Data Activity Settings: Fine-Tuning the Process

Once you’ve added a Copy Data activity to your pipeline, you can select it on the pipeline canvas and access its settings in the pane below. These settings allow you to fine-tune aspects like data mapping, column selection, and error handling.

When to Use the Copy Data Activity

Opt for the Copy Data activity when your requirement is to directly copy data between a supported source and destination without any transformations. It’s also suitable when you want to import raw data and apply transformations later in the pipeline using other activities.

Alternatives for Transformation and Merging

If you need to apply transformations during data ingestion or merge data from multiple sources, consider utilizing a Data Flow activity to execute a dataflow (Gen2). You can leverage the Power Query user interface to define a dataflow (Gen2) that incorporates multiple transformation steps and seamlessly integrate it into your pipeline.

Tip: To delve into using Dataflow (Gen2) for data ingestion in Microsoft Fabric, consider exploring the “Ingest Data with Dataflows Gen2 in Microsoft Fabric” module.
https://learn.microsoft.com/en-us/training/modules/use-dataflow-gen-2-fabric/

By understanding the Copy Data activity and its appropriate use cases, you can make informed decisions about how to structure your data ingestion pipelines effectively. In the upcoming sections, we’ll dive into other essential activities and control flow mechanisms that contribute to the power and flexibility of Microsoft Fabric pipelines.

Streamlining Common Scenarios

While you have the freedom to define pipelines from any combination of activities to tailor your data ingestion and transformation processes, Microsoft Fabric offers a collection of predefined pipeline templates for frequently encountered scenarios. These templates provide a starting point that you can customize to align with your specific requirements, saving you time and effort in pipeline development.

Creating a Pipeline from a Template

To leverage a template, simply select the “Choose a task to start” tile when creating a new pipeline, as illustrated here:

New pipeline with the Choose a task to start tile selected

This action will reveal a selection of pipeline templates, like the one shown below:

Choose the template that best suits your needs and proceed to edit the pipeline on the canvas, tailoring it to your precise data processing objectives.

Key Takeaway

Pipeline templates in Microsoft Fabric accelerate the creation of common data pipelines, providing a convenient starting point for customization and adaptation. This feature enhances efficiency and productivity, allowing you to focus on refining the pipeline to meet your specific use case.

This blog post is based on information and concepts derived from the Microsoft Learn module titled “Use Data Factory pipelines in Microsoft Fabric.” The original content can be found here:
https://learn.microsoft.com/en-us/training/modules/use-data-factory-pipelines-fabric/.