Data Warehouses in Microsoft Fabric

Data Warehouses in Microsoft Fabric

Relational data warehouses have long been the central pillar of enterprise business intelligence (BI) solutions. While specific implementations might differ, a common pattern based on a denormalized, multidimensional schema has become the standard design for these data warehouses.

Microsoft Fabric’s Modern Data Warehouse

Microsoft Fabric’s data warehouse represents a contemporary evolution of the traditional data warehouse concept. It centralizes and organizes data from disparate departments, systems, and databases, consolidating them into a single, unified view for analysis and reporting. Fabric’s data warehouse offers full SQL semantics, empowering users to insert, update, and delete data within its tables.

What sets Fabric’s data warehouse apart is its foundation on the Lakehouse, which employs the Delta format and is readily queryable using SQL. This design caters to the needs of the entire data team, not just data engineers.

Overcoming Traditional Challenges

Traditional data warehouses often encounter challenges like:

  • Data Silos: Data scattered across various systems, hindering a holistic view.
  • Complex ETL Processes: Time-consuming and error-prone data extraction, transformation, and loading.
  • Limited Collaboration: Difficulty for different teams to work together on the data warehouse.
  • Performance Bottlenecks: Query performance issues, especially with large datasets.

Fabric’s data warehouse experience directly addresses these hurdles. It fosters collaboration among data engineers, analysts, and data scientists, enabling them to create and query a data warehouse tailored to their specific needs.


The Modern Data Warehouse Process

Building a modern data warehouse typically involves the following stages:

  1. Data Ingestion: Moving data from source systems into the data warehouse.
  2. Data Storage: Storing the data in a format optimized for analytics.
  3. Data Processing: Transforming the data into a format ready for consumption by analytical tools.
  4. Data Analysis and Delivery: Analyzing the data to extract insights and presenting those insights to the business.

Microsoft Fabric empowers data engineers and analysts to accomplish all these stages within a single tool, offering both low-code and traditional experiences.

Fabric’s Data Warehouse Experience

Fabric’s data warehouse is a relational data warehouse that supports the complete range of transactional T-SQL capabilities you’d expect from an enterprise-grade solution. It’s fully managed, scalable, and highly available, designed to store and query data within the Lakehouse. You have full control over creating tables, loading, transforming, and querying data using either the Fabric portal or T-SQL commands. This flexibility allows you to use SQL for data analysis or leverage Spark for data processing and machine learning model creation.

Furthermore, Fabric’s data warehouse fosters collaboration between data engineers and data analysts by providing a shared workspace. Data engineers can construct a relational layer on top of the Lakehouse data, while analysts can seamlessly explore and analyze this data using T-SQL and Power BI.

Designing a Data Warehouse

Like any relational database, Fabric’s data warehouse relies on tables to store data for future analysis. Typically, these tables are organized using a schema optimized for multidimensional modeling. In this approach, numerical data associated with events (e.g., sales orders) is grouped based on various attributes (e.g., date, customer, store). This structure allows for analyzing metrics like the total amount paid for sales orders on a specific date or at a particular store.

Tables in a Data Warehouse

Tables in a data warehouse are strategically structured to support efficient and effective analysis of large datasets. This organization, often called dimensional modeling, involves categorizing tables into fact tables and dimension tables.

  • Fact Tables: These tables store the numerical data you aim to analyze. They usually contain a large number of rows and serve as the primary data source for analysis. An example would be a fact table holding the total amount paid for sales orders, categorized by date or store.
  • Dimension Tables: These tables provide descriptive context for the data in fact tables. They typically have fewer rows and enrich the fact table data with additional information. For instance, a dimension table might contain details about customers who placed sales orders.

In addition to attribute columns, dimension tables include unique key columns to identify each row. It’s common for a dimension table to have two key columns:

  • Surrogate Key: A unique identifier for each row, often an auto-generated integer.
  • Alternate Key: A natural or business key from the source system, like a product code or customer ID, aiding in traceability.

Both surrogate and alternate keys are crucial in a data warehouse, serving distinct purposes of maintaining data consistency and traceability, respectively.

Special Types of Dimension Tables

  • Time Dimensions: These tables provide temporal context for events, enabling aggregation over time intervals. A time dimension might include columns for year, quarter, month, and day of a sales order.
  • Slowly Changing Dimensions: These tables track changes to dimension attributes over time, such as customer address or product price changes. They’re vital for analyzing data evolution and ensuring accuracy for informed decision-making.

Data Warehouse Schema Designs

  • Star Schema: A common design where a fact table is directly related to dimension tables, facilitating straightforward querying and analysis.
  • Snowflake Schema: A more normalized design where some dimension tables are further broken down, potentially improving data integrity but increasing query complexity.

What is a star schema? https://learn.microsoft.com/en-us/power-bi/guidance/star-schema

If there are lots of levels or some information is shared by different things, it might make sense to use a snowflake schema instead. Here’s an example:

In this case, the DimProduct table has been split up (normalized) to create separate dimension tables for product categories and suppliers.

  • Each row in the DimProduct table contains key values for the corresponding rows in the DimCategory and DimSupplier tables.

DimGeography table has been added containing information on where customers and stores are located.

  • Each row in the DimCustomer and DimStore tables contains a key value for the corresponding row in the DimGeography table.

The Lakehouse and Data Warehouse Connection

Fabric’s Lakehouse functions as a collection of files, folders, tables, and shortcuts, effectively acting as a database on top of a data lake. It’s utilized by both the Spark engine and SQL engine for big data processing and offers ACID transaction support when using open-source Delta formatted tables.

The data warehouse experience in Fabric seamlessly bridges the gap between the lake view of the Lakehouse (catering to data engineering and Apache Spark) and the SQL-centric experiences typically associated with traditional data warehouses. The Lakehouse enables you to read tables and leverage the SQL analytics endpoint, while the data warehouse empowers you to manipulate the data.

Within the data warehouse experience, you can:

  • Model data using tables and views
  • Execute T-SQL queries across both the data warehouse and Lakehouse
  • Perform DML (Data Manipulation Language) operations on data within the data warehouse
  • Serve as a foundation for reporting layers like Power BI

With a grasp of the fundamental architectural principles of relational data warehouse schemas, let’s explore how to create a data warehouse in Fabric.

Describing a Data Warehouse in Fabric

The data warehouse experience in Fabric empowers you to construct a relational layer atop the physical data residing in the Lakehouse, making it accessible to analysis and reporting tools. You can create your data warehouse directly in Fabric from the create hub or within a workspace. Once you have an empty warehouse, you can populate it with various objects.

After creating your warehouse, you can use T-SQL directly within the Fabric interface to create tables.

Ingesting Data into Your Data Warehouse

Fabric offers multiple avenues for ingesting data into a data warehouse, including Pipelines, Dataflows, cross-database querying, and the COPY INTO command. Once ingested, the data becomes available for analysis by diverse business groups, who can leverage features like cross-database querying and sharing to access it.

Creating Tables

To create a table in the data warehouse, you have two options:

  1. Use SQL Server Management Studio (SSMS) or another SQL client to connect to the data warehouse and execute a CREATE TABLE statement.
  2. Create tables directly within the Fabric UI.

You can also copy data from an external location into a data warehouse table using the COPY INTO syntax.

COPY INTO dbo.Region 
FROM 'https://mystorageaccountxxx.blob.core.windows.net/private/Region.csv' WITH ( 
            FILE_TYPE = 'CSV'
            ,CREDENTIAL = ( 
                IDENTITY = 'Shared Access Signature'
                , SECRET = 'xxx'
                )
            ,FIRSTROW = 2
            )
GO

This SQL query loads data from a CSV file stored in Azure Blob Storage into a table called “Region” in the Fabric data warehouse.

Table Considerations

After table creation, the data loading process is crucial. A common practice involves using staging tables. In Fabric, you can employ T-SQL commands to load data from files into these staging tables within the data warehouse.

Staging tables act as temporary tables for data cleansing, transformation, and validation. They also facilitate loading data from multiple sources into a single destination table.

Data loading is typically performed as a periodic batch process, where inserts and updates to the data warehouse are coordinated to occur at regular intervals (e.g., daily, weekly, or monthly).

A typical data warehouse load process follows these steps:

  1. Ingest new data into a data lake, applying pre-load cleansing or transformations as needed.
  2. Load data from files into staging tables in the relational data warehouse.
  3. Load dimension tables from the dimension data in staging tables, updating or inserting rows and generating surrogate keys.
  4. Load fact tables from the fact data in staging tables, referencing appropriate surrogate keys for related dimensions.
  5. Perform post-load optimization by updating indexes and table distribution statistics.   1. github.com github.com

Cross-Database Querying

If you have tables within the lakehouse that you want to query from your data warehouse (without modifying them), Fabric’s data warehouse eliminates the need for data copying. You can directly query lakehouse data from the data warehouse using cross-database querying.

Important Note:

Working with tables in the Fabric data warehouse currently has some limitations. Refer to the Fabric documentation for further details.
https://learn.microsoft.com/en-us/fabric/data-warehouse/tables

This blog post is based on information and concepts derived from the Microsoft Learn module titled “Get started with data warehouses in Microsoft Fabric.” The original content can be found here:
https://learn.microsoft.com/en-us/training/modules/get-started-data-warehouse/


Comments

Deixe um comentário

O seu endereço de email não será publicado. Campos obrigatórios marcados com *