Re: Modelling a large customer table in Fabric

pmscorca · ‎06-04-2024

Hi,

I'm thinking about how modelling a large customer registry in Fabric for a banking scenario.

The customer data source is composed from some tables, a base table and some tables with a 1-to-1 relationship with the base table and other tables with a 1-to-many relationship. F.e. as source tables there are an addresses table and a notes table: a customer could have more addresses and for a customer could be more notes.

Moreover there are some domain tables, f.e. customer type.

My idea is to create a table by putting together all data with a 1-to-1 relationship in a header customer table and creating a table each for a 1-to-many relationship table (f.e. addresses table), but my customer has in the mind a unique large table with all customer data, comprising f.e. addresses and notes: a such table for a customer could have more rows, a cartesian product, but my customer remembers a big data solution to handle the customer registry as a unique large table.

I prefer a data model thinking to a star schema and know bit how modelling big data rather than a data warehouse.

So, any suggests to me to propose an appropriate data model? Are there any best practices about a big data modelling for Fabric?

Anonymous · ‎06-04-2024

Hi @pmscorca
Thanks for using Fabric Community.

Here's how you can approach the customer registry data model in Microsoft Fabric, considering both approaches and best practices:

Understanding the Needs:

Star Schema: This is a good fit for analytical workloads where you prioritize fast querying and aggregation. It's familiar and works well with Power BI for visualizations.

Star Schema

Fast querying and aggregation.
Familiar and easy to understand for analysts.
Works well with Power BI.

Hybrid Approach: This is a common and effective strategy.

Create a dimensional model (star schema) for the core customer data and frequently queried attributes (e.g., name, contact details).
Store detailed information like addresses and notes in separate tables with foreign keys linking them to the customer dimension.
Consider materialized views for frequently accessed denormalized data within the star schema (e.g., latest address).

Best Practices for Big Data Modeling in Microsoft Fabric:

Leverage Data Lake: Use Lakehouse to store your raw and semi-structured data.
Data Transformation: Use Data Factory to ingest, cleanse, and transform data before loading it into the star schema or big table.
Partitioning: Partition your tables based on access patterns for faster queries.
Monitoring: Monitor data quality and performance of your data pipelines.

You can refer to these links for more information:
https://community.fabric.microsoft.com/t5/Desktop/Best-data-modeling-approach/td-p/2927540
https://community.fabric.microsoft.com/t5/Desktop/data-modelling/td-p/1797121
https://learn.microsoft.com/en-us/fabric/data-warehouse/data-modeling-defining-relationships
Best practices for lifecycle management in Fabric - Microsoft Fabric | Microsoft Learn

Hope this helps. Please let me know if you have any further questions.

pmscorca · ‎06-05-2024

Hi, many thanks for your reply.

I try to write further details about the question.

I need to model the data about the customer registry in a banking scenario, namely in an enterprise organization, as a data product. So, I think to create a dedicated workspace in Fabric to manage the customer registry.

The ambit of the customer registry (from the source) is composed by some basic data, more some data with 1-to-1 relationships, more some data with 1-to-m relationships (e.g. customer addresses and customer notes) and some domain data (e.g. customer types) with 1-to-1 relationships.

The source is a structured and relational system.

I need to import csv files, one file for each source table.

I could combine basic data and the customer data with 1-to-1 relationships in a flat header table, the header customer table without any problems.

Then I could manage the domain tables maintaining the 1-to-1 relationships as a dimensional data model without any issues.

The focus is about how managing the customer data with 1-to-m relationships with the header data, considering that these data haven't any dependencies between them; a customer address does't depend to a customer note and viceversa.

My personal idea is to maintain these data as separate tables with 1-to-m relationships with the header customer data, but I should think to an alternative data model.

The goal is to implement a data product to consume not only by Power BI but also and specially by other workspaces related to other banking data.

Now, could I create an unique flat customer table combining together header data and addresses and notes and other 1-to-m data?

Following this way, if a customer has 3 addresses and 4 notes, in this unique flat table should I have 7 records (3 rows with header data more 3 addresses with null values for the notes, and 4 rows with header data more 4 notes with null values for the addresses) or 12 records (resulting by a cartesian product that generates 12 rows with header data more the address and note combinations, f.e. address1 and note1, address1 and note2, address1 and note3, and so on).

The basic customer data have almost 1-2 millions of rows, the addresses and notes could have more data.

Implementing an unique flat customer table could mean to manage big data.

So, which could be the appropriate data model for a such big data scenario?

E-R model or dimensional model or a combination between them or other right data model for big data?

I say again that the goal is to implement a customer data product to consume not only by Power BI, but also and specially by other workspaces related to other banking data. The customer data aren't more appropriate to perform data analysis.

Thanks

Anonymous · ‎06-10-2024

Hi @pmscorca
I would suggest you to create a support ticket for the above case, as our support team can suggest you the best solution for this based on your requirement. Creating a support ticket allows our team to gather all the necessary details about your specific situation. This ensures we can provide the most accurate and tailored solution for you.
Please do reach out to our support team : https://support.fabric.microsoft.com/support
and provide the details of the support ticket here for further tracking.
Thanks

Anonymous · ‎06-05-2024

Hi @pmscorca
We haven’t heard from you on the last response and was just checking back to see if your query got resolved. Otherwise, will respond back with the more details and we will try to help.
Thanks

pmscorca · ‎06-11-2024

Hi, I think that is enough to reply by leveraging on some key concepts.

Having a customer registry with core/header data and some other data with a 1-to-m relationship should represent a pattern scenario, manageable thinking to a dwh by a dimensional model based on a star schema, but thinking to a big data modeling?

Fabric allows to create lakehouses and warehouses, based on Spark delta tables: this format is more performant and has a columnar data organitazion, perfectly suitable to big data.

Moreover, Fabric allows to partition table f.e. respect to a date and so a dimension could be handle in a different manner than a SCD one.

Does the columnar organization of the delta tables for a big data scenario suffer to have more null values?

Fabric allows to implement a big data solution, by his features, but to handle big data occurs to go beyond the known data modeling approaches (E/R model, dimensional model and so on).

In these terms, is it possible to have an initial reply for my issue? Thanks