Solved: Splitting a unique csv file data set to different ...

filipenovaims20 · ‎12-25-2023

Hi everyone,

I'm a new user of M.Fabric, and also studying and applying Kimball's 4 step for dimensional modelling.

I have a single csv source file wich describes food inspection across Chicago form 2010 up to now, with different columns. I have organized my data source into 5 dimensions, then created a Lakehouse where I loaded this single dataset, and now I'm working on transformations required in each of them by making use of Data Flows in a Staging area warehouse pipeline created for this specif purpose, which is one of the requeriments of the project.

For each of the dimensions I have quite considerable transformations to carry out, and intend to do this in different data flows specifically created and to be applied over the dimensions mentioned.

How could be the best pratice you could advise me?

1. Should I have to apply a transformation, for instance, the inspection date dimension through a Data Flow, then stop and clone it and applying the next transformation for the location dimension, then stop it and repeating this process for the rest of dimensions defined, just by simply cloning the previous Data Flow created?

2. Is it possible to, first, create a Data Flow for just splitting the data into different tables in my lakehouse and then applying the overall process of Data flows without the need of cloning the previous data flow?

OBS: For this project I cannot split the data manually just by separating into different csv files, then load it in the Lakehouse. It is supposed to do all the transformations required inside Fabric.

Thank you in advance and let me know if you have questions.

Anonymous · ‎12-26-2023

Hi @filipenovaims20 ,

Thanks for using Fabric Community.

1. Cloning Data Flows:

While cloning existing Data Flows can be convenient, it can lead to code duplication and maintenance challenges. Instead, I recommend these options:

Modular Data Flows: Design reusable Data Flow components for common transformations (e.g., date formatting, address cleaning). These can be used across different dimensions, reducing redundancy.
Parameterization: Instead of cloning, modify the same Data Flow with different parameters (e.g., column names, transformation rules) for each dimension. This ensures flexibility and centralizes changes.
Chained Data Flows: If transformations are truly unique, consider separate Data Flows chained together. This improves maintainability by keeping each dimension's logic isolated.

2. Data Splitting:

Yes, splitting data into separate tables before individual Data Flow transformations is a good practice. This avoids unnecessary processing on irrelevant data for each dimension.

Pre-split Data Flow: Create a dedicated Data Flow to split the original data based on dimension-specific criteria. This pre-processed data can then be fed into separate Data Flows for each dimension, improving efficiency.
Lakehouse Partitioning: Utilize M.Fabric's lakehouse partitioning feature to group relevant data together based on dimension identifiers. This allows Data Flows to efficiently access only the required data for each dimension.

Additional Tips:

Documentation: Clearly document your chosen approach for transformation and data organization to improve project clarity and avoid future confusion.
Testing and Monitoring: Regularly test and monitor your Data Flows to ensure they are operating correctly and efficiently.

Remember: There is no one-size-fits-all solution, and the best approach will depend on your specific data, transformations, and preferences.

I hope this helps! Feel free to ask any further questions you may have.

View solution in original post

filipenovaims20 · ‎12-31-2023

Hi @Anonymous thank you for your support.

I ended up splitting the data source into several files based on my dimension criteria defined, did some pre-cleaning for duplicates and null values, and it worked when doing the Data Flows and loading to warehouses.

Didn't had the time for the partioning solution you mentioned, But I found it interesting and recommend other to have a try. I will certainly follow it in the near future. If I have doubts I'll let you know.

Thank you for your incredible tips and consider my question as finished!!

Anonymous · ‎01-01-2024

Hi @filipenovaims20 ,

Glad to know that query got resolved, Please continue using Fabric Community for your further queries.

Anonymous · ‎12-26-2023

Hi @filipenovaims20 ,

Thanks for using Fabric Community.

1. Cloning Data Flows:

While cloning existing Data Flows can be convenient, it can lead to code duplication and maintenance challenges. Instead, I recommend these options:

Modular Data Flows: Design reusable Data Flow components for common transformations (e.g., date formatting, address cleaning). These can be used across different dimensions, reducing redundancy.
Parameterization: Instead of cloning, modify the same Data Flow with different parameters (e.g., column names, transformation rules) for each dimension. This ensures flexibility and centralizes changes.
Chained Data Flows: If transformations are truly unique, consider separate Data Flows chained together. This improves maintainability by keeping each dimension's logic isolated.

2. Data Splitting:

Yes, splitting data into separate tables before individual Data Flow transformations is a good practice. This avoids unnecessary processing on irrelevant data for each dimension.

Pre-split Data Flow: Create a dedicated Data Flow to split the original data based on dimension-specific criteria. This pre-processed data can then be fed into separate Data Flows for each dimension, improving efficiency.
Lakehouse Partitioning: Utilize M.Fabric's lakehouse partitioning feature to group relevant data together based on dimension identifiers. This allows Data Flows to efficiently access only the required data for each dimension.

Additional Tips:

Documentation: Clearly document your chosen approach for transformation and data organization to improve project clarity and avoid future confusion.
Testing and Monitoring: Regularly test and monitor your Data Flows to ensure they are operating correctly and efficiently.

Remember: There is no one-size-fits-all solution, and the best approach will depend on your specific data, transformations, and preferences.

I hope this helps! Feel free to ask any further questions you may have.

filipenovaims20 · ‎12-26-2023

@Anonymous regarding 2. Pre-split Data flow, when I am in the Staging Area Warehouse, create a Data Flow to split, for instance, the Date Dimension by just deleting all the columns that aren't part of it and keep those that are part of, then doing the same for the rest of Dimensions, and at the of it adding a wait activity, then start the process of applying transformation on each dimension. Am I mistaking?

Below a pic of the process describred:

Anonymous · ‎12-27-2023

It may change based on your data and requirment, if the approach you are following works for you.
Please go ahead with your logic.

filipenovaims20 · ‎12-26-2023

Hi @Anonymous many thanks for the help.

I tried to apply the "Lakehouse Partioning" solution and find a very interesting article about this here: https://techcommunity.microsoft.com/t5/fasttrack-for-azure/efficient-data-partitioning-with-microsoft-fabric-best-practices/ba-p/3890097

Unfortunatley I stopped in the part 4, where, I got this errors:

Could you help me identify where I did wrong please?

Anonymous · ‎12-26-2023

Hi @filipenovaims20 ,

It looks like csv file is not located in that specified path. Could you please verify your path and try again?

Anonymous · ‎12-28-2023

Hi @filipenovaims20 ,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond back with the more details and we will try to help .

Anonymous · ‎12-29-2023

Hi @filipenovaims20 ,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond back with the more details and we will try to help .

Splitting a unique csv file data set to different tables inside Fabric

Helpful resources

Microsoft Fabric & AI Learning Hackathon

Fabric Community Update - October 2024