Power BI is turning 10! Tune in for a special live episode on July 24 with behind-the-scenes stories, product evolution highlights, and a sneak peek at what’s in store for the future.
Save the dateEnhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.
Hello guys,
Being new to data engineering, would like to seek advises about what is the best and efficient way of doing data validation, for example when adopting the Medallion approach, processing Bronze to Silver, we need to clean the data like for example moving Purchase Order Transactions data, we should check:
1. Vendor id valid, already in Vendor Master
2. Product id is valid, already in Product Master
3. Currency is valid,
4. All the outher information are valid.
In Data engineering perspective, what should we do during checking ? should we flag the record ? means there is additional field "valid" for each column need to check whether has master or not. this flag can be brought to Silver so than can easily identify, that the field is not valid, but the data is available there, we just "clearly" identified it already. Is that the right approach ?
Logging the invalid, is obvious step, but I found this is not effective, as what should we do about it ? someone need to maintain the log table ? by means look into that table then do the necessary thing first ? so the import is hold ? left behind the invalid rows ? until it is reprocessed and get the valid flag ?
If the "checked" id is not exist in master, is it good to have some automation to retrieved it from master data pipeline ?
Appreciate for some share. Thanks,
Solved! Go to Solution.
Hi @VoltesDev ,
Thanks for posting your question in Microsoft Fabric Community forum.
When adopting the Medallion architecture and moving data from Bronze to Silver, data validation is a critical step to ensure data quality and reliability. Below is a step-by-step approach to efficiently handle validation, including the use of column profiling.
Column profiling in Microsoft Fabric Data Flows is a helpful tool to get a clear picture of your data and spot major issues before setting up validation rules.
For example:
In the above images we can see Valid, Error and Empty.
Then you can use the methods like Flagging and Logging Invalid records to apply Validation in Bronze to Silver Transformation.
Flagging records which are valid and invalid is a solid approach. It allows
Flexibility: Keeps the data pipeline moving without halting the entire process.
Enhancements to This Approach:
Column-Level Flags: Instead of just flagging the entire record as "valid" or "invalid," you can add separate flags for each key column (e.g., valid_vendor_id
, valid_product_id
). This makes it easier to pinpoint issues and take necessary action.
Valid State Metadata: You can create a structured metadata table to track the validation status of individual records or columns. This helps avoid repeatedly checking the same data . It’s like having a clear to-do list for your data!
Instead of relying on someone to manually check the log table and take action, you can automate the process:
Automated Reprocessing: When missing or corrected data (like a Vendor ID) becomes available, set up an automatic system to revalidate flagged records and mark them as "valid."
Hold or Skip Invalid Rows: Instead of stopping the entire import process, you can temporarily flag and skip invalid rows, letting the valid data continue. The flagged rows can then be reprocessed once the issues are resolved.
Automated Notifications: Set up automatic alerts to notify the right teams when there’s invalid data that needs attention, so they can fix it quickly and keep things moving smoothly.
For more insights into enhancing your data validation practices, refer to Semantic Link - Data Validation Using Great Expectations.
Hi @VoltesDev ,
We just wanted to check in again regarding your issue. If you’ve found a solution, marking the reply as the solution and leaving a kudos would be greatly appreciated—it helps the community and others with similar questions.
If you’re still facing challenges or have further questions, please let us know—we’re here to help and would love to assist you in resolving this.
Looking forward to hearing back from you!
Best regards,
Vinay.
Hi @VoltesDev ,
We hope you’re doing well! We just wanted to follow up since we haven’t heard back from you. If your issue is all sorted, it’d be great if you could mark the reply as the solution and maybe drop a kudos--it helps others too. But if you still need assistance, just let us know--we’re happy to help!
Best,
Vinay.
Hi @VoltesDev ,
We haven’t heard back since our last response and wanted to check if your query has been resolved. If not, please feel free to reach out for further assistance. If it has been resolved, kindly mark the helpful reply as the solution to make it easier for others to find. A kudos would also be greatly appreciated!
Thank you.
Hi @VoltesDev ,
Thanks for posting your question in Microsoft Fabric Community forum.
When adopting the Medallion architecture and moving data from Bronze to Silver, data validation is a critical step to ensure data quality and reliability. Below is a step-by-step approach to efficiently handle validation, including the use of column profiling.
Column profiling in Microsoft Fabric Data Flows is a helpful tool to get a clear picture of your data and spot major issues before setting up validation rules.
For example:
In the above images we can see Valid, Error and Empty.
Then you can use the methods like Flagging and Logging Invalid records to apply Validation in Bronze to Silver Transformation.
Flagging records which are valid and invalid is a solid approach. It allows
Flexibility: Keeps the data pipeline moving without halting the entire process.
Enhancements to This Approach:
Column-Level Flags: Instead of just flagging the entire record as "valid" or "invalid," you can add separate flags for each key column (e.g., valid_vendor_id
, valid_product_id
). This makes it easier to pinpoint issues and take necessary action.
Valid State Metadata: You can create a structured metadata table to track the validation status of individual records or columns. This helps avoid repeatedly checking the same data . It’s like having a clear to-do list for your data!
Instead of relying on someone to manually check the log table and take action, you can automate the process:
Automated Reprocessing: When missing or corrected data (like a Vendor ID) becomes available, set up an automatic system to revalidate flagged records and mark them as "valid."
Hold or Skip Invalid Rows: Instead of stopping the entire import process, you can temporarily flag and skip invalid rows, letting the valid data continue. The flagged rows can then be reprocessed once the issues are resolved.
Automated Notifications: Set up automatic alerts to notify the right teams when there’s invalid data that needs attention, so they can fix it quickly and keep things moving smoothly.
For more insights into enhancing your data validation practices, refer to Semantic Link - Data Validation Using Great Expectations.
User | Count |
---|---|
25 | |
17 | |
6 | |
5 | |
2 |
User | Count |
---|---|
49 | |
43 | |
18 | |
7 | |
6 |