Solved: Re: Best and efficient way for data validation in ...

VoltesDev · ‎12-19-2024

Hello guys,

Being new to data engineering, would like to seek advises about what is the best and efficient way of doing data validation, for example when adopting the Medallion approach, processing Bronze to Silver, we need to clean the data like for example moving Purchase Order Transactions data, we should check:

1. Vendor id valid, already in Vendor Master

2. Product id is valid, already in Product Master

3. Currency is valid,

4. All the outher information are valid.

In Data engineering perspective, what should we do during checking ? should we flag the record ? means there is additional field "valid" for each column need to check whether has master or not. this flag can be brought to Silver so than can easily identify, that the field is not valid, but the data is available there, we just "clearly" identified it already. Is that the right approach ?

Logging the invalid, is obvious step, but I found this is not effective, as what should we do about it ? someone need to maintain the log table ? by means look into that table then do the necessary thing first ? so the import is hold ? left behind the invalid rows ? until it is reprocessed and get the valid flag ?

If the "checked" id is not exist in master, is it good to have some automation to retrieved it from master data pipeline ?

Appreciate for some share. Thanks,

v-veshwara-msft · ‎12-20-2024

Hi @VoltesDev ,
Thanks for posting your question in Microsoft Fabric Community forum.
When adopting the Medallion architecture and moving data from Bronze to Silver, data validation is a critical step to ensure data quality and reliability. Below is a step-by-step approach to efficiently handle validation, including the use of column profiling.

Use Column Profiling for Initial Analysis

Column profiling in Microsoft Fabric Data Flows is a helpful tool to get a clear picture of your data and spot major issues before setting up validation rules.

For example:

You can check how many values are missing (null counts), see how data is distributed, find duplicates, or catch unusual values in key fields like Vendor ID, Product ID, and Currency.

In the above images we can see Valid, Error and Empty.

Then you can use the methods like Flagging and Logging Invalid records to apply Validation in Bronze to Silver Transformation.
Flagging records which are valid and invalid is a solid approach. It allows
Transparency: Invalid fields are marked but can remain accessible for further analysis.
Traceability: Enables to identify issues without needing to re-run validations.
Flexibility: Keeps the data pipeline moving without halting the entire process.

Enhancements to This Approach:
- Column-Level Flags: Instead of just flagging the entire record as "valid" or "invalid," you can add separate flags for each key column (e.g., valid_vendor_id, valid_product_id). This makes it easier to pinpoint issues and take necessary action.
- Valid State Metadata: You can create a structured metadata table to track the validation status of individual records or columns. This helps avoid repeatedly checking the same data . It’s like having a clear to-do list for your data!
Logging records:

Instead of relying on someone to manually check the log table and take action, you can automate the process:
- Automated Reprocessing: When missing or corrected data (like a Vendor ID) becomes available, set up an automatic system to revalidate flagged records and mark them as "valid."
- Hold or Skip Invalid Rows: Instead of stopping the entire import process, you can temporarily flag and skip invalid rows, letting the valid data continue. The flagged rows can then be reprocessed once the issues are resolved.
- Automated Notifications: Set up automatic alerts to notify the right teams when there’s invalid data that needs attention, so they can fix it quickly and keep things moving smoothly.
For more insights into enhancing your data validation practices, refer to Semantic Link - Data Validation Using Great Expectations.
If younneed any further assistance please reach out.
If this post helps please accept as solution to help find others easily and a kudos would be appreciated.

Thank you.

View solution in original post

v-veshwara-msft · ‎01-02-2025

Hi @VoltesDev ,

We just wanted to check in again regarding your issue. If you’ve found a solution, marking the reply as the solution and leaving a kudos would be greatly appreciated—it helps the community and others with similar questions.

If you’re still facing challenges or have further questions, please let us know—we’re here to help and would love to assist you in resolving this.

Looking forward to hearing back from you!

Best regards,
Vinay.

v-veshwara-msft · ‎12-30-2024

Hi @VoltesDev ,

We hope you’re doing well! We just wanted to follow up since we haven’t heard back from you. If your issue is all sorted, it’d be great if you could mark the reply as the solution and maybe drop a kudos--it helps others too. But if you still need assistance, just let us know--we’re happy to help!

Best,
Vinay.

v-veshwara-msft · ‎12-25-2024

Hi @VoltesDev ,
We haven’t heard back since our last response and wanted to check if your query has been resolved. If not, please feel free to reach out for further assistance. If it has been resolved, kindly mark the helpful reply as the solution to make it easier for others to find. A kudos would also be greatly appreciated!
Thank you.

v-veshwara-msft · ‎12-20-2024

Hi @VoltesDev ,
Thanks for posting your question in Microsoft Fabric Community forum.
When adopting the Medallion architecture and moving data from Bronze to Silver, data validation is a critical step to ensure data quality and reliability. Below is a step-by-step approach to efficiently handle validation, including the use of column profiling.

Use Column Profiling for Initial Analysis

Column profiling in Microsoft Fabric Data Flows is a helpful tool to get a clear picture of your data and spot major issues before setting up validation rules.

For example:

You can check how many values are missing (null counts), see how data is distributed, find duplicates, or catch unusual values in key fields like Vendor ID, Product ID, and Currency.

In the above images we can see Valid, Error and Empty.

Then you can use the methods like Flagging and Logging Invalid records to apply Validation in Bronze to Silver Transformation.
Flagging records which are valid and invalid is a solid approach. It allows
Transparency: Invalid fields are marked but can remain accessible for further analysis.
Traceability: Enables to identify issues without needing to re-run validations.
Flexibility: Keeps the data pipeline moving without halting the entire process.

Enhancements to This Approach:
- Column-Level Flags: Instead of just flagging the entire record as "valid" or "invalid," you can add separate flags for each key column (e.g., valid_vendor_id, valid_product_id). This makes it easier to pinpoint issues and take necessary action.
- Valid State Metadata: You can create a structured metadata table to track the validation status of individual records or columns. This helps avoid repeatedly checking the same data . It’s like having a clear to-do list for your data!
Logging records:

Instead of relying on someone to manually check the log table and take action, you can automate the process:
- Automated Reprocessing: When missing or corrected data (like a Vendor ID) becomes available, set up an automatic system to revalidate flagged records and mark them as "valid."
- Hold or Skip Invalid Rows: Instead of stopping the entire import process, you can temporarily flag and skip invalid rows, letting the valid data continue. The flagged rows can then be reprocessed once the issues are resolved.
- Automated Notifications: Set up automatic alerts to notify the right teams when there’s invalid data that needs attention, so they can fix it quickly and keep things moving smoothly.
For more insights into enhancing your data validation practices, refer to Semantic Link - Data Validation Using Great Expectations.
If younneed any further assistance please reach out.
If this post helps please accept as solution to help find others easily and a kudos would be appreciated.

Thank you.

Best and efficient way for data validation in cleansing data ?

Use Column Profiling for Initial Analysis

Use Column Profiling for Initial Analysis

Helpful resources

Fabric Monthly Update - September 2025

Fabric Community Update - August 2025

FabCon is coming to Atlanta

Best and efficient way for data validation in cleansing data ?

Use Column Profiling for Initial Analysis

Use Column Profiling for Initial Analysis

Helpful resources

Fabric Monthly Update - September 2025

Fabric Community Update - August 2025