This time we’re going bigger than ever. Fabric, Power BI, SQL, AI and more. We're covering it all. You won't want to miss it.
Learn moreGet Fabric Certified for FREE during AI Skills Fest. This week only. Secure your voucher now.
Dears,
We a medallion architecture in Fabric (raw, bronze, silver , gold and consumption layer)
Every time a new data quality rule runs in Purview , in case it finds records with errors (not compliant with the DQ rule) it adds those records into a lakehouse in Fabric (to the files section of the lakehouse).
This happens in nearrealtime, meaning, the rule runs in Purview, if it finds records not compliant, it adds those (via a built-in Purview Fabric Connector) in form of parquet files into the lakehouse folder.
We can have for example 1K rules running at same time though probably it will not be the case. It should be much less than that
I need to move this across the layers till it arrives to silver and I would like to have it available as faster as I can there
I was thinking in creating a onelake event trigger that fires everytime there is a change in the root folder of tjhis lakehouse.. But I am not sure if I should do that. Alternativaly I could also set a pipeline to bring the infromation into bronze, for example, every 15 minutes, but there must better options...
Goal is to move fech the infromaiton from the parquet files (once they are added into the lakehouse file section) and add them into a table in bronze in form of payloads (Json)
What do you think is ideally for this situation ? knowing that would like to have the infromation ASAP into bronze?
Thanks,
pedro
Solved! Go to Solution.
Hi @fabricpribeiro,
Thank you for your feedback. let me clarify this more directly in the context of your scenario.
When Purview generates DQ results, it writes many small Parquet files into the Lakehouse. Microsoft documentation explains that frequent ingestion can lead to a large number of small files, which can negatively impact performance and therefore require regular Delta table maintenance, such as running Optimize operations and maintaining tables to manage file counts. In practical terms, more files mean more overhead during reads, so optimizing helps reduce the number of files and improves query efficiency.
Regarding “detect activity quickly and process in short windows,” Fabric supports two ingestion approaches depending on latency needs: real-time ingestion using Eventstreams and batch ingestion using pipelines.
In your scenario, this means instead of processing each file individually as it arrives, you process all new files together at regular intervals using a pipeline or Spark job. This approach reduces execution overhead, improves scalability when multiple DQ rules run concurrently, and still allows data to reach the Bronze layer quickly while maintaining stable performance.
Thanks, @sabledattatray, @oussamahaimoud & @Tamanchu for sharing valuable insights.
Hi @fabricpribeiro,
May I check if this issue has been resolved? If not, Please feel free to contact us if you have any further questions. Your update will be valuable to the community and may assist others with similar concerns.
Thank you.
Hi @fabricpribeiro ,
Could you please confirm if your issue has been resolved using the suggested approach? This will help other community members facing similar scenarios.
Thank you.
Hi @fabricpribeiro,
Thanks for raising this—let me explain it in a simpler way.
When Purview writes data quality results into the Lakehouse, it often creates many small Parquet files. This is expected, but too many small files can slow down queries and increase processing overhead.
To handle this properly, we should not process each file one by one as it arrives. Instead, we should process the data in batches at regular intervals using a pipeline or Spark job.
This approach helps in three ways:
Reduces the number of file operations
Improves overall read and query performance
Keeps the system stable even when multiple DQ rules run at the same time
Also, running regular maintenance like OPTIMIZE on Delta tables helps merge small files and keeps performance healthy.
So in short: ingest quickly, but process in batches + optimize tables regularly for best performance.
Thanks.
To explain the window consolidation:
If you trigger a pipeline run for every single file that lands in OneLake, and you have many files arriving, Fabric will try to run dozens of pipelines at the same time. This will quickly overload your Fabric capacity, hit concurrency limits, and consume too many Capacity Units.
To avoid this, you should group the processing:
Option 1: You can run a Spark Structured Streaming notebook that runs continuously. You configure it to check the folder and write in batches at a set interval, like every two minutes. Spark will automatically aggregate all files that arrived during those two minutes and write them in a single batch, which is very efficient.
Option 2: Instead of using any event triggers, you can simply schedule a standard Fabric pipeline to run on a timer, like once every five minutes. Each time it runs, it picks up all new files, processes them in bulk, and stops. This keeps your pipeline runs low and predictable.
Let me know if this helps!
Hi Pedro,
Since DQ writes in frequent batches, it creates many tiny files that hurt query performance. The best practice is to let your ingestion write normally, and run OPTIMIZE on the table periodically using a scheduled notebook (rather than running it inline on every single write).
V-Order is enabled by default in Fabric Spark, so your files will automatically be optimized for Power BI reads when written from a notebook.
Hi @fabricpribeiro,
Honestly, I think your intuition is already going in the right direction.
What you’re describing is less a classic ETL problem and more an event-driven ingestion pattern inside Fabric. Since Purview is dropping parquet files directly into the Lakehouse /Files area in near real time, using OneLake events is definitely aligned with the direction Microsoft Fabric is pushing today for reactive architectures.
That said, I would probably be careful with a pure :
> 1 file arrival = 1 pipeline execution
pattern if you expect the volume of DQ events to grow significantly over time.
Why ?
Because once you start having : many small parquet files, bursts of concurrent DQ rule executions, hundreds of simultaneous triggers.
you can quickly run into : orchestration overhead, many tiny Spark executions, Delta small-files problems, unnecessary CU consumption.
So personally, I think the sweet spot here is usually a hybrid event-driven micro-batch approach.
For example :
That still gives you :
From there, the medallion flow becomes very clean :
I would also strongly recommend planning early for, Delta compaction, OPTIMIZE / V-Order and small-files mitigation, because DQ systems tend to generate many small parquet files over time.
If later the latency requirements become extremely aggressive (sub-minute / streaming-like workloads), then it may be worth looking at Eventstream, Spark Structured Streaming or a more continuous ingestion architecture. But for most enterprise DQ monitoring scenarios, a small event-driven micro-batch window is usually the best balance between freshness, scalability, reliability and CU efficiency.
Useful references from Microsoft Fabric docs :
Hope this helps a bit.
Really interesting use case by the way this is exactly the kind of scenario where Fabric starts behaving more like a real event-driven data platform than a traditional BI stack.
Feel free to share which architecture you finally go with.
Can you also, please explain this part a bit better? menaing, with more detail so that I can udnerstand ? "
"
Hi @fabricpribeiro ,
Thank you for reaching out to the Microsoft Community Forum and please consider the below points:
In Fabric lakehouses, frequent ingestion (such as DQ outputs) can lead to a large number of small Parquet files, which can negatively impact performance.
Microsoft recommends maintaining Delta tables by:
Fabric supports different ingestion approaches depending on latency needs:
Event‑driven approaches can be used for low‑latency scenarios, while batch pipelines are typically used to process data at regular intervals based on scheduling or orchestration requirements.
For your reference:
SQL analytics endpoint performance considerations
Feel free to reach out to us if you have any further questions.
This does not reply to my questions
Hi @fabricpribeiro,
Thank you for your feedback. let me clarify this more directly in the context of your scenario.
When Purview generates DQ results, it writes many small Parquet files into the Lakehouse. Microsoft documentation explains that frequent ingestion can lead to a large number of small files, which can negatively impact performance and therefore require regular Delta table maintenance, such as running Optimize operations and maintaining tables to manage file counts. In practical terms, more files mean more overhead during reads, so optimizing helps reduce the number of files and improves query efficiency.
Regarding “detect activity quickly and process in short windows,” Fabric supports two ingestion approaches depending on latency needs: real-time ingestion using Eventstreams and batch ingestion using pipelines.
In your scenario, this means instead of processing each file individually as it arrives, you process all new files together at regular intervals using a pipeline or Spark job. This approach reduces execution overhead, improves scalability when multiple DQ rules run concurrently, and still allows data to reach the Bronze layer quickly while maintaining stable performance.
Thanks, @sabledattatray, @oussamahaimoud & @Tamanchu for sharing valuable insights.
Thanks for your reply. Can you please detail a bit better this part "Delta compaction, OPTIMIZE / V-Order and small-files mitigation, because DQ systems tend to generate many small parquet files over time"
Hi @fabricpribeiro,
Hope you're doing well!
As I know, when Purview finds bad records, it drops parquet files into your RAW lakehouse almost instantly. Instead of polling every 15 minutes, you set up a OneLake storage event that fires the moment a new file lands, then a single Spark Structured Streaming job picks it up, wraps each row into a JSON payload, and writes it straight into your Bronze Delta table. From there, a second lightweight stream reads the Delta change feed and pushes clean, deduplicated data into Silver.
Result: bad records go from Purview to Silver in under 3 minutes, you never poll, and one stream handles all your rules simultaneously no matter how many fire at once.
Here are the official documentation links that you can check after for more information:
https://learn.microsoft.com/en-us/purview/unified-catalog-data-quality-fabric-lakehouse
https://learn.microsoft.com/en-us/fabric/governance/microsoft-purview-fabric
https://learn.microsoft.com/en-us/fabric/data-factory/pipeline-storage-event-triggers
https://learn.microsoft.com/en-us/fabric/real-time-hub/fabric-events-overview
https://learn.microsoft.com/en-us/fabric/real-time-hub/tutorial-build-event-driven-data-pipelines
https://learn.microsoft.com/en-us/fabric/data-engineering/lakehouse-streaming-data
https://learn.microsoft.com/en-us/fabric/data-engineering/get-started-streaming
https://learn.microsoft.com/en-us/azure/databricks/ingestion/cloud-object-storage/auto-loader/
https://learn.microsoft.com/en-us/azure/databricks/structured-streaming/delta-lake
https://learn.microsoft.com/en-us/azure/databricks/delta/delta-change-data-feed
Hope this helps. Feel free to ask me questions if needed, and don’t forget to Accept as Solution if this guidance worked for you. That's motivate me to keep helping.
Best regards,
Oussama (Data Consultant & Fabric's Expert)
Did my response help you? Clicking Kudos is a small gesture that goes a long way, it encourages contributors and helps the community thrive!
✅ Did I answer your question? Please mark my post as a Solution, it helps others find the answer faster.
Senior Data & BI Consultant · Microsoft Fabric & Power BI Specialist
Thank you very much for your reply. Can you please detail a bit better this part : "then a single Spark Structured Streaming job picks it up, wraps each row into a JSON payload, and writes it straight into your Bronze Delta table."
You're welcome @fabricpribeiro!
The Spark Structured Streaming job uses Auto Loader (cloudFiles format) to watch the RAW lakehouse folder. The moment a new parquet file lands, Auto Loader detects it, reads it row by row, and converts each row into a single JSON string using to_json(struct("*")), meaning all columns get wrapped into one JSON payload column. That payload, plus a few metadata columns (timestamp, rule ID, file path), gets appended as a new row into your Bronze Delta table. Auto Loader handles all the file tracking internally via a checkpoint, so it never reprocesses the same file twice even if hundreds of rules fire simultaneously.
In short: parquet file arrives → Auto Loader detects it → each row becomes a JSON string → appended to Bronze Delta table. Done.
Here's the minimal code that does exactly that:
from pyspark.sql.functions import to_json, struct, current_timestamp
df = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "parquet")
.option("cloudFiles.schemaLocation", "abfss://.../checkpoints/schema")
.load("abfss://.../raw-lakehouse/Files/purview-dq/"))
(df.select(
to_json(struct("*")).alias("payload"),
current_timestamp().alias("ingestion_ts"),
df["rule_id"],
df["_metadata.file_path"].alias("source_file"))
.writeStream
.format("delta")
.option("checkpointLocation", "abfss://.../checkpoints/bronze")
.trigger(availableNow=True)
.toTable("bronze.dq_errors_raw"))
That's it! So, we can say one job, one stream, handles all your rules at once.
Hope this helps. Feel free to ask me questions if needed, and don’t forget to Accept as Solution if this guidance worked for you. That's motivate me to keep helping.
Best regards,
Oussama (Data Consultant & Fabric's Expert)
Did my response help you? Clicking Kudos is a small gesture that goes a long way, it encourages contributors and helps the community thrive!
✅ Did I answer your question? Please mark my post as a Solution, it helps others find the answer faster.
Senior Data & BI Consultant · Microsoft Fabric & Power BI Specialist
Check out the June 2026 Fabric update to learn about new features.
Sign up to receive a private message when registration opens and key events begin.
| User | Count |
|---|---|
| 28 | |
| 23 | |
| 15 | |
| 15 | |
| 13 |