Solved: RealTime File Processing options in Microsoft Fabr...

ananthkrishna99 · ‎04-24-2025

Hi,

I'm currently working on a POC where data from multiple sources lands in a Lakehouse folder. The requirement is to automatically pick up each file as soon as it lands, process it, and push the data to EventHub.

We initially considered using Data Activator for this, but it doesn't support passing parameters to downstream jobs. This poses a risk, especially when multiple files arrive simultaneously, as it could lead to conflicts or incorrect processing.

Additionally, we are dealing with files that can range from a single record to millions of records, which adds another layer of complexity.

Given these challenges, what would be the best approach to handle this scenario efficiently and reliably? Any suggestions would be greatly appreciated.

Thanks in advance!

v-karpurapud · ‎04-24-2025

Hi @ananthkrishna99

Thank you for reaching out to the Microsoft Fabric Community Forum.

The optimal solution integrates Spark Structured Streaming for real-time file detection and processing, Fabric Pipelines for orchestration, and Eventstream for seamless EventHub integration. This approach avoids Data Activator's limitations, manages concurrent files safely, and scales for varying file sizes.

To set up the Lakehouse and landing folder, create a Lakehouse in your Fabric workspace, define a folder structure in the Files section for source files, and create a Delta table in the Tables section to temporarily store processed data before routing to EventHub.

To utilize Spark Structured Streaming for real-time file detection and processing, create a new Notebook in Fabric's Data Engineering workload and use PySpark to establish a streaming job monitoring the Files/landing/ folder.

Create an Eventstream item in your Fabric workspace, add a Lakehouse source to read new records from the Delta table, and connect to an Azure Event Hub instance to map the Delta table columns to the Event Hub message format.

Then, orchestrate with Fabric Pipelines by creating a pipeline in the Data Factory workload, adding activities to trigger the Spark Structured Streaming Notebook and Eventstream processing, and scheduling the pipeline to run continuously or on a set schedule.

Finally, optimize for file size variability by configuring Spark Structured Streaming to handle both small and large files efficiently and monitor the process using Fabric Monitoring Hub and Azure Portal.

If this response resolves your query, please mark it as the Accepted Solution to assist other community members. A Kudos is also appreciated if you found the response helpful.

Thank You!

View solution in original post

v-karpurapud · ‎05-05-2025

Hi @ananthkrishna99

I hope this information is helpful. Please let me know if you have any further questions or if you'd like to discuss this further. If this answers your question, please Accept it as a solution and give it a 'Kudos' so others can find it easily.

Thank you.

v-karpurapud · ‎05-02-2025

Hi @ananthkrishna99

I wanted to check if you had the opportunity to review the information provided. Please feel free to contact us if you have any further questions. If my response has addressed your query, please accept it as a solution and give a 'Kudos' so other members can easily find it.

Thank you.

v-karpurapud · ‎04-28-2025

Hi @ananthkrishna99

May I ask if you have resolved this issue? If so, please mark the helpful reply and accept it as the solution. This will be helpful for other community members who have similar problems to solve it faster.

Thank you.

v-karpurapud · ‎04-24-2025