Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Microsoft is giving away 50,000 FREE Microsoft Certification exam vouchers. Get Fabric certified for FREE! Learn more

Reply
_augustine_
Regular Visitor

Spark Write Stream in One Lake

Hi,

I am trying to create a streaming data pipeline where I have a single source folder in ADLS Gen2, which receive multiple files related to different oracle tables. In the fabric workspace, I have a spark streaming notebook which reads the files from this single source folder in ADLS Gen2, and I want to write it to multiple folders as per table name in One Lake present in the Fabric Workspace. How can we achieve this streaming requirement without creating multiple streams for each table?

3 ACCEPTED SOLUTIONS
Anonymous
Not applicable

Hi @_augustine_ ,

Thanks for using Fabric Community. 
I am just trying to give some idea and a generalised approach based on my understanding.

Here's how you can achieve your streaming requirement in Microsoft Fabric without creating multiple streams for each table:

 

1. Pipeline with Notebook Activity:

  • Create a pipeline in Microsoft Fabric with a Notebook Activity. This activity will execute the notebook you design to read data from ADLS Gen2 and save it to One Lake.

 

2. Event-Driven Trigger:

  • Configure the pipeline to be triggered by an event. You have two options:
    • ADLS Gen2 Event: If your ADLS Gen2 storage account supports eventing, you can set up an event trigger that fires whenever a new file appears in the source folder. This will automatically initiate the pipeline execution.
    • Schedule Trigger (Alternative): As an alternative, you can set a scheduled trigger for the pipeline to run periodically (e.g., every minute). While not truly event-driven, this ensures the pipeline checks for new data at regular intervals.

 

3. Notebook Logic:

  • Within your notebook:
    • Read data from the ADLS Gen2 folder using spark.read.format("parquet") (or your file format).
    • Extract the table name from the filename using logic like substring, split, or regular expressions.
    • Partition the data by the extracted table name using df.write.partitionBy("table_name").
    • Write the partitioned data to One Lake folders using format("parquet").start()`.

 

4. Benefits:

  • Event-Driven Efficiency: The pipeline triggers only when new data arrives (if using ADLS Gen2 events), optimizing resource usage.
  • Modular Design: The pipeline and notebook maintain separation of concerns, making the code more maintainable.
  • Scalability: This approach scales well as you add more tables. New tables will be automatically handled by the notebook logic.

 

By following these steps, you can create a single streaming pipeline in Microsoft Fabric that efficiently reads data from a single ADLS Gen2 folder, extracts table names, partitions the stream, and writes data to separate folders in One Lake based on the table name.

I just gave some idea over your scenario, I hope this might be helpful. Please do let me know incase of further queries.


View solution in original post

Hi,

Thanks a lot for the solution, I am following the above approaches but the notebook part, I want to use Spark Streaming notebook, and an option to stop the streaming notebook if there is no file in the ADLS Gen2. 

Example:
Multiple files are coming at a certain interval in ADLS Gen2, which triggers the pipeline and the spark streaming notebook. When there is no file the pipeline stops, stopping the spark streaming notebook.

In the solution you provided, files coming after the pipeline running, wont be read by the notebook. Right?
 

View solution in original post

Anonymous
Not applicable

Hi @_augustine_ ,

Glad to know that you got some insights.
Yes, your understanding is right. It can read the files which are present before execution, but not that came after the pipeline execution.

You can also refer this - Get started with streaming data in lakehouse - Microsoft Fabric | Microsoft Learn

I hope this might help you.
Thank you

View solution in original post

5 REPLIES 5
Anonymous
Not applicable

Hi @_augustine_ ,

Thanks for using Fabric Community. 
I am just trying to give some idea and a generalised approach based on my understanding.

Here's how you can achieve your streaming requirement in Microsoft Fabric without creating multiple streams for each table:

 

1. Pipeline with Notebook Activity:

  • Create a pipeline in Microsoft Fabric with a Notebook Activity. This activity will execute the notebook you design to read data from ADLS Gen2 and save it to One Lake.

 

2. Event-Driven Trigger:

  • Configure the pipeline to be triggered by an event. You have two options:
    • ADLS Gen2 Event: If your ADLS Gen2 storage account supports eventing, you can set up an event trigger that fires whenever a new file appears in the source folder. This will automatically initiate the pipeline execution.
    • Schedule Trigger (Alternative): As an alternative, you can set a scheduled trigger for the pipeline to run periodically (e.g., every minute). While not truly event-driven, this ensures the pipeline checks for new data at regular intervals.

 

3. Notebook Logic:

  • Within your notebook:
    • Read data from the ADLS Gen2 folder using spark.read.format("parquet") (or your file format).
    • Extract the table name from the filename using logic like substring, split, or regular expressions.
    • Partition the data by the extracted table name using df.write.partitionBy("table_name").
    • Write the partitioned data to One Lake folders using format("parquet").start()`.

 

4. Benefits:

  • Event-Driven Efficiency: The pipeline triggers only when new data arrives (if using ADLS Gen2 events), optimizing resource usage.
  • Modular Design: The pipeline and notebook maintain separation of concerns, making the code more maintainable.
  • Scalability: This approach scales well as you add more tables. New tables will be automatically handled by the notebook logic.

 

By following these steps, you can create a single streaming pipeline in Microsoft Fabric that efficiently reads data from a single ADLS Gen2 folder, extracts table names, partitions the stream, and writes data to separate folders in One Lake based on the table name.

I just gave some idea over your scenario, I hope this might be helpful. Please do let me know incase of further queries.


Anonymous
Not applicable

Hi @_augustine_ ,

We haven’t heard from you on the last response and was just checking back to see if your query was answered.
Otherwise, will respond back with the more details and we will try to help .

Thanks

Anonymous
Not applicable

Hi @_augustine_ ,

We haven’t heard from you on the last response and was just checking back to see if your query was answered.
Otherwise, will respond back with the more details and we will try to help .

Thanks

Hi,

Thanks a lot for the solution, I am following the above approaches but the notebook part, I want to use Spark Streaming notebook, and an option to stop the streaming notebook if there is no file in the ADLS Gen2. 

Example:
Multiple files are coming at a certain interval in ADLS Gen2, which triggers the pipeline and the spark streaming notebook. When there is no file the pipeline stops, stopping the spark streaming notebook.

In the solution you provided, files coming after the pipeline running, wont be read by the notebook. Right?
 

Anonymous
Not applicable

Hi @_augustine_ ,

Glad to know that you got some insights.
Yes, your understanding is right. It can read the files which are present before execution, but not that came after the pipeline execution.

You can also refer this - Get started with streaming data in lakehouse - Microsoft Fabric | Microsoft Learn

I hope this might help you.
Thank you

Helpful resources

Announcements
MarchFBCvideo - carousel

Fabric Monthly Update - March 2025

Check out the March 2025 Fabric update to learn about new features.

Notebook Gallery Carousel1

NEW! Community Notebooks Gallery

Explore and share Fabric Notebooks to boost Power BI insights in the new community notebooks gallery.

April2025 Carousel

Fabric Community Update - April 2025

Find out what's new and trending in the Fabric community.