Solved: Re: Spark Write Stream in One Lake

_augustine_ · ‎05-29-2024

Hi,

I am trying to create a streaming data pipeline where I have a single source folder in ADLS Gen2, which receive multiple files related to different oracle tables. In the fabric workspace, I have a spark streaming notebook which reads the files from this single source folder in ADLS Gen2, and I want to write it to multiple folders as per table name in One Lake present in the Fabric Workspace. How can we achieve this streaming requirement without creating multiple streams for each table?

Anonymous · ‎05-29-2024

Hi @_augustine_ ,

Thanks for using Fabric Community.
I am just trying to give some idea and a generalised approach based on my understanding.

Here's how you can achieve your streaming requirement in Microsoft Fabric without creating multiple streams for each table:

1. Pipeline with Notebook Activity:

Create a pipeline in Microsoft Fabric with a Notebook Activity. This activity will execute the notebook you design to read data from ADLS Gen2 and save it to One Lake.

2. Event-Driven Trigger:

Configure the pipeline to be triggered by an event. You have two options:
- ADLS Gen2 Event: If your ADLS Gen2 storage account supports eventing, you can set up an event trigger that fires whenever a new file appears in the source folder. This will automatically initiate the pipeline execution.
- Schedule Trigger (Alternative): As an alternative, you can set a scheduled trigger for the pipeline to run periodically (e.g., every minute). While not truly event-driven, this ensures the pipeline checks for new data at regular intervals.

3. Notebook Logic:

Within your notebook:
- Read data from the ADLS Gen2 folder using spark.read.format("parquet") (or your file format).
- Extract the table name from the filename using logic like substring, split, or regular expressions.
- Partition the data by the extracted table name using df.write.partitionBy("table_name").
- Write the partitioned data to One Lake folders using format("parquet").start()`.

4. Benefits:

Event-Driven Efficiency: The pipeline triggers only when new data arrives (if using ADLS Gen2 events), optimizing resource usage.
Modular Design: The pipeline and notebook maintain separation of concerns, making the code more maintainable.
Scalability: This approach scales well as you add more tables. New tables will be automatically handled by the notebook logic.

By following these steps, you can create a single streaming pipeline in Microsoft Fabric that efficiently reads data from a single ADLS Gen2 folder, extracts table names, partitions the stream, and writes data to separate folders in One Lake based on the table name.

I just gave some idea over your scenario, I hope this might be helpful. Please do let me know incase of further queries.

View solution in original post

_augustine_ · ‎06-03-2024

Hi,

Thanks a lot for the solution, I am following the above approaches but the notebook part, I want to use Spark Streaming notebook, and an option to stop the streaming notebook if there is no file in the ADLS Gen2.

Example:
Multiple files are coming at a certain interval in ADLS Gen2, which triggers the pipeline and the spark streaming notebook. When there is no file the pipeline stops, stopping the spark streaming notebook.

In the solution you provided, files coming after the pipeline running, wont be read by the notebook. Right?

View solution in original post

Anonymous · ‎06-03-2024

Hi @_augustine_ ,

Glad to know that you got some insights.
Yes, your understanding is right. It can read the files which are present before execution, but not that came after the pipeline execution.

You can also refer this - Get started with streaming data in lakehouse - Microsoft Fabric | Microsoft Learn

I hope this might help you.
Thank you

View solution in original post

Anonymous · ‎05-29-2024

Hi @_augustine_ ,

Thanks for using Fabric Community.
I am just trying to give some idea and a generalised approach based on my understanding.

Here's how you can achieve your streaming requirement in Microsoft Fabric without creating multiple streams for each table:

1. Pipeline with Notebook Activity:

Create a pipeline in Microsoft Fabric with a Notebook Activity. This activity will execute the notebook you design to read data from ADLS Gen2 and save it to One Lake.

2. Event-Driven Trigger:

Configure the pipeline to be triggered by an event. You have two options:
- ADLS Gen2 Event: If your ADLS Gen2 storage account supports eventing, you can set up an event trigger that fires whenever a new file appears in the source folder. This will automatically initiate the pipeline execution.
- Schedule Trigger (Alternative): As an alternative, you can set a scheduled trigger for the pipeline to run periodically (e.g., every minute). While not truly event-driven, this ensures the pipeline checks for new data at regular intervals.

3. Notebook Logic:

Within your notebook:
- Read data from the ADLS Gen2 folder using spark.read.format("parquet") (or your file format).
- Extract the table name from the filename using logic like substring, split, or regular expressions.
- Partition the data by the extracted table name using df.write.partitionBy("table_name").
- Write the partitioned data to One Lake folders using format("parquet").start()`.

4. Benefits:

Event-Driven Efficiency: The pipeline triggers only when new data arrives (if using ADLS Gen2 events), optimizing resource usage.
Modular Design: The pipeline and notebook maintain separation of concerns, making the code more maintainable.
Scalability: This approach scales well as you add more tables. New tables will be automatically handled by the notebook logic.

By following these steps, you can create a single streaming pipeline in Microsoft Fabric that efficiently reads data from a single ADLS Gen2 folder, extracts table names, partitions the stream, and writes data to separate folders in One Lake based on the table name.

I just gave some idea over your scenario, I hope this might be helpful. Please do let me know incase of further queries.

Anonymous · ‎05-31-2024

Hi @_augustine_ ,

We haven’t heard from you on the last response and was just checking back to see if your query was answered.
Otherwise, will respond back with the more details and we will try to help .

Thanks

Anonymous · ‎06-03-2024

Hi @_augustine_ ,

We haven’t heard from you on the last response and was just checking back to see if your query was answered.
Otherwise, will respond back with the more details and we will try to help .

Thanks

_augustine_ · ‎06-03-2024

Hi,

Thanks a lot for the solution, I am following the above approaches but the notebook part, I want to use Spark Streaming notebook, and an option to stop the streaming notebook if there is no file in the ADLS Gen2.

Example:
Multiple files are coming at a certain interval in ADLS Gen2, which triggers the pipeline and the spark streaming notebook. When there is no file the pipeline stops, stopping the spark streaming notebook.

In the solution you provided, files coming after the pipeline running, wont be read by the notebook. Right?

Anonymous · ‎06-03-2024

Hi @_augustine_ ,

Glad to know that you got some insights.
Yes, your understanding is right. It can read the files which are present before execution, but not that came after the pipeline execution.

You can also refer this - Get started with streaming data in lakehouse - Microsoft Fabric | Microsoft Learn

I hope this might help you.
Thank you

Spark Write Stream in One Lake

Helpful resources

Fabric Monthly Update - December 2025

FabCon Atlanta 2026

FabCon is coming to Atlanta

Spark Write Stream in One Lake

Helpful resources

Fabric Monthly Update - December 2025

FabCon Atlanta 2026