Don't miss your chance to take the Fabric Data Engineer (DP-600) exam for FREE! Find out how by attending the DP-600 session on April 23rd (pacific time), live or on-demand.
Learn moreNext up in the FabCon + SQLCon recap series: The roadmap for Microsoft SQL and Maximizing Developer experiences in Fabric. All sessions are available on-demand after the live show. Register now
Currenty project i have a requment where there will be multiple files but i have to read only 10 files and load to warehouse using spark note books. Please suggest me best approach and spark code to implement this
Solved! Go to Solution.
Problem:
In this task, many files can arrive in the source folder, but only 10 files should be processed at a time.
Proposed Solution:
To handle this, the Spark notebook first checks which files are available, then picks only the first 10 files and loads them into the warehouse. After the data is successfully loaded, those files are marked as processed or moved to a separate folder, so they are not picked up again.
This approach keeps the process controlled, efficient, and safe, and ensures that one large batch of files doesn’t overload the system. For scenarios where files arrive continuously, Spark’s Auto Loader can be used to automatically process a fixed number of files per run.
Streaming Version (Auto Loader)
spark.readStream \
.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("cloudFiles.maxFilesPerTrigger", 10) \
.load(SOURCE_PATH) \
.writeStream \
.format("delta") \
.option("checkpointLocation", "/chk/sales") \
.toTable(TARGET_TABLE)
Or you can use Pipeline copy activity or dataflow Gen -2, which is more UI friendly options.
If this helped, ✓ Mark as Solution | Kudos appreciated
Hi @Rahul_6610,
Checking in to see if your issue has been resolved. let us know if you still need any assistance.
Thank you.
Hi @Rahul_6610,
Have you had a chance to review the solution we shared by @Murtaza_Ghafoor @stoic-harsh @deborshi_nag @smeetsh ? If the issue persists, feel free to reply so we can help further.
Thank you.
As @stoic-harsh already mentioned you can do this with a simple data copy activity, this will allow you to read a whole range of filetypes into a lakehouse, from there you can copy the data, with any ETL you would like, into warehouse. There is no need to use a notebook.
Couple of questions though.
1: You mention you only need 10 files read, how are these 10 files defined, are they fixed filenames, is it only todays date etc etc? The key here is that you need a way to pinpoint these files, with either a wildcard or a filelist for instance. Fabric will read either a specific file, a wildcard file defnition, or a filelist. You can use dynamic expressions to define the filename as well.
2 what is the file type? Fabric can read a bunch of filetypes. Make sure your type is one of them.
Cheers
Hans
Hello @Rahul_6610
You have 2 options:
1. Use Spark notebook
Use Spark to read the 10 files and then using Synapse/fabric connector to write directly to your Warehouse. You can use regex to select the 10 files.
import re
from notebookutils import mssparkutils
SOURCE_DIR = "abfss://container@storage.dfs.core.windows.net/data/"
# List contents
entries = mssparkutils.fs.ls(SOURCE_DIR)
# Regex: only parquet files with YYYYMMDD at end (change as needed)
pattern = re.compile(r".*_(\d{8})\.parquet$")
# Filter
matches = [f.path for f in entries if pattern.match(f.name)]
# Choose only 10
files_to_load = sorted(matches)[:10]
df = spark.read.parquet(files_to_load)
df.show()
#write to sql warehouse
df.write.format("jdbc")....
2. Use TSQL with OPENROWSET
INSERT INTO dbo.MyEntity (Col1, Col2, _ingest_time_utc, _source_path)
SELECT
r.Col1,
r.Col2,
SYSUTCDATETIME() AS _ingest_time_utc,
r.$path AS _source_path
FROM OPENROWSET(
BULK 'https://onelake.dfs.fabric.microsoft.com/<workspace-guid>/<lakehouse-guid>/Files/myfolder/*.parquet',
FORMAT = 'PARQUET'I would recommend you use TSQL and OPENROWSET over a Spark notebook.
Hi @Rahul_6610,
You can certainly use Spark Notebooks. But if your requirement is simple ingestion, I would recommend to use either Pipeline (Copy Activity) or Dataflow Gen2 for low-code ingestion experience. It's fairly easy to use, and supports wide range of source connections.
Once you are more comfortable, or if you need finer-grained control (over files' selection), using a notebook makes sense. Just provide a bit more context such as location of your files (ADLS, local or another database) and how the target files can be identified (file_name pattern, file type, etc.), and one of us can share a simplified code.
Problem:
In this task, many files can arrive in the source folder, but only 10 files should be processed at a time.
Proposed Solution:
To handle this, the Spark notebook first checks which files are available, then picks only the first 10 files and loads them into the warehouse. After the data is successfully loaded, those files are marked as processed or moved to a separate folder, so they are not picked up again.
This approach keeps the process controlled, efficient, and safe, and ensures that one large batch of files doesn’t overload the system. For scenarios where files arrive continuously, Spark’s Auto Loader can be used to automatically process a fixed number of files per run.
Streaming Version (Auto Loader)
spark.readStream \
.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("cloudFiles.maxFilesPerTrigger", 10) \
.load(SOURCE_PATH) \
.writeStream \
.format("delta") \
.option("checkpointLocation", "/chk/sales") \
.toTable(TARGET_TABLE)
Or you can use Pipeline copy activity or dataflow Gen -2, which is more UI friendly options.
If this helped, ✓ Mark as Solution | Kudos appreciated
Hi @Rahul_6610,
I'd start by getting the files into a Lakehouse. You can do this externally using ADLS Gen 2 APIs, or use a Fabric Copy Job or Pipeline with a Copy Activity to do this. You may also be able to use a ntoebook to copy the files now that notebooks can access connections, but I have not tested that.
Once they are in the lakehouse, you can use notebooks to interact with them and do whatever ETL needs to be done, and load them to a table in the lakehouse.
The last step is to use a copy activity or copy job to move the tables into the warehouse.
Proud to be a Super User! | |
Experience the highlights from FabCon & SQLCon, available live and on-demand starting April 14th.
If you have recently started exploring Fabric, we'd love to hear how it's going. Your feedback can help with product improvements.
| User | Count |
|---|---|
| 2 | |
| 2 | |
| 1 | |
| 1 | |
| 1 |
| User | Count |
|---|---|
| 9 | |
| 4 | |
| 4 | |
| 3 | |
| 3 |