Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Get Fabric certified for FREE! Don't miss your chance! Learn more

Reply
pmscorca
Kudo Kingpin
Kudo Kingpin

Ordering csv files to import historical data in a lakehouse

Hi,

I need to import historical data about some file groups in a lakehouse.

Each file group has a specific prefix saved in a configuration table in a warehouse.

Each file group has a name composed of filegroupprefix_nn_fromdate_mm_todate, where nn and mm represent two digits (f.e. 01, 02, 03, 04, ..., 10, 11, 12, 13 and so on) and fromdate and todate are in yyyymmdd format.

For a specific file group, e.g. filegroup1prefix, I could have these csv files to import:

  • filegroup1prefix_01_20250101_01_20250201
  • filegroup1prefix_01_20250101_02_20250202
  • filegroup1prefix_01_20250101_03_20250203
  • filegroup1prefix_01_20250101_04_20250204
  • etc. etc.

Each file covers about two months of data: the filegroup1prefix_01_20250101_01_20250201 csv file contains data between 20250101 date (January) and 20250201 date (February).
Moreover, filegroup1prefix_01_20250101_02_20250202 data have to overwrite filegroup1prefix_01_20250101_01_20250201 data, filegroup1prefix_01_20250101_03_20250203 data have to overwrite filegroup1prefix_01_20250101_02_20250202 and so on.
So, in order to avoid any mistakes, I need to work the csv files following the file name order.
I think to implement a pipeline that runs:

  1. a lookup against the configuration table in the warehouse to get the file group prefixes,
  2. a for each to save in a variable the specific prefix, then a get metadata against the on-premise folder to get the name of all csv files, then a filter to select the csv files belonging to the specific group file, but I cannot order these file names saved in an array.

Now, any suggests to me in order to solve a such issue, please? Thanks

I've tried to search a specific regular expression, something similar to "@sort"

 

Update: while I think a good solution, I've created this idea -->
Feature to order an array into a pipeline - Microsoft Fabric Community
Please vote it, thanks.

1 ACCEPTED SOLUTION
v-pnaroju-msft
Community Support
Community Support

Hi pmscorca,

Please find the sample code and screenshots attached below, which may help resolve the issue:
1.Input files:

vpnarojumsft_0-1767976221240.png


2. Output table:

vpnarojumsft_1-1767976308396.png

3. Sample code:

vpnarojumsft_2-1767976351026.png

 

from pyspark.sql import functions as F
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Step 1: Path to files in Lakehouse
files_path = "Files/bronze/filegroup1prefix/"

# Step 2: List CSV files
files_df = (
    spark.read.format("binaryFile")
    .load(files_path)
    .select("path")
    .filter(F.col("path").endswith(".csv"))
)

# Step 3: Extract ordering number from filename
files_df = files_df.withColumn(
    "seq",
    F.regexp_extract(
        "path",
        r"_(\d{2})_\d{2}_\d{2}_\d{8}",
        1
    ).cast("int")
)

# Step 4: Sort files in correct order
ordered_files = files_df.orderBy("seq")

# Step 5: Sequential ingestion with overwrite
for row in ordered_files.collect():
    df = (
        spark.read
        .option("header", True)
        .option("dateFormat", "MM/dd/yyyy")
        .csv(row.path)
    )

    df.write \
      .format("delta") \
      .mode("overwrite") \
      .saveAsTable("sales_final")

print("Ordered ingestion completed successfully.")

We hope the information provided is helpful. If you have any further queries, please feel free to contact the Microsoft Fabric community.

Thank you.

View solution in original post

12 REPLIES 12
v-pnaroju-msft
Community Support
Community Support

Hi pmscorca,

We would like to follow up and see whether the details we shared have resolved your problem. If you need any more assistance, please feel free to connect with the Microsoft Fabric community.

Thank you.

v-pnaroju-msft
Community Support
Community Support

Hi pmscorca,

Please find the sample code and screenshots attached below, which may help resolve the issue:
1.Input files:

vpnarojumsft_0-1767976221240.png


2. Output table:

vpnarojumsft_1-1767976308396.png

3. Sample code:

vpnarojumsft_2-1767976351026.png

 

from pyspark.sql import functions as F
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Step 1: Path to files in Lakehouse
files_path = "Files/bronze/filegroup1prefix/"

# Step 2: List CSV files
files_df = (
    spark.read.format("binaryFile")
    .load(files_path)
    .select("path")
    .filter(F.col("path").endswith(".csv"))
)

# Step 3: Extract ordering number from filename
files_df = files_df.withColumn(
    "seq",
    F.regexp_extract(
        "path",
        r"_(\d{2})_\d{2}_\d{2}_\d{8}",
        1
    ).cast("int")
)

# Step 4: Sort files in correct order
ordered_files = files_df.orderBy("seq")

# Step 5: Sequential ingestion with overwrite
for row in ordered_files.collect():
    df = (
        spark.read
        .option("header", True)
        .option("dateFormat", "MM/dd/yyyy")
        .csv(row.path)
    )

    df.write \
      .format("delta") \
      .mode("overwrite") \
      .saveAsTable("sales_final")

print("Ordered ingestion completed successfully.")

We hope the information provided is helpful. If you have any further queries, please feel free to contact the Microsoft Fabric community.

Thank you.

v-pnaroju-msft
Community Support
Community Support

Hi pmscorca,

Thank you for the followup.

Based on my understanding, to ensure correct overwrite behavior you should list, sort, and ingest the files sequentially within a Fabric notebook. This ensures deterministic execution and is the supported approach when file order matters. Pipelines should be used only to orchestrate the execution of the notebook.

We hope this information helps to resolve the issue. Should you have any further queries, please feel free to contact the Microsoft Fabric community.

Thank you.

Ok, a sample notebook to accomplish a such task, please? Thanks

v-pnaroju-msft
Community Support
Community Support

Hi pmscorca,

Thank you for the followup. We have already upvoted the suggestion on the forum.

Based on my understanding, using a notebook to sort file names and then passing the resulting array to a pipeline ForEach is not a reliable pattern in Fabric. Pipelines do not guarantee ordered execution across activity boundaries. When file order is critical, both sorting and ingestion should be performed within the notebook. Pipelines ought to be used only to orchestrate the execution of the notebook. This is the supported and most robust approach in Fabric.

We hope this information helps resolve the issue. If you have any further queries, please feel free to contact the Microsoft Fabric community.

Thank you.




Hi and many thanks for your reply.

Any suggests to perform a such ingestion within a notebook? Thanks

lbendlin
Super User
Super User

Not really. You are trying to prevent parallelism from happening, and you are reducing the robustness of your ingestion process. Typical "fighting the API" pattern.

v-pnaroju-msft
Community Support
Community Support

Thankyou, @lbendlin and @tayloramy for your responses.

Hi pmscorca,

We appreciate your inquiry on the Microsoft Fabric Community Forum.

Based on my understanding, in Fabric Data Pipelines, arrays returned by Get Metadata or Lookup activities cannot be sorted, and the execution order of ForEach is non-deterministic. Consequently, pipelines cannot guarantee filename based processing order, which may result in older CSV files overwriting newer data when overwrite logic is applied.

One workaround is to follow the approach suggested by @tayloramy.

Your idea to support array sorting within pipelines is valid and addresses the gap. Until such a capability is introduced, notebook based ordered ingestion is a feasible workaround.

Additionally, please refer the links below:
How to use notebooks - Microsoft Fabric | Microsoft Learn
Overview of Fabric deployment pipelines - Microsoft Fabric | Microsoft Learn
The Microsoft Fabric deployment pipelines process - Microsoft Fabric | Microsoft Learn

We hope the information provided helps to resolve the issue. Should you have any further queries, please feel free to contact the Microsoft Fabric community.

Thank you.

Hi, thanks four you reply but I hopo to vote my idea, please.

I'd like to use a notebook only to order the file names saved in an array, next I'd like to use the for each activity to iterate the array from the first ordered file name to the last one: it is the right approach, isn't it? Thanks

tayloramy
Super User
Super User

Hi @pmscorca

 

The way that I would approach this is by landing the files in a lakehouse, and then using a notebook to load them so you can ensure things load in the order you want (filename, and not the sys modified time). 

 

Then you load the data into a lakehouse table, and from there you can move it into the warehouse. 

 

 





If you found this helpful, consider giving some Kudos.
If I answered your question or solved your problem, mark this post as the solution!

Proud to be a Super User!





lbendlin
Super User
Super User

not clear why the order of processing would be important here? What kind of mistakes are you anticipating?

Because old files could overwrite more recent files.

Helpful resources

Announcements
Sticker Challenge 2026 Carousel

Join our Community Sticker Challenge 2026

If you love stickers, then you will definitely want to check out our Community Sticker Challenge!

Free Fabric Certifications

Free Fabric Certifications

Get Fabric certified for free! Don't miss your chance.

January Fabric Update Carousel

Fabric Monthly Update - January 2026

Check out the January 2026 Fabric update to learn about new features.

FabCon Atlanta 2026 carousel

FabCon Atlanta 2026

Join us at FabCon Atlanta, March 16-20, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.