Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Did you hear? There's a new SQL AI Developer certification (DP-800). Start preparing now and be one of the first to get certified. Register now

Reply
SuganADF
New Member

Filter CSV dataset based on a column

Hi,

I am a newbie to ADF. i have CSV dataset referring from SFTP holding 14 million records. Currently, it has copyactivity which copies to CSV dataset(unzipped) from source zipped format dataset! Then dataflow , which refers this CSV and loads to AzureSQL (upsert operation) . The pipeline fails at times. 

Its an ask to filter read data only from 8 days backdated based on a column "DateColumn" present in the CSV file.

I added this condition(filter) in dataflow (Source ->filter ->sink) where Filter takes expression as DateColumn>= toString(addDays(currentDate(), -8)).

The pipeline doesnt haveconsistent success! any other method would help since the data would be growing for years .

 

1 ACCEPTED SOLUTION
Tamanchu
Continued Contributor
Continued Contributor

Hi @SuganADF ,

 

The core issue is that SFTP as a source doesn't support query pushdown there's no way to push the date filter to the SFTP side, so the full 14M rows are always read before filtering. This explains both the inconsistency (timeouts on large reads) and the slow performance.

Here are the practical approaches depending on your situation:

Option 1 Files are split by date on SFTP (recommended)
If your SFTP has daily/weekly files in a folder structure like /data/YYYY/MM/DD/, use a wildcard file path in Copy Activity to only read the last 8 days:

/data/2026/04/*/*.csv ← dynamically generated using @adddays(utcnow(),-8)

This avoids loading the full file entirely.

Option 2 Single large file, staging approach
Load the full CSV into a Lakehouse table first via Copy Activity (faster than Dataflow for raw ingestion), then query with a Notebook or Warehouse:

df = spark.read.format("delta").load("Tables/staging_csv")
filtered = df.filter(df["DateColumn"] >= date_8_days_ago)
filtered.write.format("delta").mode("overwrite").save("Tables/output")

This decouples the slow SFTP read from the filter logic and is much more reliable.

Option 3 Optimize Dataflow Gen2
If you must keep Dataflow, enable incremental refresh on the sink and make sure the filter uses a folding-compatible expression (some expressions break query folding and force a full scan). Check the "Query folding indicators" in Power Query editor.

For 14M+ rows from SFTP, Options 1 or 2 will be significantly more stable than filtering inside Dataflow.

 

I hope you'll find this response helpful. 

View solution in original post

4 REPLIES 4
Tamanchu
Continued Contributor
Continued Contributor

Hi @SuganADF ,

 

The core issue is that SFTP as a source doesn't support query pushdown there's no way to push the date filter to the SFTP side, so the full 14M rows are always read before filtering. This explains both the inconsistency (timeouts on large reads) and the slow performance.

Here are the practical approaches depending on your situation:

Option 1 Files are split by date on SFTP (recommended)
If your SFTP has daily/weekly files in a folder structure like /data/YYYY/MM/DD/, use a wildcard file path in Copy Activity to only read the last 8 days:

/data/2026/04/*/*.csv ← dynamically generated using @adddays(utcnow(),-8)

This avoids loading the full file entirely.

Option 2 Single large file, staging approach
Load the full CSV into a Lakehouse table first via Copy Activity (faster than Dataflow for raw ingestion), then query with a Notebook or Warehouse:

df = spark.read.format("delta").load("Tables/staging_csv")
filtered = df.filter(df["DateColumn"] >= date_8_days_ago)
filtered.write.format("delta").mode("overwrite").save("Tables/output")

This decouples the slow SFTP read from the filter logic and is much more reliable.

Option 3 Optimize Dataflow Gen2
If you must keep Dataflow, enable incremental refresh on the sink and make sure the filter uses a folding-compatible expression (some expressions break query folding and force a full scan). Check the "Query folding indicators" in Power Query editor.

For 14M+ rows from SFTP, Options 1 or 2 will be significantly more stable than filtering inside Dataflow.

 

I hope you'll find this response helpful. 

Hi @SuganADF ,

I hope the above details help you fix the issue. If you still have any questions or need more help, feel free to reach out. We’re always here to support you.

Best Regards, 
Community Support Team

Hi @SuganADF ,

I hope the above details help you fix the issue. If you still have any questions or need more help, feel free to reach out. We’re always here to support you.

Best Regards, 
Community Support Team

v-menakakota
Community Support
Community Support

Hi @SuganADF ,
Thanks for reaching out to the Microsoft fabric community forum. 

 

The issue is likely because the filter is applied in Data Flow after reading all 14M+ records, which impacts performance and causes failures.

Try to filter data earlier instead of in Data Flow. If possible, process only recent files (last 8 days) using folder/date-based filtering. Consider converting CSV to Parquet/Delta for better performance.

 

If I misunderstand your needs or you still have problems on it, please feel free to let us know. 

Best Regards, 
Community Support Team

Helpful resources

Announcements
April Fabric Update Carousel

Fabric Monthly Update - April 2026

Check out the April 2026 Fabric update to learn about new features.

Fabric SQL PBI Data Days

Data Days 2026 coming soon!

Sign up to receive a private message when registration opens and key events begin.

New to Fabric survey Carousel

New to Fabric Survey

If you have recently started exploring Fabric, we'd love to hear how it's going. Your feedback can help with product improvements.