This time we’re going bigger than ever. Fabric, Power BI, SQL, AI and more. We're covering it all. You won't want to miss it.
Learn moreDid you hear? There's a new SQL AI Developer certification (DP-800). Start preparing now and be one of the first to get certified. Register now
Hi,
I am a newbie to ADF. i have CSV dataset referring from SFTP holding 14 million records. Currently, it has copyactivity which copies to CSV dataset(unzipped) from source zipped format dataset! Then dataflow , which refers this CSV and loads to AzureSQL (upsert operation) . The pipeline fails at times.
Its an ask to filter read data only from 8 days backdated based on a column "DateColumn" present in the CSV file.
I added this condition(filter) in dataflow (Source ->filter ->sink) where Filter takes expression as DateColumn>= toString(addDays(currentDate(), -8)).
The pipeline doesnt haveconsistent success! any other method would help since the data would be growing for years .
Solved! Go to Solution.
Hi @SuganADF ,
The core issue is that SFTP as a source doesn't support query pushdown there's no way to push the date filter to the SFTP side, so the full 14M rows are always read before filtering. This explains both the inconsistency (timeouts on large reads) and the slow performance.
Here are the practical approaches depending on your situation:
Option 1 Files are split by date on SFTP (recommended)
If your SFTP has daily/weekly files in a folder structure like /data/YYYY/MM/DD/, use a wildcard file path in Copy Activity to only read the last 8 days:
/data/2026/04/*/*.csv ← dynamically generated using @adddays(utcnow(),-8)
This avoids loading the full file entirely.
Option 2 Single large file, staging approach
Load the full CSV into a Lakehouse table first via Copy Activity (faster than Dataflow for raw ingestion), then query with a Notebook or Warehouse:
df = spark.read.format("delta").load("Tables/staging_csv")
filtered = df.filter(df["DateColumn"] >= date_8_days_ago)
filtered.write.format("delta").mode("overwrite").save("Tables/output")
This decouples the slow SFTP read from the filter logic and is much more reliable.
Option 3 Optimize Dataflow Gen2
If you must keep Dataflow, enable incremental refresh on the sink and make sure the filter uses a folding-compatible expression (some expressions break query folding and force a full scan). Check the "Query folding indicators" in Power Query editor.
For 14M+ rows from SFTP, Options 1 or 2 will be significantly more stable than filtering inside Dataflow.
I hope you'll find this response helpful.
Hi @SuganADF ,
The core issue is that SFTP as a source doesn't support query pushdown there's no way to push the date filter to the SFTP side, so the full 14M rows are always read before filtering. This explains both the inconsistency (timeouts on large reads) and the slow performance.
Here are the practical approaches depending on your situation:
Option 1 Files are split by date on SFTP (recommended)
If your SFTP has daily/weekly files in a folder structure like /data/YYYY/MM/DD/, use a wildcard file path in Copy Activity to only read the last 8 days:
/data/2026/04/*/*.csv ← dynamically generated using @adddays(utcnow(),-8)
This avoids loading the full file entirely.
Option 2 Single large file, staging approach
Load the full CSV into a Lakehouse table first via Copy Activity (faster than Dataflow for raw ingestion), then query with a Notebook or Warehouse:
df = spark.read.format("delta").load("Tables/staging_csv")
filtered = df.filter(df["DateColumn"] >= date_8_days_ago)
filtered.write.format("delta").mode("overwrite").save("Tables/output")
This decouples the slow SFTP read from the filter logic and is much more reliable.
Option 3 Optimize Dataflow Gen2
If you must keep Dataflow, enable incremental refresh on the sink and make sure the filter uses a folding-compatible expression (some expressions break query folding and force a full scan). Check the "Query folding indicators" in Power Query editor.
For 14M+ rows from SFTP, Options 1 or 2 will be significantly more stable than filtering inside Dataflow.
I hope you'll find this response helpful.
Hi @SuganADF ,
I hope the above details help you fix the issue. If you still have any questions or need more help, feel free to reach out. We’re always here to support you.
Best Regards,
Community Support Team
Hi @SuganADF ,
I hope the above details help you fix the issue. If you still have any questions or need more help, feel free to reach out. We’re always here to support you.
Best Regards,
Community Support Team
Hi @SuganADF ,
Thanks for reaching out to the Microsoft fabric community forum.
The issue is likely because the filter is applied in Data Flow after reading all 14M+ records, which impacts performance and causes failures.
Try to filter data earlier instead of in Data Flow. If possible, process only recent files (last 8 days) using folder/date-based filtering. Consider converting CSV to Parquet/Delta for better performance.
If I misunderstand your needs or you still have problems on it, please feel free to let us know.
Best Regards,
Community Support Team
Check out the April 2026 Fabric update to learn about new features.
Sign up to receive a private message when registration opens and key events begin.
| User | Count |
|---|---|
| 5 | |
| 3 | |
| 2 | |
| 1 | |
| 1 |