Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

View all the Fabric Data Days sessions on demand. View schedule

Reply
PAVAN_111
New Member

How to validate CSV blank lines (before & after header) in Microsoft Fabric Data Pipelines?

Hi Community,

 

I have a requirement in Microsoft Fabric to validate CSV files (comma/pipe delimited) stored in a Lakehouse.

 

We need to detect blank lines in two scenarios:

 

Blank lines before the header
The first non-blank line should be considered the header. If any blank lines exist above the header, the file should be marked invalid.

 

Blank lines after the header
Any blank rows in the data section should also make the file invalid.

 

Additional constraints:

 

The solution must be implemented within Fabric only (no Azure Data Factory).

 

How to achieve in Fabric Data Pipeline (without dataflow Gen2) .

 

Files should remain in the same folder . we only need an output response/log showing which files are valid or invalid.

your guidance means a lot and keeps learners like me motivated.

Thanks you

2 ACCEPTED SOLUTIONS
Ugk161610
Continued Contributor
Continued Contributor

Hi @PAVAN_111 ,

 

This is a good “quality gate” use case, and you can do it fully inside Fabric with a pipeline plus a notebook, without Dataflow Gen2 or ADF, and without touching the original files.

 

The simplest pattern is: let the pipeline call a notebook that reads each CSV as plain text, checks the lines, and then writes a small log table saying “valid / invalid / reason” for each file. The files stay exactly where they are.

 

Inside the notebook you don’t use spark.read.csv for the check, because that will happily ignore blank lines. Instead, you read the file as text and apply your two rules manually. For example (PySpark in a Fabric notebook):

 

from datetime import datetime

folder = "/lakehouse/default/Files/your-folder" # or path passed in from pipeline

 

# List all CSV files under the folder
files_df = spark.read.format("binaryFile").load(folder + "/*.csv").select("path")
file_paths = [r.path for r in files_df.collect()]

results = []

for path in file_paths:


# Read file as text, line by line
lines = spark.read.text(path).rdd.map(lambda r: r.value).collect()

# Strip whitespace
stripped = [ (i, (line or "").strip()) for i, line in enumerate(lines) ]

# Find first non-blank line = header


header_idx = next((i for i, v in stripped if v != ""), None)
if header_idx is None:
results.append((path, False, "File is empty or only blank lines"))
continue

 

# Check for blank lines above header
if any(v == "" for i, v in stripped[:header_idx]):
results.append((path, False, "Blank line(s) before header"))
continue

 

# Check for blank lines after header
if any(v == "" for i, v in stripped[header_idx+1:]):
results.append((path, False, "Blank line(s) after header"))
else:
results.append((path, True, "Valid"))

 

# Write results into a small Delta table in the same Lakehouse
log_df = spark.createDataFrame(
[(p, ok, reason, datetime.utcnow()) for p, ok, reason in results],
["file_path", "is_valid", "reason", "checked_utc"]
)
log_df.write.mode("append").saveAsTable("csv_validation_log")

 

Your pipeline just has a notebook activity that passes the folder path (and maybe delimiter info) to this notebook. After it runs, you can query the csv_validation_log table to see exactly which files are valid or invalid and why, without moving or modifying the source files.

 

This stays 100% inside Fabric, uses only a pipeline and a notebook, and gives you a clear yes/no plus reason for every CSV in the folder.

– Gopi Krishna

View solution in original post

v-ssriganesh
Community Support
Community Support

Hi @PAVAN_111,

Thank you for posting your query in the Microsoft Fabric Community Forum, and thanks to @tayloramy & @Ugk161610 for sharing valuable insights.

 

Could you please confirm if your query has been resolved by the provided solutions? This would be helpful for other members who may encounter similar issues.

 

Thank you for being part of the Microsoft Fabric Community.

View solution in original post

5 REPLIES 5
v-ssriganesh
Community Support
Community Support

Hi @PAVAN_111,

Thank you for posting your query in the Microsoft Fabric Community Forum, and thanks to @tayloramy & @Ugk161610 for sharing valuable insights.

 

Could you please confirm if your query has been resolved by the provided solutions? This would be helpful for other members who may encounter similar issues.

 

Thank you for being part of the Microsoft Fabric Community.

tayloramy
Community Champion
Community Champion

Hi @PAVAN_111

 

@Ugk161610 is correct, the best way to implement this is with a Notebook. 

If you must only use data pipelines, then you cannot achieve the desired result today. 

 

What is the limitation for not using a notebook? 

 

If you found this helpful, consider giving some Kudos. If I answered your question or solved your problem, mark this post as the solution. 

PAVAN_111
New Member

Hi,

Thanks for the suggestion!
Using a notebook is definitely a possible option, but in my case the requirement is a bit stricter:

I need to implement the solution without Dataflow Gen2 and without using a Notebook.
The design should use Fabric Data Pipeline activities only.

 

Ugk161610
Continued Contributor
Continued Contributor

Hi @PAVAN_111 ,

 

Thanks for the clarification — this changes the answer.

 

If you must use ONLY Fabric Data Pipeline activities (no notebook, no Dataflow Gen2), then this validation cannot be implemented fully in Fabric today. Pipeline activities do not read files line-by-line and they do not expose file content in a way that allows you to detect blank rows before or after the header.

 

Copy activity, Lookup, Get Metadata, and other pipeline steps only work at the file or dataset level, not at the row or line level. They can check file existence, size, and schema, but they cannot inspect raw file content for formatting rules like:

 

– detecting blank rows
– locating the first non-blank header
– validating rows inside the file

 

There is no built-in pipeline activity that can scan the contents of a CSV and apply row-level validation logic.

 

So with your current restriction:

  1.  You can check file presence
  2.  You can route files based on name/size
  3.  You cannot validate blank lines
  4. You cannot inspect file contents
  5.  You cannot identify header position
  6.  You cannot flag row-level issues

The only Fabrics tools that support this kind of validation are:
• Notebooks (Spark / Python)
• Dataflow Gen2
• External services (Logic Apps, Azure Functions, etc.)

 

If none of those are allowed, then this requirement is not solvable in Fabric Pipelines alone at the moment. The only correct design is to introduce one processing step that can actually read file content.

 

If you want a workaround using pipelines only, the closest you can get is:
copy the file into a staging area and mark it “unvalidated” until a downstream process validates it — but the validation itself cannot happen inside the pipeline.

 

So the honest answer is:

  This requirement is not supported using Fabric pipelines alone today.

 

If the design can be adjusted, the smallest change would be allowing one lightweight notebook for validation only.

– Gopi Krishna

Ugk161610
Continued Contributor
Continued Contributor

Hi @PAVAN_111 ,

 

This is a good “quality gate” use case, and you can do it fully inside Fabric with a pipeline plus a notebook, without Dataflow Gen2 or ADF, and without touching the original files.

 

The simplest pattern is: let the pipeline call a notebook that reads each CSV as plain text, checks the lines, and then writes a small log table saying “valid / invalid / reason” for each file. The files stay exactly where they are.

 

Inside the notebook you don’t use spark.read.csv for the check, because that will happily ignore blank lines. Instead, you read the file as text and apply your two rules manually. For example (PySpark in a Fabric notebook):

 

from datetime import datetime

folder = "/lakehouse/default/Files/your-folder" # or path passed in from pipeline

 

# List all CSV files under the folder
files_df = spark.read.format("binaryFile").load(folder + "/*.csv").select("path")
file_paths = [r.path for r in files_df.collect()]

results = []

for path in file_paths:


# Read file as text, line by line
lines = spark.read.text(path).rdd.map(lambda r: r.value).collect()

# Strip whitespace
stripped = [ (i, (line or "").strip()) for i, line in enumerate(lines) ]

# Find first non-blank line = header


header_idx = next((i for i, v in stripped if v != ""), None)
if header_idx is None:
results.append((path, False, "File is empty or only blank lines"))
continue

 

# Check for blank lines above header
if any(v == "" for i, v in stripped[:header_idx]):
results.append((path, False, "Blank line(s) before header"))
continue

 

# Check for blank lines after header
if any(v == "" for i, v in stripped[header_idx+1:]):
results.append((path, False, "Blank line(s) after header"))
else:
results.append((path, True, "Valid"))

 

# Write results into a small Delta table in the same Lakehouse
log_df = spark.createDataFrame(
[(p, ok, reason, datetime.utcnow()) for p, ok, reason in results],
["file_path", "is_valid", "reason", "checked_utc"]
)
log_df.write.mode("append").saveAsTable("csv_validation_log")

 

Your pipeline just has a notebook activity that passes the folder path (and maybe delimiter info) to this notebook. After it runs, you can query the csv_validation_log table to see exactly which files are valid or invalid and why, without moving or modifying the source files.

 

This stays 100% inside Fabric, uses only a pipeline and a notebook, and gives you a clear yes/no plus reason for every CSV in the folder.

– Gopi Krishna

Helpful resources

Announcements
November Fabric Update Carousel

Fabric Monthly Update - November 2025

Check out the November 2025 Fabric update to learn about new features.

Fabric Data Days Carousel

Fabric Data Days

Advance your Data & AI career with 50 days of live learning, contests, hands-on challenges, study groups & certifications and more!

FabCon Atlanta 2026 carousel

FabCon Atlanta 2026

Join us at FabCon Atlanta, March 16-20, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.

Top Kudoed Authors