Solved: Pipeline Copy activity CSV to Parquet issue

A_monged · ‎12-17-2024

I have a CSV file in my Lakehouse. When I use the Copy Activity to move it and change the format to Parquet, I see columns with null values, although the CSV file is configured correctly (with escape and quote characters)!!

i tried both to load file to table in lakehouse and read it as parquet using spark and in both it shows the null columns !!!

A_monged · ‎12-18-2024

I found the issue to be that many rows with null values were intentionally added to the CSV, and after partitioning, the nulls were the first to be seen.

View solution in original post

Anonymous · ‎12-17-2024

Hi @A_monged ,

Thanks for the reply from lbendlin .

I used PySpark statements in notebook to convert CSV files from lakehouse to parquet files:

# start SparkSession
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CSV_to_Parquet").getOrCreate()

# load CSV file
df = spark.read.format("csv").option("header","true").load("Files/orders/2019.csv")
df.show()

# transform to Parquet form and save it
parquet_file_path = "Files/test.parquet"
df.write.parquet(parquet_file_path)

Works fine, as shown below, and my file does not contain null values:

The error may occur because the data types and schemas defined in the CSV file do not match the data types and schemas expected by the Parquet format.

Optionally, you can use the same method as I did to convert the CSV file to a Parquet file.

If you want to save it as a table you can use the following syntax:

df.write.mode(“overwrite”).saveAsTable(“parquetTestTable”)

If you have any other questions please feel free to contact me.

Best Regards,
Yang
Community Support Team

If there is any post helps, then please consider Accept it as the solution to help the other members find it more quickly.
If I misunderstand your needs or you still have problems on it, please feel free to let us know. Thanks a lot!

A_monged · ‎12-18-2024

I tried your approach but still got the same null columns attached table output and sample rows of raw CSV file

lbendlin · ‎12-17-2024

Does your CSV file contain quoted row delimiters?

A_monged · ‎12-18-2024

File use comma as delimiter and " as a quote character

lbendlin · ‎12-18-2024

So commas are quoted. But what about linefeeds in your data? are they quoted?

A_monged · ‎12-18-2024

I found the issue to be that many rows with null values were intentionally added to the CSV, and after partitioning, the nulls were the first to be seen.

Anonymous · ‎12-22-2024

Hi @A_monged ,

Thanks for the reply from lbendlin .

In order to deal with null values, an efficient way to get the data is to use Dataflow Gen2. In Dataflow Gen2, you can process the data, such as removing columns that contain null values. Then, set Destination to Lakehouse so that data that does not contain null values can be written to Lakehouse. This approach ensures data integrity and accuracy.

For more information on using Dataflow Gen2, you can refer to these official documents:

mslearn-fabric

Create your first Microsoft Fabric dataflow - Microsoft Fabric | Microsoft Learn

If you have any other questions please feel free to contact me.

Best Regards,
Yang
Community Support Team

If there is any post helps, then please consider Accept it as the solution to help the other members find it more quickly.
If I misunderstand your needs or you still have problems on it, please feel free to let us know. Thanks a lot!

A_monged · ‎12-18-2024

There are no linefeeds in my data

Pipeline Copy activity CSV to Parquet issue

Helpful resources

Join our Fabric User Panel

Fabric Monthly Update - June 2025

Fabric Community Update - June 2025

Join the #PBI10 DataViz contest

Pipeline Copy activity CSV to Parquet issue

Helpful resources

Join our Fabric User Panel

Fabric Monthly Update - June 2025

Fabric Community Update - June 2025