Solved: Error reading parquet files: Illegal Parquet type:...

Kesahli · ‎10-28-2024

Hi All,

I have been getting the following error when reading some parquet files in a pyspark notebook.

Illegal Parquet type: INT64 (TIME(NANOS,true))

The parquet files are loaded by a copy activity in a pipeline, contained in a forEach loop, so its not easy to pull them out and manually map to say a string for later conversion.

I have done a bit of searching and it seems this was a known spark issue some time ago that was apparently rectified in Spark 3.2 ([SPARK-40819] Parquet INT64 (TIMESTAMP(NANOS,true)) now throwing Illegal Parquet type instead of aut...)

I have tried running the below in first cell of notebook.

spark.conf.set("spark.sql.legacy.parquet.nanosAsLong", "true")

spark.conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED")

spark.conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")

spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")

spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")

Additionally I have tried setting the above in a Spark environment and assigning that to the notebook.

Any other suggestions or help would be appreciated.

Cheers.

Kesahli · ‎10-28-2024

Just an update...

I have successfully read the offending files using pandas and the fastparquet engine (after setting up a new environment to load that library).

Once read into a pandas frame, I convert to a spark df in order to continue with the rest of the notebook without having to refactor. I found I do need to run the spark.conf.set() calls above in order to write to delta tables (obviously parquet underneath).

Not elegant but is a workaround unless anyone has something else?

View solution in original post

Kesahli · ‎10-28-2024

Just an update...

I have successfully read the offending files using pandas and the fastparquet engine (after setting up a new environment to load that library).

Once read into a pandas frame, I convert to a spark df in order to continue with the rest of the notebook without having to refactor. I found I do need to run the spark.conf.set() calls above in order to write to delta tables (obviously parquet underneath).

Not elegant but is a workaround unless anyone has something else?