Solved: Error saving dataframe to delta due to Illegal unq...

JVL2 · ‎05-16-2025

Hi,

I encountered a strange issue. I read daily parquet files and save them into delta files, this processing is running for over a year without issue. Today, during saving of the delta file, I receive this error: "An error occurred while calling o5092.save. : com.fasterxml.jackson.core.JsonParseException: Illegal unquoted character ((CTRL-CHAR, code 29)): has to be escaped using backslash to be included in string value". The data contains some strange characters but this always has been the case and did not cause any issues. I first thought the issue was with todays parquet file but if I tried reading older files (which used to work); I get the same error. I did not change anything in the code nor PySpark settings.

Does anybody has an idea what is the cause of this behaviour (I rather not filter out the problematic columns in the bronze layer). Thank you!

v-lgarikapat · ‎05-18-2025

Hi @JVL2 ,

Thanks for reaching out to the Microsoft fabric community forum.
@burakkaragoz Thank you for your prompt response

Suggested Additions:

Note about performance (UDF vs built-in functions)

UDFs are convenient, but they can be slower than Spark’s native functions.

Add this note to help users scale better:

Performance Tip: If you're working with large datasets, consider replacing the UDF with Spark's built-in regexp_replace function. It's faster and leverages Catalyst optimizations.

Example (optional to include):

from pyspark.sql.functions import regexp_replace, col

for c, t in df.dtypes:

if t == 'string':

df = df.withColumn(c, regexp_replace(col(c), r'[\x00-\x1F\x7F]', ''))

Optional: Scan for control characters before cleaning

Give users a way to identify where the problem exists before cleaning everything.

Add this snippet as an optional diagnostic:

from pyspark.sql.functions import col

# Identify rows with control characters

control_char_pattern = r'[\x00-\x1F\x7F]'

for col_name, dtype in df.dtypes:

if dtype == 'string':

count = df.filter(col(col_name).rlike(control_char_pattern)).count()

if count > 0:

print(f"Column '{col_name}' has {count} rows with control characters.")

Note about downstream impact

Mention that even if Spark accepts bad characters now, downstream tools (like Power BI, APIs, consumers reading Delta/Parquet) might break.

It's a good practice to sanitize these characters at the bronze/silver layer to avoid silent corruption or downstream ingestion errors.

Mention reproducibility

It helps to mention that users should log their Spark/Delta version, since future changes in JSON behavior could cause similar issues again.

Tip: Log your Spark and Delta versions (spark.version, delta.__version__) to help with future debugging if behaviors change again.

If this post helped resolve your issue, please consider giving it Kudos and marking it as the Accepted Solution. This not only acknowledges the support provided but also helps other community members find relevant solutions more easily.

We appreciate your engagement and thank you for being an active part of the community.

Best Regards,

Lakshmi Narayana.

View solution in original post

JVL2 · ‎05-19-2025

Thank you for the feedback! I will indeed add some logging of the used delta and pyspark versions and a good tip to avoid using UDF for performance reasons,I did not know that. Thanks!

v-lgarikapat · ‎05-18-2025

Hi @JVL2 ,

Thanks for reaching out to the Microsoft fabric community forum.
@burakkaragoz Thank you for your prompt response

Suggested Additions:

Note about performance (UDF vs built-in functions)

UDFs are convenient, but they can be slower than Spark’s native functions.

Add this note to help users scale better:

Performance Tip: If you're working with large datasets, consider replacing the UDF with Spark's built-in regexp_replace function. It's faster and leverages Catalyst optimizations.

Example (optional to include):

from pyspark.sql.functions import regexp_replace, col

for c, t in df.dtypes:

if t == 'string':

df = df.withColumn(c, regexp_replace(col(c), r'[\x00-\x1F\x7F]', ''))

Optional: Scan for control characters before cleaning

Give users a way to identify where the problem exists before cleaning everything.

Add this snippet as an optional diagnostic:

from pyspark.sql.functions import col

# Identify rows with control characters

control_char_pattern = r'[\x00-\x1F\x7F]'

for col_name, dtype in df.dtypes:

if dtype == 'string':

count = df.filter(col(col_name).rlike(control_char_pattern)).count()

if count > 0:

print(f"Column '{col_name}' has {count} rows with control characters.")

Note about downstream impact

Mention that even if Spark accepts bad characters now, downstream tools (like Power BI, APIs, consumers reading Delta/Parquet) might break.

It's a good practice to sanitize these characters at the bronze/silver layer to avoid silent corruption or downstream ingestion errors.

Mention reproducibility

It helps to mention that users should log their Spark/Delta version, since future changes in JSON behavior could cause similar issues again.

Tip: Log your Spark and Delta versions (spark.version, delta.__version__) to help with future debugging if behaviors change again.

If this post helped resolve your issue, please consider giving it Kudos and marking it as the Accepted Solution. This not only acknowledges the support provided but also helps other community members find relevant solutions more easily.

We appreciate your engagement and thank you for being an active part of the community.

Best Regards,

Lakshmi Narayana.

burakkaragoz · ‎05-16-2025

Hi @JVL ,

we ran into a similar issue recently. In our case, it was caused by control characters (like ASCII 29) sneaking into string fields. Even if they were there before, something might have changed in the underlying Spark/Delta version or how JSON serialization is handled.

Here’s what helped us:

1. Clean control characters before writing
We added a small UDF to strip out non-printable characters:

import re
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def clean_str(s):
    if s:
        return re.sub(r'[\\x00-\\x1F\\x7F]', '', s)
    return s

clean_udf = udf(clean_str, StringType())

df_clean = df.select([clean_udf(c).alias(c) if t == 'string' else c for c, t in df.dtypes])

2. Check Spark/Delta version
If you recently updated your runtime (even silently), the JSON parser behavior might have changed. Worth checking.

3. Try writing with mode='append' or overwriteSchema
Sometimes schema evolution or write mode triggers weird serialization paths.

Let me know if you want a quick script to scan for control chars in your dataframe.

If my response resolved your query, kindly mark it as the Accepted Solution to assist others. Additionally, I would be grateful for a 'Kudos' if you found my response helpful.

Error saving dataframe to delta due to Illegal unquoted character

Helpful resources

Join our Fabric User Panel

Fabric Monthly Update - June 2025

Fabric Community Update - June 2025

Join the #PBI10 DataViz contest

Error saving dataframe to delta due to Illegal unquoted character

Helpful resources

Join our Fabric User Panel

Fabric Monthly Update - June 2025

Fabric Community Update - June 2025