Solved: Nullable = False in schema not correctly preserved...

spencer_sa · ‎12-13-2024

We've found some 'interesting' and potentially bugged behaviour when writing Delta tables in pyspark with nullable = False columns in the schema - specifically if you write it to a table and the re-read the table, the column is now nullable = True.
Use SparkSQL however to create the table and all is well.
Has anyone else seen this? And apart from using ALTER TABLE statements, has anyone worked around?

Sample code below;
Using pyspark - shows issue

from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([
    StructField("id", IntegerType(), nullable=False),
    StructField("name", StringType(), nullable=True)
])
df = spark.createDataFrame([],schema)
print(df.schema)
df.write.format('delta').mode('overwrite').save('Tables/pyspark_version')
df2 = spark.read.format('delta').load('Tables/pyspark_version')
print(df2.schema)

returns the following - note the schema difference;

StructType([StructField('id', IntegerType(), False), StructField('name', StringType(), True)])
StructType([StructField('id', IntegerType(), True), StructField('name', StringType(), True)])

Using SparkSQL - no issue

create_statement = '''
CREATE TABLE Test.sql_version
(id integer NOT NULL,
name varchar(20))
'''
spark.sql(create_statement)
df2 = spark.read.format('delta').load('Tables/sql_version')
print(df2.schema)

returns the following - note the presence of False for the nullable field;

StructType([StructField('id', IntegerType(), False), StructField('name', StringType(), True)])

jennratten · ‎12-27-2024

Hello @spencer_sa - I've done some more research on this. Based on my findings, while PySpark does include the ability to specify the schema, I believe that when writing a delta table to a lakehouse in this way, the nullability specifications in the schema are not preserved - and all columns set to nullable so that it is optimized for schema-on-read operations. Since the parameters are available that lead the user to think the schema can be explicitly specified, I think it would be good to get feedback from Microsoft so they can confirm.

View solution in original post

jennratten · ‎12-26-2024

Hello @spencer_sa -

The reason the PySpark result makes nullable = true even though you have defined the schema to be false is because of how PySpark handles schema inference during the write operation with overwrite mode. To ensure that the nullable property remains as intended, you need to explicitly enforce the schema, which will prevent PySpark from inferring a new schema based on the data.

When you write a dataframe to a delta table using PySpark with overwrite mode, PySpark infers the schema based on the data being written. If the dataframe is empty or if the schema is not explicitly enforced, PySpark may default to making all columns nullable when it reads the table back.
When you create a table using SparkSQL the schema is explicitly defined and enforced, so the nullable property is preserved as specified.

You can explicitly enforce the schema when writing the dataframe using PySpark like this:

df.write.format('delta').mode('overwrite').option('overwriteSchema', 'true').save('Tables/pyspark_version')

spencer_sa · ‎12-26-2024

Addressing both of your points I've amended the code and am still seeing the schema change issue.
Specifically I've readded the sample data I removed before I posted this the first time and I've added the overwriteSchema option to the write. (I also deleted the existing table to start from a clean slate)

from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([
    StructField("id", IntegerType(), nullable=False),
    StructField("name", StringType(), nullable=True)
])
# Now with added data
data = [{'id': 1, 'name': 'Alice'},{'id': 2, 'name': 'Bob'}]
df = spark.createDataFrame(data,schema)
print(df.schema)
# And now overwriting the schema - theoretically
df.write.format('delta').mode('overwrite').option('overwriteSchema', 'true').save('Tables/pyspark_version')
df2 = spark.read.format('delta').load('Tables/pyspark_version')
print(df2.schema)

results in;

StructType([StructField('id', IntegerType(), False), StructField('name', StringType(), True)])
StructType([StructField('id', IntegerType(), True), StructField('name', StringType(), True)])

jennratten · ‎12-27-2024

Hello @spencer_sa - I've done some more research on this. Based on my findings, while PySpark does include the ability to specify the schema, I believe that when writing a delta table to a lakehouse in this way, the nullability specifications in the schema are not preserved - and all columns set to nullable so that it is optimized for schema-on-read operations. Since the parameters are available that lead the user to think the schema can be explicitly specified, I think it would be good to get feedback from Microsoft so they can confirm.

spencer_sa · ‎12-27-2024

Already got a ticket in for this - can't get to it 'til I'm back from annual leave.
I probably found the same articles you've had - I just generated some sample code to reproduce it

asanz93 · ‎04-10-2025

Hello, did you solve the issue? I don't see any clear approach

spencer_sa · ‎12-16-2024

Thanks for confirming. I've raised as a 'report a problem'.

Anonymous · ‎12-25-2024

Hi @spencer_sa,

Any responded that you received from dev team? If they shared any root causing about these , did you mind sharing them here? I think they will help for other users who faced the similar scenario.

Regards,

Xiaoxin Sheng

Anonymous · ‎12-15-2024

Hi @spencer_sa,

I tried your code on my side and I can reproduce this scenario. It seems like the 'write' delta table operations modify the field nullable property of the schema.

Current I checked the documents but not found them mentions these, perhaps you can try to report to dev team about this issues.
Regards,

Xiaoxin Sheng

Nullable = False in schema not correctly preserved / written using pyspark

Helpful resources

Fabric Monthly Update - July 2025

Fabric Community Update - August 2025

Join us at FabCon Vienna from September 15-18, 2025

Nullable = False in schema not correctly preserved / written using pyspark

Helpful resources

Fabric Monthly Update - July 2025

Fabric Community Update - August 2025