Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Enhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.

Reply
spencer_sa
Super User
Super User

Nullable = False in schema not correctly preserved / written using pyspark

We've found some 'interesting' and potentially bugged behaviour when writing Delta tables in pyspark with nullable = False columns in the schema - specifically if you write it to a table and the re-read the table, the column is now nullable = True.
Use SparkSQL however to create the table and all is well.
Has anyone else seen this?  And apart from using ALTER TABLE statements, has anyone worked around?

Sample code below;
Using pyspark - shows issue

 

from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([
    StructField("id", IntegerType(), nullable=False),
    StructField("name", StringType(), nullable=True)
])
df = spark.createDataFrame([],schema)
print(df.schema)
df.write.format('delta').mode('overwrite').save('Tables/pyspark_version')
df2 = spark.read.format('delta').load('Tables/pyspark_version')
print(df2.schema)

 

returns the following - note the schema difference;

 

StructType([StructField('id', IntegerType(), False), StructField('name', StringType(), True)])
StructType([StructField('id', IntegerType(), True), StructField('name', StringType(), True)])

 


Using SparkSQL - no issue

 

create_statement = '''
CREATE TABLE Test.sql_version
(id integer NOT NULL,
name varchar(20))
'''
spark.sql(create_statement)
df2 = spark.read.format('delta').load('Tables/sql_version')
print(df2.schema)

 

returns the following - note the presence of False for the nullable field;

 

StructType([StructField('id', IntegerType(), False), StructField('name', StringType(), True)])

 

 

1 ACCEPTED SOLUTION

Hello @spencer_sa - I've done some more research on this.  Based on my findings, while PySpark does include the ability to specify the schema, I believe that when writing a delta table to a lakehouse in this way, the nullability specifications in the schema are not preserved - and all columns set to nullable so that it is optimized for schema-on-read operations.  Since the parameters are available that lead the user to think the schema can be explicitly specified, I think it would be good to get feedback from Microsoft so they can confirm.

If this post helps to answer your questions, please consider marking it as a solution so others can find it more quickly when faced with a similar challenge.

Proud to be a Microsoft Fabric Super User

View solution in original post

8 REPLIES 8
jennratten
Super User
Super User

Hello @spencer_sa -

The reason the PySpark result makes nullable = true even though you have defined the schema to be false is because of how PySpark handles schema inference during the write operation with overwrite mode. To ensure that the nullable property remains as intended, you need to explicitly enforce the schema, which will prevent PySpark from inferring a new schema based on the data.

  • When you write a dataframe to a delta table using PySpark with overwrite mode, PySpark infers the schema based on the data being written. If the dataframe is empty or if the schema is not explicitly enforced, PySpark may default to making all columns nullable when it reads the table back.
  • When you create a table using SparkSQL the schema is explicitly defined and enforced, so the nullable property is preserved as specified.

You can explicitly enforce the schema when writing the dataframe using PySpark like this:

 

df.write.format('delta').mode('overwrite').option('overwriteSchema', 'true').save('Tables/pyspark_version')

 

 

If this post helps to answer your questions, please consider marking it as a solution so others can find it more quickly when faced with a similar challenge.

Proud to be a Microsoft Fabric Super User

Addressing both of your points I've amended the code and am still seeing the schema change issue.
Specifically I've readded the sample data I removed before I posted this the first time and I've added the overwriteSchema option to the write.  (I also deleted the existing table to start from a clean slate)

from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([
    StructField("id", IntegerType(), nullable=False),
    StructField("name", StringType(), nullable=True)
])
# Now with added data
data = [{'id': 1, 'name': 'Alice'},{'id': 2, 'name': 'Bob'}]
df = spark.createDataFrame(data,schema)
print(df.schema)
# And now overwriting the schema - theoretically
df.write.format('delta').mode('overwrite').option('overwriteSchema', 'true').save('Tables/pyspark_version')
df2 = spark.read.format('delta').load('Tables/pyspark_version')
print(df2.schema)

 results in;

StructType([StructField('id', IntegerType(), False), StructField('name', StringType(), True)])
StructType([StructField('id', IntegerType(), True), StructField('name', StringType(), True)])

Hello @spencer_sa - I've done some more research on this.  Based on my findings, while PySpark does include the ability to specify the schema, I believe that when writing a delta table to a lakehouse in this way, the nullability specifications in the schema are not preserved - and all columns set to nullable so that it is optimized for schema-on-read operations.  Since the parameters are available that lead the user to think the schema can be explicitly specified, I think it would be good to get feedback from Microsoft so they can confirm.

If this post helps to answer your questions, please consider marking it as a solution so others can find it more quickly when faced with a similar challenge.

Proud to be a Microsoft Fabric Super User

Already got a ticket in for this - can't get to it 'til I'm back from annual leave.
I probably found the same articles you've had - I just generated some sample code to reproduce it

Hello, did you solve the issue? I don't see any clear approach

spencer_sa
Super User
Super User

Thanks for confirming.  I've raised as a 'report a problem'.

Anonymous
Not applicable

Hi @spencer_sa,

Any responded that you received from dev team? If they shared any root causing about these , did you mind sharing them here? I think they will help for other users who faced the similar scenario.

Regards,

Xiaoxin Sheng

Anonymous
Not applicable

Hi @spencer_sa,

I tried your code on my side and I can reproduce this scenario. It seems like the 'write' delta table operations modify the field nullable property of the schema.

Current I checked the documents but not found them mentions these, perhaps you can try to report to dev team about this issues.
Regards,

Xiaoxin Sheng

Helpful resources

Announcements
July 2025 community update carousel

Fabric Community Update - July 2025

Find out what's new and trending in the Fabric community.

Join our Fabric User Panel

Join our Fabric User Panel

This is your chance to engage directly with the engineering team behind Fabric and Power BI. Share your experiences and shape the future.

June FBC25 Carousel

Fabric Monthly Update - June 2025

Check out the June 2025 Fabric update to learn about new features.