Power BI is turning 10! Tune in for a special live episode on July 24 with behind-the-scenes stories, product evolution highlights, and a sneak peek at what’s in store for the future.
Save the dateEnhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.
We've found some 'interesting' and potentially bugged behaviour when writing Delta tables in pyspark with nullable = False columns in the schema - specifically if you write it to a table and the re-read the table, the column is now nullable = True.
Use SparkSQL however to create the table and all is well.
Has anyone else seen this? And apart from using ALTER TABLE statements, has anyone worked around?
Sample code below;
Using pyspark - shows issue
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([
StructField("id", IntegerType(), nullable=False),
StructField("name", StringType(), nullable=True)
])
df = spark.createDataFrame([],schema)
print(df.schema)
df.write.format('delta').mode('overwrite').save('Tables/pyspark_version')
df2 = spark.read.format('delta').load('Tables/pyspark_version')
print(df2.schema)
returns the following - note the schema difference;
StructType([StructField('id', IntegerType(), False), StructField('name', StringType(), True)])
StructType([StructField('id', IntegerType(), True), StructField('name', StringType(), True)])
Using SparkSQL - no issue
create_statement = '''
CREATE TABLE Test.sql_version
(id integer NOT NULL,
name varchar(20))
'''
spark.sql(create_statement)
df2 = spark.read.format('delta').load('Tables/sql_version')
print(df2.schema)
returns the following - note the presence of False for the nullable field;
StructType([StructField('id', IntegerType(), False), StructField('name', StringType(), True)])
Solved! Go to Solution.
Hello @spencer_sa - I've done some more research on this. Based on my findings, while PySpark does include the ability to specify the schema, I believe that when writing a delta table to a lakehouse in this way, the nullability specifications in the schema are not preserved - and all columns set to nullable so that it is optimized for schema-on-read operations. Since the parameters are available that lead the user to think the schema can be explicitly specified, I think it would be good to get feedback from Microsoft so they can confirm.
If this post helps to answer your questions, please consider marking it as a solution so others can find it more quickly when faced with a similar challenge.
Proud to be a Microsoft Fabric Super User
Hello @spencer_sa -
The reason the PySpark result makes nullable = true even though you have defined the schema to be false is because of how PySpark handles schema inference during the write operation with overwrite mode. To ensure that the nullable property remains as intended, you need to explicitly enforce the schema, which will prevent PySpark from inferring a new schema based on the data.
You can explicitly enforce the schema when writing the dataframe using PySpark like this:
df.write.format('delta').mode('overwrite').option('overwriteSchema', 'true').save('Tables/pyspark_version')
If this post helps to answer your questions, please consider marking it as a solution so others can find it more quickly when faced with a similar challenge.
Proud to be a Microsoft Fabric Super User
Addressing both of your points I've amended the code and am still seeing the schema change issue.
Specifically I've readded the sample data I removed before I posted this the first time and I've added the overwriteSchema option to the write. (I also deleted the existing table to start from a clean slate)
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([
StructField("id", IntegerType(), nullable=False),
StructField("name", StringType(), nullable=True)
])
# Now with added data
data = [{'id': 1, 'name': 'Alice'},{'id': 2, 'name': 'Bob'}]
df = spark.createDataFrame(data,schema)
print(df.schema)
# And now overwriting the schema - theoretically
df.write.format('delta').mode('overwrite').option('overwriteSchema', 'true').save('Tables/pyspark_version')
df2 = spark.read.format('delta').load('Tables/pyspark_version')
print(df2.schema)
results in;
StructType([StructField('id', IntegerType(), False), StructField('name', StringType(), True)])
StructType([StructField('id', IntegerType(), True), StructField('name', StringType(), True)])
Hello @spencer_sa - I've done some more research on this. Based on my findings, while PySpark does include the ability to specify the schema, I believe that when writing a delta table to a lakehouse in this way, the nullability specifications in the schema are not preserved - and all columns set to nullable so that it is optimized for schema-on-read operations. Since the parameters are available that lead the user to think the schema can be explicitly specified, I think it would be good to get feedback from Microsoft so they can confirm.
If this post helps to answer your questions, please consider marking it as a solution so others can find it more quickly when faced with a similar challenge.
Proud to be a Microsoft Fabric Super User
Already got a ticket in for this - can't get to it 'til I'm back from annual leave.
I probably found the same articles you've had - I just generated some sample code to reproduce it
Hello, did you solve the issue? I don't see any clear approach
Thanks for confirming. I've raised as a 'report a problem'.
Hi @spencer_sa,
Any responded that you received from dev team? If they shared any root causing about these , did you mind sharing them here? I think they will help for other users who faced the similar scenario.
Regards,
Xiaoxin Sheng
Hi @spencer_sa,
I tried your code on my side and I can reproduce this scenario. It seems like the 'write' delta table operations modify the field nullable property of the schema.
Current I checked the documents but not found them mentions these, perhaps you can try to report to dev team about this issues.
Regards,
Xiaoxin Sheng
This is your chance to engage directly with the engineering team behind Fabric and Power BI. Share your experiences and shape the future.
User | Count |
---|---|
5 | |
4 | |
2 | |
2 | |
2 |