Re: Discrepancy in spark.read.schema in Databricks...

anawast · ‎09-02-2024

Spark.read.json offers 2 functionalities:

1. You can impose a schema on top of spark.read.json using spark.read.schema(schema).json(df). This will ensure that only records that fit the schema will be produced as rows in the resultant df

2. You can send the records which don't fit the schema to a badRecordsPath using spark.read.option("badRecordsPath",path).schema(schema).json(df).

While this work as described above in databricks, the same doesn't apply to fabric.

As you can see in the code below, I have created a dataframe of json records on which i want to impose a schema. AS you can see the second record has a string for a value that should be of integertype.

1. In databricks I get only 2 records in validRecordsTemp as expected, and the bad record goes and sits in the defined path

2. In fabric however I get all 3 records with NULL as the value for col1 in record 2

databricks result:

Fabric result:

Example Code:

import org.apache.spark.sql.expressions.Window

import org.apache.spark.sql.Dataset

import org.apache.spark.sql.Row

import org.apache.spark.sql.functions._

import org.apache.spark.sql.types.StructType

import org.apache.spark.sql.{SQLContext, SQLImplicits, SparkSession}

import org.apache.spark.sql.types.{IntegerType, LongType, StringType, StructField, StructType}

val basePath="DqSchemacheck"

val uuid=java.util.UUID.randomUUID.toString

val baseDqFolder="/tmp/"+basePath+"/"+uuid

val df = Seq(

("{\"col1\":2,\"col2\":3}"),

("{\"col1\":\"failure\",\"col2\":3}"),

("{\"col1\":2,\"col2\":3}")

).toDF("body")

val dfStringDS=df.select(col("body")as "body").map(_.toString())

val schema=StructType(Seq(StructField("col1",IntegerType),StructField("col2",IntegerType)))

val validRecordsTemp=spark.read.option("badRecordsPath", baseDqFolder).schema(schema).json(dfStringDS)

display(validRecordsTemp)

Anonymous · ‎09-02-2024

HI @anawast,

I suppose this may be related to the internal processing, they recognize and convert the invalid value to default values. If you do not want this part existed in result, you may need to do filter operation before they load into the data frame.

Regards,

Xiaoxin Sheng

Discrepancy in spark.read.schema in Databricks and Azure Fabric.

Helpful resources

Fabric Monthly Update - July 2025

Fabric Community Update - August 2025

Join us at FabCon Vienna from September 15-18, 2025

Discrepancy in spark.read.schema in Databricks and Azure Fabric.

Helpful resources

Fabric Monthly Update - July 2025

Fabric Community Update - August 2025