Solved: Re: Managing Data Type Conversions for Silver Data...

BriefStop · ‎11-04-2024

I'm using a medallion architecture system where one pipeline ingests data in the Bronze layer as parquet files and another pipeline validates data types and copies the parquet files into the Silver layer as a table.

Right now I'm quite confused about how to manage data type checks in the pipeline. It seems simple but I'm not sure about how I should be doing data type checks between a parquet file(no data types), to a PySpark data frame and then to a Fabric table.

My Silver pipeline uses a PySpark notebook to read in the parquet file as a PySpark data frame and then casts columns to certain PySpark data types(https://spark.apache.org/docs/latest/sql-ref-datatypes.html). However once the PySpark data frame is written to a Fabric table then the data types become Fabric data types(https://learn.microsoft.com/en-us/fabric/data-warehouse/data-types). So far I've been casting the columns to PySpark data types based on what they would convert to when they reach Fabric

Is this the best practice? Data type checks would rely on conversions between PySpark and Fabric which could change. Also it relies on the Fabric data type of the column when it was first copied over in Silver so if I intentionally change the data type in the notebook due to business rules thatn it would read into the Fabric table.

Is there a better way to manage data types for Silver in Fabric?

Srisakthi · ‎11-04-2024

Hi @BriefStop ,

Few ways to handle type conversion,

Notebook:

1. Have the mapping sheet for the datatypes between pyspark and fabric.

2. Create table with the proper datatype required for Fabric

3. Define your schema and read the parquet files. Use casting only on the required places. or Use infer schema and read the parquet files.

Copy Activity:

1. If no transformation required then you can use copy activity and under mapping you can see Type Conversion settingsfor datetime data type .

2. For other types if you want to convert use import schema and change the data type for destination table.

Regards,

Srisakthi

View solution in original post

Srisakthi · ‎11-04-2024

Hi @BriefStop ,

Few ways to handle type conversion,

Notebook:

1. Have the mapping sheet for the datatypes between pyspark and fabric.

2. Create table with the proper datatype required for Fabric

3. Define your schema and read the parquet files. Use casting only on the required places. or Use infer schema and read the parquet files.

Copy Activity:

1. If no transformation required then you can use copy activity and under mapping you can see Type Conversion settingsfor datetime data type .

2. For other types if you want to convert use import schema and change the data type for destination table.

Regards,

Srisakthi

Anonymous · ‎11-04-2024

Hi @BriefStop

Here are some practices that may help you:

Maintain a mapping table between PySpark data types and Fabric data types. This can be used as a reference for conversion to ensure that you are converting to the right type.

Suppose you have the following data type mappings

Before writing data to the Fabric table, implement validation functions in the PySpark notebook to check that the DataFrame column matches the expected data type based on the mapping. This can help you catch any inconsistencies early.

You can write a simple validation function to check the column data type in the DataFrame. For example,

from pyspark.sql.types import StringType, IntegerType

def validate_data_types(df):
    expected_types = {
        'name': StringType(),
        'age': IntegerType(),
        'salary': FloatType()
    }
    
    for column, expected_type in expected_types.items():
        actual_type = df.schema[column].dataType
        if actual_type != expected_type:
            raise ValueError(f"Column '{column}' has type '{actual_type}' but expected '{expected_type}'")

validate_data_types(my_dataframe)

If you need to support schema evolution in Fabric, you can use the following code to handle possible schema changes,

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DataTypeExample").getOrCreate()

df = spark.read.parquet("path/to/parquet")

df = df.withColumn("age", df["age"].cast(StringType()))

df.write.format("fabric").mode("overwrite").save("path/to/fabric_table")

You can use a simple script to automate data quality checks and ensure that data meets expectations before being written to Fabric tables,

def check_data_quality(df):
    if df.filter(df.age.isNull()).count() > 0:
        raise ValueError("Data quality check failed: 'age' column contains null values.")

check_data_quality(my_dataframe)

These examples may help you understand how to manage data types in the Silver pipeline. Ensuring validation and mapping at each step can help you reduce potential errors and improve data quality.

Regards,

Nono Chen

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Managing Data Type Conversions for Silver Data Pipeline

Helpful resources

Fabric Monthly Update - July 2025

Fabric Community Update - July 2025

Join us at FabCon Vienna from September 15-18, 2025

Managing Data Type Conversions for Silver Data Pipeline

Helpful resources

Fabric Monthly Update - July 2025

Fabric Community Update - July 2025