Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

View all the Fabric Data Days sessions on demand. View schedule

Reply
davidding
Resolver I
Resolver I

Issue in Fabric when loading data into notebook with different column width

I have blob container with multiple csv files. ALL of the files contain 17 standard columns, but a few of them have 1-3 extra columns. A shortcut called "demo" is created in a Fabric Lakehouse under Files. When i use spark.read, it loads all data into df, but the extra columns really mess up the data as it get loaded under wrong column names.

 

df = spark.read.format("csv").option("header", "true").load("Files/demo/*.csv")

 

davidding_0-1721654906237.png

 

Does pyspark has a way of recognising and matching the column names when loading multiple csv files in a folder?

3 REPLIES 3
GilbertQ
Super User
Super User

Hi @davidding 

 

Here is an example I got from Chat GTP explaining how to handle columns of different sizes. The code below is from Chat GTP

from pyspark.sql.functions import lit

def read_and_align_csv(file_path, base_schema=None):
    # Read the CSV file
    df = spark.read.csv(file_path, header=True, inferSchema=True)
    
    if base_schema is not None:
        # Get the columns that are in the base schema but not in the current DataFrame
        for col in base_schema:
            if col not in df.columns:
                # Add the missing column with null values
                df = df.withColumn(col, lit(None).cast(base_schema[col]))
        
        # Get the columns that are in the current DataFrame but not in the base schema
        for col in df.columns:
            if col not in base_schema:
                # Add the column to the base schema with null values
                base_schema[col] = df.schema[col].dataType

    else:
        # Initialize base schema if it's the first DataFrame
        base_schema = {col: df.schema[col].dataType for col in df.columns}

    return df, base_schema

# List of CSV file paths
csv_files = ["path/to/csv1.csv", "path/to/csv2.csv", "path/to/csv3.csv"]

# Initialize base schema
base_schema = None
dataframes = []

# Read and align each CSV file
for file_path in csv_files:
    df, base_schema = read_and_align_csv(file_path, base_schema)
    dataframes.append(df)

# Combine all DataFrames
combined_df = dataframes[0]
for df in dataframes[1:]:
    combined_df = combined_df.unionByName(df)

# Show combined DataFrame
combined_df.show()




Did I answer your question? Mark my post as a solution!

Proud to be a Super User!







Power BI Blog

thanks @GilbertQ! it's hard to choose between chatgpt and claude sonnet, so i am sharing the love 🙂

 

but i am hoping there might be a more efficient bulk loading method that will handle the extra columns. i'm also considering changing the storage into json moving forward, at least it seems to be able to handle the bulk loading part much more effectively than the read csv method. 

 

davidding_0-1721735990833.png

 

Hi @davidding 

 

Yeah 100% JSON would probably be the best way forward or as I suggested earlier!





Did I answer your question? Mark my post as a solution!

Proud to be a Super User!







Power BI Blog

Helpful resources

Announcements
November Power BI Update Carousel

Power BI Monthly Update - November 2025

Check out the November 2025 Power BI update to learn about new features.

Fabric Data Days Carousel

Fabric Data Days

Advance your Data & AI career with 50 days of live learning, contests, hands-on challenges, study groups & certifications and more!

FabCon Atlanta 2026 carousel

FabCon Atlanta 2026

Join us at FabCon Atlanta, March 16-20, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.

Top Solution Authors