Join us at FabCon Atlanta from March 16 - 20, 2026, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.
Register now!View all the Fabric Data Days sessions on demand. View schedule
I have blob container with multiple csv files. ALL of the files contain 17 standard columns, but a few of them have 1-3 extra columns. A shortcut called "demo" is created in a Fabric Lakehouse under Files. When i use spark.read, it loads all data into df, but the extra columns really mess up the data as it get loaded under wrong column names.
Does pyspark has a way of recognising and matching the column names when loading multiple csv files in a folder?
Hi @davidding
Here is an example I got from Chat GTP explaining how to handle columns of different sizes. The code below is from Chat GTP
from pyspark.sql.functions import lit
def read_and_align_csv(file_path, base_schema=None):
# Read the CSV file
df = spark.read.csv(file_path, header=True, inferSchema=True)
if base_schema is not None:
# Get the columns that are in the base schema but not in the current DataFrame
for col in base_schema:
if col not in df.columns:
# Add the missing column with null values
df = df.withColumn(col, lit(None).cast(base_schema[col]))
# Get the columns that are in the current DataFrame but not in the base schema
for col in df.columns:
if col not in base_schema:
# Add the column to the base schema with null values
base_schema[col] = df.schema[col].dataType
else:
# Initialize base schema if it's the first DataFrame
base_schema = {col: df.schema[col].dataType for col in df.columns}
return df, base_schema
# List of CSV file paths
csv_files = ["path/to/csv1.csv", "path/to/csv2.csv", "path/to/csv3.csv"]
# Initialize base schema
base_schema = None
dataframes = []
# Read and align each CSV file
for file_path in csv_files:
df, base_schema = read_and_align_csv(file_path, base_schema)
dataframes.append(df)
# Combine all DataFrames
combined_df = dataframes[0]
for df in dataframes[1:]:
combined_df = combined_df.unionByName(df)
# Show combined DataFrame
combined_df.show()
thanks @GilbertQ! it's hard to choose between chatgpt and claude sonnet, so i am sharing the love 🙂
but i am hoping there might be a more efficient bulk loading method that will handle the extra columns. i'm also considering changing the storage into json moving forward, at least it seems to be able to handle the bulk loading part much more effectively than the read csv method.
Check out the November 2025 Power BI update to learn about new features.
Advance your Data & AI career with 50 days of live learning, contests, hands-on challenges, study groups & certifications and more!