Solved: Re: Ingest list of parquet files with complex data...

surbhinijhara · ‎01-24-2025

Hello,

Problem : Unable to ingest from Azure storage blob -> Lakehouse

I have a list of parquet files in multi-folder structure in Azure storage blob.

I understand that datapipeline copy activity does not support Parquet complex types into lakehouse tables. Ref:

But I am trying to ingest as files only and then process it and store ion flattened structure in lakehouse tables.

However I still get the error, as it appears that the type is checked when being read from Source, irrespective of Lakehouse table or file chosen,

What is an appropriate way to ingest the Parquet files with complex types?

Error that I receive:

ErrorCode=UnsupportedParquetComplexType,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=,Source=,''Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Parquet complex types of STRUCT, LIST and MAP are not supported.

Sample Azure Folder structure:

1. container name : sampletelemery

2. folders: year=25/month-01/day=01

There are bunch of files in the day folder

If I use Dataflow, then the I get the storage blob metada in columns where name column contains the parquet file name. Do I then read just the columns and process one by one. Is taht teh way to go?

Any Directions will be useful. Thanks

nilendraFabric · ‎01-24-2025

Hello @surbhinijhara

Have you tried creating a notebook and then invoke the notebook or schedule it as required.

One way to bring Parquet files that contain structures like LIST, MAP, or STRUCT into a lakehouse is to use a notebook with Spark, rather than a pipeline activity that enforces type checks. You can read these files, flatten or transform their complex columns, then write them into the lakehouse. For a multi-folder structure such as year=25/month=01/day=01, you can specify wildcards in your Spark read path.

Give it a try:

df = spark.read.parquet("abfss://sampletelemery@<yourstorageaccount>.dfs.core.windows.net/year=25/month=01/day=01/*.parquet")

from pyspark.sql.functions import col, explode

# Example of flattening a nested array column
df_flat = df.withColumn("exploded_items", explode(col("someArrayColumn")))
# Continue transformations as needed...

# Write as Delta table to the Tables section
df_flat.write.mode("overwrite").format("delta").saveAsTable("your_lakehouse_table")

# Or write as Parquet files to the Files section
df_flat.write.mode("overwrite").format("parquet").save("Files/my_parquet_folder")

hope this helps

Thanks

View solution in original post

surbhinijhara · ‎01-27-2025

Thanks, @nilendraFabric

surbhinijhara · ‎01-27-2025

Thanks, @nilendraFabric !

So esentially it means that I cannot use Data factory here - i.e. neither data pipeline nor dataflow. Instead write a custom code using Notebook to load the data.

Below is the piece of code that I have used (similar to yours just that i need to read from blob, and not datalake), I need not needed to use explode function and I could still load the raw data into lakehouse tables as well as files. Can you comment if that is fine or you have another input. Thanks again.

# Welcome to your new notebook

# Type here in the cell editor to add code

# Azure Blob Storage credentials

storage_account_name = "<account-name>"

storage_account_key =<key>"

container_name = "<container-name>"

blob_path = "<folder-path>/*.parquet"

# Configure Spark to access Azure Blob Storage

spark.conf.set(f"fs.azure.account.key.{storage_account_name}.blob.core.windows.net", storage_account_key)

# Path to the Parquet file on Azure Blob Storage

parquet_file_path = f"wasbs://{container_name}@{storage_account_name}.blob.core.windows.net/{blob_path}"

# Load the Parquet file into a DataFrame

df = spark.read.parquet(parquet_file_path)

# Write as Delta table to the Tables section

#df_flat.write.mode("overwrite").format("delta").saveAsTable("<lakehouse_table>")

#OR write as Parquet files to the Files section

df.write.mode("overwrite").format("parquet").save("Files/<folder>")

nilendraFabric · ‎01-27-2025

It looks great. Try to use key vault for secret storage.

nilendraFabric · ‎01-24-2025