Power BI is turning 10! Tune in for a special live episode on July 24 with behind-the-scenes stories, product evolution highlights, and a sneak peek at what’s in store for the future.
Save the dateEnhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.
Hello,
Problem : Unable to ingest from Azure storage blob -> Lakehouse
I have a list of parquet files in multi-folder structure in Azure storage blob.
I understand that datapipeline copy activity does not support Parquet complex types into lakehouse tables. Ref:
But I am trying to ingest as files only and then process it and store ion flattened structure in lakehouse tables.
However I still get the error, as it appears that the type is checked when being read from Source, irrespective of Lakehouse table or file chosen,
What is an appropriate way to ingest the Parquet files with complex types?
Error that I receive:
ErrorCode=UnsupportedParquetComplexType,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=,Source=,''Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Parquet complex types of STRUCT, LIST and MAP are not supported.
Sample Azure Folder structure:
1. container name : sampletelemery
2. folders: year=25/month-01/day=01
There are bunch of files in the day folder
If I use Dataflow, then the I get the storage blob metada in columns where name column contains the parquet file name. Do I then read just the columns and process one by one. Is taht teh way to go?
Any Directions will be useful. Thanks
Solved! Go to Solution.
Hello @surbhinijhara
Have you tried creating a notebook and then invoke the notebook or schedule it as required.
One way to bring Parquet files that contain structures like LIST, MAP, or STRUCT into a lakehouse is to use a notebook with Spark, rather than a pipeline activity that enforces type checks. You can read these files, flatten or transform their complex columns, then write them into the lakehouse. For a multi-folder structure such as year=25/month=01/day=01, you can specify wildcards in your Spark read path.
Give it a try:
df = spark.read.parquet("abfss://sampletelemery@<yourstorageaccount>.dfs.core.windows.net/year=25/month=01/day=01/*.parquet")
from pyspark.sql.functions import col, explode
# Example of flattening a nested array column
df_flat = df.withColumn("exploded_items", explode(col("someArrayColumn")))
# Continue transformations as needed...
# Write as Delta table to the Tables section
df_flat.write.mode("overwrite").format("delta").saveAsTable("your_lakehouse_table")
# Or write as Parquet files to the Files section
df_flat.write.mode("overwrite").format("parquet").save("Files/my_parquet_folder")
hope this helps
Thanks
Thanks, @nilendraFabric !
So esentially it means that I cannot use Data factory here - i.e. neither data pipeline nor dataflow. Instead write a custom code using Notebook to load the data.
Below is the piece of code that I have used (similar to yours just that i need to read from blob, and not datalake), I need not needed to use explode function and I could still load the raw data into lakehouse tables as well as files. Can you comment if that is fine or you have another input. Thanks again.
It looks great. Try to use key vault for secret storage.
Hello @surbhinijhara
Have you tried creating a notebook and then invoke the notebook or schedule it as required.
One way to bring Parquet files that contain structures like LIST, MAP, or STRUCT into a lakehouse is to use a notebook with Spark, rather than a pipeline activity that enforces type checks. You can read these files, flatten or transform their complex columns, then write them into the lakehouse. For a multi-folder structure such as year=25/month=01/day=01, you can specify wildcards in your Spark read path.
Give it a try:
df = spark.read.parquet("abfss://sampletelemery@<yourstorageaccount>.dfs.core.windows.net/year=25/month=01/day=01/*.parquet")
from pyspark.sql.functions import col, explode
# Example of flattening a nested array column
df_flat = df.withColumn("exploded_items", explode(col("someArrayColumn")))
# Continue transformations as needed...
# Write as Delta table to the Tables section
df_flat.write.mode("overwrite").format("delta").saveAsTable("your_lakehouse_table")
# Or write as Parquet files to the Files section
df_flat.write.mode("overwrite").format("parquet").save("Files/my_parquet_folder")
hope this helps
Thanks
User | Count |
---|---|
24 | |
17 | |
6 | |
5 | |
2 |
User | Count |
---|---|
49 | |
43 | |
18 | |
6 | |
5 |