Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Be one of the first to start using Fabric Databases. View on-demand sessions with database experts and the Microsoft product team to learn just how easy it is to get started. Watch now

Reply
TePe
Helper IV
Helper IV

Reading partitioned Parquet Files with Fabric Pipelines

Hi,

I'm trying to setup a Fabric Pipeline to consume Azure Open Datasets NYC Taxi and Limousine yellow dataset - Azure Open Datasets | Microsoft Learn. For that I created a connection to the blob storage with the name azureopendatastorage as it's in the samples. The data from NYC Yellow Caps is partitioned so I create a source in my copy activity like this:

TePe_0-1687469233077.png

When I click on "Preview Data" I see a correct sample and also the partitions are detected correctly, I see additional fields in the dataset reflecting them. 

However, when I now run the package I receive the following error:

TePe_1-1687469349346.png

Error details are:

{
"errorCode": "2200",
"message": "Failure happened on 'Source' side. ErrorCode=PartitionDiscoveryWithInvalidFolderPath,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Source file path 'nyctlc/yellow/puYear=2010/puMonth=8/part-00000-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426339-25.c000.snappy.parquet' invalid when processing partition discovery. Please check the folder pathes.,Source=Microsoft.DataTransfer.ClientLibrary,'",
"failureType": "UserError",
"target": "Copy data1",
"details": []
}

 

I checked the data using a pyspark Notebook in Synapse and it works fine...

 

Any idea what I should check?


Thanks,

 

 

Thomas

2 REPLIES 2
TePe
Helper IV
Helper IV

Hi,

 

sorry for the late reply (I was off for a few days) and thanks for your feedback...

 

I saw the option for getting the NYC Taxi data as a sample but this is too easy for me 😉 And I wanted to use the "big" dataset containing >50GB of data...

 

I tried what you suggested. This removes the puYear and puMonth from the dataset so I can't use them anymore for partitioning the result in OneLake. So I skipped partitioning for now...

 

I deed this was now successful 

TePe_0-1688043819058.png

Interesting to see that data compression seems to be higher at the destination... 

 

Question remains why my approach didn't work... Might be a bug I think...


Thanks,

 


Thomas

ajarora
Microsoft Employee
Microsoft Employee

Couple of suggestions:

1. There is a much easier option that allows you to ingest NYC taxi data from open datasets, instead of creating your own connection, in the copy activity, go to source, and chose the 3rd option of "sample datasets", then choose NYC taxi option.

 

ajarora_0-1687506986922.png

2. However, the way you are doing, should have also worked. Can you disable the partition discovery option and retry ? Another way to load the entire data is via the below configuration:

Set "recursive" to false
Set "wildcardFolderPath" to "yellow/puYear=*/puMonth=*"
Set "wildcardFileName" to "*.parquet"
Set "enablePartitionDiscovery" to false

Helpful resources

Announcements
Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!

ArunFabCon

Microsoft Fabric Community Conference 2025

Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.

December 2024

A Year in Review - December 2024

Find out what content was popular in the Fabric community during 2024.