Reading partitioned Parquet Files with Fabric Pipe...

TePe · ‎06-22-2023

Hi,

I'm trying to setup a Fabric Pipeline to consume Azure Open Datasets NYC Taxi and Limousine yellow dataset - Azure Open Datasets | Microsoft Learn. For that I created a connection to the blob storage with the name azureopendatastorage as it's in the samples. The data from NYC Yellow Caps is partitioned so I create a source in my copy activity like this:

When I click on "Preview Data" I see a correct sample and also the partitions are detected correctly, I see additional fields in the dataset reflecting them.

However, when I now run the package I receive the following error:

Error details are:

{
"errorCode": "2200",
"message": "Failure happened on 'Source' side. ErrorCode=PartitionDiscoveryWithInvalidFolderPath,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Source file path 'nyctlc/yellow/puYear=2010/puMonth=8/part-00000-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426339-25.c000.snappy.parquet' invalid when processing partition discovery. Please check the folder pathes.,Source=Microsoft.DataTransfer.ClientLibrary,'",
"failureType": "UserError",
"target": "Copy data1",
"details": []
}

I checked the data using a pyspark Notebook in Synapse and it works fine...

Any idea what I should check?

Thanks,

Thomas

TePe · ‎06-29-2023

Hi,

sorry for the late reply (I was off for a few days) and thanks for your feedback...

I saw the option for getting the NYC Taxi data as a sample but this is too easy for me 😉 And I wanted to use the "big" dataset containing >50GB of data...

I tried what you suggested. This removes the puYear and puMonth from the dataset so I can't use them anymore for partitioning the result in OneLake. So I skipped partitioning for now...

I deed this was now successful

Interesting to see that data compression seems to be higher at the destination...

Question remains why my approach didn't work... Might be a bug I think...

Thanks,

Thomas

ajarora · ‎06-23-2023

Couple of suggestions:

1. There is a much easier option that allows you to ingest NYC taxi data from open datasets, instead of creating your own connection, in the copy activity, go to source, and chose the 3rd option of "sample datasets", then choose NYC taxi option.

2. However, the way you are doing, should have also worked. Can you disable the partition discovery option and retry ? Another way to load the entire data is via the below configuration:

Set "recursive" to false

Set "wildcardFolderPath" to "yellow/puYear=*/puMonth=*"

Set "wildcardFileName" to "*.parquet"

Set "enablePartitionDiscovery" to false

Reading partitioned Parquet Files with Fabric Pipelines

Helpful resources

Fabric Monthly Update - July 2025

Fabric Community Update - August 2025

Join us at FabCon Vienna from September 15-18, 2025

Reading partitioned Parquet Files with Fabric Pipelines

Helpful resources

Fabric Monthly Update - July 2025

Fabric Community Update - August 2025