March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount! Early bird discount ends December 31.
Register NowBe one of the first to start using Fabric Databases. View on-demand sessions with database experts and the Microsoft product team to learn just how easy it is to get started. Watch now
Hi,
I'm trying to setup a Fabric Pipeline to consume Azure Open Datasets NYC Taxi and Limousine yellow dataset - Azure Open Datasets | Microsoft Learn. For that I created a connection to the blob storage with the name azureopendatastorage as it's in the samples. The data from NYC Yellow Caps is partitioned so I create a source in my copy activity like this:
When I click on "Preview Data" I see a correct sample and also the partitions are detected correctly, I see additional fields in the dataset reflecting them.
However, when I now run the package I receive the following error:
Error details are:
{
"errorCode": "2200",
"message": "Failure happened on 'Source' side. ErrorCode=PartitionDiscoveryWithInvalidFolderPath,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Source file path 'nyctlc/yellow/puYear=2010/puMonth=8/part-00000-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426339-25.c000.snappy.parquet' invalid when processing partition discovery. Please check the folder pathes.,Source=Microsoft.DataTransfer.ClientLibrary,'",
"failureType": "UserError",
"target": "Copy data1",
"details": []
}
I checked the data using a pyspark Notebook in Synapse and it works fine...
Any idea what I should check?
Thanks,
Thomas
Hi,
sorry for the late reply (I was off for a few days) and thanks for your feedback...
I saw the option for getting the NYC Taxi data as a sample but this is too easy for me 😉 And I wanted to use the "big" dataset containing >50GB of data...
I tried what you suggested. This removes the puYear and puMonth from the dataset so I can't use them anymore for partitioning the result in OneLake. So I skipped partitioning for now...
I deed this was now successful
Interesting to see that data compression seems to be higher at the destination...
Question remains why my approach didn't work... Might be a bug I think...
Thanks,
Thomas
Couple of suggestions:
1. There is a much easier option that allows you to ingest NYC taxi data from open datasets, instead of creating your own connection, in the copy activity, go to source, and chose the 3rd option of "sample datasets", then choose NYC taxi option.
2. However, the way you are doing, should have also worked. Can you disable the partition discovery option and retry ? Another way to load the entire data is via the below configuration:
March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!
Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.
User | Count |
---|---|
6 | |
2 | |
2 | |
1 | |
1 |
User | Count |
---|---|
12 | |
3 | |
3 | |
2 | |
2 |