Solved: Re: pipeline failing to load csv files from lakeho...

pavannarani · ‎01-22-2025

Hello people,

I have a requirement where I want to load the data from a csv file in Lakehouse to a table in Warehouse. To do that, I created a structure similar to the one below in the warehouse.

CREATE TABLE [dbo].[Device]

(

[DeviceID] INT NOT NULL,

[CreatedDate] DATETIME2(3) NULL,

[DeviceName] [varchar](100) NULL,

[ModifiedDate] DATETIME2(3) NULL

)

GO

So here DeviceID I gave as an integer. When I ran the pipeline, it was failing with the below error. I even tried to change the type of DeviceID to Number in the csv file and reran it, but still gave the same error.

ErrorCode=DWCopyCommandOperationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message='DataWarehouse' Copy Command operation failed with error ''Column 'DeviceID' of type 'INT' is not compatible with external data type 'Parquet physical type: BYTE_ARRAY, logical type: UTF8', please try with 'VARCHAR(8000)'. Underlying data description: file 'https://*********/MSSQLImportCommand/Dim_Device.parquet'.
Statement ID: {BA8D-43C3-81EE-98D0DC46E792} | Query hash: 0xCC100A17B5CA5D03 | Distributed request ID: {6C0C9DB4-8C82-4B09-2CCDECCE63FF}'.,

When I uploaded the same csv file to Azure Blob Storage and ran the pipeline, it was successfully loading the data into warehouse tables. I want to know what is causing the issue here in Lakehouse: is it not able to retain the type of column or is it changing its type while data orchestration?

Did anyone faced this issue earlier? Any help will be really appreciated.

Thanks in Advance

v-pnaroju-msft · ‎01-26-2025

Hi pavannarani,

When files are uploaded to the Lakehouse, they are generally stored in the Delta Lake format, which is based on Parquet for optimised storage and querying. Utilising a standardised format such as Parquet ensures that all engines within the Fabric ecosystem can access and process the same dataset without the need for data duplication.

Parquet is a columnar storage format that is highly efficient for analytical queries; however, it may interpret data types differently from traditional CSV files. This discrepancy might be causing the DeviceID to be interpreted as a string instead of an integer, thereby leading to the compatibility error during the loading process.

Instead of employing staging tables, it is recommended to utilise Data Flows to transform the data during the loading process. Data Flows can be designed to read data from the Lakehouse, perform the necessary transformations, and directly load it into the Warehouse without the need for intermediate staging.

Microsoft frequently updates its services, and it is possible that future releases will offer more seamless integration and improved options for loading data between the Lakehouse and Warehouse.

We also recommend providing feedback to Microsoft regarding your specific requirements. Microsoft actively considers user feedback while designing its product roadmap. You may submit your feature requests or feedback through the following link:Home

If you find the response helpful, we kindly request you to mark it as the accepted solution and provide kudos. This will greatly assist other community members encountering similar challenges.

Thank you.

View solution in original post

v-pnaroju-msft · ‎01-26-2025

Hi pavannarani,

When files are uploaded to the Lakehouse, they are generally stored in the Delta Lake format, which is based on Parquet for optimised storage and querying. Utilising a standardised format such as Parquet ensures that all engines within the Fabric ecosystem can access and process the same dataset without the need for data duplication.

Parquet is a columnar storage format that is highly efficient for analytical queries; however, it may interpret data types differently from traditional CSV files. This discrepancy might be causing the DeviceID to be interpreted as a string instead of an integer, thereby leading to the compatibility error during the loading process.

Instead of employing staging tables, it is recommended to utilise Data Flows to transform the data during the loading process. Data Flows can be designed to read data from the Lakehouse, perform the necessary transformations, and directly load it into the Warehouse without the need for intermediate staging.

Microsoft frequently updates its services, and it is possible that future releases will offer more seamless integration and improved options for loading data between the Lakehouse and Warehouse.

We also recommend providing feedback to Microsoft regarding your specific requirements. Microsoft actively considers user feedback while designing its product roadmap. You may submit your feature requests or feedback through the following link:Home

If you find the response helpful, we kindly request you to mark it as the accepted solution and provide kudos. This will greatly assist other community members encountering similar challenges.

Thank you.

pavannarani · ‎01-23-2025

@v-pnaroju-msft Thank you so much for your response.

I am glad there is a way to load data in the warehouse. But let's say I have raw data in the lakehouse as CSV files. To load them into the warehouse, I need to perform an extra step to load into staging tables. I am sure this approach will work. I want to know if Lakehouse behaves like this by default; whatever file we put in Lakehouse, does it treat it as a parquet file? If so, is there any reason why it is doing like this?

Is there any other way to load data directly to the warehouse without using the staging step? I thought of using notebooks, but still using notebooks, we can't load data into the warehouse; even if we load into Lakehouse tables, the structure will be different for every CSV file, so it's not possible to automate the solution.

Is this planned in the future roadmap

v-pnaroju-msft · ‎01-22-2025

Hi @pavannarani,

We appreciate your query through the Microsoft Fabric Community Forum.

Based on the details shared, the error message indicates that the DeviceID column of type INT is incompatible with the external data type Parquet physical type: BYTE_ARRAY, logical type: UTF8. This issue arises because the Parquet format utilised in Lakehouse is interpreting the DeviceID as a string (UTF8), which is not directly compatible with the INT type in the Warehouse table.

Please find below the steps to troubleshoot the issue:

Open the CSV file and ensure that the DeviceID column contains only numeric values without any extra characters such as spaces or letters. Any such inconsistencies may result in the data being interpreted as a string.
Modify the DeviceID column in the Warehouse table to VARCHAR(8000), as suggested in the error message. This will enable the data to be loaded without type conflicts. If required, the data can subsequently be converted to INT within the Warehouse.
Load the data into an intermediate staging table where all columns are defined as VARCHAR. Once loaded, you can transform the data and insert it into the final table with the appropriate data types.
Incorporate a data transformation step in the pipeline to explicitly cast the DeviceID column to INT prior to loading it into the Warehouse.

If you find the response helpful, we kindly request you to mark it as the accepted solution and provide kudos. This will greatly assist other community members encountering similar challenges.

Thank you.

pipeline failing to load csv files from lakehouse to warehouse

Helpful resources

Fabric Monthly Update - August 2025

Fabric Community Update - August 2025