Solved: Direct Lake Mode error: "Class: 'ParquetException'...

j_hoolachan · ‎06-05-2023

Hello,

I have created a star schema using the TPCH10 benchmark dataset in two separate Power BI datasets: one uses import mode and one uses direct lake mode. The model is below.

Fact_orders has ~60M records and Fact_orders[l_extendedprice] is a decimal column. Two separate thin reports have been created in PBI desktop, one connecting to each model in PBI service. When Fact_orders[l_extendedprice] is summed via a card visual in a report based on import mode dataset, the visual renders successfully. When the same visual is created in a report based on the the direct lake dataset, the visual fails to render with the following message:

"Unexpected parquet exception occurred. Class: 'ParquetException' Status: 'Unexpected end of stream'"

The same error appears in other scenarios when using the direct lake mode dataset. For example, if a card visual used to count the rows in the fact_orders table, the visual successfully renders the correct value (~60M). However, if a filter is added to the report using dim_customer[c_mktsegment], the visual fails to render with the same error. Below is a screenshot of the error when the filter query is evaluated in TE3.

Please let me know if you need additional information.

Thanks!

jihool3670 · ‎08-30-2023

This issue has been resolved. My dim and fact tables had skewed data which was causing the issue (somehow). Running OPTIMIZE on each of the tables and then refreshing my PBI dataset allowed the measures to calculate successfully.

One thing that is unclear to me: I explicitly enabled OptimizeWrite (and it's enabled by default https://learn.microsoft.com/en-us/fabric/data-engineering/delta-optimization-and-v-order?tabs=sparks...) so why were explicit OPTIMIZE commands required as well? I assumed OptimizeWrite would result in...optimized writes...

View solution in original post

AkshaiM · ‎08-29-2023

There is a known bug for this "Unexpected end of stream" error - it can happen in some relatively uncommon Parquet column layouts. A fix for this should be rolling out around next week (hopefully by Sep 8th). Please give it a try after that...

j_hoolachan · ‎08-29-2023

Thanks for the info. I was able to resolve the issue today by removing two high cardinality comment columns. Neither are involved in the query that fails, but their removal solved the issue. They are both strings.

Do you think they meet the criteria that the bug will fix?

AkshaiM · ‎08-29-2023

I wouldn't have expected string columns to cause this - I think it was related to specific encodings. But it may be there there is a consequence to other columns that results from removing these two columns.

Note that this issue during an analysis phase, so it wouldn't matter which columns the query would access.

jihool3670 · ‎08-29-2023

Here is the notebook code if you want to try to reproduce...the data is publically available. Let me know if there's a better way to share the notebook; I just copy/pasted each cell into the file. The bottom of the file also has some info about custom PBI dataset model.

TPCH Notebook/Direct Lake Code - Pastebin.com

jihool3670 · ‎08-30-2023

This issue has been resolved. My dim and fact tables had skewed data which was causing the issue (somehow). Running OPTIMIZE on each of the tables and then refreshing my PBI dataset allowed the measures to calculate successfully.

One thing that is unclear to me: I explicitly enabled OptimizeWrite (and it's enabled by default https://learn.microsoft.com/en-us/fabric/data-engineering/delta-optimization-and-v-order?tabs=sparks...) so why were explicit OPTIMIZE commands required as well? I assumed OptimizeWrite would result in...optimized writes...

mtna-1990 · ‎08-25-2023

I fixed the issue by changing the databricks runtime version to 11.3 TLS, with spark version 3.3.0, which is the same that fabric uses. After that I have deleted the folder with the delta table in the storage account. Finally, I recreated the delta table and the issue was resolved. Hopefully MS comes with a better solution.

Anonymous · ‎08-25-2023

Got this error as well when tesitng out fabric's new direct lake pbi connection mode.

I load a table form azure data lake gen 2 with shortcut to a lakehouse in microsoft fabric. I then create a dataset based on this lakehouse. The table I get trouble with is the largest in my model and has 75 000 000 rows.

j_hoolachan · ‎08-23-2023

No update from my end. The docs do mention that some column types are not supported but I haven't found further detail.

kydang · ‎08-23-2023

Hello,

Do you have any update on this topic, I'm having the same issue with directlake.
Thanks

Tomiasp · ‎08-14-2023

We are having the same issue. I've created a table in Databricks and tried to bring it in via OneLake to Fabric, but getting this same error. I can see the table just fine when just looking at it from the Lakehouse view but as soon I as I try to build a report on it, this...

bradleyj · ‎08-16-2023

Same situation for me. I created a delta table in Databricks, and while the SQL endpoint can read the data properly, pulling it with DirectLake/Power BI gives this exception. It appears to be related to specific data types--either floats/decimals or timestamps.

rcufley-lrdist · ‎07-20-2023

I was wondering if you were able to resolve this, j_hoolachan. I am having the same issue and reported error for visuals using decimal columns in some tables. It is occuring on a direct lake mode query against the default dataset for a lakehouse.

Thanks