Starting December 3, join live sessions with database experts and the Microsoft product team to learn just how easy it is to get started
Learn moreGet certified in Microsoft Fabric—for free! For a limited time, get a free DP-600 exam voucher to use by the end of 2024. Register now
I've been experiencing an issue with my scripts over Fabric Notebooks in terms of optimization . I've been trying to control the V-Order functionality in data loading processes. According to the documentation and examples provided, it should be feasible to control the Parquet V-Order at the DataFrame level using the parquet.vorder.enabled option and if the V order option is disabled at session level i can control this by additional syntax…. I am following this article.
However, attempt to execute this is unsuccessful. I’ve written script to write data to the Delta format with the V-Order alternately enabled and disabled. But despite this, both instances in the script seem to be executing with V-Order disabled for both of them in a single session
Here are the relevant sections of script:
We've also looked through our delta logs for both files in Tables and in Files folder, but there are no traces of the Vorder tag to be found, which is puzzling and contrary.
The mentioned behaviour contradicts the following examples we found in the official documentation saying this might work if V order is set to false at session level .
df_source.write\
.format("delta")\
.mode("overwrite")\
.option("replaceWhere","start_date >= '2017-01-01' AND end_date <= '2017-01-31'")\
.option("parquet.vorder.enabled ","true")\
.saveAsTable("myschema.mytable")
With the confusion I thought I could use some community guidance ,Wondering any additional configurations or prerequisites to control V-Order by disabling at session level and manually control it when writing the file or anything I may have missed?
@DennesTorres Thanks @DennesTorres for confirming
I am talking to someone from MS who now looking in to this
@DennesTorres to me it seems like a bug , when tables are not optimized at session level while they are at dataframe write level ( by adding "parquet.vorder.enabled ","true") even after that there seems no signs that it has been optimized in metadata/ logs not even at the time of its first creation
After doing checks at _delta logs and in the metadata and even table properties
There is no v order which has been present after checking by all three methods , i
am not sure if any other method is also availiable, if not then to me it sounds more like a bug
Hi,
I tried your script. It will help me a lot in other purposes, but I confirm your result: If the optmization doesn't happen on session level there is no sign of any metadata pointing to the optmization.
To be absolutely sure about the result, I completed a test on the file level as well. The original article mentions only the parquet files affected by the write operation would be optmized.
But even on file level, there is no metadata pointing to any optmization.
The script I tried, a variation from yours, is located below. What's the link to the issue you registered?
Hi,
But or missing feature, I don't know. When the optimization is on session level, it's strange to me that the 'V-ORDER' appears only as TAG, a property not intended to mean something so important as the optmization format of the file (I think so, what are your thoughts?).
When optimizing on the write level, the optmization is designed to affect specific parquet files, so it's not included on the TAG of the table and since it was not included on metadata anyway, we are left with no way to confirm the optmization. There is some logic, but it's still a missing point.
We end up with many questions: Why does it only appear on the TAG?
How to identify the optimization on individual parquet files?
What about OpmizeWrite, for example, which appears no where ?
Kind Regards,
Dennes
@DennesTorres based on this official doc i can say its indeed optimization at parquet level files
https://learn.microsoft.com/en-us/fabric/data-engineering/delta-optimization-and-v-order?tabs=sparks...
When i turn them on on session level i see them applied on parquet files metatata using pyarrow ( above code)
Also as we see tag in delta logs level as well ..my guess tag in delta log would be there for some reason
the accurate way i understad is to check at the parquet's metatadata level , however in my mentioned issue when i turn them off at session level and try to control at dataframe write level i do not see expected behavior in parquet's metata
@DennesTorres I checked at metadata level of parquet files using pyarrow and printing dataset level schema for the table using below function .. i still dont see any metatadata related to Vorder ..not sure what i am missing ..can you test this at your end and see do you also seeing same behaviour ??
Hi,
I will check.
But, this may be related to my previous comment: On the tables optimized by session level configuration, the V-ORDER optimization appears only as a TAG, it doesn't appear as metadata and I don't know exactly why. Would this mean the table is not optimized, or we need a different way to identify it is optmized ?
When trying the optimization on the write, instead of session level, the TAG doesn't appear as well.
Kind Regards,
Dennes
Hi,
About this:
"But despite this, both instances in the script seem to be executing with V-Order disabled for both of them in a single session"
How do you know the v-order was disabled?
"We've also looked through our delta logs for both files in Tables and in Files folder, but there are no traces of the Vorder tag to be found, which is puzzling and contrary."
Could you give more details about this?
I have been working on a related challenge: How to identify if an existing table was created with v-order enabled or not?
Here is a thread about my investigation: https://community.fabric.microsoft.com/t5/General-Discussion/How-to-list-Table-Properties/m-p/337130...
Another one which may or may not be related: https://community.fabric.microsoft.com/t5/Issues/Workspace-level-boolean-spark-configurations-appear...
Kind Regards,
Dennes
@DennesTorres
i am checking in one lake explorer mannualy _delta_log files if VOrder Tag is present and is set to true
You can also write spark code to check the delta log details from notebook
here is my code to read metadata of the table
tablebasepath="your path to the table "
Hi,
I tested your script. All the tables I have were marked with V-ORDER, but what caught my attention was that the V-ORDER was only a TAG, it was not on the metadata. Is this correct?
I managed to turn off V-ORDER optimization on session level and the TAG disappeared. But after that I repeated your results: Trying to enable vorder for one specific write operation doesn't bring the TAG back.
One detail about this and the article you are using is one mention to the fact that the write configuration will affect only the parquet files involved in the operation, not the entire table. Could this result in a table with mixed parquet files, some with the optimization and some not? Could this explain why the V-Order doesn't appear on table level ?
Starting December 3, join live sessions with database experts and the Fabric product team to learn just how easy it is to get started.
Check out the November 2024 Fabric update to learn about new features.
User | Count |
---|---|
6 | |
4 | |
4 | |
4 | |
1 |
User | Count |
---|---|
16 | |
12 | |
9 | |
9 | |
4 |