Data types in Delta

fredforest · ‎12-04-2023

In the olden days of data warehousing, a developper/modeler would be careful to optimize data types (especially on large fact tables) for a variety of reasons.

I'm wondering if there's still a reason to care about this is a Fabric world.

I see that the Data warehouse support many T-SQL types
But these probably map to delta logical types under the hood
Which themselves map to Parquet premitives... where the smallest type (beside the bit) is an INT32

So... any reasons to keep learning about data type optimization or should we just drop this whole topic altogether?

AndyDDC · ‎12-21-2023

Well, I did offer some advice as to why we still want to optimise data types. The engine will provide the resources necesarry to process data operations based on data types and precision. The bigger the data type and precision, the bigger the resources are used to process. If you keep your data type sizes to a minimum, this will help the engine only provision what it needs.

Also, if we define our tables with data type sizes in Delta, it will stick to the data type size (e.g. cannot declare a string as 100 and then insert 200 chars in)

View solution in original post

v-cboorla-msft · ‎01-08-2024

Hi @fredforest

Glad that your query got resolved.

Please continue using Fabric Community for any help regarding your queries.

v-cboorla-msft · ‎12-20-2023

Hi @fredforest

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet.
In case if you have any resolution please do share that same with the community as it can be helpful to others.
Otherwise, will respond back with the more details and we will try to help.

Thanks

fredforest · ‎12-21-2023

Hi,

No, beside @AndyDDC who shared the same question I had, I have not had any further information on the topic sadly.

v-cboorla-msft · ‎12-21-2023

Hi @fredforest

In addition to @AndyDDC, Microsoft Fabric is a unified platform that allows you to store, process, and analyze data using various compute engines, such as Power BI, SQL, Spark, and others. All these engines can access the same data stored in a central lakehouse, which uses the Delta Lake table format. Delta Lake is an open-source storage layer that supports ACID transactions, schema enforcement, and time travel on top of Parquet files.

When you create a table in Microsoft Fabric, you can specify the data types for each column using T-SQL syntax. However, not all T-SQL data types are supported by Parquet, and some of them may have different precisions or representations. Therefore, Microsoft Fabric will map the T-SQL data types to the corresponding Parquet types under the hood and apply some optimizations to improve the performance and compression of the data.

Microsoft Fabric offers significant optimizations, data type optimization remains a valuable practice for storage efficiency, query performance, data integrity, compatibility, and best practices. Understanding these implications will enable you to make informed decisions for your data warehouse design and ensure optimal performance and data quality within Microsoft Fabric.

For more details please refer: Link

I hope this helps. Please do let us know if you have any further questions.

v-cboorla-msft · ‎12-25-2023

Hi @fredforest

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet.
In case if you have any resolution please do share that same with the community as it can be helpful to others.
Otherwise, will respond back with the more details and we will try to help.

Thanks

v-cboorla-msft · ‎01-01-2024

Hi @fredforest

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet.
In case if you have any resolution please do share that same with the community as it can be helpful to others.
If you have any question relating to the current thread, please do let us know and we will try out best to help you.
In case if you have any other question on a different issue, we request you to open a new thread.

Thanks

AndyDDC · ‎12-21-2023

Well, I did offer some advice as to why we still want to optimise data types. The engine will provide the resources necesarry to process data operations based on data types and precision. The bigger the data type and precision, the bigger the resources are used to process. If you keep your data type sizes to a minimum, this will help the engine only provision what it needs.

Also, if we define our tables with data type sizes in Delta, it will stick to the data type size (e.g. cannot declare a string as 100 and then insert 200 chars in)

AndyDDC · ‎12-04-2023

That's a great question @fredforest personally I'm going to keep optimising data types even in this new world of delta and parquet. One of the best practices I worked with in synapse serverless (which the lakehouse endpoint and warehouse endpoint are built from) was optimising data types. Reason for this was that the data types of the columns being used in a workload would directly impact the amount of cluster resources was given to process the workload. The larger the data type, the more resources were provided. Also with joining data together across different delta tables, still can't beat integers.

Data types in Delta