Solved: Re: Best practice to model large volumes row orien...

Anonymous · ‎10-07-2024

I am looking for advice on how best to model the data in the scenario outlined below. The core of this problem is to do with the scalability and simplicity in mind.

Consider a scenario where have three tables (this is a simplified use case). The first is an object table that contains 3 things unique object id, name and object type. The second is a feature value table that contains unique object id, feature id and feature value. The third and final table is feature table that contains unique feature id and feature name.

In terms of scale a customer may typically have 100,000s of objects with 1,000s od features. Some customers much more and some customers much less. It is likely the data is extremely sparse, by that I mean an object may only 40 or 50 feature values but, these will be spread across the many different features.

I am sure this row type structure of data storage has a name but it eludes me at the moment.

The problem is how to analyze the data according to specific features. Consider as an example the features activity type, launch date, col and weight. I want to use the structure above to report on all objects of the object type footwear and create a heat map by activity type with drill down to color. Review this by product at different points in time before such as before 2024 or after 2023.

The challenge is how to report on the feature value table that has values that could be of type date, integer, float, list of values, text, etc. For a given feature all values will be of the same type but the feature values table contains 1,000s of features of various types for 100,000s of objects. The featureValues table can be millions of rows.

Obviously I would wish to avoid creating separate tables for each feature as this has scalability and manageability issues.

What is the best approach to modeling this data to allow business analysts work with it easily in Power BI. The actual data source is not excel but JSON loaded into me MS Fabric Lakehouse and transformed using PySpark.

Thoughts and pointers on how to do this appreciated, I have a very limited scope of remodeling the data to simplify the problem if needed.

Sample data below

object

id	name	type
A0001	Sneaker 1	footwear
A0002	Sneaker 2	footwear
A0003	Sneaker 3	footwear
A0004	Sneaker 4	footwear
A0101	Ski Glove 1	gloves
A0102	Ski Glove 2	gloves

featureValues

id	featureId	value
A0001	FID00001	2024-01-01
A0001	FID00002	Tan
A0001	FID00003	Running
A0001	FID00004	900
A0002	FID00001	2024-01-31
A0002	FID00002	Blue
A0002	FID00003	Hiking
A0002	FID00004	1200
A0003	FID00001	2023-12-21
A0003	FID00002	White
A0003	FID00003	Tennis
A0003	FID00004	950
A0004	FID00001	2023-03-01
A0004	FID00002	Blue
A0004	FID00003	Running
A0004	FID00004	775
A0101	FID00001	2024-01-01
A0101	FID00003	Skiing
A0101	FID00004	267
A0102	FID00001	2023-10-01
A0102	FID00003	Sailing
A0102	FID00004	201

feaureNames

featureId	name
FID00001	Launch Date
FID00002	Color
FID00003	Activity
FID00004	Weight (g)

Anonymous · ‎10-07-2024

Hi @Anonymous ,
Thanks for lbendlin reply.

Based on your description, your data is eligible to create an EVA model, and since feature_value can be of various types (date, integer, float, text, etc.), it is possible to store all the values as strings and convert them to the appropriate type when analyzing. Make sure to index object_id and feature_id in the feature_value table to speed up queries. Import the converted data into Power BI. you can use DirectQuery if the dataset is too large.

The EAV data model: when to use it (inviqa.com)

Best regards,
Albert He

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly

View solution in original post

Anonymous · ‎10-07-2024

Hi @Anonymous ,
Thanks for lbendlin reply.

Based on your description, your data is eligible to create an EVA model, and since feature_value can be of various types (date, integer, float, text, etc.), it is possible to store all the values as strings and convert them to the appropriate type when analyzing. Make sure to index object_id and feature_id in the feature_value table to speed up queries. Import the converted data into Power BI. you can use DirectQuery if the dataset is too large.

The EAV data model: when to use it (inviqa.com)

Best regards,
Albert He

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly

Anonymous · ‎10-08-2024

I guess as an obvious follow up question what is the best appraoch to transforming EAV data for anlysis in Power BI.

My inclanation right now is to avoid doing this at an sql level due to the overhead of dealing with the high volumes of data and updates.

I am considering whether it could be best to do this using dataframes and pyspark.

Any thoughts, opinions and / or referencese on this?

lbendlin · ‎10-08-2024

dataframes are kinda the opposite of key/value pairs. I would think that in-memory columnar data stores are better suited at handling that kind of data, especially when it is presorted (resulting in substantial compression). The question will be how much the presentation layer (and I include the Vertipaq engine here) can help mitigate the additional administrative effort to transpose the key/values back into columns and rows.

Anonymous · ‎10-08-2024

This is a great reference and helps me dig deeper to implement a solution using EAV models.

lbendlin · ‎10-07-2024

I am sure this row type structure of data storage has a name

It's called "the dimension table with (too) high cardinality primary key"

Power BI struggles with any relationship where the key column has a cardinality over 50000. The issue is that the query (each query) will enumerate all values in the key column.

I am not aware of a remedy. Maybe Qlik with its associative model is a better fit.