How to improve data-model from 1 large table to mu...

vPCvP · ‎11-16-2023

Hello,

I made a table to demonstrate my problem, original_table:

date	client	box_id	product_id
16-11-2023	james	fruit_box_01	orange
16-11-2023	james	fruit_box_01	pineapple
16-11-2023	james	fruit_box_01	pear
16-11-2023	james	fruit_box_01	banana
16-11-2023	james	fruit_box_01	avocado
16-11-2023	james	fruit_box_02	apple
16-11-2023	james	fruit_box_02	orange
16-11-2023	james	fruit_box_02	banana
16-11-2023	james		mango
16-11-2023	james		grapes
16-11-2023	peter	fruit_box_02	apple
16-11-2023	peter	fruit_box_02	orange
16-11-2023	peter	fruit_box_02	banana
16-11-2023	peter		mango
16-11-2023	peter		avocado
16-11-2023	peter		banana
16-11-2023	robert		pineapple
16-11-2023	robert		pear
16-11-2023	robert		banana
16-11-2023	robert		mango
16-11-2023	john	fruit_box_01	orange
16-11-2023	john	fruit_box_01	pineapple
16-11-2023	john	fruit_box_01	pear
16-11-2023	john	fruit_box_01	banana
16-11-2023	john	fruit_box_01	avocado
16-11-2023	john		mango
16-11-2023	frank		banana
16-11-2023	frank		orange

The main problem with this data, is that it becomes very large due to the fact that a lot of people can have a box_id assigned, which can contain a lot of products. So let's say 2000 clients get assigned a box that contains 200 products, I will end up with 400000 rows and that's just for one box and I have a lot of boxes/clients. So in my opinion a lot of unnecessary repetition.

So to solve this, I wanted to split this table into multiple tables, so I can just assign a client his boxes and his separate products. I tried doing this by splitting the table into two separate tables, tablesplit1:

date	client	collection_id
16-11-2023	james	fruit_box_01
16-11-2023	james	fruit_box_02
16-11-2023	james	mango
16-11-2023	james	grapes
16-11-2023	peter	fruit_box_02
16-11-2023	peter	mango
16-11-2023	peter	avocado
16-11-2023	peter	banana
16-11-2023	robert	pineapple
16-11-2023	robert	pear
16-11-2023	robert	banana
16-11-2023	robert	mango
16-11-2023	john	fruit_box_01
16-11-2023	john	mango
16-11-2023	frank	banana
16-11-2023	frank	orange

and tablesplit2:

date	collection_id	box_id	product_id
16-11-2023	fruit_box_01	fruit_box_01	orange
16-11-2023	fruit_box_01	fruit_box_01	pineapple
16-11-2023	fruit_box_01	fruit_box_01	pear
16-11-2023	fruit_box_01	fruit_box_01	banana
16-11-2023	fruit_box_01	fruit_box_01	avocado
16-11-2023	fruit_box_02	fruit_box_02	apple
16-11-2023	fruit_box_02	fruit_box_02	orange
16-11-2023	fruit_box_02	fruit_box_02	banana
16-11-2023	mango		mango
16-11-2023	grapes		grapes
16-11-2023	avocado		avocado
16-11-2023	banana		banana
16-11-2023	pineapple		pineapple
16-11-2023	pear		pear
16-11-2023	orange		orange

The logic behind collection_id is that when there is a box_id then it will be equal to box_id, else it will be equal to the product_id. This logic get's applied in both part 1 and 2, part 1 assigns the client to the collection_id and part 2 assigns the collection_id to the box_id and the product_id.

This however, creates a new problem, because this will create a many-to-many relationship between these tables. Because multiple clients can have multiple collection_id's and vice versa.

So I tried solving this with a bridge table, following this example:
Joining Many to Many with a Bridge in Power BI | Seer Interactive

But that didn't allow me to recreate the original_table in Power BI. I want to create a table in Power BI like this that shows per client the products and the boxes (if applicable). Since using a bridge only allows you to connect to one side and the other side can only be summarizations.

So then it's back to the original many-to-many construction, but combining this with a date table confuses me even more. So in general would like some guidance on how to proceed from here with the following in mind:

I will get a weekly export and need to be able to show changes per week.
I will need to be able to filter on client, box_id and product_id and it needs to filter both ways.
I must be able to recreate the original_table in a table in Power BI

Any help is greatly appreciated, and if I need to explain anything else, please let me know.

SmartSmith · ‎11-18-2023

It seems like you're trying to design a database schema for your data to avoid redundancy and improve efficiency. Based on your description, you've already attempted to split the data into two tables but are facing challenges due to the many-to-many relationship.

Here's a suggested approach based on your requirements:

Tables:

Calendar Table:

Date
Week
Month
Year
...

Clients Table:

ClientID (Primary Key)
ClientName

Products Table:

ProductID (Primary Key)
ProductName

Boxes Table:

BoxID (Primary Key)
BoxName

Collection Table:

CollectionID (Primary Key)
BoxID (Foreign Key referencing Boxes Table)
ProductID (Foreign Key referencing Products Table)
DateID (Foreign Key referencing Calendar Table)
ClientID (Foreign Key referencing Clients Table)

Relationships:

Clients to Collection: One-to-Many (A client can have multiple collections over time)
Products to Collection: One-to-Many (A product can be part of multiple collections)
Boxes to Collection: One-to-Many (A box can contain multiple products in a collection)
Calendar to Collection: One-to-Many (A collection can belong to a specific date)

Fact Table:

Collection Fact Table:

This table will contain any metrics or measures you want to track for each collection, such as the quantity of products.

Queries:

To recreate the original table in Power BI, you can use relationships to create visualizations that display data per client, box, and product.
To show changes per week, use the Calendar table and filter based on the date.

Considerations:

Ensure that you have unique primary keys for each table.
Use surrogate keys (auto-incremented integers) for primary keys to improve efficiency.
The fact table can include measures like product quantity, allowing you to aggregate data.

Remember that normalization is about efficiently organizing data and minimizing redundancy. The schema above allows you to represent the relationships between clients, boxes, products, and dates while maintaining the ability to analyze data over time and by different dimensions. Adjust the schema based on specific business requirements and the nature of your data. Astro luna hoy is a best game.

lbendlin · ‎11-18-2023

Please read about Normalization. From your sample data you can create the following dimensions:

- Calendar

- Clients

- Products

- Boxes

What's missing is a fact column - do you only want to count the rows?

How to improve data-model from 1 large table to multiple smaller ones correctly

Helpful resources

Power BI Monthly Update - June 2025

Fabric Community Update - June 2025

How to Get Your Question Answered Quickly

Become a Certified Power BI Data Analyst!

How to improve data-model from 1 large table to multiple smaller ones correctly

Helpful resources

Power BI Monthly Update - June 2025

Fabric Community Update - June 2025

How to Get Your Question Answered Quickly