Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

To celebrate FabCon Vienna, we are offering 50% off select exams. Ends October 3rd. Request your discount now.

Reply
Anonymous
Not applicable

Visualizing time series data that only contain changes

Hello,

 

I have a time series dataset representing product stock in a store. The columns are:

  • date (type:date)
  • product_id (type:string)
  • in_stock (type:integer)

However, for most combinations of dates and products, there is no change in stock.

So instead of having a row in the data set for every combination of date and product,

there is only a row present if there has been a change in the stock level.

 

As an example, a dataset of this structure for dates 2022-03-01 to 2022-03-12 might look like:

date  product_id  in_stock
2022-03-01  A  5
2022-03-10  A  10
2022-03-05  B  2
2022-03-12  B  8

 

I would like to:

  • Visualize time series of individual products.
  • Visualize aggregations, e.g. total stock of a set of products over time.

 

The problem is that if I for instance filter on product A, the standard visualization interpolates linearily between a stock of 5 products on 2022-03-01 and a stock of 10 products on 2022-03-10. This is not what I want. The stock was 5 until and including 2022-03-09, before it rose to 10 on 2022-03-10.

 

For the aggregation case, there is also an issue: if I try to aggregate over a set of products to compute the sum, it will only sum when a row is present (when there was a change in stock). In the example above the sum of the date 2022-03-05 will be 2, but in reality the sum is 2 + 5 = 7. For the date 2022-03-11 the sum will be 0, but in reality it is 2 + 10 = 12.

 

What I have a tried

  • Expanding the dataset to include every combination of date and product. This increases the number of rows by a factor of 100x, leads to a table with billions of rows. Even with indexes on the table in the database, this is too slow to work with.
  • Using Python to create a custom visual. Then I can query the dataset, process it in Python in linear time to create "dummy" observations between actual rows (or do a aggregations), then show it using matplotlib. But this functionaly does not let me retrieve more than 1 million rows, so while it works for some cases it does not work for all.

Help very much appreciated. If you have any questions please ask and I will explain more details.

3 REPLIES 3
Anonymous
Not applicable

Hi @Anonymous ,

 

According to my understanding, you want to get the total In_stock of each date, the date should be continuous.

If so, you need to create a new table with all dates between the min date and the max date from the original table:

New Table = 
var _1=CALENDAR(MIN('Table'[date]),MAX('Table'[date]))
var _2=VALUES('Table'[product_id])
return CROSSJOIN(_1,_2)

Then to find the corrcet stock of each day:

In_Stock = 
var _date= CALCULATE(LASTNONBLANK('Table'[date],TRUE()), FILTER('Table',[product_id]=EARLIER('New Table'[product_id]) && [Date]<=EARLIER('New Table'[Date])))
return LOOKUPVALUE('Table'[in_stock],[product_id],[product_id],[date],_date)

Eyelyn9_0-1647486625570.png

 

Best Regards,
Eyelyn Qin
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Anonymous
Not applicable

Thank you so much for your time and effort. Unfortunately, your solution is essentially one of the things that I've already tried (see the "What I've tried" section in the post). It does not solve the problem.

 

Your solution does the right thing, but the computational complexity of explicitly forming the cross join makes it infeasible. In reality the data set is very spare - most products do not change on most dates. The sparsity factor is about 1000x, so computing the cross join creates a data set with 1000x as many rows as are in the original data set. This is infeasible.

 

The total stock can be found in O(number_of_rows) time. It's enough to see each row once and keep track of one counter for each product ID, in a language such as Python. But the computational complexity of the cross join is O(number_of_dates * number_of_products), which is 1000x larger than O(number_of_rows) in my actual data set. I wrote 100x in the original post, but 1000x is more accurate. I don't know of any SQL-like way to perform the computation without a cross join though.

Anonymous
Not applicable

Does anyone have any input on this? Any help would be greatly appreciated.

Helpful resources

Announcements
Top Kudoed Authors