March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount! Early bird discount ends December 31.
Register NowBe one of the first to start using Fabric Databases. View on-demand sessions with database experts and the Microsoft product team to learn just how easy it is to get started. Watch now
Hello Community,
I have run into a PowerBI dataflow issue. My source table is in Big Query and is a table that gets updated daily and has data for the last 90 days in a rolling fashion.
What i would like to accomplish is provide this data to another team in the org using dataflows.
Can't use direct connection to BQ for access reasons
Can't use dataset as the team has it's own dataset with other sources
My error is when i try to create the dataflow and run the first refresh it fails on me with response too large to return error.
I tried to filter it to 30 days and that worked but even if i create 3 different dataflows i can't think of a way on how to schedule this as i am manually adding the last 30 days filter on one and the other two has actual dates for the days before accumulating to 90.
I also have made the table as lean as possible and it can't be aggregated (requirement is for that level of granularity)
I am considering bringing in just tiny amount (30 days) of data and then running on an incremental load (loading the data piece by piece until i get 90 days) but am yet to try that out? Before i begin just wanted to find if i had any other options or different approaches.
Thanks
Solved! Go to Solution.
Can't use direct connection to BQ for access reasons
Challenge that assumption. Tell them that 650 GB is too much.
If they don't budge, use Incremental Refresh. Start VERY small, with a day or so. See how many rows you get. Recommended partition size is around 8 to 20 million rows.
Please explain "import does not work" - are you getting an error message?
Direct Query on a dataflow = datamart = lipstick on a pig (Azure SQL db being the lipstick and the dataflow being the pig). Possible, sure, but a travesty nevertheless.
You can consider switching the incremental refresh to the dataset instead and to completely remove the dataflow from the equation.
You can consider dataflow Gen2 and store the results as Delta Lake in Fabric.
@Sam_Jain Sorry very hard to tell what is going on. What error message do you get in gen2? I'm wondering if it is a transient error or if it is failing at the BQ end.
Subscribe to the @PowerBIHowTo YT channel for an upcoming video on List and Record functions in Power Query!!
Learn Power BI and Fabric - subscribe to our YT channel - Click here: @PowerBIHowTo
If my solution proved useful, I'd be delighted to receive Kudos. When you put effort into asking a question, it's equally thoughtful to acknowledge and give Kudos to the individual who helped you solve the problem. It's a small gesture that shows appreciation and encouragement! ❤
Did I answer your question? Mark my post as a solution. Proud to be a Super User! Appreciate your Kudos 🙂
Feel free to email me with any of your BI needs.
Can't use direct connection to BQ for access reasons
Challenge that assumption. Tell them that 650 GB is too much.
If they don't budge, use Incremental Refresh. Start VERY small, with a day or so. See how many rows you get. Recommended partition size is around 8 to 20 million rows.
Yea, so the 650 GB is for the entire dataset on BQ, which is roughly about a year's worth of data and the proposed solution is to have two datasets one rolled up monthly, which i was able to succesfully create and provide a dataflow access. The second one was a 90 day rolling window but at daily level and i was able to create a stored procedure on BQ that deletes the oldest day's data and add's new day's data from source table , had that schduled as a daily run to comply the 90 day rolling window.
Challenge started when i created a dataflow off of it where the data size on bq is about 200 GB, which kept failing on me so i cut the columns into half to 30 from 50 odd. Still didn't work.
So, i then filtered by date on power query editor - that worked! But now it only has 30 days of data and am not sure how i proceed from here?
Also the dataset on BQ is partitioned by day for effeciency.
From what you are suggesting, it looks like i have done the first bit getting data for 30 days and now i should use incremental refresh for every 15-30 days until i have all 90? Is that the right approach?
Thanks.
Use the biggest partition size you can get away with. Your choices are Day,Month,Quarter and Year.
I would start with Day and check how many rows you get per partition. If not too many you can switch to Month.
Please help me understand this step. My aim was to get data for the past 90 days - so period being = 9 April to 9 July or so. The current dataflow has data for last 30 days meaning - 9 June to 9 july.
What am i supposed to be inputting on the incremental refresh because the refresh rows days can't be higher than the store rows days.
And if i am successful to even bring all the past 90 days somwhow, i still have to figure out a way to keep that as a rolling 90 day.
Thanks and i appreciate all your help.
Set it to refresh 1 day and store 10 days. Then evaluate the refresh log to see how many rows are in each partition. If there are substantially fewer than 8M rows per day then you can change your settings to 1 day and 3 months.
You will get the rolling 90 days automatically (plus some extra days to complete the months, as the actual month partitions will include not three but four months). You could change your setting to 1 day and 90 days but that would mean you would have 90 partitions. Only do that if necessary.
Okay, i just ran the refresh for store 1 day and refresh last 3 months. Also just from the dataset info on BQ there's an avg of about 4M rows per day.
So this should potentially have all the data for the past 90 days and more at the moment. Am just waiting for it to finish the run.
Once this is successful, i'll have to change the refresh schedule to be store x months and refresh 1 day, that way i get the latest day's data added, right?
Only concern here - the new data keeps on getting added with daily refresh but there is historic data present which as days pass will be much much more than 90 days. Will i run into storage/processing issues at that point?
Is there a way of automatic data pruning to get rid of historic data past 90 days?
Thanks!
Will i run into storage/processing issues at that point?
Is there a way of automatic data pruning to get rid of historic data past 90 days?
keep "x" at 3 and the Power BI service will do the pruning for you automatically.
I ran the refresh and it took about 45 mins, i checked the log and it looks rows were processed. I kept the incremental refresh at store from 3 months and refresh at 1 day.
I then tried connecting to the dataflow to verify and it's empty! Checked the query editor and the two parameters for Range Start and Range End have been created - the date though is historic and one in future. So i assume it was preset by the incremental refresh.
Should i get rid of that filter which i initially had on the date to filter for last 30 days? Or does the incremental refresh supersede that and take into account days from the latest date available?
Where did i go wrong? Attaching the log for reference. Thanks again!
yes, looks like your day partitions processed fine. Do not add your own filters for the range, and do not use any transforms that break query folding.
Am sorry was that a yes for getting rid of the date filter which i had initially ? (last 30 days)
Because right now when connected to the dataflow, it's empty - shows nothing on there.
Just want to make sure.
The only tansformations i have on there is that date filter and one where i remove those 20 odd columns.
Oh actually one more i had to change the data type on the date column from date to date/time in order to use it for incremental refresh.
Am sorry was that a yes for getting rid of the date filter which i had initially ? (last 30 days)
yes
Hello @lbendlin back to this issue.
So, i was able to succesfully load the dataflow and have it run on a 90 day refresh.
Now, coming to connecting this dataflow to a PBI report - import does not work for that large of a dataset and for a direct query to dataflow i turned on the enhance compute engine but it still wouldn't give me an option to use direct query when connecting.
Please help on how i can use this dataflow now.
Thanks.
Alternatively, at this point am open to exploring a data model option even but i suspect the dataset is huge even for that.
@lbendlin @parry2k @SaiTejaTalasila
Please explain "import does not work" - are you getting an error message?
Direct Query on a dataflow = datamart = lipstick on a pig (Azure SQL db being the lipstick and the dataflow being the pig). Possible, sure, but a travesty nevertheless.
You can consider switching the incremental refresh to the dataset instead and to completely remove the dataflow from the equation.
You can consider dataflow Gen2 and store the results as Delta Lake in Fabric.
So, i think the dataflow may work - i connected the PowerBI report to the dataflow and had it imported and refreshed- sits at about 4.5 GB of pbix.
However, the only issue now is the data that is flowing in on that is not of the last 90 days.
So ideally it should start with 05/14/2024 and up to 08/12/2024 and that's exactly how it is on the source as well as dataflow. But the moment i bring it into the report the data now starts from 05/01/2024 and ends at 07/08/2024.
I suspect it's because i kept the incremental refresha t 3 months and 1 day and should switch to 90 days and 1 day.
I can't think of any other reasons why the data is falling short.
There are no other filters applied on the report, dataflow has the auto appiled parameters for incremental refresh and the source has data for last 90 days.
uncheck the "only complete months" box.
Not sure i follow. Can you please tell me the exact step?
Also when i look at the dataflow on query editor it loads the right amount of data.
That looks ok. How many partitions do you see?
I don't know of a direct way to find taht but looking at the refresh history - it looks like it ran everyday for about 10-20 seconds with this being from the last run -
Do you think the data isn't even making it's way into the dataflow?
Not sure if i mentioned before but the source data does run on a 2 day delay (including today), meaning today i only have data on that table till 08/12/2024
And since i kept the inc refresh at refresh rows from past 1 day since refresh date - it will not get any data because it just does not exist.
March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!
User | Count |
---|---|
36 | |
31 | |
20 | |
11 | |
8 |
User | Count |
---|---|
52 | |
42 | |
28 | |
12 | |
11 |