Re: PowerBI Service Dataflow & Datamart & PowerBI ...

zhanna27 · ‎12-21-2023

Hello,

I am new to this community and I have started recently the PowerBI Desktop & Service skills. I am struggling right now because I have to download 3 static archived heavy datasets .csv for the last 13 years. these three should be completed with their respective 3 API calls giving new data each day. so this new data will be appended to each static dataset. I have a problem because I have too many datasets now :

- 3 static datasets 2010-2023(mid-December)

- 3 queries/functions of API calls for getting updated data for the last day which have to be appended to the 3 static datasets 2010-2023(mid-December)

- 3 new appended datasets (appended as new) from 3 static datasets + 3 queries with API cals

So now I have 9 queries, just to get 3 final datasets which should be used for my model.

Besides these 9 queries, I still have 11 datasets which come with API calls each time.

1) Is this a good approach or does it exist another way which is more simple to reduce all those queries in dataflow ?

2) Also each time, seams like there's a download of heavy static documents. I need it to refresh only the new data for the last day but don't find an option to do that. I put on 'enhanced compute engine' and incremental refresh but no difference

3)Also how can we put that into Datamarts ?

thank you a lot for you help.

Anonymous · ‎01-11-2024

Hi @zhanna27 ,

First of all, if you are using Dataset and have all queries built in the PowerBI Desktop, then this is not the optimal solution from the performance point of view as it works much slower than Dataflows. In one of my first projects, when I started working with large data sources, I created the queries in PowerBI Desktop, which then were all published to the Dataset, and the refresh took between 1-2 hours. Then I transferred these queries to Dataflow, and the refresh took 5-10 minutes, so this is the only way I am designing my PowerBI now.

Secondly, I think the incremental refresh may not work for your needs, which you already suggested that it doesn't.

I had to build something similar where I had some historical data in csv files, and this data didn't change, and then I needed to append additional new data, which was updated daily.

The best solution for me was again to use Dataflows in the PowerBI service. You just need to design them in a specific way, so you don't waste time on refresh of something which doesn't need to be refreshed.

So, to start with the archived data, you create a new Dataflow where you load all the data which doesn't change. Depending on where you store these .csv files and how many of them you have you can actually do one query and a function to import all of these. But, if you have just 3 files (based on your initial post) then you can also load them in 3 queries, then potentially merge in another query to have a single output table (unless these 3 you mentioned are different datasets). Any query which is not required for any further processing make sure you disable from data loading (right click on the query in Dataflow and disable option 'Enable load') this will make this query 'hidden', so you can't see it when importing data from this Dataflow to somewhere else but it will also speed up the refresh.

Next, save this new Dataflow and refresh. Now you have all the data loaded and ready for the next step.

Now, create a new Dataflow where you repeat the same process as in the previous step, but this time you load the live data which does change. Then save the Dataflow and setup scheduled refresh e.g. once a day or whatever number of times a day you need it. This way, your archived data is just loaded to the first Dataflow, and you do not set up refresh to it (you can still manually refresh it if the data changes at any time), and for the new data, you have scheduled refresh, which will automatically refresh it for you.

Now, the third step is to merge these two Dataflows. You can either create these queries in Dataflow#2, which is refreshed or create a new Dataflow#3. Anyway, this step involves creating all queries to load the static data from Dataflow#1 and dynamic data from Dataflow#2 and appending these two. If you chose to create Dataflow#3, then you also set up the scheduled refresh; however, this must be at least 1hr after the scheduled refresh of Dataflow#2 to allow Dataflow#2 for a full refresh and load of the latest data. So, it might be easier to just add these queries to Dataflow#2.

Now, you have all the data processed, so you move to the PowerBI desktop, where you create the queries to load the data from the final output query in Dataflow#2 or #3. All the transformations needed for the data should be completed in Dataflow, so here, you just load the data and do not add any steps.

When you publish your report to PowerBI Service, set up the scheduled refresh of the report's Dataset (Semantic model in the latest PowerBI) and make sure that this is again at least 1hr after the refresh of the final queries (Dataflow#2 or #3).

zhanna27 · ‎01-15-2024

Hi @Anonymous, Thank you so much for your detailed feedback. Yes, that's exactly my problem what you've described. It's a good idea to divide those dataflows in order to speed up loading of dynamic data and leaving static one aside. You gave great tips for which I thank you ;-). My problem is now different. My dynamic data which updates the archived dataset every day does show only uploaded data for the current day. The next day this data will be lost. Actually it doesn't register in a dataset. And because we call a dataflow which itself gets a fixed data from my desktop, this dataset is not saving the updated information on my desktop. Datamart doesn't seem to make this registration as well. Do you have ny suggestions ? Also the following step would be to merge all those datasets into one so that I can apply RLS safely and be sure that only necessary information will be seen by antoher user of this datamart. But merging process is not possible with all these big datasets Thank you a lot.

Anonymous · ‎01-15-2024

Hi @zhanna27

I am glad you find my tips helpful.

Now, to your new issue. Not sure if I understand it correctly - I think what you are saying is that the dynamic data you receive in .csv file contains only data for the current day. But what happens to the previous files from yesterday, the day before, etc? Is the file replaced with a new file every day, so you don't keep the previous days in separate files? If you don't keep the files from the previous days, then you can't really achieve this in PowerBI as whenever you need to refresh the Dataflow, it will only use the files you currently have. e.g. can't get data which is no longer in the datasource. So, in this case, first, you need to sort out the files from previous days to be kept in the folder, and you only add a new file to the folder every day. This way, you keep all the 'dynamic' data files in a folder, so when the Dataflow is refreshed, it can read data from all files within the folder. Depending on how you get the new file every day, there are different ways to sort it out. In my case, I receive the new file via email everyday, then I have a simple PowerAutomate which gets the attachment from the email and saves it in a SharePoint folder for me (if the file name is the same everyday you can add a date at the end of the file name when saving to the folder). This way, the process doesn't involve any manual handling of the data as it is automatically done for me even if I am away. Once the file is uploaded then the scheduled Dataflow is refreshed and the report updated with the latest data. So, the key is that you need to keep all the files for the 'dynamic' data in a single folder, then every month or couple of months, you can move the files to the 'archived' data folder and then remember to refresh the archive data Dataflow to get it updated and the 'dynamic' data folder is empty again so your daily refreshed Dataflow refreshes nice and quick.

One thing you mentioned is that you have the data on your desktop? Although it may work, I would strongly recommend moving the data into SharePoint (I hope you can create your own site within your company or can be given access to any existing SharePoint). You would need to have two folders - one for the archived data files and one for the dynamic data files, so the Dataflows will load all files from these folders. Which brings me to the next point, how do you load the data from files? Is it just load from Excel or do you use load from folder (SharePoint Folder). The best method in your case is to use Load from SharePoint Folder where you need to provide link to the main SharePoint site then filter to the folder you want to load and when you see all files within the folder, the easiest is to create a simple function to extract data from each file and then combine into one single table. Also, I always use formula =Excel.Workbook([content]) in a new column to get a list of all elements within the workbook, then you can filter to the item which you want to extract e.g. Sheet1 or Table_name to only work with what you need and then apply a function to extract/transform the data for me before expanding the final output into a single table.

zhanna27 · ‎01-15-2024

Dear @Anonymous,

thank you for your quick reply. I appreciate your tips. Again that's a good solution of yours which you propose for dynamic data to be stored in a file in my sharepoint desktop folder. Indeed, I haven't done any storage of this updated data because I thought it could be appended directly to a dataflow dataset and then the updated dataflow dataset could be used by datamarts directly and could be saved in datamarts each day (then the next day, it would upload archived data would delete already present archived data with a "remove_duplicates" step and + would upload just a new date). I understood now that dataflow doesn't save documents but I thought datamarts did so. So my update function here below is supposed to update and append new data into an archived dataset each day but only in dataflows :

let

// Function to retrieve data from the API

Limit = 10000,

Query = (offset) =>

let

Source = Json.Document(Web.Contents("https://api.url_base", [

RelativePath = "some_endpoint.json",

Query = [

limit = Number.ToText(Limit),

offset = Number.ToText(offset),

auth = "my_token",

startDate = Date.ToText(DateTime.Date(Date.AddDays(DateTime.LocalNow(), -2)), "yyy-MM-dd"),

endDate = Date.ToText(DateTime.Date(Date.AddDays(DateTime.LocalNow(), -1)), "yyy-MM-dd")

]

]))

in

Source,

// Call the function with an offset of 0

SourceAPI = QueryAPI(0),

// Convert the JSON data to a table

TableAPI = Table.FromList(SourceAPI, Splitter.SplitByNothing(), null, null, ExtraValues.Error),

// Expand the record column

ExpandedAPI = Table.ExpandRecordColumn(TableAPI, "Column1", {"....."}, {"....."}),

// Change column types

ChangedTypeAPI = Table.TransformColumnTypes(ExpandedAPI, {{"....}}),

// Replace errors

ReplacedErrorsAPI = Table.ReplaceErrorValues(ChangedTypeAPI, {{.....}),

// Remove duplicates

RemovedDuplicatesAPI = Table.Distinct(ReplacedErrorsAPI),

// Retrieve data from the CSV file

SourceCSV = Csv.Document(Web.Contents("https://company-sharepoint.com/my_account/documents/desktop/2010-2023_folder/my_archived_file_2010_2023.csv"), [Delimiter = ",", Columns = 10, QuoteStyle = QuoteStyle.None]),

PromotedHeadersCSV = Table.PromoteHeaders(SourceCSV, [PromoteAllScalars = true]),

ChangedColumnTypeCSV = Table.TransformColumnTypes(PromotedHeadersCSV, {{".....}}),

AppendedQueries = Table.Combine({ChangedColumnTypeCSV, RemovedDuplicatesAPI}),

RemovedDuplicatesCSV = Table.Distinct(AppendedQueries),

#"Transform columns" = Table.TransformColumnTypes(RemovedDuplicatesCSV, {{"...}}),

#"Replace errors" = Table.ReplaceErrorValues(#"Transform columns", {{"...}}),

#"Sorted rows" = Table.Sort(#"Replace errors", {{"localTime", Order.Descending}})

in

#"Sorted rows"

But I don't know how to automate the process so that I don't do the update manually everyday by myself. Do you have any suggestions ? There should be some tool that can help me automate.

So you confirm that no possible storage in powerbi service inside ? No Azure datalake or something going with the power BI service together ? Other option would be probably to store the data in some database which can get updated automatically calling APIs. What do you think ? If I could exchange in live that would be easier to explain 😉 have a nice day !! and thanks again !!

Anonymous · ‎01-17-2024

Hi @zhanna27 , ok, sorry I missed the part where you said in the original post that the dynamic data is loaded from an API rather than CSV files. So, is it an API call to a database or something else? Can you query more data through the API than just a single day? I can see in your function you are querying the data based on two dates, so can't you just query the data since the static data ends (mid-Dec 23)? If there is too much data for the API to handle, you could probably multiply the queries and load, e.g. a month at a time. Also, if the data is coming from a Database, can't you just use the DB (e.g. SQL connector in Dataflow, so then you can use normal SQL query to import all the data you need)? This would be probably the simplest solution as an SQL query should handle millions of rows of data, so you could probably do it in a single query.

I assume that from time to time data which you currently loading dynamically are dumped to the csv files (e.g. every year?) so then you would need to amend the dynamic data query date range.

As for the storage options, I don't think there is anything like that in PowerBI itself. I am not sure why specifically you are using Datamarts. In my experience, Dataflows work faster when loading data, and I don't find them very useful apart from one single feature, which gives you an SQL endpoint to query the data from Datamart.

You could use a database to store the data, but first you would still need to extract the data from your API. There are a couple of ways how you could automate this depending on what tools you have access to. One example might be to use PowerAutomate, which can run on a schedule every day and export the data from API and either upload it to a Database, Datamart, or SharePoint List or just save it as a .csv or .xlsx file where from you can load the data into your 'dynamic' dataflow. However, there would be some limits on the amount of data you can handle with PowerAutomate, so if you have millions of rows every day, this may not work. Another more advanced solution would be to use Azure DataFactory (if you have access to Azure). In the ADF, you can create a pipeline which will run every day, export the data from API and load it into, for example, the Azure SQL database.

zhanna27 · ‎01-17-2024

Hi @Anonymous,

thank you for your suggestions and your feedback. I don't have any storage/database for my current data (archived/.csv + updated daily/API) except of my desktop which is connected to onedrive and sharepoint with my company account or temporary storage on PowerBi service. One person suggested me to save a static dataset into a semantic model of datamarts and update dailly this dataset with new data which will be appended and will not replace the earlier updated data. I don't know how to do that and I am not sure that datamarts can save a document and just append new information and save this information because each new day it would take the same old source and will add a new data. I was asked for the solution to have storage in a semantic model of powerBi service rather than saving on my computer, etc. I have read about powerautomate and it could work I suppose because I don't have a lot of data on a daily basis (7K rows) but if datamarts/semantic model can be refreshed each day and append the new data to a final dataset, then we don't need powerautomate or ADF pipeline. What do you think?

zhanna27 · ‎01-10-2024

Thank you for your feedback. Unfortunately the merge with huge datasets is not working ;-( . I will have to find another solution.

lbendlin · ‎12-23-2023

1) Give Incremental Refresh another try. It is designed to handle your scenario.

2) ditto

3) Datamarts are on the way out. If you really must, use dataflows (but even those don't seem to be needed in your scenario)

zhanna27 · ‎01-03-2024

Hi Ibendlin,

Happy new year and thank you for your reply. I have consulted it only now. I have to manage this dataset in PowerBI Service (Dataflow or Datamart) in order to share it with others. RLS will be needed later. I just have problems merging huge datasets. I have several tables with 15M one and others around 2M. I need to merge so that we can apply RLS easily and surely. Maybe you can suggest another way ? I just cannot merge because of the heavy datasets. I need to do outer inn but it fails. When you say that neither dataflow nor dtamarts is required in my scenario, what do you mean ? What tool should I use in this case? Also incremental refresh requires a spesific day for saving data. I am afraid that some data will not be refreshed and maybe lost. I have a function which makes an API call for getting data for the day-2 (example: on the 4th of January it will get the data for the 2nd of January. So I don't understand what should I precice in the dataflow for incremental refresh. Thank you for your help.

lbendlin · ‎01-09-2024

I would propose you use datasets. Yes, incremental refresh uses date ranges but you can fake these by mapping years to "first day of the year", for example.

zhanna27 · ‎01-11-2024

Hi Ibendlin, I already use datasets. The problem is that my datasets which I upload from another source don't register the update. They only show updated and appended information for today but then the next day, yesterday's information is lost and only a new one is appended and updated. How can I have a dataset registered in powerbi service and not taking it from my desktop source? Or should I created a Azure mysql database for that? Will in this case the information be updated each day and appended ? thank you

lbendlin · ‎01-11-2024

Power BI is a reporting platform, it is not a data storage. You will want to store your data further upstream (for example in that Azure MySQL database you mentioned). The most appropriate architecture solution depends on your business scenario and your growth projections (scalability).