Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Get certified in Microsoft Fabric—for free! For a limited time, the Microsoft Fabric Community team will be offering free DP-600 exam vouchers. Prepare now

Reply
ceindev
New Member

Architecture question with external parquet files

Hello,

 

I am looking for advise on how to best PowerBI components to use given the following:

 

We have spark jobs running in our own compute that processes data.  The spark jobs write the processes dataframes to Azure Data Lake Gen2 storage and are segmented into different storage accounts + containers for each project.  

 

For example,

project1 -> spark job writes dataframe1, dataframe2, etc... to storage account 1, container = project1

project2 -> spark job writes dataframe1, dataframe2, etc... to storage account 2, container = project1

 

We have a set of powerBI reports that share a semantic model.  What I want to do is as follows:

 

Using powerBI rest api...

for each project:

    provision workspace with name = project namae

    import our master set of powerBI files into workspace

    update semantic model to point to the project's storage account + container's parquet files

 

 

We have most of this working with the exception of updating the semantic model.  Here are questions:

1) Given that the source data is from parquet and that these files can be large, what is the best powerbi tech to use here?  I see mentions of using a dataflows but I also see that lakehouses add query performance options.  

 

2) Is there a nice and clean way to script out setting the datasource to the appropriate storage location as I outlined above?

 

 

 

 

5 REPLIES 5
lbendlin
Super User
Super User

1)  You already have Parquet.  Consume as is.  No need to convert into anything else.

Thank you for this information.  Can you expand on this?

 

- What is the proper place to define the data source definitions?  Is it a DataSet, Semantic Model, etc...?

 

- This is M code I have made to handle spark multi partitioned parquet files (is there a built-in one already)?  How can I incorporate this into power bi rest api and what api endpoint do I use (datasource, dataset, dataflow, etc...)

 

let
Source = AzureStorage.DataLake("https://<storage account>.dfs.core.windows.net/<container for project>"),
#"Filtered rows" = Table.SelectRows(Source, each ([Extension] = ".parquet")),
Navigation = #"Filtered rows"{[#"Folder Path" = "xxx"]}[Content],
#"Imported Parquet" = Parquet.Document(Navigation)
in
#"Imported Parquet"

 

"dataset"  is the legacy name for "semantic model". That's the place to define your relationships. Data Source definitions happen before that, in the Power Query phase.

 

Please elaborate on your "incorporate this into power bi rest api"  comment - what are you trying to accomplish?

Thank you for the additional information.  Now I understand that I should update the dataset to point to different project data that exist on different azure datalake gen 2 storage accounts/containers.

 

So the step I would appreciate more details on is how to do this using the restapi

 

update semantic model to point to the project's storage account + container's parquet files

 

 

This documentation appears to say that the proper call is Update Parameters in Group

https://learn.microsoft.com/en-us/rest/api/power-bi/datasets/update-datasources-in-group

 

Are there any examples of how this should look to be able to change a table using the M language I previously posted:

let
Source = AzureStorage.DataLake("https://<storage account>.dfs.core.windows.net/<container for project>"),
#"Filtered rows" = Table.SelectRows(Source, each ([Extension] = ".parquet")),
Navigation = #"Filtered rows"{[#"Folder Path" = "xxx"]}[Content],
#"Imported Parquet" = Parquet.Document(Navigation)
in
#"Imported Parquet"

 

Lastly, I just want to confirm that using parquet files (that may be large) will not suffer performance issues?  I have seen other posts and documentation that suggest using other components in PowerBI (like adding compute, lakehouse, etc...)

I recommend you watch this space:  Optimizations — Delta Lake Documentation

 

Z-Ordering seems to be the latest craze.

Helpful resources

Announcements
OCT PBI Update Carousel

Power BI Monthly Update - October 2024

Check out the October 2024 Power BI update to learn about new features.

September Hackathon Carousel

Microsoft Fabric & AI Learning Hackathon

Learn from experts, get hands-on experience, and win awesome prizes.

October NL Carousel

Fabric Community Update - October 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors
Top Kudoed Authors