Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Grow your Fabric skills and prepare for the DP-600 certification exam by completing the latest Microsoft Fabric challenge.

Reply
FabioBeka
Frequent Visitor

How can I reference a (stochastic) query output only without re-executing the original query?

I have a huge problem and I can't seem to find a solution.

 

I'm implementing some machine learning algorithms through a Python script inside a PowerQuery query in order to analyze a csv of data.

I need to chain the output (alone) of this query (original query) to another query (derivate query) for further analysis, and I need to have both the original and derivate queries as tables in a report.

 

The problem is that, the original query's script, uses some stochastic functions, (random sampling and stochastic ML models) so the results of the query change at every execution and are never the same.

 

When I try to reference another query on the result of the script, the two queries don't have the same results, probably because the derivate query, doesn't simply copy the values of the original query but re-executes the original query too, and this doesn't allow me to use them both inside a report since they're presenting 2 different sets of results.

 

Is there a way to reference only the output of a query, or to treat a query as a View or static Table, and avoid re-executing the whole set of instrunctions and scripts inside the original query?

 

An easy example that anybody can replicate, here I generate a random dataframe with this python code;

 

import numpy as np
import pandas as pd

X=pd.DataFrame(np.random.randint(0,10,(3,3)))

 

As you can see, if I reference the scripted query, the derivate table is different because it re-executes the script and doesn't simply copy the values.

FabioBeka_0-1608298169812.png

 

 

FabioBeka_1-1608298262147.png

 

10 REPLIES 10
AlB
Super User
Super User

@PhilipTreacy  @FabioBeka 

Yesterday I saw that if we use  @PhilipTreacy's solution with the [dataset= .... ] argument in the Python.Execute then it stops working and the generated numbers are new in every step within the same queryObviously we do not need that argument here but I was rather surprised at the big change. Any thoughts?

let
    Initial = #table({"Col1"},{{1}}),
    Source = Python.Execute("import numpy as np#(lf)import pandas as pd#(lf)#(lf)X=pd.DataFrame(np.random.randint(0,10,(3,3)))", [dataset=#"Initial"]),
    X1 = Source{[Name="X"]}[Value],
    #"Changed Type" = Table.TransformColumnTypes(X1,{{"0", Int64.Type}, {"1", Int64.Type}, {"2", Int64.Type}}),
    Py = Table.FromColumns(Table.ToColumns(#"Changed Type"))
in
    Py

 

Please mark the question solved when done and consider giving a thumbs up if posts are helpful.

Contact me privately for support with any larger-scale BI needs, tutoring, etc.

Cheers 

 

SU18_powerbi_badge

FabioBeka
Frequent Visitor

I've been using PowerBI for only 3 weeks, I'm far from being an expert, but my guess is, when we load data from powerquery it probably has some in-built system that updates the source, while when we do it from Python it goes out of reach.

 

Cause at this point the only reason I can see of why it re-executes the query is that it's for reloading the data source, in Python maybe we've bypassed that.

Still, in the first example I generated the data inside the script, and it re-executed the whole thing anyway.

PhilipTreacy
Super User
Super User

Hi @FabioBeka 

So reading the csv in with Python and then using this fixes the issue?

Py = Table.FromColumns(Table.ToColumns(#"Changed Type"))

Regards

Phil



Did I answer your question? Then please mark my post as the solution.
If I helped you, click on the Thumbs Up to give Kudos.


Blog :: YouTube Channel :: Connect on Linkedin


Proud to be a Super User!


If I read the csv inside the python script, it actually works even without the 

 

 

    Py = Table.FromColumns(Table.ToColumns(#"Changed Type"))

 

 

 Your example though made me think that the problem was probably in the dataset argument of the script:

 

 

let
    Source = "",
    #"Run Python script" = Python.Execute("# 'dataset' holds the input data for this script#(lf)import numpy as np#(lf)import pandas as pd#(lf)#(lf)df=pd.read_csv(r""C:\...\file.csv"",sep=';')#(lf)X=df.sample(n=100)"),
    X = #"Run Python script"{[Name="X"]}[Value],
    #"Changed Type" = Table.TransformColumnTypes(X,{{"CustomerID", type text}, {"Assignments", Int64.Type}, {"Recency", Int64.Type}, {"Frequency", Int64.Type}, {"Monetary", Int64.Type}})
in
    #"Changed Type"

 

 

 

With this code I can apparently reference the same random sample on different secondary queries.

 

FabioBeka_0-1608629604413.png

FabioBeka_1-1608629664528.png

 

PhilipTreacy
Super User
Super User

Hi @AlB 

Not sure, maybe to do with lazy evaluation?  After the copy of the Python output table is created then used as the query result, PBI/PQ figures it doesn't need to refer to any previous steps to provide that data, so the Python code isn't re-executed.

Cheers

Phil



Did I answer your question? Then please mark my post as the solution.
If I helped you, click on the Thumbs Up to give Kudos.


Blog :: YouTube Channel :: Connect on Linkedin


Proud to be a Super User!


AlB
Super User
Super User

@PhilipTreacy 

That's interesting. Any idea why it works doing it like that?

 

Please mark the question solved when done and consider giving a thumbs up if posts are helpful.

Contact me privately for support with any larger-scale BI needs, tutoring, etc.

Cheers 

SU18_powerbi_badge

 

PhilipTreacy
Super User
Super User

Hi @FabioBeka 

This worked for me.  Create a copy of the Python results table and use that as the query result.

 

let
    Source = Python.Execute("import numpy as np#(lf)import pandas as pd#(lf)#(lf)X=pd.DataFrame(np.random.randint(0,10,(3,3)))"),
    X1 = Source{[Name="X"]}[Value],
    #"Changed Type" = Table.TransformColumnTypes(X1,{{"0", Int64.Type}, {"1", Int64.Type}, {"2", Int64.Type}}),
    Py = Table.FromColumns(Table.ToColumns(#"Changed Type"))
in
    Py

 

You can then reference the output of this query without re-executing the Python script.

Regards

Phil


If I answered your question please mark my post as the solution.
If my answer helped solve your problem, give it a kudos by clicking on the Thumbs Up.



Did I answer your question? Then please mark my post as the solution.
If I helped you, click on the Thumbs Up to give Kudos.


Blog :: YouTube Channel :: Connect on Linkedin


Proud to be a Super User!


Thank you very much, this solved partially the problem, it does indeed work in this example case.

 

But then I tried to apply  it to the real case, in which I load a csv of data through PowerQuery, and thus the Source step reads a csv and in the python script step there's a "[dataset=Source]" argument, in this case the problem persists even if I uncheck the "Enable load" option, it re-executes the original query and all the stochastic functions give different results.


Source = Csv.Document(File.Contents("C:\....\file.csv"),[Delimiter=",", Columns=9, Encoding=1252, QuoteStyle=QuoteStyle.None]),

...

#"Run Python script" = Python.Execute("# 'dataset' holds the input data for this script#(lf)#(lf)import pandas as pd#(lf)import numpy as np#(lf)#(lf)DF=dataset.sample(n=100)",[dataset=#"Changed Type"]),

 

 

But I also noticed that I can bypass this by reading the csv directly inside the python script (pd.read_csv()) instead of setting it as the source through PowerQuery, leaving the Source blank, (Source='') and having no [dataset=Source] argument in the python script.

 

Source = "",
#"Run Python script" = Python.Execute("# 'dataset' holds the input data for this script#(lf)import numpy as np#(lf)import pandas as pd#(lf)#(lf)df=pd.read_csv(r""C:\...\file.csv"",sep=';')#(lf)X=df.sample(n=100)"),

 

mahoneypat
Employee
Employee

I haven't tried this suggestion, but here is one other configuration to explore.  You could generate both of your tables within the same query (i.e., as two separate variables then in the last step have them in Table form in separate columns or rows).  Disable load on that table, but then reference it two new queries where you can separately expand each table.  If you uncheck parallel loading and refresh all, you may get your desired result.

 

Regards,

Pat





Did I answer your question? Mark my post as a solution! Kudos are also appreciated!

To learn more about Power BI, follow me on Twitter or subscribe on YouTube.


@mahoneypa HoosierBI on YouTube


AlB
Super User
Super User

Hi @FabioBeka 

Interesting. I'm afraid it won't work with the reference to the  original query. The Python script will get executed again. Check this out: https://social.technet.microsoft.com/Forums/en-US/8d5ee632-fdff-4ba2-b150-bb3591f955fb/queries-evalu...

You could:

1. Execute the Python script outside of PQ, instead of having PQ invoke it directly and just read the static values into PQ or

2.  Work with both tables (and do all the processing) in the same query. Something like:

let
    Source = "",
    #"Run Python script" = Python.Execute("import numpy as np#(lf)import pandas as pd#(lf)#(lf)X=pd.DataFrame(np.random.randint(0,10,(3,3)))#(lf)"),
    #"Expanded Value" = Table.ExpandTableColumn(#"Run Python script", "Value", {"0", "1", "2"}, {"Col0", "Col1", "Col2"}),
    #"Removed Columns" = Table.RemoveColumns(#"Expanded Value",{"Name"}),
    Table1_ = #"Removed Columns",
    Table2_ = Table1_ 
in
    Table2_

You can work on Table1_ and Table2_ from the last step on, within the same query

Perhaps these knowledgeable people have other ideas @ImkeF  @edhans  @Jimmy801  @mahoneypat @PhilipTreacy 

 

Please mark the question solved when done and consider giving a thumbs up if posts are helpful.

Contact me privately for support with any larger-scale BI needs, tutoring, etc.

Cheers 

SU18_powerbi_badge

 

 

Helpful resources

Announcements
Europe Fabric Conference

Europe’s largest Microsoft Fabric Community Conference

Join the community in Stockholm for expert Microsoft Fabric learning including a very exciting keynote from Arun Ulag, Corporate Vice President, Azure Data.

RTI Forums Carousel3

New forum boards available in Real-Time Intelligence.

Ask questions in Eventhouse and KQL, Eventstream, and Reflex.