Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
Luis_RBDO
Frequent Visitor

Connect to file with dynamic link embbed in a webpage

Hello everyone,

 

I have a complex challenge to solve.

 

I need to extract a list of companies, available in a file linked in public agency webpage.

Instead of manually downloading the file from the website and importing to Power BI, I was looking for a way to automatically retrieve it from the webpage using power query.

This was possible by right clicking the file link in the webpage, copying the address and connecting to it using the power query web connector.

 

The problem is that, when this agency updates the file once a year, the file address changes. They add the update date to the link, example https://www.agency_webpage.pt/list-of-companies_22_02_2024.pdf. They also don't update it on the same day and month each year, so next year the link could be https://www.agency_webpage.pt/list-of-companies_18_01_2025.pdf, making it impossible to dynamically change the date on the file address using the web connector parameter [RelativePath], like:

 

Source = Pdf.Tables(Web.Contents("https://www.agency_webpage.pt/list-of-companies_22_02_",
[RelativePath = Number.ToText(Date.Year(DateTime.Date(DateTime.LocalNow())))]),
[Implementation="1.3"])

 

I created a query to scrap the file link from the webpage html code (fetching it from a href line code) using the Web.BrowserContents connector, that returns a string with the file link, #"Fetch File Address Query". Then I created a second query to extract the file, using the Web.Contents connector, with the agency website address as the first parameter and referenced the #"Fetch File Address Query" query in the parameter [RelativePath], like so:

Source = Pdf.Tables(Web.Contents("https://www.agency_webpage.pt/",[RelativePath = #"Fetch File Address Query"]), [Implementation="1.3"]),

 

This solution returned the error: "Formula.Firewall: Query references other queries, so it may not directly access a data source.".

 

I've tried changing the #"Fetch File Address Query"  to a function and this way it worked on the Desktop. I'm able to apply and manually refresh the report in Power BI desktop. But after publishing to Power BI Service, it's impossible to manually refresh or schedule refresh in the service because it returns an error "Scheduled refresh is disabled because at least one data source is missing credentials." and the option to manage the credentials is greyed out:

Luis_RBDO_1-1709727163953.png

 

After extensive research, I didn't find any other way to create a query that could connect to a file with a dynamic address, located in a webpage, that worked properly in power bi service.

 

Has anyone had a similar situation and was able to find a solution?

 

8 REPLIES 8
lbendlin
Super User
Super User

That's a very common error pattern.  your RelativePath must include the 

list-of-companies_22_02_

bit, or whatever the new name is. If you want to supply the Relative Path dynamically then that must happen inside the same query partition.

 

 

I've created a function to fetch from the website and return as a string the bit that changes every year . When I call the function on the Relative Path, it works on Power BI Desktop but not on the service. The "Data source credentials" section greys out, making it impossible to configure.

You are trying to automate something that changes once a year?

It's not necessary in this case, it's a challenge to test what is possible to do with querying web sources to use in  similar situations in the future.

Make sure that the base URL returns valid HTML, or skip the connection test.

Thank you for your help @lbendlin but unfortunately it seems that this is something that it's not possible to do.

 

The base URL returns the webpage HTML source code, I'm not sure if this qualifies as valid but I have no issue scraping the file link on PBI desktop. It's not possible to check the box "Skip Connection Test" because the section "Data source credentials" is greyed out.


After further research, it seems that queries that use the "Web.BrowserContents" connector cannot be used for scheduled refresh. On the other hand, the connector "Web.Contents" works for other sources that I've implemented but it does not work in this case because it doesn't return the HTML webpage source code.

This issue as no solution.

Your original post did not mention Web.BrowserContents.  Can you not see the required data when using Web.Contents?

I mentioned it in the third paragraph of the original post.

Luis_RBDO_0-1710247379147.png

But it's easier to explain with real examples:

 

The link to the pdf file is provided in the the body of website: https://www.ani.pt/pt/financiamento/incentivos-fiscais/selo-id-reconhecimento-de-idoneidade-na-pr%C3...

 

Luis_RBDO_3-1710249315316.png

 

 

First, I scrap the website to obtain the file link with the following function, using the connector "Web.BrowserContents":

 

 

let
    File_link = () => let
    #"HTML Code" = Web.BrowserContents("https://www.ani.pt/pt/financiamento/incentivos-fiscais/selo-id-reconhecimento-de-idoneidade-na-pr%C3%A1tica-de-atividades-de-id/"),
    #"Split Text" = Text.Split(#"HTML Code", "href="""),
    #"Converted to Table" = Table.FromList(#"Split Text", Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    Filtered = Table.SelectRows(#"Converted to Table", each Text.Contains([Column1], "lista-entidades-reconhecidas")),
    #"Extracted Text Before Delimiter" = Table.TransformColumns(Filtered, {{"Column1", each Text.BeforeDelimiter(_, """"), type text}}),
    #"Clean Address" = Table.TransformColumns(#"Extracted Text Before Delimiter", {{"Column1", each Text.TrimStart(_,"/"), type text}}),
    Source = #"Clean Address"{0}[Column1]
    in
        Source
in
    File_link

 

 

Returning the dynamic component of the file link as a string: 

Luis_RBDO_2-1710248458407.png

 

Then, on a second query, I use the connector "Web.Contents" and supply it the static part of the link, "https://www.ani.pt/", and invoke the function in the relative path argument, to add the dynamic part:

Source = Pdf.Tables(Web.Contents("https://www.ani.pt/",[RelativePath = #"Fetch File Address Query"]), [Implementation="1.3"]),

 

I didn't manage to find a way to access the file link from the website body directly using "Web.Contents". But, if I copy/paste the file link and supply it as a static address it works and allows to schedule refresh. This solution is not dynamic and we will need to manually change the link every time there is a newer file published on the website, but since it's only once a year, it's perfectly acceptable.

Helpful resources

Announcements
Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors
Top Kudoed Authors