Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Did you hear? There's a new SQL AI Developer certification (DP-800). Start preparing now and be one of the first to get certified. Register now

Reply

Extract URLS from HTML code into table

Hi folks,

I'm hoping someone might be able to help. I have a bunch of SharePoint intranet pages, and I'm attempting to pull out all the URLs from the text, buttons, etc. into a table. The purpose will be to create a report that shows which pages have links in them, and where they point to.

 

I have the HTML code for each page in a report, now I need to create some rules in a DAX measure to pull out the URLs from each page. The challenge is that different page widgets use different HTML code consistencies, so there will need to be a number of different rules.

 

So the rules need to extract all of the text between the start and ends points below, and store them in a table. The rules would

 

RuleCapture string where the beginning of the text starts withStop capturing string when the first instance of the text below occurs:
1href=""""
2"http"
3href=""http&quot

 

Here is some sample code:

https://1drv.ms/u/s!AvfpkO5b74akg6li3CyRiH-e-KE8Zg?e=GRcW4A 

 

This sample code is in a column in a table called 'Authoring Canvas Content'. There is also a column called 'Name' that contains the page URL. Here's a sample:

 

'Authoring Canvas Content'Name
<code above>https://intranet/sites/mysite/mypage.aspx

 

In terms of outputs, here's what I need to see:

PageURL
https://intranet/sites/mysite/mypage.aspxhttps://www.google1.com
https://intranet/sites/mysite/mypage.aspxhttps://www.google1.5.com
https://intranet/sites/mysite/mypage.aspxhttps://www.google1.6.com
https://intranet/sites/mysite/mypage.aspxhttps://www.google2.com
https://intranet/sites/mysite/mypage.aspxhttps://www.google3.com
https://intranet/sites/mysite/mypage.aspxhttps://www.google4.com
https://intranet/sites/mysite/mypage.aspxhttps://www.google4.5.com
https://intranet/sites/mysite/mypage.aspxhttps://www.google4.6.com
https://intranet/sites/mysite/mypage.aspxhttps://www.google5.com
https://intranet/sites/mysite/mypage.aspxhttps://www.google6.com
https://intranet/sites/mysite/mypage.aspxhttps://www.google6.5.com
https://intranet/sites/mysite/mypage.aspxhttps://www.google6.6.com

 

Hopefully that makes sense - any help would be really appreciated 🙂

1 ACCEPTED SOLUTION
lbendlin
Super User
Super User

Here is one version of an implementation

let
    Source = Table.FromColumns({Lines.FromBinary(File.Contents("C:\Users\xxx\Downloads\HTML_code_sample.txt"), null, null, 1252)}),
    #"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
    #"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Promoted Headers", {{"Authoring Canvas Content", Splitter.SplitTextByDelimiter("https", QuoteStyle.None), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Authoring Canvas Content"),
    #"Replaced Value" = Table.ReplaceValue(#"Split Column by Delimiter","&quot;",">",Replacer.ReplaceText,{"Authoring Canvas Content"}),
    #"Replaced Value1" = Table.ReplaceValue(#"Replaced Value"," ",">",Replacer.ReplaceText,{"Authoring Canvas Content"}),
    #"Replaced Value4" = Table.ReplaceValue(#"Replaced Value1","<",">",Replacer.ReplaceText,{"Authoring Canvas Content"}),
    #"Split Column by Delimiter1" = Table.SplitColumn(#"Replaced Value4", "Authoring Canvas Content", Splitter.SplitTextByEachDelimiter({">"}, QuoteStyle.Csv, false), {"Authoring Canvas Content.1", "Authoring Canvas Content.2"}),
    #"Removed Other Columns" = Table.SelectColumns(#"Split Column by Delimiter1",{"Authoring Canvas Content.1"}),
    #"Replaced Value2" = Table.ReplaceValue(#"Removed Other Columns","&#58;",":",Replacer.ReplaceText,{"Authoring Canvas Content.1"}),
    #"Replaced Value3" = Table.ReplaceValue(#"Replaced Value2","://","https://",Replacer.ReplaceText,{"Authoring Canvas Content.1"}),
    #"Removed Top Rows" = Table.Skip(#"Replaced Value3",1),
    #"Removed Duplicates" = Table.Distinct(#"Removed Top Rows")
in
    #"Removed Duplicates"
How to use this code: Create a new Blank Query. Click on "Advanced Editor". Replace the code in the window with the code provided here. Click "Done".

 

lbendlin_0-1644554018952.png

 

View solution in original post

2 REPLIES 2
lbendlin
Super User
Super User

Here is one version of an implementation

let
    Source = Table.FromColumns({Lines.FromBinary(File.Contents("C:\Users\xxx\Downloads\HTML_code_sample.txt"), null, null, 1252)}),
    #"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
    #"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Promoted Headers", {{"Authoring Canvas Content", Splitter.SplitTextByDelimiter("https", QuoteStyle.None), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Authoring Canvas Content"),
    #"Replaced Value" = Table.ReplaceValue(#"Split Column by Delimiter","&quot;",">",Replacer.ReplaceText,{"Authoring Canvas Content"}),
    #"Replaced Value1" = Table.ReplaceValue(#"Replaced Value"," ",">",Replacer.ReplaceText,{"Authoring Canvas Content"}),
    #"Replaced Value4" = Table.ReplaceValue(#"Replaced Value1","<",">",Replacer.ReplaceText,{"Authoring Canvas Content"}),
    #"Split Column by Delimiter1" = Table.SplitColumn(#"Replaced Value4", "Authoring Canvas Content", Splitter.SplitTextByEachDelimiter({">"}, QuoteStyle.Csv, false), {"Authoring Canvas Content.1", "Authoring Canvas Content.2"}),
    #"Removed Other Columns" = Table.SelectColumns(#"Split Column by Delimiter1",{"Authoring Canvas Content.1"}),
    #"Replaced Value2" = Table.ReplaceValue(#"Removed Other Columns","&#58;",":",Replacer.ReplaceText,{"Authoring Canvas Content.1"}),
    #"Replaced Value3" = Table.ReplaceValue(#"Replaced Value2","://","https://",Replacer.ReplaceText,{"Authoring Canvas Content.1"}),
    #"Removed Top Rows" = Table.Skip(#"Replaced Value3",1),
    #"Removed Duplicates" = Table.Distinct(#"Removed Top Rows")
in
    #"Removed Duplicates"
How to use this code: Create a new Blank Query. Click on "Advanced Editor". Replace the code in the window with the code provided here. Click "Done".

 

lbendlin_0-1644554018952.png

 

Apologies for the delay with my reply @lbendlin - this is great, thank you so much for your time 🙂

Helpful resources

Announcements
April Power BI Update Carousel

Power BI Monthly Update - April 2026

Check out the April 2026 Power BI update to learn about new features.

Fabric SQL PBI Data Days

Data Days 2026 coming soon!

Sign up to receive a private message when registration opens and key events begin.

New to Fabric survey Carousel

New to Fabric Survey

If you have recently started exploring Fabric, we'd love to hear how it's going. Your feedback can help with product improvements.

Power BI DataViz World Championships carousel

Power BI DataViz World Championships - June 2026

A new Power BI DataViz World Championship is coming this June! Don't miss out on submitting your entry.