Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Get Fabric Certified for FREE during Fabric Data Days. Don't miss your chance! Request now

Reply

Extract URLS from HTML code into table

Hi folks,

I'm hoping someone might be able to help. I have a bunch of SharePoint intranet pages, and I'm attempting to pull out all the URLs from the text, buttons, etc. into a table. The purpose will be to create a report that shows which pages have links in them, and where they point to.

 

I have the HTML code for each page in a report, now I need to create some rules in a DAX measure to pull out the URLs from each page. The challenge is that different page widgets use different HTML code consistencies, so there will need to be a number of different rules.

 

So the rules need to extract all of the text between the start and ends points below, and store them in a table. The rules would

 

RuleCapture string where the beginning of the text starts withStop capturing string when the first instance of the text below occurs:
1href=""""
2"http"
3href=""http&quot

 

Here is some sample code:

https://1drv.ms/u/s!AvfpkO5b74akg6li3CyRiH-e-KE8Zg?e=GRcW4A 

 

This sample code is in a column in a table called 'Authoring Canvas Content'. There is also a column called 'Name' that contains the page URL. Here's a sample:

 

'Authoring Canvas Content'Name
<code above>https://intranet/sites/mysite/mypage.aspx

 

In terms of outputs, here's what I need to see:

PageURL
https://intranet/sites/mysite/mypage.aspxhttps://www.google1.com
https://intranet/sites/mysite/mypage.aspxhttps://www.google1.5.com
https://intranet/sites/mysite/mypage.aspxhttps://www.google1.6.com
https://intranet/sites/mysite/mypage.aspxhttps://www.google2.com
https://intranet/sites/mysite/mypage.aspxhttps://www.google3.com
https://intranet/sites/mysite/mypage.aspxhttps://www.google4.com
https://intranet/sites/mysite/mypage.aspxhttps://www.google4.5.com
https://intranet/sites/mysite/mypage.aspxhttps://www.google4.6.com
https://intranet/sites/mysite/mypage.aspxhttps://www.google5.com
https://intranet/sites/mysite/mypage.aspxhttps://www.google6.com
https://intranet/sites/mysite/mypage.aspxhttps://www.google6.5.com
https://intranet/sites/mysite/mypage.aspxhttps://www.google6.6.com

 

Hopefully that makes sense - any help would be really appreciated 🙂

1 ACCEPTED SOLUTION
lbendlin
Super User
Super User

Here is one version of an implementation

let
    Source = Table.FromColumns({Lines.FromBinary(File.Contents("C:\Users\xxx\Downloads\HTML_code_sample.txt"), null, null, 1252)}),
    #"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
    #"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Promoted Headers", {{"Authoring Canvas Content", Splitter.SplitTextByDelimiter("https", QuoteStyle.None), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Authoring Canvas Content"),
    #"Replaced Value" = Table.ReplaceValue(#"Split Column by Delimiter","&quot;",">",Replacer.ReplaceText,{"Authoring Canvas Content"}),
    #"Replaced Value1" = Table.ReplaceValue(#"Replaced Value"," ",">",Replacer.ReplaceText,{"Authoring Canvas Content"}),
    #"Replaced Value4" = Table.ReplaceValue(#"Replaced Value1","<",">",Replacer.ReplaceText,{"Authoring Canvas Content"}),
    #"Split Column by Delimiter1" = Table.SplitColumn(#"Replaced Value4", "Authoring Canvas Content", Splitter.SplitTextByEachDelimiter({">"}, QuoteStyle.Csv, false), {"Authoring Canvas Content.1", "Authoring Canvas Content.2"}),
    #"Removed Other Columns" = Table.SelectColumns(#"Split Column by Delimiter1",{"Authoring Canvas Content.1"}),
    #"Replaced Value2" = Table.ReplaceValue(#"Removed Other Columns","&#58;",":",Replacer.ReplaceText,{"Authoring Canvas Content.1"}),
    #"Replaced Value3" = Table.ReplaceValue(#"Replaced Value2","://","https://",Replacer.ReplaceText,{"Authoring Canvas Content.1"}),
    #"Removed Top Rows" = Table.Skip(#"Replaced Value3",1),
    #"Removed Duplicates" = Table.Distinct(#"Removed Top Rows")
in
    #"Removed Duplicates"
How to use this code: Create a new Blank Query. Click on "Advanced Editor". Replace the code in the window with the code provided here. Click "Done".

 

lbendlin_0-1644554018952.png

 

View solution in original post

2 REPLIES 2
lbendlin
Super User
Super User

Here is one version of an implementation

let
    Source = Table.FromColumns({Lines.FromBinary(File.Contents("C:\Users\xxx\Downloads\HTML_code_sample.txt"), null, null, 1252)}),
    #"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
    #"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Promoted Headers", {{"Authoring Canvas Content", Splitter.SplitTextByDelimiter("https", QuoteStyle.None), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Authoring Canvas Content"),
    #"Replaced Value" = Table.ReplaceValue(#"Split Column by Delimiter","&quot;",">",Replacer.ReplaceText,{"Authoring Canvas Content"}),
    #"Replaced Value1" = Table.ReplaceValue(#"Replaced Value"," ",">",Replacer.ReplaceText,{"Authoring Canvas Content"}),
    #"Replaced Value4" = Table.ReplaceValue(#"Replaced Value1","<",">",Replacer.ReplaceText,{"Authoring Canvas Content"}),
    #"Split Column by Delimiter1" = Table.SplitColumn(#"Replaced Value4", "Authoring Canvas Content", Splitter.SplitTextByEachDelimiter({">"}, QuoteStyle.Csv, false), {"Authoring Canvas Content.1", "Authoring Canvas Content.2"}),
    #"Removed Other Columns" = Table.SelectColumns(#"Split Column by Delimiter1",{"Authoring Canvas Content.1"}),
    #"Replaced Value2" = Table.ReplaceValue(#"Removed Other Columns","&#58;",":",Replacer.ReplaceText,{"Authoring Canvas Content.1"}),
    #"Replaced Value3" = Table.ReplaceValue(#"Replaced Value2","://","https://",Replacer.ReplaceText,{"Authoring Canvas Content.1"}),
    #"Removed Top Rows" = Table.Skip(#"Replaced Value3",1),
    #"Removed Duplicates" = Table.Distinct(#"Removed Top Rows")
in
    #"Removed Duplicates"
How to use this code: Create a new Blank Query. Click on "Advanced Editor". Replace the code in the window with the code provided here. Click "Done".

 

lbendlin_0-1644554018952.png

 

Apologies for the delay with my reply @lbendlin - this is great, thank you so much for your time 🙂

Helpful resources

Announcements
Fabric Data Days Carousel

Fabric Data Days

Advance your Data & AI career with 50 days of live learning, contests, hands-on challenges, study groups & certifications and more!

October Power BI Update Carousel

Power BI Monthly Update - October 2025

Check out the October 2025 Power BI update to learn about new features.

FabCon Atlanta 2026 carousel

FabCon Atlanta 2026

Join us at FabCon Atlanta, March 16-20, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.

Top Solution Authors
Top Kudoed Authors