Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

July 7 - July 17 | Round 2 of the Power BI Dataviz World Championships. Don't miss your chance! Learn more

Reply
seedrs91
Frequent Visitor

Very tricky text extraction

Hi all, I have two columns below where there are multiple “tenants” in one column and multiple “areas” in another. I want to extract all the “SA Majeste” and “PWGSC” entities from the Tenant column such that in the new first column, I get:

 

  • SA MAJESTE LA REINE ( 81001133)
  • SA MAJESTE LA REINE (81001086

 

etc. and their corresponding areas—formatted as “(XX,XXX.00)”—in the new second column:

 

  • 6,075
  • 7,453

 

The SA Majeste or PWGSC entities are not always in second place. They could be in third, fourth or nth place. Thus, there is a pattern for extraction, but I had no luck figuring out the M code for this extraction. Unfortunately, all the data comes from a PDF online, so I have no control over how the data are structured…I’d have normalized the data to make the table analysis-friendly otherwise.

 

seedrs91_0-1689190600254.jpeg

4 REPLIES 4
smpa01
Community Champion
Community Champion

@seedrs91 is there any other data format avaialble other than PDF for the download? such as XML/HTML?


========================
Did I answer your question? Mark my post as a solution!
Proud to be a Super User
My Custom Visualization Projects
• Plotting Live Sound: Live Sound
• Beautiful News: Women in Parliament, Energy Mix, Shrinking Armies
• Visual Capitalist: Working Hrs
• Others: Easing Graph, Animated Calendar
MayViz Submissions
• Week 1: View
• Week 2: View
• Week 3: View
• Week 4: View
========================

Unfortunately not, It is only available in PDF.

smpa01
Community Champion
Community Champion

Like you said there is no apparent pattern to this and trying to come up with a pattern and then translate that into regex and then using js within power query seems like a difficult task for this (from performance standpoint too).

There is one thing that comes to my mind; if use node.js; there is a package called pdf2json. So you can create a node project and utilize pdf2json to convert that to a json first and then pass the json to power bi. I am not 100% sure though if it would still give you what you need.

 

Or if you have direct access to data vendor, you can ask them to format the data with desired delimiter between tenants and corresponding fields before making it avaialble to clients. I did exactly this in similar situation like you at past.


========================
Did I answer your question? Mark my post as a solution!
Proud to be a Super User
My Custom Visualization Projects
• Plotting Live Sound: Live Sound
• Beautiful News: Women in Parliament, Energy Mix, Shrinking Armies
• Visual Capitalist: Working Hrs
• Others: Easing Graph, Animated Calendar
MayViz Submissions
• Week 1: View
• Week 2: View
• Week 3: View
• Week 4: View
========================

Sorry, I'm still a beginner so may I ask what is it that pdf2json does that helps with my goal?

 


@smpa01 wrote:

 

Or if you have direct access to data vendor, you can ask them to format the data with desired delimiter between tenants and corresponding fields before making it avaialble to clients. I did exactly this in similar situation like you at past.


Apologies but this won't be possible. This is a bid whereby the vendor doesn't provide preferntial treatment to any bidder.

Helpful resources

Announcements
Fabric Data Days is here Carousel

Data Days 2026

Don't miss out on Data Days, June 15 through August 7. Learn Fabric, Power BI, SQL, AI and more.

Power BI DataViz World Championships carousel

Power BI DataViz World Championships - June 2026

A new Power BI DataViz World Championship is coming this June! Don't miss out on submitting your entry.