Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Enhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends September 15. Request your voucher.

Reply
_V_
Helper I
Helper I

Dont get the start done - transform multiple pdf from a file

Hey guys

 

So I watched some tutorials and read some Microsoft answer how to start with the data import via PQ.

When I selcted the file which contains the pdf´s, I click on transform,  followed by clicking the down pointing double arrow at the first colum "content" to get to the combine window.

 

Now I got the problem, that the suggested "Tables" dont fit the way I need different text passages to be extracted.
I got the option to choose the complete pdf with "Page001" - but what to do then?

How can I select the passages I need?

 

 

The main project I want to clear is, to get a self refreshing diagram, based on the "everytime a file is created" synchronisation from onedrive. Maybe someone can share some light here please, thanks a lot.

1 ACCEPTED SOLUTION

Hi @_V_ ,

Please share more details about the issue, as it might help in resolving it accurately.

Thank You.

View solution in original post

21 REPLIES 21
V-yubandi-msft
Community Support
Community Support

Hi @_V_ ,

We haven’t received a response from you in some time. To help address the issue, could you share some sample data when possible?

 

Thank you.

V-yubandi-msft
Community Support
Community Support

Hi @_V_ ,

If your issue is resolved, that’s good to hear. But if you're still facing any difficulties or running into new challenges, feel free to let me know.

 

Thank You.

Hey @V-yubandi-msft 

 

Like said, it resolves into errors, so problem not solved.

Hi @_V_ ,

Thank you for your response. Sorry to hear you’re still encountering errors. Could you provide a bit more information about the issue? For instance.

1. Does the error occur at a specific step? Or

2. Does it appear after you paste the M code?

 

With these details, I can troubleshoot more effectively or update the logic if needed.

 

Thank you for your patience let me know so I can assist further.

 

Regards,
Yugandhar.

Hi @_V_ ,

Please share more details about the issue, as it might help in resolving it accurately.

Thank You.

jaineshp
Memorable Member
Memorable Member

Hey @_V_ ,

Perfect! I can see exactly what you need from the weigh slip. Looking at your Zimmermann Recycling document, I can identify the key fields:

Target Data Points:

  • Material Type: "Bauschutt" (from the material description area)
  • Net Weight: "6.47 t" (Netto field)
  • Delivery Type: You'll need to determine this from context - likely from customer vs. internal codes or specific material descriptions

Extraction Strategy: Since this is a structured document, here's the approach:

  1. After expanding Content → Page001, use these Power Query functions:

    Text.BetweenDelimiters([Page001], "Material:", "Netto") // for material type
    Text.BetweenDelimiters([Page001], "Netto", "t") // for weight

  2. For Delivery Type detection, you might need:

    if Text.Contains([Material], "Bauschutt") then "Delivered"
    else if Text.Contains([Material], "Recycling") then "Picked up"

  3. Clean the data with Text.Trim() and Text.Replace() to remove extra spaces/characters

The German text actually helps - terms like "Bauschutt", "Beton", "Asphalt" are consistent identifiers you can use for classification.

Best Regards,
Jainesh Poojara | Power BI Developer

Jainesh,

 

I know what youre talking about, but I really dont get it done since Im completely new to this.

The problem with the delivery type is a field above; its either "ab Werk" which means taken of or its "Anlieferung" which would be "brought to us". Anyways, I guess this could be resolved later with the diagram, since we do not offer the materials we take for sale. So a materialtype can only be on or the other.

 

Ive taken some pics to show you my nooby moves, could you please have a look again?

I really appreciate your time and efforts. 

After choosing the folder, I hit transform dataAfter choosing the folder, I hit transform dataThen I use the double arrow next to contentThen I use the double arrow next to contentand now I have to approch the filters?and now I have to approch the filters?Then I choose the page, since it has all informationsThen I choose the page, since it has all informations

jaineshp
Memorable Member
Memorable Member

Hey @_V_ ,

Looking at your screenshots, you're on the right track! Here are the next steps to clean up your data:

After selecting Page001:

  1. Remove unwanted columns - Right-click on columns you don't need (like the empty ones) and select "Remove"
  2. Promote headers - Your first row likely contains column names. Go to Transform tab → "Use First Row as Headers"
  3. Filter out nulls/blanks - Click the dropdown arrow on each important column and uncheck blank/null values
  4. Handle the delivery type issue - Since "ab Werk" materials aren't for sale, you can:
    • Filter out those rows now, OR
    • Keep them and handle it later in your data model
  5. Check data types - Make sure date columns show as dates, numbers as numbers, etc. Power Query usually auto-detects but double-check
  6. Close & Apply when your data looks clean

Quick tip: Don't worry about the delivery type complexity now - you can always add a calculated column later to mark sellable vs non-sellable materials.

Your approach is correct, just need these cleanup steps before loading the data!

Best Regards,
Jainesh Poojara | Power BI Developer

Jainesh hey

 

I dont get it as I need it.

If I choose to delete all null-values in a column, it automatically deletes every following row of other columns; so I can only do this on sone columns which do not affect my needed values.

 

Also, even if every weighing pdf is built the same way, I get the net weight in different columns for instance. This is really stressing and I can see, why paid apps exist to be honest....

 

Is there a possibility that you have a look into this if I send you a couple of those files? Just to see if a pro has the same problem

jaineshp
Memorable Member
Memorable Member

Hey @_V_,

I completely understand your frustration! You've hit two very common Power Query challenges that even pros deal with regularly. Let me help you solve both issues:

Issue 1: Null value filtering deleting entire rows

This happens because Power Query removes the entire row when ANY selected column has nulls. Here's how to fix it:

Better approach:

  1. Instead of filtering out nulls, replace them first
  2. Right-click the column → "Replace Values"
  3. Leave "Value to Find" empty, put "0" or "N/A" in "Replace With"
  4. THEN filter if needed

Or use conditional filtering:

  • Click column dropdown → "Text Filters" → "Does Not Equal" → "(blank)"

Issue 2: Net weight appearing in different columns

This is the classic "inconsistent PDF structure" problem. Here's the pro approach:

Solution - Unpivot and merge:

  1. Select ALL the columns that might contain net weight
  2. Transform tab → "Unpivot Columns"
  3. This creates two columns: "Attribute" and "Value"
  4. Filter the "Attribute" column to only show weight-related column names
  5. The "Value" column now has all your weights in one place!


Did it work? ✔ Give a Kudo • Mark as Solution – help others too!

Best regards,
Jainesh Poojara / Power BI Developer

Hi @_V_ ,

Thank you for providing the context and screenshots they're very helpful. I also appreciate @jaineshp  support and valuable insights throughout this discussion. You're on the right track, and it's common to face some challenges with PDFs in Power Query. Even if the layout looks consistent, small formatting changes can cause issues.

Here are a few suggestions:

  1. Since your PDFs have similar structures but slight column differences, Extract Text can offer better control over the data you capture.
  2. Filtering out blanks early may remove rows you'll need later. Try filtering specific columns or using conditions to keep important data.
  3. If your files are mostly similar, using the Folder connector in Power Query lets you apply a consistent parsing approach to all PDFs, saving time and improving reliability.

Helpful Reference: Power Query Exact Online Premium connector - Power Query | Microsoft Learn


Hope this helps, It covers both table and text extraction.

 

Regards,

Yugandhar.

Hey good morning Yugandhar

 

Ok, but I dont know where to set the option "extract text" instead of parts/tables;

Ill have a look into your link.

Thanks so far

 

Edit: Does this contain a paid plugin? Is there no way to get this done without paying extra stuff....?

 

Edit2: Thanks again for your big help Jainesh

I still dont get it done; sometimes the net weight isnt even listed in the standard Page layout.

I can imagine, merging values into one column makes it even harder, since the source of the weight is always in a different row.

Meaning: I get then 1 column with all the weights and I wont be able to tell which one is net, which is tara and so on. Im done with this, Ill have a look who has the cheapest scraper app

Hello @_V_ ,

You're absolutely right Power Query doesn’t label a button as Extract Text. What we mean is when you select Page001 during the Combine Files step, you're effectively importing the entire page's raw text which then allows you to apply text functions like Text.BetweenDelimiters() to extract exactly what you need.

 

And just to clarify you don’t need any paid plugin to get this working. Everything can be done using Power Query in Excel or Power BI Desktop with built in features.

 

Thank You.

Like said before, its missing some net weight data and even if everything would be there, its too complicated to me to accomplish.

I completely understand your concern. Just to add to what we’ve discussed so far, here are a couple of community threads that are quite similar to your situation

Solved: Import Full PDF into Power BI instead of a few tab... - Microsoft Fabric Community

Solved: Multiple PDFs Different Tables - Microsoft Fabric Community

 

These threads may not exactly match your case, but they touch on the same challenges with PDF structure, data inconsistency, and parsing. It might be helpful to check them out for some additional insights.

 

Thanks for your patience.

Those links do not help me.

Thanks

How do I REPLACE a cells value, if the cell above meets certain conditions?

 

Since the textfilters deletes whole rows, I need the replace funktion.
So what I need is:

image.png

 

If a cell is equal "Tara (t)" then replace it with "/" and the content of the cell below with "/"

 

Can you tell me that please?

We've successfully applied the transformation as requested:

  1. Replaced "Tara (t)" with "/"

  2. Also replaced the value immediately following "Tara (t)" with "/"

FYI:

Vyubandimsft_0-1754043687901.png

Please use this M Code in the Advanced Editor.

let
    
    Source = Table.FromRows({
        {"Gross weight"},
        {"Tara (t)"},
        {"13,88"},
        {"Net weight"},
        {"Tara (t)"},
        {"14,32"},
        {"Other"}
    }, {"A"}),
  
    AddIndex = Table.AddIndexColumn(Source, "Index", 0, 1, Int64.Type),

    MarkTara = Table.AddColumn(AddIndex, "Replace", each 
        if [A] = "Tara (t)" then 1 else 0),

    MarkNext = Table.AddColumn(MarkTara, "ReplaceNext", each 
        try MarkTara[Replace]{[Index]-1} otherwise 0),
   
    Combined = Table.AddColumn(MarkNext, "Final", each 
        if [Replace] = 1 or [ReplaceNext] = 1 then "/" else [A]),

    Output = Table.SelectColumns(Combined, {"Final"})
in
    Output

 

Regards,
Yugandhar.

 

Hello @_V_ ,

Everything appears to be in order now. If you need any more help, just let us know.

 

Thanks.

jaineshp
Memorable Member
Memorable Member

Hi @_V_ ,

You're on the right track. After expanding "Content" and seeing "Page001", you'll need to use text functions to extract specific passages:

  • Use Text.BetweenDelimiters() or Text.Range() to pull specific sections
  • Try Text.Contains() to find the sections you need first
  • Consider using "Extract Text" instead of "Tables" when combining if your PDFs have consistent formatting

For auto-refresh: Once extraction works, set up scheduled refresh in Power BI/Excel and your OneDrive sync should handle the file updates automatically.

What specific text sections are you trying to extract?That would help suggest the right text functions to use.

Best Regards,
Jainesh Poojara | Power BI Developer

Helpful resources

Announcements
August Power BI Update Carousel

Power BI Monthly Update - August 2025

Check out the August 2025 Power BI update to learn about new features.

August 2025 community update carousel

Fabric Community Update - August 2025

Find out what's new and trending in the Fabric community.

Top Solution Authors