transfer messy table from PDF extraction to tidy t...

dogbural · ‎09-11-2025

Hi,

I have some invoices in PDF and would like to see if the data can be tabulised in the way I want.

So, I bring PDF files to Power query and would like to tidy the appended table up.

Unfortunately, the values are not always aligned in the same columns.

Amonst the messy table, I need only values for the Customer PO#, PO Item #, Part #, and Quantity as well as tax invoice number as listed in "Result" tab.

I have already highlighted the values I would like to transfer in yellow in "Invoice" Tab, raw data extracted from PDF.

https://limewire.com/d/UfvBI#k4d0bWtA0w

Is it something we can do using Power query?

v-veshwara-msft

Hi @dogbural ,

Thanks for posting in Microsoft Fabric Community.

Thanks @DataNinja777 for outlining the general approach using Power Query - that’s exactly the right direction.

I created a sample Power BI file where the invoice, PO, line, and quantity details are extracted into a structured table. You can download it and click Transform Data to review the applied steps and final output. Please tweak the steps as needed to fit your source data.

Hope this helps. Please reach out for further assistance.

Thank you.

DataNinja777 · ‎09-12-2025

Hi @dogbural ,

Yes, you can absolutely do this using Power Query. It's the perfect tool for transforming messy, semi-structured data like the kind extracted from PDF invoices into a clean, usable table.

The general strategy is to create a new, dedicated column for each piece of information you need: the Tax Invoice number, Customer PO#, PO Item #, Part #, and Quantity. You will use Power Query's functions to scan your existing columns, find the rows that contain the labels for your data (like "Customer PO#"), and then pull the corresponding values into your new, clean columns.

You can start this process by adding a custom column for each field. Go to the Add Column tab and select Custom Column. To get the Tax Invoice number, you can use a formula like the one below. This M code checks if a cell in Column1 contains the text "Tax Invoice" and, if it does, it returns the value from the adjacent column, Column2. You will need to adjust these column names to match your specific table.

if Text.Contains([Column1], "Tax Invoice") then [Column2] else null

You will then repeat this process for the other header-level information. For example, to extract the Customer PO number, you would create another custom column using a similar formula, again adjusting the column names as needed.

if Text.Contains([Column1], "Customer PO#") then [Column2] else null

After creating all the new columns for your data, you will notice that the values only appear on single rows, with the rest of the cells being null. To fix this, you need to carry those values down to all the relevant line items. Select the new columns you just created, navigate to the Transform tab, click on the Fill button, and then select Down. This action will populate the null cells with the value from the first populated cell above, correctly associating each line item with its parent invoice and PO number.

The final step is to clean up the table. Now that all your required data is organized in its own columns, you can remove the rows you no longer need. A good way to do this is to filter the "Part #" or "Quantity" column to show only the rows that are not null. This will effectively remove all the original header and miscellaneous text rows. Once that's done, you can simply right-click and remove the original, messy columns from the PDF import, leaving you with the perfectly structured table you need.

Best regards,

transfer messy table from PDF extraction to tidy table

Helpful resources

Power BI Monthly Update - September 2025

Fabric Community Update - August 2025

FabCon is coming to Atlanta

transfer messy table from PDF extraction to tidy table

Helpful resources

Power BI Monthly Update - September 2025

Fabric Community Update - August 2025