Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Get inspired! Check out the entries from the Power BI DataViz World Championships preliminary rounds and give kudos to your favorites. View the vizzies.

Reply
CHHSSSE
New Member

SharePoint Library extract table from pdf page for all docs

When accessing data from a library I would like to get from 

 

ContentName
Binaryfile1.pdf
Binaryfile2.pdf

 

to at least 

 

NamePageCol1Col2
file1.pdfPage1Key1Value1
file1.pdfPage1Key2Value2
file2.pdfPage1Key1Value2

 

I pressed the double down arrows on Binary Column and it built 

Helper Queries with Parameters.

Any help to better form my question or documentation I could read would be helpful.

1 ACCEPTED SOLUTION
jgeddes
Super User
Super User

One way to do this without using the helper queries would be...
If you have your query with the 'Content' and 'Name' columns add a column using Pdf.Tables()

= Table.AddColumn(yourPreviousStep, "pdf_content", each Pdf.Tables([Content]), type table)

Depending on how your pdf file is structured you will see something like the following when viewing the resulting tables...

jgeddes_0-1741897841052.png

Since you want the Page number info, you would want to filter the nested tables to only include Pages.

= Table.TransformColumns(add_pdf_content, {{"pdf_content", each Table.SelectRows(_, each [Kind] = "Page"), type table}})

Resulting in...

jgeddes_1-1741897953144.png

You can remove the 'Content' column as it is no longer needed.

= Table.RemoveColumns(select_nested_pdf_pages,{"Content"})

Expand the 'pdf_content' column getting the 'Name' and 'Data' columns. (You can rename the 'Name' column to 'Page' at this point.)

= Table.ExpandTableColumn(remove_binary, "pdf_content", {"Name", "Data"}, {"Page", "Data"})

The resulting tables require the headers to be promoted...

jgeddes_2-1741898117034.png

= Table.TransformColumns(expand_nested_pdf, {{"Data", each Table.PromoteHeaders(_), type table}})

The nested tables are now ready to be expanded...

jgeddes_3-1741898172407.png

= Table.ExpandTableColumn(promote_nested_headers, "Data", {"Column1", "Column2"}, {"Column1", "Column2"})

Ending up with the final result of...

jgeddes_4-1741898215609.png

Hope this helps.





Did I answer your question? Mark my post as a solution!

Proud to be a Super User!





View solution in original post

6 REPLIES 6
v-kpoloju-msft
Community Support
Community Support

Hi @CHHSSSE

Thank you for reaching out to the Microsoft fabric community forum. Thank you @jgeddes, for your valuable inputs on this thread.

 

After thoroughly reviewing the details you provided, I have identified few workarounds that may help resolve the issue. Please follow these steps:

 

Open Power BI Desktop and navigate to Home > Get Data > More, Select SharePoint Folder and click Connect. Enter the SharePoint site URL (not the document library URL) and click OK. Finally, select Transform Data to proceed.

After importing data from the SharePoint folder, identify the Binary column, which contains the PDF files. Click the expand icon (⇊) in the Binary column to extract helper queries with parameters. Power BI will then generate a structured table that includes the Content (Binary), File Name, and additional metadata such as the Modified Date and File Path.
Power Query SharePoint folder connector - Power Query | Microsoft Learn


To extract data from a PDF, click on the Binary field in any row and select "Extract Table from PDF." Power BI will analyse the document and display the detected tables. Choose the table that contains the required data. If the table is not automatically detected, use the "Extract Text from PDF" option and apply further data processing as necessary.


To apply the transformation to all files, edit the Helper Query by selecting the Transform Sample File query. Adjust the steps to extract tables from all PDFs instead of a single file. Ensure proper column naming, such as extracting the Page Number and key-value pairs (e.g., Col1, Col2). Finally, verify that the data structure is correctly formatted for consistency across all files.
Power Query PDF connector - Power Query | Microsoft Learn

After transforming the data, Close & Apply. Load the structured table into Power BI and create visuals or reports.

vkpolojumsft_0-1741930623625.pngvkpolojumsft_1-1741930639064.png

 

For more information kindly refer to the below link:

New action to extract tables from PDF - Power Platform Release Plan | Microsoft Learn
PDF actions reference - Power Automate | Microsoft Learn

 

If this post helps, then please give us ‘Kudos’ and consider Accept it as a solution to help the other members find it more quickly.

 

Thank you.



Thank you kindly for sharing those links. I didn't dig deep into Desktop automation because of licensing, we still have yet to understand how to have a service do desktop automations rather than a user account.

 

Everything up to paragraph 6 was what I had done that drove me to reach out to the community, so thank you for providing the language that described my steps. The Power Query PDF connector is also powerful and I think it a fantastic tool for many users. Unfortunately this reply does not solve my problem but describes how to recreate it. 

 

The trouble lies within the lines 

 


To apply the transformation to all files, edit the Helper Query by selecting the Transform Sample File query. Adjust the steps to extract tables from all PDFs instead of a single file. Ensure proper column naming, such as extracting the Page Number and key-value pairs (e.g., Col1, Col2). Finally, verify that the data structure is correctly formatted for consistency across all files.


Perhaps I should restate the question as "What are the steps to extract all the pages from all the PDFs so that I can apply the remainder of the needed query?"

jgeddes
Super User
Super User

One way to do this without using the helper queries would be...
If you have your query with the 'Content' and 'Name' columns add a column using Pdf.Tables()

= Table.AddColumn(yourPreviousStep, "pdf_content", each Pdf.Tables([Content]), type table)

Depending on how your pdf file is structured you will see something like the following when viewing the resulting tables...

jgeddes_0-1741897841052.png

Since you want the Page number info, you would want to filter the nested tables to only include Pages.

= Table.TransformColumns(add_pdf_content, {{"pdf_content", each Table.SelectRows(_, each [Kind] = "Page"), type table}})

Resulting in...

jgeddes_1-1741897953144.png

You can remove the 'Content' column as it is no longer needed.

= Table.RemoveColumns(select_nested_pdf_pages,{"Content"})

Expand the 'pdf_content' column getting the 'Name' and 'Data' columns. (You can rename the 'Name' column to 'Page' at this point.)

= Table.ExpandTableColumn(remove_binary, "pdf_content", {"Name", "Data"}, {"Page", "Data"})

The resulting tables require the headers to be promoted...

jgeddes_2-1741898117034.png

= Table.TransformColumns(expand_nested_pdf, {{"Data", each Table.PromoteHeaders(_), type table}})

The nested tables are now ready to be expanded...

jgeddes_3-1741898172407.png

= Table.ExpandTableColumn(promote_nested_headers, "Data", {"Column1", "Column2"}, {"Column1", "Column2"})

Ending up with the final result of...

jgeddes_4-1741898215609.png

Hope this helps.





Did I answer your question? Mark my post as a solution!

Proud to be a Super User!





Thanks for the well laid out steps. It leads me to some new questions.

 

What is the difference of 

Pdf.Tables([Content]), type table)

and 

#"Transform File"([Content]))

 

I will look into on the webs but for continuity, I ask here. 

 
Also what is going on the makes the Promoted Headers necessary? Each file had a specific name for Column3, most likely a header for the Document ID. Thankfully, Power Query seemed to ignore it, which worked out in my favor. I was worried I might have 100s of unique columns for column 3. 😱

The promote headers step is not required. I chose the wrong word. You can promote the headers at that point, but it is not required that you do so. You can delete that step and then deal with the headers afterwards if you choose.

#"Transform File"([Content]))

is a function call to the "Transform File" function where the [Content] is supplied as input to the function.

The "Transform File" function likely looks like something like...

let
    Source = (Parameter1) => let
        Source = Pdf.Tables(Parameter1, [Implementation="1.3"]),
        Page1 = Source{[Id="Page001"]}[Data],
        #"Promoted Headers" = Table.PromoteHeaders(Page1, [PromoteAllScalars=true])
    in
        #"Promoted Headers"

in
    Source

You can see in this case that the first row that the Pdf.Tables() function is present and would accept the input [Content]. The major difference is that the following two steps are applied before returning the output of the function to the table. 

Page1 = Source{[Id="Page001"]}[Data],

This step is taking data from Rows that are "Page001" only. (This is what prevents you from seeing multiple "pages" from a single file.

 

There is nothing stopping you from making code changes inside the "Transform File" function and continuing to use it and the rest of the helper queries. It becomes a matter of choice and asthetic preference. 
On some larger models there may be performance differences, but most users are not dealing with that size of model.
Hope this helps a bit.





Did I answer your question? Mark my post as a solution!

Proud to be a Super User!





Thank you @jgeddes I am trying this out. Let you know how this goes.

Helpful resources

Announcements
Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code FABINSIDER for a $400 discount!

FebPBI_Carousel

Power BI Monthly Update - February 2025

Check out the February 2025 Power BI update to learn about new features.

March2025 Carousel

Fabric Community Update - March 2025

Find out what's new and trending in the Fabric community.