Solved: SharePoint Library extract table from pdf page for...

CHHSSSE · ‎03-13-2025

When accessing data from a library I would like to get from

Content	Name
Binary	file1.pdf
Binary	file2.pdf

to at least

Name	Page	Col1	Col2
file1.pdf	Page1	Key1	Value1
file1.pdf	Page1	Key2	Value2
file2.pdf	Page1	Key1	Value2

I pressed the double down arrows on Binary Column and it built

Helper Queries with Parameters.

Any help to better form my question or documentation I could read would be helpful.

jgeddes · ‎03-13-2025

One way to do this without using the helper queries would be...
If you have your query with the 'Content' and 'Name' columns add a column using Pdf.Tables()

= Table.AddColumn(yourPreviousStep, "pdf_content", each Pdf.Tables([Content]), type table)

Depending on how your pdf file is structured you will see something like the following when viewing the resulting tables...

Since you want the Page number info, you would want to filter the nested tables to only include Pages.

= Table.TransformColumns(add_pdf_content, {{"pdf_content", each Table.SelectRows(_, each [Kind] = "Page"), type table}})

Resulting in...

You can remove the 'Content' column as it is no longer needed.

= Table.RemoveColumns(select_nested_pdf_pages,{"Content"})

Expand the 'pdf_content' column getting the 'Name' and 'Data' columns. (You can rename the 'Name' column to 'Page' at this point.)

= Table.ExpandTableColumn(remove_binary, "pdf_content", {"Name", "Data"}, {"Page", "Data"})

The resulting tables require the headers to be promoted...

= Table.TransformColumns(expand_nested_pdf, {{"Data", each Table.PromoteHeaders(_), type table}})

The nested tables are now ready to be expanded...

= Table.ExpandTableColumn(promote_nested_headers, "Data", {"Column1", "Column2"}, {"Column1", "Column2"})

Ending up with the final result of...

Hope this helps.

Did I answer your question? Mark my post as a solution!

Proud to be a Super User!

View solution in original post

v-kpoloju-msft · ‎03-13-2025

Hi @CHHSSSE,

Thank you for reaching out to the Microsoft fabric community forum. Thank you @jgeddes, for your valuable inputs on this thread.

After thoroughly reviewing the details you provided, I have identified few workarounds that may help resolve the issue. Please follow these steps:

Open Power BI Desktop and navigate to Home > Get Data > More, Select SharePoint Folder and click Connect. Enter the SharePoint site URL (not the document library URL) and click OK. Finally, select Transform Data to proceed.

After importing data from the SharePoint folder, identify the Binary column, which contains the PDF files. Click the expand icon (⇊) in the Binary column to extract helper queries with parameters. Power BI will then generate a structured table that includes the Content (Binary), File Name, and additional metadata such as the Modified Date and File Path.
Power Query SharePoint folder connector - Power Query | Microsoft Learn

To extract data from a PDF, click on the Binary field in any row and select "Extract Table from PDF." Power BI will analyse the document and display the detected tables. Choose the table that contains the required data. If the table is not automatically detected, use the "Extract Text from PDF" option and apply further data processing as necessary.

To apply the transformation to all files, edit the Helper Query by selecting the Transform Sample File query. Adjust the steps to extract tables from all PDFs instead of a single file. Ensure proper column naming, such as extracting the Page Number and key-value pairs (e.g., Col1, Col2). Finally, verify that the data structure is correctly formatted for consistency across all files.
Power Query PDF connector - Power Query | Microsoft Learn

After transforming the data, Close & Apply. Load the structured table into Power BI and create visuals or reports.

For more information kindly refer to the below link:

New action to extract tables from PDF - Power Platform Release Plan | Microsoft Learn
PDF actions reference - Power Automate | Microsoft Learn

If this post helps, then please give us ‘Kudos’ and consider Accept it as a solution to help the other members find it more quickly.

Thank you.

CHHSSSE · ‎03-14-2025

Thank you kindly for sharing those links. I didn't dig deep into Desktop automation because of licensing, we still have yet to understand how to have a service do desktop automations rather than a user account.

Everything up to paragraph 6 was what I had done that drove me to reach out to the community, so thank you for providing the language that described my steps. The Power Query PDF connector is also powerful and I think it a fantastic tool for many users. Unfortunately this reply does not solve my problem but describes how to recreate it.

The trouble lies within the lines

To apply the transformation to all files, edit the Helper Query by selecting the Transform Sample File query. Adjust the steps to extract tables from all PDFs instead of a single file. Ensure proper column naming, such as extracting the Page Number and key-value pairs (e.g., Col1, Col2). Finally, verify that the data structure is correctly formatted for consistency across all files.

Perhaps I should restate the question as "What are the steps to extract all the pages from all the PDFs so that I can apply the remainder of the needed query?"

jgeddes · ‎03-13-2025

One way to do this without using the helper queries would be...
If you have your query with the 'Content' and 'Name' columns add a column using Pdf.Tables()

= Table.AddColumn(yourPreviousStep, "pdf_content", each Pdf.Tables([Content]), type table)

Depending on how your pdf file is structured you will see something like the following when viewing the resulting tables...

Since you want the Page number info, you would want to filter the nested tables to only include Pages.

= Table.TransformColumns(add_pdf_content, {{"pdf_content", each Table.SelectRows(_, each [Kind] = "Page"), type table}})

Resulting in...

You can remove the 'Content' column as it is no longer needed.

= Table.RemoveColumns(select_nested_pdf_pages,{"Content"})

Expand the 'pdf_content' column getting the 'Name' and 'Data' columns. (You can rename the 'Name' column to 'Page' at this point.)

= Table.ExpandTableColumn(remove_binary, "pdf_content", {"Name", "Data"}, {"Page", "Data"})

The resulting tables require the headers to be promoted...

= Table.TransformColumns(expand_nested_pdf, {{"Data", each Table.PromoteHeaders(_), type table}})

The nested tables are now ready to be expanded...

= Table.ExpandTableColumn(promote_nested_headers, "Data", {"Column1", "Column2"}, {"Column1", "Column2"})

Ending up with the final result of...

Hope this helps.

Did I answer your question? Mark my post as a solution!

Proud to be a Super User!

CHHSSSE · ‎03-14-2025

Thanks for the well laid out steps. It leads me to some new questions.

What is the difference of

Pdf.Tables([Content]), type table)

and

#"Transform File"([Content]))

I will look into on the webs but for continuity, I ask here.

Also what is going on the makes the Promoted Headers necessary? Each file had a specific name for Column3, most likely a header for the Document ID. Thankfully, Power Query seemed to ignore it, which worked out in my favor. I was worried I might have 100s of unique columns for column 3. 😱

jgeddes · ‎03-17-2025

The promote headers step is not required. I chose the wrong word. You can promote the headers at that point, but it is not required that you do so. You can delete that step and then deal with the headers afterwards if you choose.

#"Transform File"([Content]))

is a function call to the "Transform File" function where the [Content] is supplied as input to the function.

The "Transform File" function likely looks like something like...

let
    Source = (Parameter1) => let
        Source = Pdf.Tables(Parameter1, [Implementation="1.3"]),
        Page1 = Source{[Id="Page001"]}[Data],
        #"Promoted Headers" = Table.PromoteHeaders(Page1, [PromoteAllScalars=true])
    in
        #"Promoted Headers"

in
    Source

You can see in this case that the first row that the Pdf.Tables() function is present and would accept the input [Content]. The major difference is that the following two steps are applied before returning the output of the function to the table.

Page1 = Source{[Id="Page001"]}[Data],

This step is taking data from Rows that are "Page001" only. (This is what prevents you from seeing multiple "pages" from a single file.

There is nothing stopping you from making code changes inside the "Transform File" function and continuing to use it and the rest of the helper queries. It becomes a matter of choice and asthetic preference.
On some larger models there may be performance differences, but most users are not dealing with that size of model.
Hope this helps a bit.

Did I answer your question? Mark my post as a solution!

Proud to be a Super User!

CHHSSSE · ‎03-14-2025

Thank you @jgeddes I am trying this out. Let you know how this goes.

SharePoint Library extract table from pdf page for all docs

Helpful resources

Fabric Community Update - July 2025

Power BI Monthly Update - July 2025

Party with Power BI’s own Guy in a Cube

SharePoint Library extract table from pdf page for all docs

Helpful resources

Fabric Community Update - July 2025

Power BI Monthly Update - July 2025