The ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM.
Get registeredCompete to become Power BI Data Viz World Champion! First round ends August 18th. Get started.
Good morning,
How do I import a PDF without dividing it into columns?
Att.
Julio
Solved! Go to Solution.
The only built-in connector is Pdf.Tables, and I do not believe there is any way to prevent it from parsing the PDF content into a table of tables of page text and tables (charts, etc.). You may want to look into using Power Query R or Python.
That said, here is a general set of PQ transformations you can use to convert the default Pdf.Tables parsing into a something similar to Lines.FromBinary, which I think is what you are asking for.
let
// Load the PDF file a location
// Note: Use File.Contents for local files
Source = Web.Contents(
"https://file-examples.com/storage/feeed4f6296807c3196e058/2017/10/file-example_PDF_1MB.pdf"
),
// Parse the PDF file to extract its content
PdfParse = Pdf.Tables(Source),
// Filter the parsed content to get only the pages
GetPagesOnly = Table.SelectRows(PdfParse, each ([Kind] = "Page")),
// Combine all the separate tables of page content into one table for the whole file
CombineAllPageContent = Table.Combine(GetPagesOnly[Data]),
// Merge all columns into a single column (called "Merged")
MergeAllPageColumns = Table.CombineColumns(
CombineAllPageContent,
Table.ColumnNames(CombineAllPageContent),
Combiner.CombineTextByDelimiter("", QuoteStyle.None),
"Merged"
)
in
MergeAllPageColumns
Quick visual of the steps to see what the above is doing:
Hi @Jbuzios ,
As we haven’t heard back from you, we wanted to kindly follow up to check if the solution provided for the issue worked? or Let us know if you need any further assistance?
If our response addressed, please mark it as Accept as solution and click Yes if you found it helpful.
Regards,
The only built-in connector is Pdf.Tables, and I do not believe there is any way to prevent it from parsing the PDF content into a table of tables of page text and tables (charts, etc.). You may want to look into using Power Query R or Python.
That said, here is a general set of PQ transformations you can use to convert the default Pdf.Tables parsing into a something similar to Lines.FromBinary, which I think is what you are asking for.
let
// Load the PDF file a location
// Note: Use File.Contents for local files
Source = Web.Contents(
"https://file-examples.com/storage/feeed4f6296807c3196e058/2017/10/file-example_PDF_1MB.pdf"
),
// Parse the PDF file to extract its content
PdfParse = Pdf.Tables(Source),
// Filter the parsed content to get only the pages
GetPagesOnly = Table.SelectRows(PdfParse, each ([Kind] = "Page")),
// Combine all the separate tables of page content into one table for the whole file
CombineAllPageContent = Table.Combine(GetPagesOnly[Data]),
// Merge all columns into a single column (called "Merged")
MergeAllPageColumns = Table.CombineColumns(
CombineAllPageContent,
Table.ColumnNames(CombineAllPageContent),
Combiner.CombineTextByDelimiter("", QuoteStyle.None),
"Merged"
)
in
MergeAllPageColumns
Quick visual of the steps to see what the above is doing:
Hi @Jbuzios ,
We wanted to kindly follow up to check if the solution provided for the issue worked? or Let us know if you need any further assistance?
If our response addressed, please mark it as Accept as solution and click Yes if you found it helpful.
Regards,
Hi @Jbuzios
When you import a PDF in Power BI, the connector automatically tries to detect tables and split them into columns. If you want the entire page or table content as a single column, you can do this:
If the PDF layout makes it hard to merge properly, you can also import it as plain text (using a PDF-to-text converter first) and then load that into Power BI as a single column.
Hi @Jbuzios ,
As we haven’t heard back from you, we wanted to kindly follow up to check if the solution provided for the issue worked? or Let us know if you need any further assistance?
If our response addressed, please mark it as Accept as solution and click Yes if you found it helpful.
Regards,
Assuming you have the data already inputed in PQ. Using the following steps post that should work:
= Table.FromRows( List.Combine( Table.ToColumns( [YourTableName] ) ) )
If you're looking for all the contents in different columns to be in one single column, then this should work. Thanks!
Let me know if I understood your query correctly.
hi @Jbuzios ,
probably multiple steps to this depending on the pdf structure. Potential steps:
1. Combine Data from Multiple PDF Files into a Single Excel File or Combine Data from Multiple PDFs with Inconsistent Column Names!
2. to get it into a couple of columns : Unpivot Multiple Column Groups
if this doesn't resolve the issue, kindly provide a sample input masking senstitive data and a sample output