Solved: PDF header inconstistency across pages

Antmkjr · ‎01-11-2025

Hi ,

I have a pdf document in which my data spans across multiple tables in multiple pages. But the header in different pages , even though it appears exactly similar structure in the pdf document, when pulled in power query , they are bit inconsistent.

Example shown: is the header in Page 1 and Page 14. Page 14 is having null in Column 8 Vs D= A+B+C as header in Page 1.

there are even more differences like this, how do I make all headers in all pages consistent .

Anonymous · ‎01-12-2025

It appears that you don't have any header rows. Were they taken out in the transform process or do they not exist? I would first select only the tables, and get rid of the pages. Then you should be able to promote the headers. Then you can choose the column names that you want to keep. Having the table with the greatest number of columns in the first row means that none of the subsequent tables' column names will be missing. After that, try to filter out any tables that you don't need--you know how sometimes on PDFs there will be some random three column table that you don't want as part of your dataset. Assuming the data you need is in table format in the PDFs, you should be able to filter to just the tables you need. Sometimes it's a pattern like every third table--you'll figure that part out. My point is that you need to get some headers in those tables before you expand them.

If there are missing fields when you reorder, the MissingFields.Ignore parameter handles that.

These are the steps I always take when combining PDFs.

--Nate

View solution in original post

Anonymous · ‎02-16-2025

Hi @Antmkjr
I hope this information is helpful. Please let me know if you have any further questions or if you'd like to discuss this further. If this answers your question, please Accept it as a solution and give it a 'Kudos' so others can find it easily.
Thank you.

Anonymous · ‎02-03-2025

Hi @Antmkjr

May I ask if you have resolved this issue? If so, please mark the helpful reply and accept it as the solution. This will be helpful for other community members who have similar problems to solve it faster.

Thank you.

Anonymous · ‎01-20-2025

Hi @Antmkjr
Thank you for reaching out to the Microsoft Fabric Community Forum!

We just wanted to check in and see if your issue has been resolved. If it has, it would be great if you could mark the reply that helped as the accepted solution. This can be really helpful for other community members who might face similar issues in the future.

If you have any other questions or need further assistance, please feel free to reach us.

Thank you

Anonymous · ‎01-12-2025

True but now that you know that you are going to have all of the columns, you can just (before you expand the tables) you can either add a column or transform the existing table column like

= Table.AddColumn(PriorStepOrTableName, "Ordered", each Table.ReorderColumns(Table.PromoteHeaders([TableColumnName]), {"Column1", "Column2", "Column3"}, MissingFields.Ignore))

So you have to sort by number of columns, then promote the headers of the tables, and finally re-order the columns. Then expand the tables.

The purpose of sorting descending by number of columns then promoting headers is to make sure that the table with the most columns can be at the top of the list of tables, thus ensuring that there are names for all of your columns when you expand.

--Nate

Antmkjr · ‎01-12-2025

Hi Nate,

I am not sure if I understood you correctly. But in actually scenario there are around more than 10 pages, and unwanted columns(Columns with null values) can appear at any position. how does the reordering technique work when the position of columns with null value are uncertain? Would you be so kind to explain further.

The table with most columns -- there are more columns because those are unwanted columns which has to be removed. Eg: Column 8 in Page 14 is extra and is unwanted. So I dont understand the need for sorting also.

Anonymous · ‎01-12-2025

It appears that you don't have any header rows. Were they taken out in the transform process or do they not exist? I would first select only the tables, and get rid of the pages. Then you should be able to promote the headers. Then you can choose the column names that you want to keep. Having the table with the greatest number of columns in the first row means that none of the subsequent tables' column names will be missing. After that, try to filter out any tables that you don't need--you know how sometimes on PDFs there will be some random three column table that you don't want as part of your dataset. Assuming the data you need is in table format in the PDFs, you should be able to filter to just the tables you need. Sometimes it's a pattern like every third table--you'll figure that part out. My point is that you need to get some headers in those tables before you expand them.

If there are missing fields when you reorder, the MissingFields.Ignore parameter handles that.

These are the steps I always take when combining PDFs.

--Nate

Anonymous · ‎01-11-2025

Before you expand the tables. You have to sort the tables by number of columns greatest to least. so before you expand the tables, add a custom column using each Table.ColumnCount. Then sort, the resulting column, descending. Then expand your columns.

--Nate

Antmkjr · ‎01-11-2025

By doing this , I will be able to see that there is a difference in the no. of columns . But im still not able to align the coloums in the correct place.

Just showing expected output:

PDF header inconstistency across pages

Helpful resources

Power BI Monthly Update - August 2025

Fabric Community Update - August 2025

Join us at FabCon Vienna from September 15-18, 2025

PDF header inconstistency across pages

Helpful resources

Power BI Monthly Update - August 2025

Fabric Community Update - August 2025