Join us at FabCon Atlanta from March 16 - 20, 2026, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.
Register now!The Power BI Data Visualization World Championships is back! Get ahead of the game and start preparing now! Learn more
Hi,
I have quickly searched on the forum but did not find what I need, I want to combine multiple PDF, but I prefer to write my own M code than using the UI, which is sometime limited and maybe my M code will be more performant than the UI (maybe)
Here is my M code so far
let
Source = Folder.Files(Ffiles),
FilterPDF = Table.SelectRows(Source, each ([Extension] = ".pdf")),
BinaryToTables = Table.TransformColumns(FilterPDF, {"Content", each Pdf.Tables(_)}),
KeepContent = Table.SelectColumns(BinaryToTables,{"Content"}),
KeepPages = Table.TransformColumns(KeepContent,{"Content", each Table.SelectRows(_, each [Kind] = "Page")}),
CombineData = Table.TransformColumns(KeepPages, {"Content", each Table.Combine(_[Data])}),
SkipInvoice = Table.TransformColumns(CombineData,{"Content", each Table.Skip(_, each [Column1] <> "Invoice")}),
ColumnCount = Table.AddColumn(SkipInvoice,"Columns", each List.Count (Table.ColumnNames([Content]))),
SortRows = Table.Sort(ColumnCount,{{"Columns", Order.Ascending}})
in
SortRows
I did a count on the columns of each table and it goes from 12 columns to 20 columns...
My approach is to remove all blank columns - and this is my question because I am not sure how to do it
then combine some columns together to get the same number of columns in each table
@lbendlin @dufoq3 @AlienSx @ManuelBolz @jennratten @ronrsnfld @m_dekorte @Rickmaurinus
thanks for your time and for your help
Solved! Go to Solution.
@Einomi my bad - wrong name after "in". I'll correct that soon. You may create blank query and place the following code in there (half of job is done by @dufoq3 - check his query as well, it's also works)
(tbl) =>
[ReplaceBlankToNull = Table.TransformColumns(tbl, {}, each if _ = "" then null else _),
profile = Table.Buffer(Table.Profile(ReplaceBlankToNull)),
columns_to_delete = Table.SelectRows(profile, each [Count] = [NullCount])[Column],
delete_columns = Table.RemoveColumns(ReplaceBlankToNull, columns_to_delete)][delete_columns]
name the query as you wish. E.g. "fxRemoveBlankColumns". This is now a custom function that removes all null/blank columns from the table. In your "main" column make the following statement:
remove_blanks = Table.TransformColumns(ref_to_last_step, {"Content", fxRemoveBlankColumns})
Let us know if it worked.
Hello @Einomi
If my post helped you, please give me a 👍kudos and mark this post with Accept as Solution.
When it comes to my solutions, I don't focus 100% on performance optimization. I always try to develop a compromise between performance and "code understanding". @dufoq3´s code is probably more performant than my solution. Your "problem" is multiple file reads (unknown number of files). Therefore, I would generally advise you to choose a high-performance solution.
let
Source = Folder.Files(Ffiles),
FilterPDF = Table.SelectRows(Source, each ([Extension] = ".pdf")),
Content = Table.TransformColumns(FilterPDF, {"Content", each Pdf.Tables(_)}),
Columns = Table.SelectColumns(Content,{"Content"}),
Rows = Table.TransformColumns(Columns,{"Content", each Table.SelectRows(_, each [Kind] = "Page")}),
Combine = Table.TransformColumns(Rows, {"Content", each Table.Combine(_[Data])}),
Function = (table as table) as table =>
let
NonEmptyColumns = List.Select(Table.ColumnNames(table), each List.NonNullCount(Table.Column(table, _)) > 0),
Result = Table.SelectColumns(table, NonEmptyColumns)
in
Result,
ColumnsRemove = Table.TransformColumns(Combine, {"Content", each Function(_)}),
ColumnsList = List.Accumulate(ColumnsRemove[Content], {}, (state, current) => List.Union({state, Table.ColumnNames(current)})),
ColumnsNormalize = Table.TransformColumns(ColumnsRemove, {"Content", each Table.SelectColumns(_, List.Intersect({Table.ColumnNames(_), ColumnsList}))}),
CombinedTables = Table.Combine(ColumnsNormalize[Content]),
Headers = Table.PromoteHeaders(CombinedTables, [PromoteAllScalars=true])
in
Headers
Best regards from Germany
Manuel Bolz
🟦Follow me on LinkedIn
🟨How to Get Your Question Answered Quickly
🟩Fabric Community Conference
🟪My Solutions on Github
Hi @ManuelBolz
I would like to try your solution, but when I paste your M code into my Excel file, it just refreshing without stopping and I cannot see if this solution works or not, I have only 50 pdf files at the moment, each of them is less than 1 MB
If you have any clue 😉 except the source step I did not change anything
Hello @Einomi
in my solution the source step is relatively important. Your example shows that you are using a Power Query parameter here.
I didn't fully understand your error with my solution. Here are a few comments/ideas:
- Make sure the "Ffiles" parameter points to the correct folder,
- First add only 2 - 4 PDF files to the folder to test the code,
- Make sure none of these files are open (PFD viewer) while testing my solution. A file handler is usually active here and blocks access to the file.
If my post helped you, please give me a 👍kudos and mark this post with Accept as Solution.
Best regards from Germany
Manuel Bolz
🟦Follow me on LinkedIn
🟨How to Get Your Question Answered Quickly
🟩Fabric Community Conference
🟪My Solutions on Github
@ManuelBolz thanks for the comment
- I use a path in my query, I replaced it with a parameter for confidentialy purposes in the forum
- No PDF file is open
I will test again your solution and post the error message I get (it is French so I need to change the language of PQ then send the screenshot)
Hi @Einomi, you can delete blank columns this way:
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WUtIBIUcIpRSrAxMBIkNUEScgNsJQZIwq4ozNIBOwSCwA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Column1 = _t, Column2 = _t, Column3 = _t, Column4 = _t, Column5 = _t]),
// You can probably delete this step.
ReplaceBlankToNull = Table.TransformColumns(Source, {}, each if _ = "" then null else _),
RemoveBlankColumns = Table.SelectColumns(ReplaceBlankToNull,
[ a = Table.ColumnNames(ReplaceBlankToNull),
b = List.Select(a, (x)=> List.NonNullCount(Table.Column(ReplaceBlankToNull, x)) > 0)
][b] )
in
RemoveBlankColumns
or
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WUtIBIUcIpRSrAxMBIkNUEScgNsJQZIwq4ozNIBOwSCwA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Column1 = _t, Column2 = _t, Column3 = _t, Column4 = _t, Column5 = _t]),
// You can probably delete this step.
ReplaceBlankToNull = Table.TransformColumns(Source, {}, each if _ = "" then null else _),
profile = Table.Buffer(Table.Profile(ReplaceBlankToNull)),
columns_to_delete = Table.SelectRows(profile, each [Count] = [NullCount])[Column],
delete_columns = Table.RemoveColumns(ReplaceBlankToNull, columns_to_delete)
in
delete_columns
Thanks @AlienSx but I must miss something, when I paste your code, I get a list of Column1, Column 2 and Column3 (or something similar I did not pay attention to the column numbers)
I want to delete the columns containing only null values from my nested tables that are in the "Content" column 😉
@Einomi my bad - wrong name after "in". I'll correct that soon. You may create blank query and place the following code in there (half of job is done by @dufoq3 - check his query as well, it's also works)
(tbl) =>
[ReplaceBlankToNull = Table.TransformColumns(tbl, {}, each if _ = "" then null else _),
profile = Table.Buffer(Table.Profile(ReplaceBlankToNull)),
columns_to_delete = Table.SelectRows(profile, each [Count] = [NullCount])[Column],
delete_columns = Table.RemoveColumns(ReplaceBlankToNull, columns_to_delete)][delete_columns]
name the query as you wish. E.g. "fxRemoveBlankColumns". This is now a custom function that removes all null/blank columns from the table. In your "main" column make the following statement:
remove_blanks = Table.TransformColumns(ref_to_last_step, {"Content", fxRemoveBlankColumns})
Let us know if it worked.
I understood what the issue was, I assumed that some columns are only null because when I have previewed my nested tables, I was quite sure that that the preview is really the full data because my pdf contains maximum 2 pages and there is not much in them
So, I duplicated my query and expanded one of the tables and ideed there was one row at the bottom which was not null...
I'll test it and will come back to you, thanks all for your time
Hi @dufoq3
thannks for your proposition, I have tried this and it does not seem to work, do I miss something ?
let
Source = Folder.Files(Ffiles),
FilterPDF = Table.SelectRows(Source, each ([Extension] = ".pdf")),
BinaryToTables = Table.TransformColumns(FilterPDF, {"Content", each Pdf.Tables(_)}),
KeepContent = Table.SelectColumns(BinaryToTables,{"Content"}),
KeepPages = Table.TransformColumns(KeepContent,{"Content", each Table.SelectRows(_, each [Kind] = "Page")}),
CombineData = Table.TransformColumns(KeepPages, {"Content", each Table.Combine(_[Data])}),
SkipInvoice = Table.TransformColumns(CombineData,{"Content", each Table.Skip(_, each [Column1] <> "Invoice")}),
ColumnCount = Table.AddColumn(SkipInvoice,"Columns", each List.Count (Table.ColumnNames([Content]))),
SortRows = Table.Sort(ColumnCount,{{"Columns", Order.Ascending}}),
BlankCol = Table.TransformColumns(SortRows, {"Content", each Table.SelectColumns(_,
[ a = Table.ColumnNames(_),
b = List.Select(a, (x)=> List.NonNullCount(Table.Column(_, x)) > 0)
][b] )})
in
BlankCol
I do not see your data, but maybe you don't have blank values as null. Try this:
let
Source = Folder.Files(Ffiles),
FilterPDF = Table.SelectRows(Source, each ([Extension] = ".pdf")),
BinaryToTables = Table.TransformColumns(FilterPDF, {"Content", each Pdf.Tables(_)}),
KeepContent = Table.SelectColumns(BinaryToTables,{"Content"}),
KeepPages = Table.TransformColumns(KeepContent,{"Content", each Table.SelectRows(_, each [Kind] = "Page")}),
CombineData = Table.TransformColumns(KeepPages, {"Content", each Table.Combine(_[Data])}),
SkipInvoice = Table.TransformColumns(CombineData,{"Content", each Table.Skip(_, each [Column1] <> "Invoice")}),
ColumnCount = Table.AddColumn(SkipInvoice,"Columns", each List.Count (Table.ColumnNames([Content]))),
SortRows = Table.Sort(ColumnCount,{{"Columns", Order.Ascending}}),
BlankCol = Table.TransformColumns(SortRows, {"Content", each Table.SelectColumns(_,
[ a = Table.ColumnNames(_),
b = List.Select(a, (x)=> List.NonNullCount(List.Select(Table.Column(_, x), (y)=> Text.Trim(y) <> "")) > 0)
][b] )})
in
BlankCol
I copied paste your query into my PQ, and unfortunately as shown in the picture, Column18 and Column20 are still around 🤔
The Power BI Data Visualization World Championships is back! Get ahead of the game and start preparing now!
| User | Count |
|---|---|
| 19 | |
| 9 | |
| 8 | |
| 7 | |
| 6 |