Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

The ultimate Microsoft Fabric, Power BI, Azure AI & SQL learning event! Join us in Las Vegas from March 26-28, 2024. Use code MSCUST for a $100 discount. Register Now

Reply
nimblecat
Frequent Visitor

Extracting PDF tables and appending when some column names are not the same

Hello,

I am trying to expand multiple binaries (PDF tables) in multiple nested folders so the table content from these PDF files are all usable as data points. I've put in a custom function with a sample file but the problem is that not all of these PDF's have the same format although majority of them do contain the same column names. 

 

Here are the challenges:

  1. Col names in each pdf file may be different (majority are the same)
  2. Some col names are repeated (since in the original pdf file, a column may be stored in two cells stacked on top of each other)
  3. The number of columns are different for each pdf file. Some files contain 

 

I want to put in an M query that has a logic as such:

* If column X exists in the table I read from the PDF file, then rename as XX else pass

 

In Python this would look something like:

 

 

 

for table in tables:
    for i in table.columns:
        if i in c.keys():
            mylist.append(c[i])
        else:
            mylist.append(i)

# VARS defined:
table = pd.DataFrame({"Customer Name": [1,2,3,4], "Content": ["s","a","3","4"], "Not exist": ["s","a","3","4"]})
a = ['Content',
 'Filename',
 'Extension',
 'Date accessed',
 'Date modified',
 'Date created',
 'Folder Path',
 'Customer Name',
 'Assumed Eff. Date']
b = ['Content Temp',
 'Filename Temp',
 'Extension Temp',
 'Date accessed Temp',
 'Date modified Temp',
 'Date created Temp',
 'Folder Path Temp',
 'Customer Name Temp',
 'Assumed Eff. Date Temp']
c = dict(zip(a,b))

# CODE
for i in df.columns:
    if i in c.keys():
        mylist.append(c[i])
    else:
        mylist.append(i)
            

 

 

 

 

Is this possible? 

 

Thank you!!

 

 

1 ACCEPTED SOLUTION
v-jingzhang
Community Support
Community Support

Hi @nimblecat 

 

Here is my solution for transforming column names. You can download the pbix at bottom to see details. 

 

TempNameTable

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45Wcs7PK0nNK1GK1YlWcsvMSc1LzE0Fc1wrgOLFmfl5YJ5zaXFJfm5qkYIfWD4WAA==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Name = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Name", type text}}),
    #"Added Custom" = Table.AddColumn(#"Changed Type", "TempName", each [Name] & " Temp")
in
    #"Added Custom"

 

Table

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlTSUSoG4jSl2FgA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [#"Customer Name" = _t, Content = _t, #"Not exist" = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Customer Name", Int64.Type}, {"Content", type text}, {"Not exist", type text}}),
    Name1 = TempNameTable[Name],
    Name2 = TempNameTable[TempName],
    ChangeColumnName = Table.TransformColumnNames(#"Changed Type", each if List.Contains(Name1, _) then let _index = List.PositionOf(Name1, _) in Name2{_index} else _)
in
    ChangeColumnName

 

Best Regards,
Community Support Team _ Jing
If this post helps, please Accept it as Solution to help other members find it.

View solution in original post

2 REPLIES 2
v-jingzhang
Community Support
Community Support

Hi @nimblecat 

 

Here is my solution for transforming column names. You can download the pbix at bottom to see details. 

 

TempNameTable

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45Wcs7PK0nNK1GK1YlWcsvMSc1LzE0Fc1wrgOLFmfl5YJ5zaXFJfm5qkYIfWD4WAA==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Name = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Name", type text}}),
    #"Added Custom" = Table.AddColumn(#"Changed Type", "TempName", each [Name] & " Temp")
in
    #"Added Custom"

 

Table

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlTSUSoG4jSl2FgA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [#"Customer Name" = _t, Content = _t, #"Not exist" = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Customer Name", Int64.Type}, {"Content", type text}, {"Not exist", type text}}),
    Name1 = TempNameTable[Name],
    Name2 = TempNameTable[TempName],
    ChangeColumnName = Table.TransformColumnNames(#"Changed Type", each if List.Contains(Name1, _) then let _index = List.PositionOf(Name1, _) in Name2{_index} else _)
in
    ChangeColumnName

 

Best Regards,
Community Support Team _ Jing
If this post helps, please Accept it as Solution to help other members find it.

Thank you! Very helpful!

Helpful resources

Announcements
Fabric Community Conference

Microsoft Fabric Community Conference

Join us at our first-ever Microsoft Fabric Community Conference, March 26-28, 2024 in Las Vegas with 100+ sessions by community experts and Microsoft engineering.

February 2024 Update Carousel

Power BI Monthly Update - February 2024

Check out the February 2024 Power BI update to learn about new features.

Fabric Career Hub

Microsoft Fabric Career Hub

Explore career paths and learn resources in Fabric.

Fabric Partner Community

Microsoft Fabric Partner Community

Engage with the Fabric engineering team, hear of product updates, business opportunities, and resources in the Fabric Partner Community.

Top Solution Authors
Top Kudoed Authors