Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Calling all Data Engineers! Fabric Data Engineer (Exam DP-700) live sessions are back! Starting October 16th. Sign up.

Reply
Userpath77
Helper II
Helper II

Transform PDF

Hello All---I have a challenge. I have a PDF document that spans a few hundred pages, but the text alignment is sporadic. The good news is the format is pretty much the same on each page. The challenge is I need to extract text from strings, create a column for it and then fill it in rows next to the numerical data. I've attached an example that shows how the PDF looks and what I need it to look like in Power Query. I can get the numbers aligned in the columns for Income, Disbursement and Profit. The issue is the "PM" and "BSA" text. I need to extract the PM and any 2 digits that follow and make that into a new column and fill for each row where there are  values. I need to do the same exact things for the BSA. The alignment is off on each page like shown on the example. Can this transformation be done?

 

Userpath77_0-1696426411421.png

 

1 ACCEPTED SOLUTION
Anonymous
Not applicable

Hi @Userpath77 

You can create a blank query and put the following code to Advanced Editor

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WUtJBQrE60UoBQf7uQY6+GOKYCn0VDAwVfD2dgwmrdQp2JKwIwjQyABKmIMLCAEnYBCRiDiIsUYQtQKpNQHKW+ExFdbeRgUKwh6tPmCum04EuVTA0VQjNS6Gqg2MB", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Column1 = _t, Column2 = _t, INCOME = _t, DISBURSEMENT = _t, REOFIT = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}, {"Column2", type text}, {"INCOME", Int64.Type}, {"DISBURSEMENT", Int64.Type}, {"REOFIT", Int64.Type}}),
    #"Removed Columns" = Table.RemoveColumns(#"Changed Type",{"Column2"}),
    #"Replaced Value" = Table.ReplaceValue(#"Removed Columns",each [Column1],each if Text.Contains([Column1],"PM") or Text.Contains([Column1],"BSA") then [Column1] else "",Replacer.ReplaceValue,{"Column1"}),
    #"Removed Top Rows" = Table.Skip(#"Replaced Value",3),
    #"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Removed Top Rows", {{"Column1", Splitter.SplitTextByDelimiter("#(tab)", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Column1"),
    #"Added Index" = Table.AddIndexColumn(#"Split Column by Delimiter", "Index", 1, 1, Int64.Type),
    #"Added Custom" = Table.AddColumn(#"Added Index", "Custom", each if [INCOME]<>null and [DISBURSEMENT]<>null and [REOFIT]<>null then List.Max(Table.SelectRows(#"Added Index",(x)=>x[Index] <[Index] and Text.Contains(x[Column1],"PM"))[Column1]) else null),
    #"Added Custom1" = Table.AddColumn(#"Added Custom", "Custom.1", each if [INCOME]<>null and [DISBURSEMENT]<>null and [REOFIT]<>null then List.Max(Table.SelectRows(#"Added Index",(x)=>x[Index] <[Index] and Text.Contains(x[Column1],"BSA"))[Column1]) else null),
    #"Removed Columns1" = Table.RemoveColumns(#"Added Custom1",{"Column1"}),
    #"Reordered Columns" = Table.ReorderColumns(#"Removed Columns1",{"Custom", "Custom.1", "INCOME", "DISBURSEMENT", "REOFIT", "Index"}),
    #"Filtered Rows" = Table.SelectRows(#"Reordered Columns", each ([Custom] <> null)),
    #"Split Column by Delimiter1" = Table.SplitColumn(#"Filtered Rows", "Custom", Splitter.SplitTextByDelimiter(" ", QuoteStyle.Csv), {"Custom.1.1", "Custom.2", "Custom.3"}),
    #"Changed Type1" = Table.TransformColumnTypes(#"Split Column by Delimiter1",{{"Custom.1.1", type text}, {"Custom.2", Int64.Type}, {"Custom.3", type text}}),
    #"Split Column by Delimiter2" = Table.SplitColumn(#"Changed Type1", "Custom.1", Splitter.SplitTextByDelimiter(" ", QuoteStyle.Csv), {"Custom.1.1.1", "Custom.1.2", "Custom.1.3"}),
    #"Changed Type2" = Table.TransformColumnTypes(#"Split Column by Delimiter2",{{"Custom.1.1.1", type text}, {"Custom.1.2", Int64.Type}, {"Custom.1.3", type text}}),
    #"Removed Columns2" = Table.RemoveColumns(#"Changed Type2",{"Custom.3", "Custom.1.3"}),
    #"Renamed Columns" = Table.RenameColumns(#"Removed Columns2",{{"Custom.2", "PM"}, {"Custom.1.2", "BSA"}}),
    #"Removed Columns3" = Table.RemoveColumns(#"Renamed Columns",{"Custom.1.1", "Custom.1.1.1"})
in
    #"Removed Columns3"

vxinruzhumsft_0-1696570165133.png

Best Regards!

Yolo Zhu

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

View solution in original post

1 REPLY 1
Anonymous
Not applicable

Hi @Userpath77 

You can create a blank query and put the following code to Advanced Editor

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WUtJBQrE60UoBQf7uQY6+GOKYCn0VDAwVfD2dgwmrdQp2JKwIwjQyABKmIMLCAEnYBCRiDiIsUYQtQKpNQHKW+ExFdbeRgUKwh6tPmCum04EuVTA0VQjNS6Gqg2MB", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Column1 = _t, Column2 = _t, INCOME = _t, DISBURSEMENT = _t, REOFIT = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}, {"Column2", type text}, {"INCOME", Int64.Type}, {"DISBURSEMENT", Int64.Type}, {"REOFIT", Int64.Type}}),
    #"Removed Columns" = Table.RemoveColumns(#"Changed Type",{"Column2"}),
    #"Replaced Value" = Table.ReplaceValue(#"Removed Columns",each [Column1],each if Text.Contains([Column1],"PM") or Text.Contains([Column1],"BSA") then [Column1] else "",Replacer.ReplaceValue,{"Column1"}),
    #"Removed Top Rows" = Table.Skip(#"Replaced Value",3),
    #"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Removed Top Rows", {{"Column1", Splitter.SplitTextByDelimiter("#(tab)", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Column1"),
    #"Added Index" = Table.AddIndexColumn(#"Split Column by Delimiter", "Index", 1, 1, Int64.Type),
    #"Added Custom" = Table.AddColumn(#"Added Index", "Custom", each if [INCOME]<>null and [DISBURSEMENT]<>null and [REOFIT]<>null then List.Max(Table.SelectRows(#"Added Index",(x)=>x[Index] <[Index] and Text.Contains(x[Column1],"PM"))[Column1]) else null),
    #"Added Custom1" = Table.AddColumn(#"Added Custom", "Custom.1", each if [INCOME]<>null and [DISBURSEMENT]<>null and [REOFIT]<>null then List.Max(Table.SelectRows(#"Added Index",(x)=>x[Index] <[Index] and Text.Contains(x[Column1],"BSA"))[Column1]) else null),
    #"Removed Columns1" = Table.RemoveColumns(#"Added Custom1",{"Column1"}),
    #"Reordered Columns" = Table.ReorderColumns(#"Removed Columns1",{"Custom", "Custom.1", "INCOME", "DISBURSEMENT", "REOFIT", "Index"}),
    #"Filtered Rows" = Table.SelectRows(#"Reordered Columns", each ([Custom] <> null)),
    #"Split Column by Delimiter1" = Table.SplitColumn(#"Filtered Rows", "Custom", Splitter.SplitTextByDelimiter(" ", QuoteStyle.Csv), {"Custom.1.1", "Custom.2", "Custom.3"}),
    #"Changed Type1" = Table.TransformColumnTypes(#"Split Column by Delimiter1",{{"Custom.1.1", type text}, {"Custom.2", Int64.Type}, {"Custom.3", type text}}),
    #"Split Column by Delimiter2" = Table.SplitColumn(#"Changed Type1", "Custom.1", Splitter.SplitTextByDelimiter(" ", QuoteStyle.Csv), {"Custom.1.1.1", "Custom.1.2", "Custom.1.3"}),
    #"Changed Type2" = Table.TransformColumnTypes(#"Split Column by Delimiter2",{{"Custom.1.1.1", type text}, {"Custom.1.2", Int64.Type}, {"Custom.1.3", type text}}),
    #"Removed Columns2" = Table.RemoveColumns(#"Changed Type2",{"Custom.3", "Custom.1.3"}),
    #"Renamed Columns" = Table.RenameColumns(#"Removed Columns2",{{"Custom.2", "PM"}, {"Custom.1.2", "BSA"}}),
    #"Removed Columns3" = Table.RemoveColumns(#"Renamed Columns",{"Custom.1.1", "Custom.1.1.1"})
in
    #"Removed Columns3"

vxinruzhumsft_0-1696570165133.png

Best Regards!

Yolo Zhu

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Helpful resources

Announcements
FabCon Global Hackathon Carousel

FabCon Global Hackathon

Join the Fabric FabCon Global Hackathon—running virtually through Nov 3. Open to all skill levels. $10,000 in prizes!

October Power BI Update Carousel

Power BI Monthly Update - October 2025

Check out the October 2025 Power BI update to learn about new features.

FabCon Atlanta 2026 carousel

FabCon Atlanta 2026

Join us at FabCon Atlanta, March 16-20, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.

Top Kudoed Authors