The ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM.
Get registeredEnhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends September 15. Request your voucher.
Hello All---I have a challenge. I have a PDF document that spans a few hundred pages, but the text alignment is sporadic. The good news is the format is pretty much the same on each page. The challenge is I need to extract text from strings, create a column for it and then fill it in rows next to the numerical data. I've attached an example that shows how the PDF looks and what I need it to look like in Power Query. I can get the numbers aligned in the columns for Income, Disbursement and Profit. The issue is the "PM" and "BSA" text. I need to extract the PM and any 2 digits that follow and make that into a new column and fill for each row where there are values. I need to do the same exact things for the BSA. The alignment is off on each page like shown on the example. Can this transformation be done?
Solved! Go to Solution.
Hi @Userpath77
You can create a blank query and put the following code to Advanced Editor
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WUtJBQrE60UoBQf7uQY6+GOKYCn0VDAwVfD2dgwmrdQp2JKwIwjQyABKmIMLCAEnYBCRiDiIsUYQtQKpNQHKW+ExFdbeRgUKwh6tPmCum04EuVTA0VQjNS6Gqg2MB", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Column1 = _t, Column2 = _t, INCOME = _t, DISBURSEMENT = _t, REOFIT = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}, {"Column2", type text}, {"INCOME", Int64.Type}, {"DISBURSEMENT", Int64.Type}, {"REOFIT", Int64.Type}}),
#"Removed Columns" = Table.RemoveColumns(#"Changed Type",{"Column2"}),
#"Replaced Value" = Table.ReplaceValue(#"Removed Columns",each [Column1],each if Text.Contains([Column1],"PM") or Text.Contains([Column1],"BSA") then [Column1] else "",Replacer.ReplaceValue,{"Column1"}),
#"Removed Top Rows" = Table.Skip(#"Replaced Value",3),
#"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Removed Top Rows", {{"Column1", Splitter.SplitTextByDelimiter("#(tab)", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Column1"),
#"Added Index" = Table.AddIndexColumn(#"Split Column by Delimiter", "Index", 1, 1, Int64.Type),
#"Added Custom" = Table.AddColumn(#"Added Index", "Custom", each if [INCOME]<>null and [DISBURSEMENT]<>null and [REOFIT]<>null then List.Max(Table.SelectRows(#"Added Index",(x)=>x[Index] <[Index] and Text.Contains(x[Column1],"PM"))[Column1]) else null),
#"Added Custom1" = Table.AddColumn(#"Added Custom", "Custom.1", each if [INCOME]<>null and [DISBURSEMENT]<>null and [REOFIT]<>null then List.Max(Table.SelectRows(#"Added Index",(x)=>x[Index] <[Index] and Text.Contains(x[Column1],"BSA"))[Column1]) else null),
#"Removed Columns1" = Table.RemoveColumns(#"Added Custom1",{"Column1"}),
#"Reordered Columns" = Table.ReorderColumns(#"Removed Columns1",{"Custom", "Custom.1", "INCOME", "DISBURSEMENT", "REOFIT", "Index"}),
#"Filtered Rows" = Table.SelectRows(#"Reordered Columns", each ([Custom] <> null)),
#"Split Column by Delimiter1" = Table.SplitColumn(#"Filtered Rows", "Custom", Splitter.SplitTextByDelimiter(" ", QuoteStyle.Csv), {"Custom.1.1", "Custom.2", "Custom.3"}),
#"Changed Type1" = Table.TransformColumnTypes(#"Split Column by Delimiter1",{{"Custom.1.1", type text}, {"Custom.2", Int64.Type}, {"Custom.3", type text}}),
#"Split Column by Delimiter2" = Table.SplitColumn(#"Changed Type1", "Custom.1", Splitter.SplitTextByDelimiter(" ", QuoteStyle.Csv), {"Custom.1.1.1", "Custom.1.2", "Custom.1.3"}),
#"Changed Type2" = Table.TransformColumnTypes(#"Split Column by Delimiter2",{{"Custom.1.1.1", type text}, {"Custom.1.2", Int64.Type}, {"Custom.1.3", type text}}),
#"Removed Columns2" = Table.RemoveColumns(#"Changed Type2",{"Custom.3", "Custom.1.3"}),
#"Renamed Columns" = Table.RenameColumns(#"Removed Columns2",{{"Custom.2", "PM"}, {"Custom.1.2", "BSA"}}),
#"Removed Columns3" = Table.RemoveColumns(#"Renamed Columns",{"Custom.1.1", "Custom.1.1.1"})
in
#"Removed Columns3"
Best Regards!
Yolo Zhu
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
Hi @Userpath77
You can create a blank query and put the following code to Advanced Editor
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WUtJBQrE60UoBQf7uQY6+GOKYCn0VDAwVfD2dgwmrdQp2JKwIwjQyABKmIMLCAEnYBCRiDiIsUYQtQKpNQHKW+ExFdbeRgUKwh6tPmCum04EuVTA0VQjNS6Gqg2MB", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Column1 = _t, Column2 = _t, INCOME = _t, DISBURSEMENT = _t, REOFIT = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}, {"Column2", type text}, {"INCOME", Int64.Type}, {"DISBURSEMENT", Int64.Type}, {"REOFIT", Int64.Type}}),
#"Removed Columns" = Table.RemoveColumns(#"Changed Type",{"Column2"}),
#"Replaced Value" = Table.ReplaceValue(#"Removed Columns",each [Column1],each if Text.Contains([Column1],"PM") or Text.Contains([Column1],"BSA") then [Column1] else "",Replacer.ReplaceValue,{"Column1"}),
#"Removed Top Rows" = Table.Skip(#"Replaced Value",3),
#"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Removed Top Rows", {{"Column1", Splitter.SplitTextByDelimiter("#(tab)", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Column1"),
#"Added Index" = Table.AddIndexColumn(#"Split Column by Delimiter", "Index", 1, 1, Int64.Type),
#"Added Custom" = Table.AddColumn(#"Added Index", "Custom", each if [INCOME]<>null and [DISBURSEMENT]<>null and [REOFIT]<>null then List.Max(Table.SelectRows(#"Added Index",(x)=>x[Index] <[Index] and Text.Contains(x[Column1],"PM"))[Column1]) else null),
#"Added Custom1" = Table.AddColumn(#"Added Custom", "Custom.1", each if [INCOME]<>null and [DISBURSEMENT]<>null and [REOFIT]<>null then List.Max(Table.SelectRows(#"Added Index",(x)=>x[Index] <[Index] and Text.Contains(x[Column1],"BSA"))[Column1]) else null),
#"Removed Columns1" = Table.RemoveColumns(#"Added Custom1",{"Column1"}),
#"Reordered Columns" = Table.ReorderColumns(#"Removed Columns1",{"Custom", "Custom.1", "INCOME", "DISBURSEMENT", "REOFIT", "Index"}),
#"Filtered Rows" = Table.SelectRows(#"Reordered Columns", each ([Custom] <> null)),
#"Split Column by Delimiter1" = Table.SplitColumn(#"Filtered Rows", "Custom", Splitter.SplitTextByDelimiter(" ", QuoteStyle.Csv), {"Custom.1.1", "Custom.2", "Custom.3"}),
#"Changed Type1" = Table.TransformColumnTypes(#"Split Column by Delimiter1",{{"Custom.1.1", type text}, {"Custom.2", Int64.Type}, {"Custom.3", type text}}),
#"Split Column by Delimiter2" = Table.SplitColumn(#"Changed Type1", "Custom.1", Splitter.SplitTextByDelimiter(" ", QuoteStyle.Csv), {"Custom.1.1.1", "Custom.1.2", "Custom.1.3"}),
#"Changed Type2" = Table.TransformColumnTypes(#"Split Column by Delimiter2",{{"Custom.1.1.1", type text}, {"Custom.1.2", Int64.Type}, {"Custom.1.3", type text}}),
#"Removed Columns2" = Table.RemoveColumns(#"Changed Type2",{"Custom.3", "Custom.1.3"}),
#"Renamed Columns" = Table.RenameColumns(#"Removed Columns2",{{"Custom.2", "PM"}, {"Custom.1.2", "BSA"}}),
#"Removed Columns3" = Table.RemoveColumns(#"Renamed Columns",{"Custom.1.1", "Custom.1.1.1"})
in
#"Removed Columns3"
Best Regards!
Yolo Zhu
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.