Solved: Categorisation through keywords slow

Nad_PBI_21 · ‎02-16-2024

Hi,

I'm looking for a more efficient method to categorize dates based on a reference table that contains keywords.

In the query below, I search for keywords (Booking and Partner) twice and then merge to retrieve specific columns from the reference table. However, any modification is quite slow, even simple tasks like changing the format. The left pane indicates a high amount of MB loading. I've applied some filters to reduce the number of lines and also buffers.

Would anyone know of a smarter approach to achieve the same results?

Thank you very much,

Nad

Steps:

Added Booking keyword: This column is created by searching for matches between the "Booking keyword" column in a reference table named REF_KEY and the "Buchungstext" column in the Source table. If a match is found, the value from the REF_KEY table is added to the new column.

Added Partner keyword: Similar to the previous step,

…

Merged Queries: Performs a left outer join between the Source table and the REF_KEY table based on three columns: "Kostenart", "Match Booking keyword", and "Match Partner keyword".

…

Query

let

// Rename the TRANSFORM table as Source

Source = TRANSFORM,

// Add custom column for Booking keyword match

#"added Booking keyword" = Table.AddColumn(bufferedSource, "Match Booking keyword", each

let

// Filter REF_KEY table and buffer it

filtered_REF_KEY = Table.Buffer(Table.SelectColumns(REF_KEY, {"Kostenart", "Booking keyword"})),

condition1 = [Kostenart],

filteredCondition = Table.SelectRows(filtered_REF_KEY, each [Kostenart] = condition1),

whatSearch = List.RemoveNulls(filteredCondition[Booking keyword]),

whereSearch = [Buchungstext], // Assuming [Buchungstext] is a column in Source

matches = List.Select(whatSearch, each Text.Contains(Text.Lower(whereSearch), Text.Lower(_)))

in

if List.IsEmpty(matches) then null else matches{0}

),

// Add custom column for Partner keyword match

#"added Partner keyword" = Table.AddColumn(#"added Booking keyword", "Match Partner keyword", each

let

// Filter REF_KEY table and buffer it

filtered_REF_KEY = Table.Buffer(Table.SelectColumns(REF_KEY, {"Kostenart", "Partner keyword"})),

condition1 = [Kostenart],

filteredCondition = Table.SelectRows(filtered_REF_KEY, each [Kostenart] = condition1),

whatSearch = List.RemoveNulls(filteredCondition[Partner keyword]),

whereSearch = [Partner],

matches = List.Select(whatSearch, each Text.Contains(Text.Lower(whereSearch), Text.Lower(_)))

in

if List.IsEmpty(matches) then null else matches{0}

),

#"Inserted Year" = Table.AddColumn(#"added Partner keyword", "Year", each Date.Year([Belegdatum]), Int64.Type),

#"Merged Queries" = Table.NestedJoin(#"Inserted Year", {"Kostenart", "Match Booking keyword", "Match Partner keyword"}, REF_KEY, {"Kostenart", "Booking keyword", "Partner keyword"}, "REF_KEY", JoinKind.LeftOuter),

#"Expanded REF_KEY" = Table.ExpandTableColumn(#"Merged Queries", "REF_KEY", {"Assumption", "Em. fact. key"}, {"Assumption", "Em. fact. key"})

in

#"Expanded REF_KEY"

Anonymous · ‎02-18-2024

Hi @Nad_PBI_21 ,

I can give you some suggestions:

1. Minimize the use of 'Table.Buffer'. While buffering can improve performance by reducing the number of times a source is accessed, overusing it, especially with large tables, can lead to high memory consumption. Consider buffering only the smallest necessary tables or the ones that are used multiple times in subsequent transformations.
2. Instead of performing the keyword search within an `each` function for every row, you could try to transform your reference table into a list of keywords and their corresponding categories once, and then use this list to categorize your main table. This way, you perform the heavy lifting of creating the keyword-category list just once.
3. Use 'Table.Join' instead of 'Table.NestedJoin' when possible. If you don't need nested tables and only require a few columns from the reference table, a regular join might be more efficient.
4. Ensure that the columns used for joins and searches have the appropriate data types. Text operations can be slow on numeric or date fields if they are not properly typed.

Best Regards,
Dino Tao
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

View solution in original post

Anonymous · ‎02-18-2024

Hi @Nad_PBI_21 ,

I can give you some suggestions:

1. Minimize the use of 'Table.Buffer'. While buffering can improve performance by reducing the number of times a source is accessed, overusing it, especially with large tables, can lead to high memory consumption. Consider buffering only the smallest necessary tables or the ones that are used multiple times in subsequent transformations.
2. Instead of performing the keyword search within an `each` function for every row, you could try to transform your reference table into a list of keywords and their corresponding categories once, and then use this list to categorize your main table. This way, you perform the heavy lifting of creating the keyword-category list just once.
3. Use 'Table.Join' instead of 'Table.NestedJoin' when possible. If you don't need nested tables and only require a few columns from the reference table, a regular join might be more efficient.
4. Ensure that the columns used for joins and searches have the appropriate data types. Text operations can be slow on numeric or date fields if they are not properly typed.

Best Regards,
Dino Tao
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.