Earn a 50% discount on the DP-600 certification exam by completing the Fabric 30 Days to Learn It challenge.
Hi,
I have a table that I download from a Salesforce report. One of the columns is the Saleforce case number and another one ('Defect')is a free text which may contain the string 'ABC-number', where number is 1-99999. There may be some other text before the string and/or after.
I wrote Python script that extracts the required string via a regexp and discards all other text, and puts it in a new column.
This works fine in most cases. However, in a few rows of the table, the text contains newlines, like line 3 here:
Serial | Case number | Defect | Comment |
1 | 1234 | ABC-123 | no problem, ACBDEF-123 |
2 | 1235 | ABC-123,XYZ-123 | no newline |
3 | 1277 | ABC-201 DEF-777 | 1 newline |
4 | 1345 | ABC-786 | no newline |
When the text in Defect column contains one or more newline characters, the dataframe that is provided to the Python script looks like this:
Case number | Defect | Comment |
1277 | ABC-201 | 'Defect' column was truncated... |
DEF-777 | null | ...Text after newline created this new dataframe row, with invalid data in 'Case Number' column and null in all other columns |
1345 | ABC-786 | Good row, not corrupted |
My guess is that there is a bug in the Power BI code that converts the table data to a Pandas dataframe and it barfs on newline characters.
Has anyone seen that?
Thanks,
Amir
Solved! Go to Solution.
The response from Microsoft is that Power BI was not designed to handle embedded newlines - see here:
So the only solution is to "sanitize" the data before invoking Python or R.
To do that, use the Table.ReplaceValue() in M Query, see below. All these actions can be invoked from the GUI.
#"Duplicated Column" = Table.DuplicateColumn(#"some-previous-step", "Defect", "Defect - Copy")
// Santize the new column
#"Replaced Value" = Table.ReplaceValue(#"Duplicated Column","#(lf)","___",Replacer.ReplaceText,{"Defect-Copy"}),#"Replaced Value1" = Table.ReplaceValue(#"Replaced Value","#(cr)","___",Replacer.ReplaceText,{"Defect-Copy"}),
// Must remove the original bad column
#"Removed Columns1" = Table.RemoveColumns(#"Replaced Value1",{"Defect"}),
// Rename the new column
#"Renamed Columns" = Table.RenameColumns(#"Removed Columns1",{{"Defect-Copy", "DefectSanitized"}})
The response from Microsoft is that Power BI was not designed to handle embedded newlines - see here:
So the only solution is to "sanitize" the data before invoking Python or R.
To do that, use the Table.ReplaceValue() in M Query, see below. All these actions can be invoked from the GUI.
#"Duplicated Column" = Table.DuplicateColumn(#"some-previous-step", "Defect", "Defect - Copy")
// Santize the new column
#"Replaced Value" = Table.ReplaceValue(#"Duplicated Column","#(lf)","___",Replacer.ReplaceText,{"Defect-Copy"}),#"Replaced Value1" = Table.ReplaceValue(#"Replaced Value","#(cr)","___",Replacer.ReplaceText,{"Defect-Copy"}),
// Must remove the original bad column
#"Removed Columns1" = Table.RemoveColumns(#"Replaced Value1",{"Defect"}),
// Rename the new column
#"Renamed Columns" = Table.RenameColumns(#"Removed Columns1",{{"Defect-Copy", "DefectSanitized"}})
To add an interesting (and relevant) data point:
I fed the same table into an R script, using this simple-minded script (which does not change any data):
new_ds <- dataset
new_ds$DefectCopy <- new_ds$Defect
And the resulting dataset is similar:
1. The field 'Defect' now contains only the data before the first newline
2. New "dummy" rows are added after this row, each containing the part from the previous newline to the next.
3. In these "dummy" rows, all other columns, including 'Defect', are empty
So this adds weight to my theory that the Power Bi code that parses the table and generates a Python/R dataframe cannot handle newlines embedded in text columns.