Solved: Pandas dataframe passed to Python script is corrup...

Anonymous · ‎01-11-2023

Hi,

I have a table that I download from a Salesforce report. One of the columns is the Saleforce case number and another one ('Defect')is a free text which may contain the string 'ABC-number', where number is 1-99999. There may be some other text before the string and/or after.

I wrote Python script that extracts the required string via a regexp and discards all other text, and puts it in a new column.

This works fine in most cases. However, in a few rows of the table, the text contains newlines, like line 3 here:

Serial	Case number	Defect	Comment
1	1234	ABC-123	no problem, ACBDEF-123
2	1235	ABC-123,XYZ-123	no newline
3	1277	ABC-201 DEF-777	1 newline
4	1345	ABC-786	no newline

When the text in Defect column contains one or more newline characters, the dataframe that is provided to the Python script looks like this:

Case number	Defect	Comment
1277	ABC-201	'Defect' column was truncated...
DEF-777	null	...Text after newline created this new dataframe row, with invalid data in 'Case Number' column and null in all other columns
1345	ABC-786	Good row, not corrupted

My guess is that there is a bug in the Power BI code that converts the table data to a Pandas dataframe and it barfs on newline characters.

Has anyone seen that?

Thanks,

Amir

Anonymous · ‎01-17-2023

The response from Microsoft is that Power BI was not designed to handle embedded newlines - see here:

https://community.powerbi.com/t5/Issues/Python-pandas-R-dataframes-passed-to-Python-R-script-are/idc....

So the only solution is to "sanitize" the data before invoking Python or R.

To do that, use the Table.ReplaceValue() in M Query, see below. All these actions can be invoked from the GUI.

#"Duplicated Column" = Table.DuplicateColumn(#"some-previous-step", "Defect", "Defect - Copy")
// Santize the new column
#"Replaced Value" = Table.ReplaceValue(#"Duplicated Column","#(lf)","___",Replacer.ReplaceText,{"Defect-Copy"}),#"Replaced Value1" = Table.ReplaceValue(#"Replaced Value","#(cr)","___",Replacer.ReplaceText,{"Defect-Copy"}),
// Must remove the original bad column
#"Removed Columns1" = Table.RemoveColumns(#"Replaced Value1",{"Defect"}),
// Rename the new column
#"Renamed Columns" = Table.RenameColumns(#"Removed Columns1",{{"Defect-Copy", "DefectSanitized"}})

View solution in original post

Anonymous · ‎01-17-2023

The response from Microsoft is that Power BI was not designed to handle embedded newlines - see here:

https://community.powerbi.com/t5/Issues/Python-pandas-R-dataframes-passed-to-Python-R-script-are/idc....

So the only solution is to "sanitize" the data before invoking Python or R.

To do that, use the Table.ReplaceValue() in M Query, see below. All these actions can be invoked from the GUI.

#"Duplicated Column" = Table.DuplicateColumn(#"some-previous-step", "Defect", "Defect - Copy")
// Santize the new column
#"Replaced Value" = Table.ReplaceValue(#"Duplicated Column","#(lf)","___",Replacer.ReplaceText,{"Defect-Copy"}),#"Replaced Value1" = Table.ReplaceValue(#"Replaced Value","#(cr)","___",Replacer.ReplaceText,{"Defect-Copy"}),
// Must remove the original bad column
#"Removed Columns1" = Table.RemoveColumns(#"Replaced Value1",{"Defect"}),
// Rename the new column
#"Renamed Columns" = Table.RenameColumns(#"Removed Columns1",{{"Defect-Copy", "DefectSanitized"}})

Anonymous · ‎01-12-2023

I have submitted an issue, here

Anonymous · ‎01-12-2023

To add an interesting (and relevant) data point:

I fed the same table into an R script, using this simple-minded script (which does not change any data):

new_ds <- dataset
new_ds$DefectCopy <- new_ds$Defect

And the resulting dataset is similar:

1. The field 'Defect' now contains only the data before the first newline

2. New "dummy" rows are added after this row, each containing the part from the previous newline to the next.

3. In these "dummy" rows, all other columns, including 'Defect', are empty

So this adds weight to my theory that the Power Bi code that parses the table and generates a Python/R dataframe cannot handle newlines embedded in text columns.