Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn a 50% discount on the DP-600 certification exam by completing the Fabric 30 Days to Learn It challenge.

Reply
AmirKatz
Advocate I
Advocate I

Pandas dataframe passed to Python script is corrupted

Hi,

I have a table that I download from a Salesforce report. One of the columns is the Saleforce case number and another one ('Defect')is a free text which may contain the string 'ABC-number', where number is 1-99999. There may be some other text before the string and/or after.

I wrote Python script that extracts the required string via a regexp and discards all other text, and puts it in a new column. 

This works fine in most cases. However, in a few rows of the table, the text contains newlines, like line 3 here:

SerialCase numberDefectComment
11234ABC-123no problem, ACBDEF-123
21235ABC-123,XYZ-123no newline
31277

ABC-201

DEF-777

1 newline
41345

ABC-786

no newline

 

When the text in Defect column contains one or more newline characters, the dataframe that is provided to the Python script looks like this:

Case numberDefectComment
1277ABC-201'Defect' column was truncated...
DEF-777null...Text after newline created this new dataframe row, with invalid data in 'Case Number' column and null in all other columns
1345ABC-786Good row, not corrupted

 

My guess is that there is a bug in the Power BI code that converts the table data to a Pandas dataframe and it barfs on newline characters.

 

Has anyone seen that?

 

Thanks,

Amir

1 ACCEPTED SOLUTION
AmirKatz
Advocate I
Advocate I

The response from Microsoft is that Power BI was not designed to handle embedded newlines - see here:

https://community.powerbi.com/t5/Issues/Python-pandas-R-dataframes-passed-to-Python-R-script-are/idc....

 

So the only solution is to "sanitize" the data before invoking Python or R.

To do that, use the Table.ReplaceValue() in M Query, see below. All these actions can be invoked from the GUI.

 

#"Duplicated Column" = Table.DuplicateColumn(#"some-previous-step", "Defect", "Defect - Copy")
// Santize the new column
#"Replaced Value" = Table.ReplaceValue(#"Duplicated Column","#(lf)","___",Replacer.ReplaceText,{"Defect-Copy"}),#"Replaced Value1" = Table.ReplaceValue(#"Replaced Value","#(cr)","___",Replacer.ReplaceText,{"Defect-Copy"}),
// Must remove the original bad column
#"Removed Columns1" = Table.RemoveColumns(#"Replaced Value1",{"Defect"}),
// Rename the new column
#"Renamed Columns" = Table.RenameColumns(#"Removed Columns1",{{"Defect-Copy", "DefectSanitized"}})

View solution in original post

3 REPLIES 3
AmirKatz
Advocate I
Advocate I

The response from Microsoft is that Power BI was not designed to handle embedded newlines - see here:

https://community.powerbi.com/t5/Issues/Python-pandas-R-dataframes-passed-to-Python-R-script-are/idc....

 

So the only solution is to "sanitize" the data before invoking Python or R.

To do that, use the Table.ReplaceValue() in M Query, see below. All these actions can be invoked from the GUI.

 

#"Duplicated Column" = Table.DuplicateColumn(#"some-previous-step", "Defect", "Defect - Copy")
// Santize the new column
#"Replaced Value" = Table.ReplaceValue(#"Duplicated Column","#(lf)","___",Replacer.ReplaceText,{"Defect-Copy"}),#"Replaced Value1" = Table.ReplaceValue(#"Replaced Value","#(cr)","___",Replacer.ReplaceText,{"Defect-Copy"}),
// Must remove the original bad column
#"Removed Columns1" = Table.RemoveColumns(#"Replaced Value1",{"Defect"}),
// Rename the new column
#"Renamed Columns" = Table.RenameColumns(#"Removed Columns1",{{"Defect-Copy", "DefectSanitized"}})
AmirKatz
Advocate I
Advocate I

I have submitted an issue, here 

AmirKatz
Advocate I
Advocate I

To add an interesting (and relevant) data point:

I fed the same table into an R script, using this simple-minded script (which does not change any data):

 

new_ds <- dataset
new_ds$DefectCopy <- new_ds$Defect

 

And the resulting dataset is similar:

1. The field 'Defect' now contains only the data before the first newline

2. New "dummy" rows are added after this row, each containing the part from the previous newline to the next.

3. In these "dummy" rows, all other columns, including 'Defect', are empty

 

So this adds weight to my theory that the Power Bi code that parses the table and generates a Python/R dataframe cannot handle newlines embedded in text columns.

Helpful resources

Announcements
LearnSurvey

Fabric certifications survey

Certification feedback opportunity for the community.

PBI_APRIL_CAROUSEL1

Power BI Monthly Update - April 2024

Check out the April 2024 Power BI update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors
Top Kudoed Authors