Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Find everything you need to get certified on Fabric—skills challenges, live sessions, exam prep, role guidance, and more. Get started

Reply
ToddChitt
Super User
Super User

PySpark to rename a column in a dataframe

Hello. I am quite new to Spark Notebooks. I am using one to extract JSON data to save to tables in a Lakehouse. It works, but there are some slight issues. The data, being JSON, has nexted objects. I have included a screenshot here to highlight my issues.

 

ToddChitt_0-1710266541471.png

I am starting with a data frame to read the entire JSON file. But the contents of the nested fields contain nested objects. So I have a second data frame that selects elements from the first using something like this:

df2 = df1.select( "Id", "EmployeeNumber",..."PositionData.Manager.Id"..."WorkLocation.Id"...)

But the second and third columns that are "Id" come out named "Id". I now have THREE columns named "Id" in the data frame. I want to rename the second and third ones to "Manager_Id" and "WorkLocation_Id", respectively.

 

I want to flatten the entire JSON file (there are no nested arrays, just nexted objects) such that I have the original Id (for the Employee) and Manager Id and Work Location Id. 

 

I tried data frame with Column Rename but it renames all column named Id.

If this was SQL I could write it as: select..."PositionData.Manager.Id" AS [Manager_Id]...

Is there a way to rename a column inline in a dataframe select operation? Or is there another/better option?

 

Thanks in advance

 

 




Did I answer your question? If so, mark my post as a solution. Also consider helping someone else in the forums!

Proud to be a Super User!





1 ACCEPTED SOLUTION
AndyDDC
Solution Sage
Solution Sage

Hi @ToddChitt @can you try aliasing the column when using .select

 

df_renamed = df.select(col("Name").alias("EmployeeName"), col("Department").alias("Dept")) 
df_renamed.show() 

View solution in original post

3 REPLIES 3
ToddChitt
Super User
Super User

Hello @AndyDDC and thank you for the reply. 

I tried your suggestion but it generated an error: ...name 'col; is not defined.

But from this website PySpark alias() Column & DataFrame Examples - Spark By {Examples} (sparkbyexamples.com)(which I think is about to become my new best friend 🙂 ) I added this line of code at the top of the block:

from pyspark.sql.functions import col
And that fixed it.
Another example from the site shows this syntax will work without the import statement above:
df.select ( df.Id.alias ( "Employee_Id" ),...
 
Thanks for your help.
 



Did I answer your question? If so, mark my post as a solution. Also consider helping someone else in the forums!

Proud to be a Super User!





Great to hear. And yes the Spark By Example website is awesome!

AndyDDC
Solution Sage
Solution Sage

Hi @ToddChitt @can you try aliasing the column when using .select

 

df_renamed = df.select(col("Name").alias("EmployeeName"), col("Department").alias("Dept")) 
df_renamed.show() 

Helpful resources

Announcements
Europe Fabric Conference

Europe’s largest Microsoft Fabric Community Conference

Join the community in Stockholm for expert Microsoft Fabric learning including a very exciting keynote from Arun Ulag, Corporate Vice President, Azure Data.

Expanding the Synapse Forums

New forum boards available in Synapse

Ask questions in Data Engineering, Data Science, Data Warehouse and General Discussion.

RTI Forums Carousel3

New forum boards available in Real-Time Intelligence.

Ask questions in Eventhouse and KQL, Eventstream, and Reflex.

MayFBCUpdateCarousel

Fabric Monthly Update - May 2024

Check out the May 2024 Fabric update to learn about new features.