Solved: PySpark to rename a column in a dataframe

ToddChitt · ‎03-12-2024

Hello. I am quite new to Spark Notebooks. I am using one to extract JSON data to save to tables in a Lakehouse. It works, but there are some slight issues. The data, being JSON, has nexted objects. I have included a screenshot here to highlight my issues.

I am starting with a data frame to read the entire JSON file. But the contents of the nested fields contain nested objects. So I have a second data frame that selects elements from the first using something like this:

df2 = df1.select( "Id", "EmployeeNumber",..."PositionData.Manager.Id"..."WorkLocation.Id"...)

But the second and third columns that are "Id" come out named "Id". I now have THREE columns named "Id" in the data frame. I want to rename the second and third ones to "Manager_Id" and "WorkLocation_Id", respectively.

I want to flatten the entire JSON file (there are no nested arrays, just nexted objects) such that I have the original Id (for the Employee) and Manager Id and Work Location Id.

I tried data frame with Column Rename but it renames all column named Id.

If this was SQL I could write it as: select..."PositionData.Manager.Id" AS [Manager_Id]...

Is there a way to rename a column inline in a dataframe select operation? Or is there another/better option?

Thanks in advance

Did I answer your question? If so, mark my post as a solution. Also consider helping someone else in the forums!

Proud to be a Super User!

AndyDDC · ‎03-12-2024

Hi @ToddChitt @can you try aliasing the column when using .select

df_renamed = df.select(col("Name").alias("EmployeeName"), col("Department").alias("Dept")) 
df_renamed.show()

View solution in original post

ToddChitt · ‎03-13-2024

Hello @AndyDDC and thank you for the reply.

I tried your suggestion but it generated an error: ...name 'col; is not defined.

But from this website PySpark alias() Column & DataFrame Examples - Spark By {Examples} (sparkbyexamples.com)(which I think is about to become my new best friend 🙂 ) I added this line of code at the top of the block:

from pyspark.sql.functions import col

And that fixed it.

Another example from the site shows this syntax will work without the import statement above:

df.select ( df.Id.alias ( "Employee_Id" ),...

Thanks for your help.

Did I answer your question? If so, mark my post as a solution. Also consider helping someone else in the forums!

Proud to be a Super User!

AndyDDC · ‎03-13-2024

Great to hear. And yes the Spark By Example website is awesome!

AndyDDC · ‎03-12-2024

Hi @ToddChitt @can you try aliasing the column when using .select

df_renamed = df.select(col("Name").alias("EmployeeName"), col("Department").alias("Dept")) 
df_renamed.show()

PySpark to rename a column in a dataframe

Helpful resources

Join our Fabric User Panel

Fabric Monthly Update - June 2025

Fabric Community Update - June 2025

Join the #PBI10 DataViz contest

PySpark to rename a column in a dataframe

Helpful resources

Join our Fabric User Panel

Fabric Monthly Update - June 2025

Fabric Community Update - June 2025