Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Enhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.

Reply
dnauflett
Frequent Visitor

NoteBook PySpark Question from a newbie (Applying data from anthoer Dataframe in notebook)

I'm probably going to have a series of questions like this!!  I'm going to baby Step this.  The 50,000 foot view of what I'm trying to do is, I have 524 files that I need to load into a Fabric Warehouse everyday. The csv files themselves do not have headers on them. I have figured out how to get the header I need.  One of those files contains the buisness Date of the Data. I want to grab the date from that one file and then for all of the over files I wanted to add that Data as a column For Example:

 

File1 (I was able to Load it as a Data Frame) and rename the columns. I need the 3rd column

AA20240721202407192024072120240719

 

 

File 2-524  (Files with various columns and number of records, I have the meta data for each file)

XXXXXSomeData324cde2024
XXXXVSomeData234cdf2024

XXXXH

 

SomeData234cdd2024

 

 

OutputResult (Want to add a column to the Beginning with the Date From File1 as a Date (That value is loading as a scring)

BusinessEffectiveDateField1Field2Field3Field4Field5
07-19-2024XXXXXSomeData324cde2024
07-19-2024      XXXXVSomeData234cdf2024
07-19-2024      

XXXXH

 

SomeData234cdd2024

 

 

I tried to keep it simple. So I'm trying to pull field 3 in File one that the data is in yyyymmdd format and then use that date and add it to all the other files as the first field.

 

 Some code that I have been playing with. I can load the file and rename the columns, but could not figureout how to extract the column and then use it on the other tables.
from functools import reduce
from pyspark.sql import functions as F
from pyspark.sql import types as T

df = spark.read.csv("Files/Development/PersistentStaging/McCrackenDaily/PBATCHCT")
df_pnote = spark.read.csv("Files/Development/PersistentStaging/McCrackenDaily/PNOTES")
# df now is a Spark DataFrame containing CSV data from "Files/Development/PersistentStaging/McCrackenDaily/PNOTES_DFN.csv".

oldColumns = df.schema.names
newColumns = ["BGAAKY""BGBTDT","BGBGDT","BGETDT","BGEGDT"]
df=reduce(lambda dfidx: df.withColumnRenamed(oldColumns[idx], newColumns[idx]), range(len(oldColumns)), df)

display(df.printSchema)

def clean_data(df😞
    # Derive column 'BusinessEffectiveDate' from column: 'BGBGDT'
    #df = df.withColumn('BusinessEffectiveDate', F.lit(None).cast(T.DateType StringType()))
    df = df.withColumn('BusinessEffectiveDate', F.lit(None).cast(T.DateType()))
  #  df = df.select(*(df.columns[:3] + 'BusinessEffectiveDate' + df.columns[3:]))
    return df

df_clean = clean_data(df)
display(df_clean)

 

1 ACCEPTED SOLUTION
Anonymous
Not applicable

Hi @dnauflett,

You can take a load at the following code to use df.collect and df.withColumn function to achieve your requirement:

 

 

# Import modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit

# Create a sample DataFrame
data = [("A", 34), ("B", 45), ("C", 29)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
display(df)

# get cellvalue from DataFrame second row, first column
cellValue = df.collect()[1][0]

# load new data
df = spark.read.format("csv").option("header","true").load("Files/churn/raw/churn.csv")

# Add a new column lit wiht extract value
df = df.withColumn("NewColumn", lit(cellValue))

display(df)

 

 

1.png

Spark dataframe: collect () vs select () - Stack Overflow

Regards,

Xiaoxin Sheng

View solution in original post

2 REPLIES 2
dnauflett
Frequent Visitor

Thanks you so much Xiaoxin!!!  You have really help advancing my knowledge and training of Pyspark

Anonymous
Not applicable

Hi @dnauflett,

You can take a load at the following code to use df.collect and df.withColumn function to achieve your requirement:

 

 

# Import modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit

# Create a sample DataFrame
data = [("A", 34), ("B", 45), ("C", 29)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
display(df)

# get cellvalue from DataFrame second row, first column
cellValue = df.collect()[1][0]

# load new data
df = spark.read.format("csv").option("header","true").load("Files/churn/raw/churn.csv")

# Add a new column lit wiht extract value
df = df.withColumn("NewColumn", lit(cellValue))

display(df)

 

 

1.png

Spark dataframe: collect () vs select () - Stack Overflow

Regards,

Xiaoxin Sheng

Helpful resources

Announcements
Join our Fabric User Panel

Join our Fabric User Panel

This is your chance to engage directly with the engineering team behind Fabric and Power BI. Share your experiences and shape the future.

June FBC25 Carousel

Fabric Monthly Update - June 2025

Check out the June 2025 Fabric update to learn about new features.

June 2025 community update carousel

Fabric Community Update - June 2025

Find out what's new and trending in the Fabric community.