Solved: NoteBook PySpark Question from a newbie (Applying ...

dnauflett · ‎07-22-2024

I'm probably going to have a series of questions like this!! I'm going to baby Step this. The 50,000 foot view of what I'm trying to do is, I have 524 files that I need to load into a Fabric Warehouse everyday. The csv files themselves do not have headers on them. I have figured out how to get the header I need. One of those files contains the buisness Date of the Data. I want to grab the date from that one file and then for all of the over files I wanted to add that Data as a column For Example:

File1 (I was able to Load it as a Data Frame) and rename the columns. I need the 3rd column

AA

20240721

20240719

20240721

20240719

File 2-524 (Files with various columns and number of records, I have the meta data for each file)

XXXXX

SomeData

324

cde

2024

XXXXV

SomeData

234

cdf

2024

XXXXH

SomeData

234

cdd

2024

OutputResult (Want to add a column to the Beginning with the Date From File1 as a Date (That value is loading as a scring)

BusinessEffectiveDate	Field1	Field2	Field3	Field4	Field5
07-19-2024	XXXXX	SomeData	324	cde	2024
07-19-2024	XXXXV	SomeData	234	cdf	2024
07-19-2024	XXXXH	SomeData	234	cdd	2024

I tried to keep it simple. So I'm trying to pull field 3 in File one that the data is in yyyymmdd format and then use that date and add it to all the other files as the first field.

Some code that I have been playing with. I can load the file and rename the columns, but could not figureout how to extract the column and then use it on the other tables.

from functools import reduce

from pyspark.sql import functions as F

from pyspark.sql import types as T

df = spark.read.csv("Files/Development/PersistentStaging/McCrackenDaily/PBATCHCT")

df_pnote = spark.read.csv("Files/Development/PersistentStaging/McCrackenDaily/PNOTES")

# df now is a Spark DataFrame containing CSV data from "Files/Development/PersistentStaging/McCrackenDaily/PNOTES_DFN.csv".

oldColumns = df.schema.names

newColumns = ["BGAAKY", "BGBTDT","BGBGDT","BGETDT","BGEGDT"]

df=reduce(lambda df, idx: df.withColumnRenamed(oldColumns[idx], newColumns[idx]), range(len(oldColumns)), df)

display(df.printSchema)

def clean_data(df😞

# Derive column 'BusinessEffectiveDate' from column: 'BGBGDT'

#df = df.withColumn('BusinessEffectiveDate', F.lit(None).cast(T.DateType StringType()))

df = df.withColumn('BusinessEffectiveDate', F.lit(None).cast(T.DateType()))

# df = df.select(*(df.columns[:3] + 'BusinessEffectiveDate' + df.columns[3:]))

return df

df_clean = clean_data(df)

display(df_clean)

Anonymous · ‎07-23-2024

Hi @dnauflett,

You can take a load at the following code to use df.collect and df.withColumn function to achieve your requirement:

# Import modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit

# Create a sample DataFrame
data = [("A", 34), ("B", 45), ("C", 29)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
display(df)

# get cellvalue from DataFrame second row, first column
cellValue = df.collect()[1][0]

# load new data
df = spark.read.format("csv").option("header","true").load("Files/churn/raw/churn.csv")

# Add a new column lit wiht extract value
df = df.withColumn("NewColumn", lit(cellValue))

display(df)

Spark dataframe: collect () vs select () - Stack Overflow

Regards,

Xiaoxin Sheng

View solution in original post

dnauflett · ‎07-23-2024

Thanks you so much Xiaoxin!!! You have really help advancing my knowledge and training of Pyspark

Anonymous · ‎07-23-2024

Hi @dnauflett,

You can take a load at the following code to use df.collect and df.withColumn function to achieve your requirement:

# Import modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit

# Create a sample DataFrame
data = [("A", 34), ("B", 45), ("C", 29)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
display(df)

# get cellvalue from DataFrame second row, first column
cellValue = df.collect()[1][0]

# load new data
df = spark.read.format("csv").option("header","true").load("Files/churn/raw/churn.csv")

# Add a new column lit wiht extract value
df = df.withColumn("NewColumn", lit(cellValue))

display(df)

Spark dataframe: collect () vs select () - Stack Overflow

Regards,

Xiaoxin Sheng