Power BI is turning 10! Tune in for a special live episode on July 24 with behind-the-scenes stories, product evolution highlights, and a sneak peek at what’s in store for the future.
Save the dateEnhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.
I'm probably going to have a series of questions like this!! I'm going to baby Step this. The 50,000 foot view of what I'm trying to do is, I have 524 files that I need to load into a Fabric Warehouse everyday. The csv files themselves do not have headers on them. I have figured out how to get the header I need. One of those files contains the buisness Date of the Data. I want to grab the date from that one file and then for all of the over files I wanted to add that Data as a column For Example:
File1 (I was able to Load it as a Data Frame) and rename the columns. I need the 3rd column
AA | 20240721 | 20240719 | 20240721 | 20240719 |
File 2-524 (Files with various columns and number of records, I have the meta data for each file)
XXXXX | SomeData | 324 | cde | 2024 |
XXXXV | SomeData | 234 | cdf | 2024 |
XXXXH
| SomeData | 234 | cdd | 2024 |
OutputResult (Want to add a column to the Beginning with the Date From File1 as a Date (That value is loading as a scring)
BusinessEffectiveDate | Field1 | Field2 | Field3 | Field4 | Field5 |
07-19-2024 | XXXXX | SomeData | 324 | cde | 2024 |
07-19-2024 | XXXXV | SomeData | 234 | cdf | 2024 |
07-19-2024 | XXXXH
| SomeData | 234 | cdd | 2024 |
I tried to keep it simple. So I'm trying to pull field 3 in File one that the data is in yyyymmdd format and then use that date and add it to all the other files as the first field.
Solved! Go to Solution.
Hi @dnauflett,
You can take a load at the following code to use df.collect and df.withColumn function to achieve your requirement:
# Import modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
# Create a sample DataFrame
data = [("A", 34), ("B", 45), ("C", 29)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
display(df)
# get cellvalue from DataFrame second row, first column
cellValue = df.collect()[1][0]
# load new data
df = spark.read.format("csv").option("header","true").load("Files/churn/raw/churn.csv")
# Add a new column lit wiht extract value
df = df.withColumn("NewColumn", lit(cellValue))
display(df)
Spark dataframe: collect () vs select () - Stack Overflow
Regards,
Xiaoxin Sheng
Thanks you so much Xiaoxin!!! You have really help advancing my knowledge and training of Pyspark
Hi @dnauflett,
You can take a load at the following code to use df.collect and df.withColumn function to achieve your requirement:
# Import modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
# Create a sample DataFrame
data = [("A", 34), ("B", 45), ("C", 29)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
display(df)
# get cellvalue from DataFrame second row, first column
cellValue = df.collect()[1][0]
# load new data
df = spark.read.format("csv").option("header","true").load("Files/churn/raw/churn.csv")
# Add a new column lit wiht extract value
df = df.withColumn("NewColumn", lit(cellValue))
display(df)
Spark dataframe: collect () vs select () - Stack Overflow
Regards,
Xiaoxin Sheng
This is your chance to engage directly with the engineering team behind Fabric and Power BI. Share your experiences and shape the future.
Check out the June 2025 Fabric update to learn about new features.
User | Count |
---|---|
3 | |
1 | |
1 | |
1 | |
1 |
User | Count |
---|---|
3 | |
2 | |
2 | |
1 | |
1 |