October 28 & 29: Experts share their secrets on how to pass the Fabric Analytics Engineer certification exam—live. Learn more
Dear Everyone,
"{'manufacturerCodeValue': 'XXXX', 'specific': '1', 'toolNumber': 'SCSTU'}, {'manufacturerCodeValue': 'XXX', 'specific': '1', 'toolNumber': 'SCST1AK35U'}, {'specific': '0', 'toolNumber': 'No Specific'}")
I kindly need your help, I'm not good at JSON formatting and I tried a lot of options. I have this kind of data in a data frame and I want only to extract the value of the toolNumber which is SCSTU and SCST1AK35U.
My expected output is SCSTU, SCST1AK35U
The code below only captures SCSTU.
result_df = result_df.withColumn(
"SUPP_PART_REQ_PN_2",
expr(
"concat("
"get_json_object(SUPP_PART_REQ_PN, '$.toolNumber')"
")"
)
)
Thank you so much for your help in advance.
Hi @kirah2128,
I must confess, I leaned heavily on ChatGPT, as I'm not an expert in JSON or PySpark.
The JSON string has a dangling ")" at the end, and you refer to an existing DataFrame, so I made a few assumptions:
1) The ")" isn't part of the JSON string.
2) The remaining quotes around the JSON string is only to specify that it is a string.
If that's the case, then it might be invalid JSON. Maybe paste the JSON string you have here [https://jsonlint.com/] for vaidation check.
However, herewith the PySpark code where I've made the JSON string an array object, with some other assumptions. There might be a more elegant way of parsing the string to your requirement, but hopefully this gives you some ideas of how to adjust your code.
I've also pasted your JSON string in a comment, and running that through only returns 1 value: "SCSTU"
from pyspark.sql.functions import col, explode, expr
# Input data
#gwason = '{"manufacturerCodeValue": "XXXX", "specific": "1", "toolNumber": "SCSTU"}, {"manufacturerCodeValue": "XXX", "specific": "1", "toolNumber": "SCST1AK35U"}, {"specific": "0", "toolNumber": "No Specific"}'
gwason = '[{"manufacturerCodeValue": "XXXX", "specific": "1", "toolNumber": "SCSTU"}, {"manufacturerCodeValue": "XXX", "specific": "1", "toolNumber": "SCST1AK35U"}, {"specific": "0", "toolNumber": "No Specific"}]'
# Create DataFrame
data = [(1, gwason)]
df = spark.createDataFrame(data, ["id", "gwason"])
#display(df)
# Create array of structs directly, as the input for 'explode' needs to be an array or map data type
df_parsed = df.withColumn("parsed", expr("from_json(gwason, 'array<struct<manufacturerCodeValue:string,specific:string,toolNumber:string>>')"))
#display(df_parsed)
# Explode the array of structs
df_exploded = df_parsed.select(explode("parsed").alias("exploded"))
#display(df_exploded)
# Filter and collect toolNumber values
result_df = df_exploded.filter(col("exploded.toolNumber") != "No Specific").select("exploded.toolNumber")
display(result_df)
# Convert DataFrame to list and print
result_list = result_df.rdd.map(lambda row: row[0]).collect()
print(str(result_list))
Dear dphugo,
It's a great help 🙂 Thank you so much. If you don't mind, I'm having a problem collecting it back to my dataframe. I got [Object Object] as a result
df_parsed = result_df.withColumn("parsed", expr("from_json(SUPP_PART_REQ_PN, 'array<struct<manufacturerCodeValue:string,specific:string,toolNumber:string>>')"))
df_exploded = df_parsed.select("FileName", F.explode("parsed").alias("exploded_value"))
result_df = df_exploded.groupBy("FileName").agg(collect_list("exploded_value").alias("SUPP_PART_REQ_PN3"))
I'm using a display() to see the results.
Hi @kirah2128 ,
Thanks for using Fabric Community.
I understand that you're encountering an issue when trying to collect the data back into your DataFrame and seeing [Object Object] in your display. Here are a few things to check:
1. Data size:
The collect() method in Spark is used to retrieve all elements of a DataFrame to the driver node. Try using df.collect(). This can be problematic for large datasets as it might not fit in memory, leading to errors.
2. Data format:
You mentioned using display() to see the results. This function might not display complex data structures like DataFrames properly, leading to the [Object Object] output. Try using print(result_df.toPandas()) or result_df.show() to view the DataFrame contents in a more structured format.
3. Specific error message:
If you're encountering an error beyond just [Object Object], please provide the exact error message. This will help pinpoint the specific issue and suggest more tailored solutions.
Hope this is helpful. Please let me know incase of further queries.
Hello @kirah2128 ,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet .
In case if you have any resolution please do share that same with the community as it can be helpful to others .
Otherwise, will respond back with the more details and we will try to help .
Hello @kirah2128 ,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet .
In case if you have any resolution please do share that same with the community as it can be helpful to others .
If you have any question relating to the current thread, please do let us know and we will try out best to help you.
In case if you have any other question on a different issue, we request you to open a new thread .
User | Count |
---|---|
11 | |
3 | |
3 | |
3 | |
2 |
User | Count |
---|---|
18 | |
11 | |
7 | |
7 | |
4 |