Supplies are limited. Contact info@espc.tech right away to save your spot before the conference sells out.
Get your discountScore big with last-minute savings on the final tickets to FabCon Vienna. Secure your discount
Hi all
I'm working on API integration in PySpark notebook and there is a column with email & phone that is an array with random order
contactMethods = [{'name': 'Email', 'value': 'email.com'}, {'name': 'Mobile', 'value': '1234'}]
df = spark.createDataFrame(
[(1, contactMethods)],
("key", "contactMethods")
)
display(df)
I need to translate it into "email" and "mobile" columns but can't figure out how to do this.
I have tried different samples of similar cases but they error in the notebook.
Even pyspark documentation samples don't work, for example "filter" sample: pyspark.sql.functions.filter — PySpark 3.5.3 documentation
I'm stuck ;(
Please help!
Solved! Go to Solution.
It looks like I managed to figure one solution.
Don't know good or bad, it is my only one
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType
extract_email = udf(lambda cell: str(next(filter(lambda t: t["name"] == "Email", cell), {}).get("value", "")), StringType())
extract_mobile = udf(lambda cell: str(next(filter(lambda t: t["name"] == "Mobile", cell), {}).get("value", "")), StringType())
df = df.withColumn('email', extract_email(col("contactMethods"))).withColumn('mobile', extract_mobile(col("contactMethods")))
display(df)
Profies, please advise!
Hi @silly_bird ,
It looks like you have found a solution. Could you please mark this helpful post as “Answered”?
This will help others in the community to easily find a solution if they are experiencing the same problem as you.
Thank you for your cooperation!
Best Regards,
Yang
Community Support Team
If there is any post helps, then please consider Accept it as the solution to help the other members find it more quickly.
If I misunderstand your needs or you still have problems on it, please feel free to let us know. Thanks a lot!
Update
If we read data from json, like that
df = spark.read.json(spark.sparkContext.parallelize([response.json()])).head(1)
.. then cell is a an array of Row objects, not an array of dict
I managed to workaround using asDict method on the row
extract_email = udf(lambda cell: None if cell is None else next(filter(lambda t: t["name"] == "Email", cell), Row(value=None)).asDict().get("value", None), StringType())
It looks like I managed to figure one solution.
Don't know good or bad, it is my only one
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType
extract_email = udf(lambda cell: str(next(filter(lambda t: t["name"] == "Email", cell), {}).get("value", "")), StringType())
extract_mobile = udf(lambda cell: str(next(filter(lambda t: t["name"] == "Mobile", cell), {}).get("value", "")), StringType())
df = df.withColumn('email', extract_email(col("contactMethods"))).withColumn('mobile', extract_mobile(col("contactMethods")))
display(df)
Profies, please advise!
Also to add, contact methods llist can contain from 0 to many methods, I'm interested only in "email" and "phone"
User | Count |
---|---|
4 | |
4 | |
2 | |
2 | |
2 |
User | Count |
---|---|
10 | |
8 | |
6 | |
6 | |
5 |