Supplies are limited. Contact info@espc.tech right away to save your spot before the conference sells out.
Get your discountScore big with last-minute savings on the final tickets to FabCon Vienna. Secure your discount
Hi team,
Can you please on the below issue.
I have the below json file.
------------------------------------
{
"SEMPermID": "139390475563",
"Contents":[
{
"MetaData": "Sentiment_1227523835698.json",
"Payload": [
{
"FileName":"transcript.pdf",
"ObjectId": "823bd00a-b399-4d47-900c-d181c84d3c97",
"MimeType":"application/pdf",
"Language": "en",
"FileCreated": "2024-04-10T16:24:19.011Z"
},
{
"FileName":"transcript.xml",
"ObjectId": "823bd00a-b399-4d47-900c-d181c84d3c88",
"MimeType":"application/pdf",
"Language": "en",
"FileCreated": "2024-04-10T16:24:19.011Z"
}
]
},
{
"MetaData": "Sentiment_1227523835700.json",
"Payload": [
{
"FileName":"transcript.xml",
"ObjectId": "823bd00a-b399-4d47-900c-d181c84d3c79",
"MimeType":"application/pdf",
"Language": "en",
"FileCreated": "2024-04-10T16:24:19.011Z"
}
]
},
{
"MetaData": "Sentiment_1227523835499.json"
}
]
}
----------------------------
I have to insert the json file into a datalakehouse table as below format. Basically i need three records to be inserted into lakehouse. I used pyspark to read a json, but it does not helped to get the below format.
Thanks
Rajan
Solved! Go to Solution.
Hi @ThiyagarajanLG ,
Can you please try the below code -
from pyspark.sql.types import StructType, StructField, StringType, ArrayType
from pyspark.sql.functions import col, coalesce, lit, explode
# Define the schema
schema = StructType([
StructField("SEMPermID", StringType(), True),
StructField("Contents", ArrayType(
StructType([
StructField("MetaData", StringType(), True),
StructField("Payload", ArrayType(
StructType([
StructField("FileName", StringType(), True),
StructField("ObjectId", StringType(), True),
StructField("MimeType", StringType(), True),
StructField("Language", StringType(), True),
StructField("FileCreated", StringType(), True),
])
), True),
])
), True),
])
# Read the JSON data with the defined schema
df3 = spark.read.option("multiline", "true").json("Files/testing2.json", schema=schema)
# Explode the "Contents" array
df3 = df3.withColumn("Contents", explode("Contents"))
# Perform transformations to Payload and MetaData
df3 = df3.withColumn("Payload", coalesce(col("Contents.Payload"), lit(None)))
df3 = df3.withColumn("MetaData", coalesce(col("Contents.MetaData"), lit(None)))
# Select only the required columns
df3 = df3.select("SEMPermID", "MetaData", "Payload")
display(df3)
FYI: In my case provided input json is in testing2.json file.
Hope this is helpful. Please let me know incase of further queries.
hi,
I tried with some other json file. The json file as below
----------------------------
{
"container_records": [
{
"state": {
"effectiveFrom": "2023-11-23T07:30:00.000Z",
"effectiveTo": "2024-04-16T14:50:44.231Z",
"currentFrom": "2023-11-23T10:45:25.953Z",
"currentTo": null,
"aDataItem": {
"distinguishingEventName": "Q3 2024 AAC Clyde Space AB Earnings Release",
"EventTypeId": 1000351539,
}
}
},
{
"state": {
"effectiveFrom": "2024-04-16T14:50:44.231Z",
"effectiveTo": null,
"currentFrom": "2023-11-23T10:45:25.953Z",
"currentTo": "2024-04-16T14:50:44.810Z",
"aDataItem": {
"distinguishingEventName": "Q3 2024 AAC Clyde Space AB Earnings Release",
"EventTypeId": 1000351539,
}
}
},
{
"state": {
"effectiveFrom": "2024-04-16T14:50:44.231Z",
"effectiveTo": null,
"currentFrom": "2024-04-16T14:50:44.810Z",
"currentTo": null,
"aDataItem": {
"distinguishingEventName": "Q3 2024 AAC Clyde Space AB Earnings Release",
"EventTypeId": 1000351539,
}
}
}
]
}
------------------------------------
my pyspark code is below.
Hi @ThiyagarajanLG ,
Can you please raise new thread for this query? So I can look into it closely..
After creating new thread link, please share it here.
Thank you
Thank you so much, it really works.
Hi @ThiyagarajanLG ,
Glad to know that your query got resolved. Please continue using Fabric Community on your further queries.
Hi @ThiyagarajanLG ,
Thanks for using Fabric Community.
As I understand the input json is -
{
"SEMPermID": "139390475563",
"Contents": [
{
"MetaData": "Sentiment_1227523835698.json",
"Payload": [
{
"FileName": "transcript.pdf",
"ObjectId": "823bd00a-b399-4d47-900c-d181c84d3c97",
"MimeType": "application/pdf",
"Language": "en",
"FileCreated": "2024-04-10T16:24:19.011Z"
},
{
"FileName": "transcript.xml",
"ObjectId": "823bd00a-b399-4d47-900c-d181c84d3c88",
"MimeType": "application/pdf",
"Language": "en",
"FileCreated": "2024-04-10T16:24:19.011Z"
}
]
},
{
"MetaData": "Sentiment_1227523835700.json",
"Payload": [
{
"FileName": "transcript.xml",
"ObjectId": "823bd00a-b399-4d47-900c-d181c84d3c79",
"MimeType": "application/pdf",
"Language": "en",
"FileCreated": "2024-04-10T16:24:19.011Z"
}
]
},
{
"MetaData": "Sentiment_1227523835499.json"
}
]
}
Expected table format in lakehouse is with 5 columns (FileName, ObjectId, MimeType, Language, FileCreated)
Please let me if my understanding is correct?
Hi,
No, as i mentioned in my post. The target lake house table will have only 3 columns.
SEMPermID, MetaData and Payload. The Payload column will have the json format values.
Thanks
Rajan
Hi @ThiyagarajanLG ,
Can you please try the below code -
from pyspark.sql.types import StructType, StructField, StringType, ArrayType
from pyspark.sql.functions import col, coalesce, lit, explode
# Define the schema
schema = StructType([
StructField("SEMPermID", StringType(), True),
StructField("Contents", ArrayType(
StructType([
StructField("MetaData", StringType(), True),
StructField("Payload", ArrayType(
StructType([
StructField("FileName", StringType(), True),
StructField("ObjectId", StringType(), True),
StructField("MimeType", StringType(), True),
StructField("Language", StringType(), True),
StructField("FileCreated", StringType(), True),
])
), True),
])
), True),
])
# Read the JSON data with the defined schema
df3 = spark.read.option("multiline", "true").json("Files/testing2.json", schema=schema)
# Explode the "Contents" array
df3 = df3.withColumn("Contents", explode("Contents"))
# Perform transformations to Payload and MetaData
df3 = df3.withColumn("Payload", coalesce(col("Contents.Payload"), lit(None)))
df3 = df3.withColumn("MetaData", coalesce(col("Contents.MetaData"), lit(None)))
# Select only the required columns
df3 = df3.select("SEMPermID", "MetaData", "Payload")
display(df3)
FYI: In my case provided input json is in testing2.json file.
Hope this is helpful. Please let me know incase of further queries.