Join us at FabCon Atlanta from March 16 - 20, 2026, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.
Register now!The Power BI Data Visualization World Championships is back! Get ahead of the game and start preparing now! Learn more
Hi team,
Can you please on the below issue.
I have the below json file.
------------------------------------
{
"SEMPermID": "139390475563",
"Contents":[
{
"MetaData": "Sentiment_1227523835698.json",
"Payload": [
{
"FileName":"transcript.pdf",
"ObjectId": "823bd00a-b399-4d47-900c-d181c84d3c97",
"MimeType":"application/pdf",
"Language": "en",
"FileCreated": "2024-04-10T16:24:19.011Z"
},
{
"FileName":"transcript.xml",
"ObjectId": "823bd00a-b399-4d47-900c-d181c84d3c88",
"MimeType":"application/pdf",
"Language": "en",
"FileCreated": "2024-04-10T16:24:19.011Z"
}
]
},
{
"MetaData": "Sentiment_1227523835700.json",
"Payload": [
{
"FileName":"transcript.xml",
"ObjectId": "823bd00a-b399-4d47-900c-d181c84d3c79",
"MimeType":"application/pdf",
"Language": "en",
"FileCreated": "2024-04-10T16:24:19.011Z"
}
]
},
{
"MetaData": "Sentiment_1227523835499.json"
}
]
}
----------------------------
I have to insert the json file into a datalakehouse table as below format. Basically i need three records to be inserted into lakehouse. I used pyspark to read a json, but it does not helped to get the below format.
Thanks
Rajan
Solved! Go to Solution.
Hi @ThiyagarajanLG ,
Can you please try the below code -
from pyspark.sql.types import StructType, StructField, StringType, ArrayType
from pyspark.sql.functions import col, coalesce, lit, explode
# Define the schema
schema = StructType([
StructField("SEMPermID", StringType(), True),
StructField("Contents", ArrayType(
StructType([
StructField("MetaData", StringType(), True),
StructField("Payload", ArrayType(
StructType([
StructField("FileName", StringType(), True),
StructField("ObjectId", StringType(), True),
StructField("MimeType", StringType(), True),
StructField("Language", StringType(), True),
StructField("FileCreated", StringType(), True),
])
), True),
])
), True),
])
# Read the JSON data with the defined schema
df3 = spark.read.option("multiline", "true").json("Files/testing2.json", schema=schema)
# Explode the "Contents" array
df3 = df3.withColumn("Contents", explode("Contents"))
# Perform transformations to Payload and MetaData
df3 = df3.withColumn("Payload", coalesce(col("Contents.Payload"), lit(None)))
df3 = df3.withColumn("MetaData", coalesce(col("Contents.MetaData"), lit(None)))
# Select only the required columns
df3 = df3.select("SEMPermID", "MetaData", "Payload")
display(df3)
FYI: In my case provided input json is in testing2.json file.
Hope this is helpful. Please let me know incase of further queries.
hi,
I tried with some other json file. The json file as below
----------------------------
{
"container_records": [
{
"state": {
"effectiveFrom": "2023-11-23T07:30:00.000Z",
"effectiveTo": "2024-04-16T14:50:44.231Z",
"currentFrom": "2023-11-23T10:45:25.953Z",
"currentTo": null,
"aDataItem": {
"distinguishingEventName": "Q3 2024 AAC Clyde Space AB Earnings Release",
"EventTypeId": 1000351539,
}
}
},
{
"state": {
"effectiveFrom": "2024-04-16T14:50:44.231Z",
"effectiveTo": null,
"currentFrom": "2023-11-23T10:45:25.953Z",
"currentTo": "2024-04-16T14:50:44.810Z",
"aDataItem": {
"distinguishingEventName": "Q3 2024 AAC Clyde Space AB Earnings Release",
"EventTypeId": 1000351539,
}
}
},
{
"state": {
"effectiveFrom": "2024-04-16T14:50:44.231Z",
"effectiveTo": null,
"currentFrom": "2024-04-16T14:50:44.810Z",
"currentTo": null,
"aDataItem": {
"distinguishingEventName": "Q3 2024 AAC Clyde Space AB Earnings Release",
"EventTypeId": 1000351539,
}
}
}
]
}
------------------------------------
my pyspark code is below.
Hi @ThiyagarajanLG ,
Can you please raise new thread for this query? So I can look into it closely..
After creating new thread link, please share it here.
Thank you
Thank you so much, it really works.
Hi @ThiyagarajanLG ,
Glad to know that your query got resolved. Please continue using Fabric Community on your further queries.
Hi @ThiyagarajanLG ,
Thanks for using Fabric Community.
As I understand the input json is -
{
"SEMPermID": "139390475563",
"Contents": [
{
"MetaData": "Sentiment_1227523835698.json",
"Payload": [
{
"FileName": "transcript.pdf",
"ObjectId": "823bd00a-b399-4d47-900c-d181c84d3c97",
"MimeType": "application/pdf",
"Language": "en",
"FileCreated": "2024-04-10T16:24:19.011Z"
},
{
"FileName": "transcript.xml",
"ObjectId": "823bd00a-b399-4d47-900c-d181c84d3c88",
"MimeType": "application/pdf",
"Language": "en",
"FileCreated": "2024-04-10T16:24:19.011Z"
}
]
},
{
"MetaData": "Sentiment_1227523835700.json",
"Payload": [
{
"FileName": "transcript.xml",
"ObjectId": "823bd00a-b399-4d47-900c-d181c84d3c79",
"MimeType": "application/pdf",
"Language": "en",
"FileCreated": "2024-04-10T16:24:19.011Z"
}
]
},
{
"MetaData": "Sentiment_1227523835499.json"
}
]
}
Expected table format in lakehouse is with 5 columns (FileName, ObjectId, MimeType, Language, FileCreated)
Please let me if my understanding is correct?
Hi,
No, as i mentioned in my post. The target lake house table will have only 3 columns.
SEMPermID, MetaData and Payload. The Payload column will have the json format values.
Thanks
Rajan
Hi @ThiyagarajanLG ,
Can you please try the below code -
from pyspark.sql.types import StructType, StructField, StringType, ArrayType
from pyspark.sql.functions import col, coalesce, lit, explode
# Define the schema
schema = StructType([
StructField("SEMPermID", StringType(), True),
StructField("Contents", ArrayType(
StructType([
StructField("MetaData", StringType(), True),
StructField("Payload", ArrayType(
StructType([
StructField("FileName", StringType(), True),
StructField("ObjectId", StringType(), True),
StructField("MimeType", StringType(), True),
StructField("Language", StringType(), True),
StructField("FileCreated", StringType(), True),
])
), True),
])
), True),
])
# Read the JSON data with the defined schema
df3 = spark.read.option("multiline", "true").json("Files/testing2.json", schema=schema)
# Explode the "Contents" array
df3 = df3.withColumn("Contents", explode("Contents"))
# Perform transformations to Payload and MetaData
df3 = df3.withColumn("Payload", coalesce(col("Contents.Payload"), lit(None)))
df3 = df3.withColumn("MetaData", coalesce(col("Contents.MetaData"), lit(None)))
# Select only the required columns
df3 = df3.select("SEMPermID", "MetaData", "Payload")
display(df3)
FYI: In my case provided input json is in testing2.json file.
Hope this is helpful. Please let me know incase of further queries.