Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Score big with last-minute savings on the final tickets to FabCon Vienna. Secure your discount

Reply
ThiyagarajanLG
Frequent Visitor

Reading json from notebook

Hi team,

 

Can you please on the below issue.

 

I have the below json file.

------------------------------------

{
"SEMPermID": "139390475563",
"Contents":[
{
"MetaData": "Sentiment_1227523835698.json",
"Payload": [
{
"FileName":"transcript.pdf",
"ObjectId": "823bd00a-b399-4d47-900c-d181c84d3c97",
"MimeType":"application/pdf",
"Language": "en",
"FileCreated": "2024-04-10T16:24:19.011Z"
},
{
"FileName":"transcript.xml",
"ObjectId": "823bd00a-b399-4d47-900c-d181c84d3c88",
"MimeType":"application/pdf",
"Language": "en",
"FileCreated": "2024-04-10T16:24:19.011Z"
}
]
},
{
"MetaData": "Sentiment_1227523835700.json",
"Payload": [
{
"FileName":"transcript.xml",
"ObjectId": "823bd00a-b399-4d47-900c-d181c84d3c79",
"MimeType":"application/pdf",
"Language": "en",
"FileCreated": "2024-04-10T16:24:19.011Z"
}
]
},
{
"MetaData": "Sentiment_1227523835499.json"
}
]
}

----------------------------

 

I have to insert the json file into a datalakehouse table as below format.  Basically i need three records to be inserted into lakehouse. I used pyspark to read a json, but it does not helped to get the below format.

 

ThiyagarajanLG_0-1715142483076.png

 

Thanks

Rajan

 

1 ACCEPTED SOLUTION
Anonymous
Not applicable

Hi @ThiyagarajanLG ,

Can you please try the below code -

from pyspark.sql.types import StructType, StructField, StringType, ArrayType
from pyspark.sql.functions import col, coalesce, lit, explode

# Define the schema
schema = StructType([
    StructField("SEMPermID", StringType(), True),
    StructField("Contents", ArrayType(
        StructType([
            StructField("MetaData", StringType(), True),
            StructField("Payload", ArrayType(
                StructType([
                    StructField("FileName", StringType(), True),
                    StructField("ObjectId", StringType(), True),
                    StructField("MimeType", StringType(), True),
                    StructField("Language", StringType(), True),
                    StructField("FileCreated", StringType(), True),
                ])
            ), True),
        ])
    ), True),
])

# Read the JSON data with the defined schema
df3 = spark.read.option("multiline", "true").json("Files/testing2.json", schema=schema)

# Explode the "Contents" array
df3 = df3.withColumn("Contents", explode("Contents"))

# Perform transformations to Payload and MetaData
df3 = df3.withColumn("Payload", coalesce(col("Contents.Payload"), lit(None)))
df3 = df3.withColumn("MetaData", coalesce(col("Contents.MetaData"), lit(None)))

# Select only the required columns
df3 = df3.select("SEMPermID", "MetaData", "Payload")

display(df3)

 

vgchennamsft_0-1715151034216.png


FYI: In my case provided input json is in testing2.json file.

Hope this is helpful. Please let me know incase of further queries.

View solution in original post

7 REPLIES 7
ThiyagarajanLG
Frequent Visitor

hi, 

 

I tried with some other json file. The json file as below

----------------------------

{
"container_records": [
{
"state": {
"effectiveFrom": "2023-11-23T07:30:00.000Z",
"effectiveTo": "2024-04-16T14:50:44.231Z",
"currentFrom": "2023-11-23T10:45:25.953Z",
"currentTo": null,
"aDataItem": {
"distinguishingEventName": "Q3 2024 AAC Clyde Space AB Earnings Release",
"EventTypeId": 1000351539,
}
}
},
{
"state": {
"effectiveFrom": "2024-04-16T14:50:44.231Z",
"effectiveTo": null,
"currentFrom": "2023-11-23T10:45:25.953Z",
"currentTo": "2024-04-16T14:50:44.810Z",
"aDataItem": {
"distinguishingEventName": "Q3 2024 AAC Clyde Space AB Earnings Release",
"EventTypeId": 1000351539,
}
}
},
{
"state": {
"effectiveFrom": "2024-04-16T14:50:44.231Z",
"effectiveTo": null,
"currentFrom": "2024-04-16T14:50:44.810Z",
"currentTo": null,
"aDataItem": {
"distinguishingEventName": "Q3 2024 AAC Clyde Space AB Earnings Release",
"EventTypeId": 1000351539,
}
}
}
]
}

------------------------------------

 

my pyspark code is below.

 

# Welcome to your new notebook
# Type here in the cell editor to add code!
from pyspark.sql.types import StructType, StructField, StringType, ArrayType
from pyspark.sql.functions import col, coalesce, lit, explode

# Define the schema
schema = StructType([
    StructField("container_records", ArrayType(
        StructType([
            StructField("state", ArrayType(
                StructType([
                    StructField("effectiveFrom",StringType(),True),
                    StructField("effectiveTo", StringType(),True),
                    StructField("currentFrom", StringType(),True),
                    StructField("currentTo", StringType(),True),
                    StructField("aDataItem", ArrayType(
                        StructType([
                            StructField("distinguishingEventName", StringType(), True),
                            StructField("EventTypeId", StringType(), True),
                        ])
                    ),True),
                ])
            ), True),
        ])                  
    ), True),
])


# Read the JSON data with the defined schema
 
 
df3 = spark.read.option("multiline", "true").json("Files/testing2.json", schema=schema)

# Explode the "Contents" array
#df3 = df3.withColumn("container_records", explode("container_records"))
#df3 = df3.withColumn("state", coalesce(col("container_records.state"), lit(None)))

display(df3)
 
 
-------------------------
 
The below is the output.. i dont see any row coming up in my output, not sure where i am doing mistake.
 
Also, how to explode the column effectiveFrom.. which is in nested array.
ThiyagarajanLG_0-1715187056458.png

 

Anonymous
Not applicable

Hi @ThiyagarajanLG ,

Can you please raise new thread for this query? So I can look into it closely..
After creating new thread link, please share it here.

Thank you

ThiyagarajanLG
Frequent Visitor

Thank you so much, it really works.

Anonymous
Not applicable

Hi @ThiyagarajanLG ,

Glad to know that your query got resolved. Please continue using Fabric Community on your further queries.

Anonymous
Not applicable

Hi @ThiyagarajanLG ,

Thanks for using Fabric Community.
As I understand the input json is -

{
  "SEMPermID": "139390475563",
  "Contents": [
    {
      "MetaData": "Sentiment_1227523835698.json",
      "Payload": [
        {
          "FileName": "transcript.pdf",
          "ObjectId": "823bd00a-b399-4d47-900c-d181c84d3c97",
          "MimeType": "application/pdf",
          "Language": "en",
          "FileCreated": "2024-04-10T16:24:19.011Z"
        },
        {
          "FileName": "transcript.xml",
          "ObjectId": "823bd00a-b399-4d47-900c-d181c84d3c88",
          "MimeType": "application/pdf",
          "Language": "en",
          "FileCreated": "2024-04-10T16:24:19.011Z"
        }
      ]
    },
    {
      "MetaData": "Sentiment_1227523835700.json",
      "Payload": [
        {
          "FileName": "transcript.xml",
          "ObjectId": "823bd00a-b399-4d47-900c-d181c84d3c79",
          "MimeType": "application/pdf",
          "Language": "en",
          "FileCreated": "2024-04-10T16:24:19.011Z"
        }
      ]
    },
    {
      "MetaData": "Sentiment_1227523835499.json"
    }
  ]
}


Expected table format in lakehouse is with 5 columns (FileName, ObjectId, MimeType, Language, FileCreated)

Please let me if my understanding is correct?

Hi,

 

No, as i mentioned in my post. The target lake house table will have only 3 columns.

SEMPermID, MetaData and Payload.  The Payload column will have the json format values.

 

Thanks

Rajan

Anonymous
Not applicable

Hi @ThiyagarajanLG ,

Can you please try the below code -

from pyspark.sql.types import StructType, StructField, StringType, ArrayType
from pyspark.sql.functions import col, coalesce, lit, explode

# Define the schema
schema = StructType([
    StructField("SEMPermID", StringType(), True),
    StructField("Contents", ArrayType(
        StructType([
            StructField("MetaData", StringType(), True),
            StructField("Payload", ArrayType(
                StructType([
                    StructField("FileName", StringType(), True),
                    StructField("ObjectId", StringType(), True),
                    StructField("MimeType", StringType(), True),
                    StructField("Language", StringType(), True),
                    StructField("FileCreated", StringType(), True),
                ])
            ), True),
        ])
    ), True),
])

# Read the JSON data with the defined schema
df3 = spark.read.option("multiline", "true").json("Files/testing2.json", schema=schema)

# Explode the "Contents" array
df3 = df3.withColumn("Contents", explode("Contents"))

# Perform transformations to Payload and MetaData
df3 = df3.withColumn("Payload", coalesce(col("Contents.Payload"), lit(None)))
df3 = df3.withColumn("MetaData", coalesce(col("Contents.MetaData"), lit(None)))

# Select only the required columns
df3 = df3.select("SEMPermID", "MetaData", "Payload")

display(df3)

 

vgchennamsft_0-1715151034216.png


FYI: In my case provided input json is in testing2.json file.

Hope this is helpful. Please let me know incase of further queries.

Helpful resources

Announcements
Fabric July 2025 Monthly Update Carousel

Fabric Monthly Update - July 2025

Check out the July 2025 Fabric update to learn about new features.

August 2025 community update carousel

Fabric Community Update - August 2025

Find out what's new and trending in the Fabric community.