Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Win a FREE 3 Day Ticket to FabCon Vienna. Apply now

Reply
ThiyagarajanLG
Frequent Visitor

Reading json from notebook

Hi team,

 

Can you please on the below issue.

 

I have the below json file.

------------------------------------

{
"SEMPermID": "139390475563",
"Contents":[
{
"MetaData": "Sentiment_1227523835698.json",
"Payload": [
{
"FileName":"transcript.pdf",
"ObjectId": "823bd00a-b399-4d47-900c-d181c84d3c97",
"MimeType":"application/pdf",
"Language": "en",
"FileCreated": "2024-04-10T16:24:19.011Z"
},
{
"FileName":"transcript.xml",
"ObjectId": "823bd00a-b399-4d47-900c-d181c84d3c88",
"MimeType":"application/pdf",
"Language": "en",
"FileCreated": "2024-04-10T16:24:19.011Z"
}
]
},
{
"MetaData": "Sentiment_1227523835700.json",
"Payload": [
{
"FileName":"transcript.xml",
"ObjectId": "823bd00a-b399-4d47-900c-d181c84d3c79",
"MimeType":"application/pdf",
"Language": "en",
"FileCreated": "2024-04-10T16:24:19.011Z"
}
]
},
{
"MetaData": "Sentiment_1227523835499.json"
}
]
}

----------------------------

 

I have to insert the json file into a datalakehouse table as below format.  Basically i need three records to be inserted into lakehouse. I used pyspark to read a json, but it does not helped to get the below format.

 

ThiyagarajanLG_0-1715142483076.png

 

Thanks

Rajan

 

1 ACCEPTED SOLUTION
Anonymous
Not applicable

Hi @ThiyagarajanLG ,

Can you please try the below code -

from pyspark.sql.types import StructType, StructField, StringType, ArrayType
from pyspark.sql.functions import col, coalesce, lit, explode

# Define the schema
schema = StructType([
    StructField("SEMPermID", StringType(), True),
    StructField("Contents", ArrayType(
        StructType([
            StructField("MetaData", StringType(), True),
            StructField("Payload", ArrayType(
                StructType([
                    StructField("FileName", StringType(), True),
                    StructField("ObjectId", StringType(), True),
                    StructField("MimeType", StringType(), True),
                    StructField("Language", StringType(), True),
                    StructField("FileCreated", StringType(), True),
                ])
            ), True),
        ])
    ), True),
])

# Read the JSON data with the defined schema
df3 = spark.read.option("multiline", "true").json("Files/testing2.json", schema=schema)

# Explode the "Contents" array
df3 = df3.withColumn("Contents", explode("Contents"))

# Perform transformations to Payload and MetaData
df3 = df3.withColumn("Payload", coalesce(col("Contents.Payload"), lit(None)))
df3 = df3.withColumn("MetaData", coalesce(col("Contents.MetaData"), lit(None)))

# Select only the required columns
df3 = df3.select("SEMPermID", "MetaData", "Payload")

display(df3)

 

vgchennamsft_0-1715151034216.png


FYI: In my case provided input json is in testing2.json file.

Hope this is helpful. Please let me know incase of further queries.

View solution in original post

7 REPLIES 7
ThiyagarajanLG
Frequent Visitor

hi, 

 

I tried with some other json file. The json file as below

----------------------------

{
"container_records": [
{
"state": {
"effectiveFrom": "2023-11-23T07:30:00.000Z",
"effectiveTo": "2024-04-16T14:50:44.231Z",
"currentFrom": "2023-11-23T10:45:25.953Z",
"currentTo": null,
"aDataItem": {
"distinguishingEventName": "Q3 2024 AAC Clyde Space AB Earnings Release",
"EventTypeId": 1000351539,
}
}
},
{
"state": {
"effectiveFrom": "2024-04-16T14:50:44.231Z",
"effectiveTo": null,
"currentFrom": "2023-11-23T10:45:25.953Z",
"currentTo": "2024-04-16T14:50:44.810Z",
"aDataItem": {
"distinguishingEventName": "Q3 2024 AAC Clyde Space AB Earnings Release",
"EventTypeId": 1000351539,
}
}
},
{
"state": {
"effectiveFrom": "2024-04-16T14:50:44.231Z",
"effectiveTo": null,
"currentFrom": "2024-04-16T14:50:44.810Z",
"currentTo": null,
"aDataItem": {
"distinguishingEventName": "Q3 2024 AAC Clyde Space AB Earnings Release",
"EventTypeId": 1000351539,
}
}
}
]
}

------------------------------------

 

my pyspark code is below.

 

# Welcome to your new notebook
# Type here in the cell editor to add code!
from pyspark.sql.types import StructType, StructField, StringType, ArrayType
from pyspark.sql.functions import col, coalesce, lit, explode

# Define the schema
schema = StructType([
    StructField("container_records", ArrayType(
        StructType([
            StructField("state", ArrayType(
                StructType([
                    StructField("effectiveFrom",StringType(),True),
                    StructField("effectiveTo", StringType(),True),
                    StructField("currentFrom", StringType(),True),
                    StructField("currentTo", StringType(),True),
                    StructField("aDataItem", ArrayType(
                        StructType([
                            StructField("distinguishingEventName", StringType(), True),
                            StructField("EventTypeId", StringType(), True),
                        ])
                    ),True),
                ])
            ), True),
        ])                  
    ), True),
])


# Read the JSON data with the defined schema
 
 
df3 = spark.read.option("multiline", "true").json("Files/testing2.json", schema=schema)

# Explode the "Contents" array
#df3 = df3.withColumn("container_records", explode("container_records"))
#df3 = df3.withColumn("state", coalesce(col("container_records.state"), lit(None)))

display(df3)
 
 
-------------------------
 
The below is the output.. i dont see any row coming up in my output, not sure where i am doing mistake.
 
Also, how to explode the column effectiveFrom.. which is in nested array.
ThiyagarajanLG_0-1715187056458.png

 

Anonymous
Not applicable

Hi @ThiyagarajanLG ,

Can you please raise new thread for this query? So I can look into it closely..
After creating new thread link, please share it here.

Thank you

ThiyagarajanLG
Frequent Visitor

Thank you so much, it really works.

Anonymous
Not applicable

Hi @ThiyagarajanLG ,

Glad to know that your query got resolved. Please continue using Fabric Community on your further queries.

Anonymous
Not applicable

Hi @ThiyagarajanLG ,

Thanks for using Fabric Community.
As I understand the input json is -

{
  "SEMPermID": "139390475563",
  "Contents": [
    {
      "MetaData": "Sentiment_1227523835698.json",
      "Payload": [
        {
          "FileName": "transcript.pdf",
          "ObjectId": "823bd00a-b399-4d47-900c-d181c84d3c97",
          "MimeType": "application/pdf",
          "Language": "en",
          "FileCreated": "2024-04-10T16:24:19.011Z"
        },
        {
          "FileName": "transcript.xml",
          "ObjectId": "823bd00a-b399-4d47-900c-d181c84d3c88",
          "MimeType": "application/pdf",
          "Language": "en",
          "FileCreated": "2024-04-10T16:24:19.011Z"
        }
      ]
    },
    {
      "MetaData": "Sentiment_1227523835700.json",
      "Payload": [
        {
          "FileName": "transcript.xml",
          "ObjectId": "823bd00a-b399-4d47-900c-d181c84d3c79",
          "MimeType": "application/pdf",
          "Language": "en",
          "FileCreated": "2024-04-10T16:24:19.011Z"
        }
      ]
    },
    {
      "MetaData": "Sentiment_1227523835499.json"
    }
  ]
}


Expected table format in lakehouse is with 5 columns (FileName, ObjectId, MimeType, Language, FileCreated)

Please let me if my understanding is correct?

Hi,

 

No, as i mentioned in my post. The target lake house table will have only 3 columns.

SEMPermID, MetaData and Payload.  The Payload column will have the json format values.

 

Thanks

Rajan

Anonymous
Not applicable

Hi @ThiyagarajanLG ,

Can you please try the below code -

from pyspark.sql.types import StructType, StructField, StringType, ArrayType
from pyspark.sql.functions import col, coalesce, lit, explode

# Define the schema
schema = StructType([
    StructField("SEMPermID", StringType(), True),
    StructField("Contents", ArrayType(
        StructType([
            StructField("MetaData", StringType(), True),
            StructField("Payload", ArrayType(
                StructType([
                    StructField("FileName", StringType(), True),
                    StructField("ObjectId", StringType(), True),
                    StructField("MimeType", StringType(), True),
                    StructField("Language", StringType(), True),
                    StructField("FileCreated", StringType(), True),
                ])
            ), True),
        ])
    ), True),
])

# Read the JSON data with the defined schema
df3 = spark.read.option("multiline", "true").json("Files/testing2.json", schema=schema)

# Explode the "Contents" array
df3 = df3.withColumn("Contents", explode("Contents"))

# Perform transformations to Payload and MetaData
df3 = df3.withColumn("Payload", coalesce(col("Contents.Payload"), lit(None)))
df3 = df3.withColumn("MetaData", coalesce(col("Contents.MetaData"), lit(None)))

# Select only the required columns
df3 = df3.select("SEMPermID", "MetaData", "Payload")

display(df3)

 

vgchennamsft_0-1715151034216.png


FYI: In my case provided input json is in testing2.json file.

Hope this is helpful. Please let me know incase of further queries.

Helpful resources

Announcements
August Fabric Update Carousel

Fabric Monthly Update - August 2025

Check out the August 2025 Fabric update to learn about new features.

August 2025 community update carousel

Fabric Community Update - August 2025

Find out what's new and trending in the Fabric community.