Solved: Re: Reading json from notebook

ThiyagarajanLG · ‎05-07-2024

Hi team,

Can you please on the below issue.

I have the below json file.

------------------------------------

{
"SEMPermID": "139390475563",
"Contents":[
{
"MetaData": "Sentiment_1227523835698.json",
"Payload": [
{
"FileName":"transcript.pdf",
"ObjectId": "823bd00a-b399-4d47-900c-d181c84d3c97",
"MimeType":"application/pdf",
"Language": "en",
"FileCreated": "2024-04-10T16:24:19.011Z"
},
{
"FileName":"transcript.xml",
"ObjectId": "823bd00a-b399-4d47-900c-d181c84d3c88",
"MimeType":"application/pdf",
"Language": "en",
"FileCreated": "2024-04-10T16:24:19.011Z"
}
]
},
{
"MetaData": "Sentiment_1227523835700.json",
"Payload": [
{
"FileName":"transcript.xml",
"ObjectId": "823bd00a-b399-4d47-900c-d181c84d3c79",
"MimeType":"application/pdf",
"Language": "en",
"FileCreated": "2024-04-10T16:24:19.011Z"
}
]
},
{
"MetaData": "Sentiment_1227523835499.json"
}
]
}

----------------------------

I have to insert the json file into a datalakehouse table as below format. Basically i need three records to be inserted into lakehouse. I used pyspark to read a json, but it does not helped to get the below format.

Thanks

Rajan

Anonymous · ‎05-07-2024

Hi @ThiyagarajanLG ,

Can you please try the below code -

from pyspark.sql.types import StructType, StructField, StringType, ArrayType
from pyspark.sql.functions import col, coalesce, lit, explode

# Define the schema
schema = StructType([
    StructField("SEMPermID", StringType(), True),
    StructField("Contents", ArrayType(
        StructType([
            StructField("MetaData", StringType(), True),
            StructField("Payload", ArrayType(
                StructType([
                    StructField("FileName", StringType(), True),
                    StructField("ObjectId", StringType(), True),
                    StructField("MimeType", StringType(), True),
                    StructField("Language", StringType(), True),
                    StructField("FileCreated", StringType(), True),
                ])
            ), True),
        ])
    ), True),
])

# Read the JSON data with the defined schema
df3 = spark.read.option("multiline", "true").json("Files/testing2.json", schema=schema)

# Explode the "Contents" array
df3 = df3.withColumn("Contents", explode("Contents"))

# Perform transformations to Payload and MetaData
df3 = df3.withColumn("Payload", coalesce(col("Contents.Payload"), lit(None)))
df3 = df3.withColumn("MetaData", coalesce(col("Contents.MetaData"), lit(None)))

# Select only the required columns
df3 = df3.select("SEMPermID", "MetaData", "Payload")

display(df3)

FYI: In my case provided input json is in testing2.json file.

Hope this is helpful. Please let me know incase of further queries.

View solution in original post

ThiyagarajanLG · ‎05-08-2024

hi,

I tried with some other json file. The json file as below

----------------------------

{
"container_records": [
{
"state": {
"effectiveFrom": "2023-11-23T07:30:00.000Z",
"effectiveTo": "2024-04-16T14:50:44.231Z",
"currentFrom": "2023-11-23T10:45:25.953Z",
"currentTo": null,
"aDataItem": {
"distinguishingEventName": "Q3 2024 AAC Clyde Space AB Earnings Release",
"EventTypeId": 1000351539,
}
}
},
{
"state": {
"effectiveFrom": "2024-04-16T14:50:44.231Z",
"effectiveTo": null,
"currentFrom": "2023-11-23T10:45:25.953Z",
"currentTo": "2024-04-16T14:50:44.810Z",
"aDataItem": {
"distinguishingEventName": "Q3 2024 AAC Clyde Space AB Earnings Release",
"EventTypeId": 1000351539,
}
}
},
{
"state": {
"effectiveFrom": "2024-04-16T14:50:44.231Z",
"effectiveTo": null,
"currentFrom": "2024-04-16T14:50:44.810Z",
"currentTo": null,
"aDataItem": {
"distinguishingEventName": "Q3 2024 AAC Clyde Space AB Earnings Release",
"EventTypeId": 1000351539,
}
}
}
]
}

------------------------------------

my pyspark code is below.

# Welcome to your new notebook

# Type here in the cell editor to add code!

from pyspark.sql.types import StructType, StructField, StringType, ArrayType

from pyspark.sql.functions import col, coalesce, lit, explode

# Define the schema

schema = StructType([

StructField("container_records", ArrayType(

StructType([

StructField("state", ArrayType(

StructType([

StructField("effectiveFrom",StringType(),True),

StructField("effectiveTo", StringType(),True),

StructField("currentFrom", StringType(),True),

StructField("currentTo", StringType(),True),

StructField("aDataItem", ArrayType(

StructType([

StructField("distinguishingEventName", StringType(), True),

StructField("EventTypeId", StringType(), True),

])

),True),

])

), True),

])

), True),

])

# Read the JSON data with the defined schema

df3 = spark.read.option("multiline", "true").json("Files/testing2.json", schema=schema)

# Explode the "Contents" array

#df3 = df3.withColumn("container_records", explode("container_records"))

#df3 = df3.withColumn("state", coalesce(col("container_records.state"), lit(None)))

display(df3)

-------------------------

The below is the output.. i dont see any row coming up in my output, not sure where i am doing mistake.

Also, how to explode the column effectiveFrom.. which is in nested array.

Anonymous · ‎05-08-2024

Hi @ThiyagarajanLG ,

Can you please raise new thread for this query? So I can look into it closely..
After creating new thread link, please share it here.

Thank you

ThiyagarajanLG · ‎05-08-2024

Thank you so much, it really works.

Anonymous · ‎05-08-2024

Hi @ThiyagarajanLG ,

Glad to know that your query got resolved. Please continue using Fabric Community on your further queries.

Anonymous · ‎05-07-2024

Hi @ThiyagarajanLG ,

Thanks for using Fabric Community.
As I understand the input json is -

{
  "SEMPermID": "139390475563",
  "Contents": [
    {
      "MetaData": "Sentiment_1227523835698.json",
      "Payload": [
        {
          "FileName": "transcript.pdf",
          "ObjectId": "823bd00a-b399-4d47-900c-d181c84d3c97",
          "MimeType": "application/pdf",
          "Language": "en",
          "FileCreated": "2024-04-10T16:24:19.011Z"
        },
        {
          "FileName": "transcript.xml",
          "ObjectId": "823bd00a-b399-4d47-900c-d181c84d3c88",
          "MimeType": "application/pdf",
          "Language": "en",
          "FileCreated": "2024-04-10T16:24:19.011Z"
        }
      ]
    },
    {
      "MetaData": "Sentiment_1227523835700.json",
      "Payload": [
        {
          "FileName": "transcript.xml",
          "ObjectId": "823bd00a-b399-4d47-900c-d181c84d3c79",
          "MimeType": "application/pdf",
          "Language": "en",
          "FileCreated": "2024-04-10T16:24:19.011Z"
        }
      ]
    },
    {
      "MetaData": "Sentiment_1227523835499.json"
    }
  ]
}

Expected table format in lakehouse is with 5 columns (FileName, ObjectId, MimeType, Language, FileCreated)

Please let me if my understanding is correct?

ThiyagarajanLG · ‎05-07-2024

Hi,

No, as i mentioned in my post. The target lake house table will have only 3 columns.

SEMPermID, MetaData and Payload. The Payload column will have the json format values.

Thanks

Rajan

Anonymous · ‎05-07-2024

Hi @ThiyagarajanLG ,

Can you please try the below code -

from pyspark.sql.types import StructType, StructField, StringType, ArrayType
from pyspark.sql.functions import col, coalesce, lit, explode

# Define the schema
schema = StructType([
    StructField("SEMPermID", StringType(), True),
    StructField("Contents", ArrayType(
        StructType([
            StructField("MetaData", StringType(), True),
            StructField("Payload", ArrayType(
                StructType([
                    StructField("FileName", StringType(), True),
                    StructField("ObjectId", StringType(), True),
                    StructField("MimeType", StringType(), True),
                    StructField("Language", StringType(), True),
                    StructField("FileCreated", StringType(), True),
                ])
            ), True),
        ])
    ), True),
])

# Read the JSON data with the defined schema
df3 = spark.read.option("multiline", "true").json("Files/testing2.json", schema=schema)

# Explode the "Contents" array
df3 = df3.withColumn("Contents", explode("Contents"))

# Perform transformations to Payload and MetaData
df3 = df3.withColumn("Payload", coalesce(col("Contents.Payload"), lit(None)))
df3 = df3.withColumn("MetaData", coalesce(col("Contents.MetaData"), lit(None)))

# Select only the required columns
df3 = df3.select("SEMPermID", "MetaData", "Payload")

display(df3)

FYI: In my case provided input json is in testing2.json file.

Hope this is helpful. Please let me know incase of further queries.

Reading json from notebook

Helpful resources

Fabric Monthly Update - August 2025

Fabric Community Update - August 2025

Win a 3 Day Ticket to FabCon Vienna

Reading json from notebook

Helpful resources

Fabric Monthly Update - August 2025

Fabric Community Update - August 2025