This time we’re going bigger than ever. Fabric, Power BI, SQL, AI and more. We're covering it all. You won't want to miss it.
Learn moreJoin the FabCon + SQLCon recap series. Up next: Power BI, Real-Time Intelligence, IQ and AI, and Data Factory take center stage. All sessions are available on-demand after the live show. Register now
I'm extremely new to notebooks and accessing data within JSON files that have been imported into a Lakehouse. I have the following JSON schema in a file (from notebook df.printSchema()):
The Modifiers array element doesn't have a name, it just looks like this:
"Modifiers": [
"US"
]
I can't figure out how to reference that using a notebook or SQL's OPENROWSET. Any ideas? Thanks!
Solved! Go to Solution.
Hi @Richtpt ,
Thanks for reaching out to the Microsoft Fabric Community forum.
If you just want to see the data in the same nested format as it appears in the JSON, you can run:
df.select("PRCCollection.Modifiers").show()
Since both PRCCollection and Modifiers are arrays, Spark will display the result as something like [[US]]. That simply means there is an array inside another array.
If you’d prefer to flatten the structure and extract the actual value, you can use explode() to expand each array level:
from pyspark.sql.functions import explode
df2 = df.withColumn("prc", explode("PRCCollection")) \
.withColumn("Modifier", explode("prc.Modifiers"))
df2.select("Modifier").show()
Here, the first explode() opens up the PRCCollection array, and the second one expands the Modifiers array inside it. The output will then show the value directly (for example US) instead of the nested [[US]] structure.
For reference this is the output of df.schema() for my data
I hope this information helps. Please do let us know if you have any further queries.
Thank you
Hi @Richtpt ,
Thanks for reaching out to the Microsoft Fabric Community forum.
If you just want to see the data in the same nested format as it appears in the JSON, you can run:
df.select("PRCCollection.Modifiers").show()
Since both PRCCollection and Modifiers are arrays, Spark will display the result as something like [[US]]. That simply means there is an array inside another array.
If you’d prefer to flatten the structure and extract the actual value, you can use explode() to expand each array level:
from pyspark.sql.functions import explode
df2 = df.withColumn("prc", explode("PRCCollection")) \
.withColumn("Modifier", explode("prc.Modifiers"))
df2.select("Modifier").show()
Here, the first explode() opens up the PRCCollection array, and the second one expands the Modifiers array inside it. The output will then show the value directly (for example US) instead of the nested [[US]] structure.
For reference this is the output of df.schema() for my data
I hope this information helps. Please do let us know if you have any further queries.
Thank you
That code helped a lot, but I still wasn't getting what I wanted. Probably because I'm extremely new to PySpark (as in this is my first time using it). I finally asked Grok how to flatten this. I got back several responses including this one that worked perfect. Wanted to share to close this up. Thanks much for the help!
from pyspark.sql.functions import col, explode_outer, explode
df = spark.read.json("Files/test.json")
df_deep = (
df
.select("*", explode_outer("PRC0Collection").alias("PR"))
.select(
"*",
col("PR.*"),
explode_outer(col("PR.Modifiers")).alias("Modifier")
)
.drop("PRC0Collection", "PR")
.drop("PRC0Collection", "Modifiers")
)
display(df_deep)
btw, I forgot to post my json earlier. Here it is:
{
"AGAG_ID": "PHKETCHMC",
"AGIP_CAP_IND": "",
"CSPI_HIOS_ID_NVL": "73836AK09999",
"CSPI_ID": "AMT01234",
"GRGR_CK": 12345,
"IPCD_ID": "R9999",
"PRC0Collection": [
{
"Cache_ID": "1462da8b",
"Exp_Date": "12/31/9999 00:00:00.000",
"Modifiers": [
"US"
],
"PSCD_ID": "11",
"Price": 16336.44,
"Type": "negotiated"
},
{
"Cache_ID": "156eab34",
"Exp_Date": "12/31/9999 00:00:00.000",
"Modifiers": [
"UQ"
],
"PSCD_ID": "11",
"Price": 24499.76,
"Type": "negotiated"
}
],
"PRPR_ID": "P00000099999",
"SEDF_CAP_IND": "N"
}
Hi @Richtpt ,
We really appreciate your efforts and for letting us know the update on the issue.
Please continue using fabric community forum for your further assistance.
Thank you
Thanks, that helps a lot and I am able to see the Modifiers values.
Modifier is an array, the value inside it is accessed by index number, not by name. firsy value is at index[0].
to get the value use code { df.select("PRC0Collection.Modifiers[0]").show() }
May be this can help.
Thanks but this isn't working for me. I have
df = spark.read.json("file1.json")
df.select("PRC0Collection.Modifiers[0]").show()
I get this error: [FIELD_NOT_FOUND] No such struct field `Modifiers[0]` in `Cache_ID`, `Exp_Date`, `Modifiers`, `PSCD_ID`, `Price`, `Type`.
| User | Count |
|---|---|
| 43 | |
| 23 | |
| 15 | |
| 12 | |
| 12 |