Solved: Efficiently extracting data from a JSON file in La...

KSPana · ‎12-24-2024

Hello

We have a data extraction pipeline that downloads JSON files from the source system into a Lakehouse. The data is split over many JSON files, each of which contains the URL linking to the subsequent file to be downloaded. A significant slowdown in our pipeline is the process of starting up a spark cluster (~several minutes) in order to run a tiny python program (~1 second) that opens the file, reads the "next_url" property, and sets an ExitValue to feed into the next CopyData activity.

Hence my question. I want to efficiently extract just a single JSON property from JSON files saved in a Lakehouse, without waiting for a new spark cluster to start up for each file. This operation has to be performed many times, at semi-unpredictable intervals (whenever a file is done downloading). The JSON files can be very small (a few KB) or large (several hundred MB), but I am only interested in that one property at this stage in the pipeline.

What are my options?

Anonymous · ‎01-01-2025

Hi @KSPana,

Have you tied to use the 'paginted rule' option in the 'copy data' activity? If that is the case, you can take a look at the official document about these:

Configure REST in a copy activity - Microsoft Fabric | Microsoft Learn

Reference about the pagination support section and details:

Copy and transform data from and to a REST endpoint - Azure Data Factory & Azure Synapse | Microsoft...

If the above also not suitable for your scenario, I think currently it may not support it directly handle these. Perhaps you can try to submit an idea with your requirement.

Microsoft Fabric Ideas

Regards,

Xiaoxin Sheng

View solution in original post

Anonymous · ‎12-24-2024

Hi @KSPana ,

Can you please share some more detail information about these operations? They should help us clarify your scenario and test.

Regards,

Xiaoxin Sheng

KSPana · ‎12-26-2024

The data consists of JSON files on the source side. We have a pipeline activity that has the following structure:

A Copy Data activity, which downloads a JSON file into the lakehouse -> a Notebook which takes the file location in the lakehouse as input and returns the "next url" field from the JSON file -> this URL goes back into the Copy Data activity to download the next file, and so on, until the last file is downloaded.

The JSON files that are downloaded range from several KB to several hundred MB, and can only be downloaded sequentially. Sometimes there can be 100s of JSON files in this process. The starting up of a spark cluster in order to run the notebook to extract this one JSON field can take from half a minute to several minutes, which adds up considerably over the entire download run. At the same time, having a cluster running throughout the entire process is costly when most of the time is idling waiting for the download to complete.

Is there any other way to just get that one field from a JSON file without having to start up a spark cluster every time? It only has to be a relatively lightweight environment. I have considered something like an Azure Function with access to the Lakehouse, but that would require going outside the Fabric environment. I have looked into the LookUp activity, but as far as I can tell this will not work because the JSON files can be larger than the LookUp activity allows. I have considered using SQL within a Script activity, but with limited success so far. Suggestions are welcome.

Anonymous · ‎12-26-2024

Hi @KSPana,

So you means to use notebook to handle with pagination rest api? If that is the case, I'd like to suggest you take a look at the following blog about use notebook to handle with pagination api:

Pagination in Fabric Notebook Using Fabric Rest API - Ben's Blog

Regads,

Xiaoxin Sheng

KSPana · ‎12-31-2024

Hello, thanks again for your response.

The data is accessed through an on-premise gateway, which means I cannot just use a Notebook to access it - as I understand it, that is currently unsupported in Fabric.

Using the pagination rule settings on the CopyData activity itself does allow me to download all the data without requiring the startup of a notebook, though this puts all the data into a singular file, instead of split over several files as we have currently in our workflow. If there is no other way, then I suppose that it's still faster to do this and manually split the file later than wait for Notebooks to start every time.

Anonymous · ‎01-01-2025

Hi @KSPana,

Have you tied to use the 'paginted rule' option in the 'copy data' activity? If that is the case, you can take a look at the official document about these:

Configure REST in a copy activity - Microsoft Fabric | Microsoft Learn

Reference about the pagination support section and details:

Copy and transform data from and to a REST endpoint - Azure Data Factory & Azure Synapse | Microsoft...

If the above also not suitable for your scenario, I think currently it may not support it directly handle these. Perhaps you can try to submit an idea with your requirement.

Microsoft Fabric Ideas

Regards,

Xiaoxin Sheng