Solved: RunMultiple DAG and tricks

DennesTorres · ‎04-30-2024

Hi,

In summary, my main question is:

If I need to use RunMultiple to call an external API multiple times and get the JSON result, are there specific tricks to be used to extract the API result from inside the DAG result ?

Let me explain better why I'm asking:

I have a notebook which makes these calls sequentially. I want to improve the performance changing to parallel execution.

On the original notebook, when I call the API, this is the piece of code which starts processing the result:

    result = requests.post(function_url, data = body, headers=headers)
    
    if result.status_code!=200:
        result.raise_for_status()
    
    data=json.loads(result.text)

    if data==None:
        continue

    for serviceLine in data:

From this point forwards, everythink works well.

When using RunMultiple, I created a parameterized notebook to make one call and return the result. This is how I'm returning the result:

result = requests.post(function_url, data = body, headers=headers)

if result.status_code!=200:
    result.raise_for_status()
    
data=json.loads(result.text)

mssparkutils.notebook.exit(data)

The problem is: Everything is arriving differently and I'm having to make more changes to the processing code (not visible on the code blocks above) than I expected.

First, to extract the JSON from the DAG result, I had to make some transformations on it:

def prepareJSON(jsonValue):
    jsonValue = jsonValue.replace("\'", "\"")
    jsonValue = jsonValue.replace('None','[]')
    jsonValue = jsonValue.replace('True','true')
    jsonValue = jsonValue.replace('False','false')
    return(jsonValue)  

    exitVal=result[account_number]['exitVal']

    if exitVal==None:
        continue

    exitVal = prepareJSON(exitVal)
    row=json.loads(exitVal)

On the code above, "account_number" is the value used as the name of the activity in the DAG.

If it had stopped here, I would not be so concerned. But at the end of the processing it generated errors because it can't join the results, some columns have mixed data types.

Back to the question: Is there any special tricks or suggestions to extract a value from inside the DAG without having to change so much my original processing code ?

Kind Regards,

Dennes

Anonymous · ‎05-02-2024

Hi @DennesTorres
The internal team replied as follows:

Just a suggestion, but seems like it's simply a problem with json->string->json conversion. Perhaps try json.dumps at notebook exit to safely convert the json to string and then json.loads once you've combined them together? Also remember the new soft limit for runMultiple is by default 50 so may be better to convert to python multithreading but note this will only run on the driver node so use single node pool

Hope this helps. Please let me know if you have any further questions.

View solution in original post

DennesTorres · ‎05-02-2024

Hi,

Thank you. Yes, I found a similar solution.

The problem was the load. Making json.load before returning the data converts string -> dictionary . When returning the dictionary in the DAG, the conversion from dictionary -> string doesn't result in the same string.

The solution was to not use JSON load inside the parallel execution. I returned the original string, retrieved outside and used the json.load outside the parallel exeuction. It work perfectly.

Is the same as the recommendation.

About the runMultiple limit, I'm aware. About python multithreading, it's something I need to explore.

Kind Regards,

Dennes

View solution in original post

Anonymous · ‎04-30-2024

Hi @DennesTorres
Thanks for using Fabric Community.
At this time, we are reaching out to the internal team to get some help on this. We will update you once we hear back from them.
Thanks

Anonymous · ‎05-02-2024

Hi @DennesTorres
The internal team replied as follows:

Just a suggestion, but seems like it's simply a problem with json->string->json conversion. Perhaps try json.dumps at notebook exit to safely convert the json to string and then json.loads once you've combined them together? Also remember the new soft limit for runMultiple is by default 50 so may be better to convert to python multithreading but note this will only run on the driver node so use single node pool

Hope this helps. Please let me know if you have any further questions.

DennesTorres · ‎05-02-2024

Hi,

Thank you. Yes, I found a similar solution.

The problem was the load. Making json.load before returning the data converts string -> dictionary . When returning the dictionary in the DAG, the conversion from dictionary -> string doesn't result in the same string.

The solution was to not use JSON load inside the parallel execution. I returned the original string, retrieved outside and used the json.load outside the parallel exeuction. It work perfectly.

Is the same as the recommendation.

About the runMultiple limit, I'm aware. About python multithreading, it's something I need to explore.

Kind Regards,

Dennes