Solved: Azure Pipeline - Lookup and CopyData

Tom_CA · ‎12-24-2024

Hello colleagues, I'm attempting to store the JSON output from my lookup activity in a file within my Lakehouse. I've experimented with the Copy Data activity, but it hasn't yielded the desired results. Could you suggest an effective method for saving the lookup activity's output to a file?

There is no "inline" option as in Azure Data Factory.

Thank you

Anonymous · ‎12-24-2024

Hi @Tom_CA ,

While I can't get the Json output of the Lookup activity directly through the data pipeline, I do have a workaround to store the JSON output of the lookup activity in a file in Lakehouse.

Please follow my steps to do this:

First, run the Lookup activity successfully in the pipeline.

Then, copy the output json result.

Use the following code to save the output json result as a temporary file. Please replace the parts I marked with your own json output.

import json
import os
import shutil

# Define example schema with more columns
test = {
    "firstRow": {
        "SalesOrderNumber": "SO51555",
        "SalesOrderLineNumber": 7,
        "OrderDate": "2021-06-23T00:00:00Z",
        "CustomerName": "Chloe Garcia",
        "Email": "chloe27@adventure-works.com",
        "Item": "Patch Kit/8 Patches",
        "Quantity": 1,
        "UnitPrice": 2.29,
        "Tax": 0.1832
    }
}

# Write to temporary file
temp_path = "/tmp/test.json"
with open(temp_path, 'w') as f:
    json.dump(test, f)

print(f"Schema temporarily saved to {temp_path}")

Use the following code to save it to the file section of lakehouse, where the path uses the File API Path.

# File API path
lakehouse_dir = "/lakehouse/default/Files/"
lakehouse_path = lakehouse_dir + "test.json"

if not os.path.exists(lakehouse_dir):
    os.makedirs(lakehouse_dir)

# Use shutil.copy to copy files to the target location
shutil.copy(temp_path, lakehouse_path)

print(f"Schema saved to {lakehouse_path}")

Finally you can use this code to see if your stored json output is correct.

import json
import pandas as pd

# set Lakehouse path
lakehouse_path = "/lakehouse/default/Files/test.json"

# use pandas to read JSON file
df = pd.read_json(lakehouse_path, orient='records', lines=True)

schema_loaded = json.loads(df.to_json(orient='records'))

print("Loaded schema:")
print(json.dumps(schema_loaded, indent=4))

By using notebook, you'll be able to save the json file in lakehouse's file.

If you have any other questions please feel free to contact me.

Best Regards,
Yang
Community Support Team

If there is any post helps, then please consider Accept it as the solution to help the other members find it more quickly.
If I misunderstand your needs or you still have problems on it, please feel free to let us know. Thanks a lot!

View solution in original post

Anonymous · ‎12-24-2024

Hi @Tom_CA ,

While I can't get the Json output of the Lookup activity directly through the data pipeline, I do have a workaround to store the JSON output of the lookup activity in a file in Lakehouse.

Please follow my steps to do this:

First, run the Lookup activity successfully in the pipeline.

Then, copy the output json result.

Use the following code to save the output json result as a temporary file. Please replace the parts I marked with your own json output.

import json
import os
import shutil

# Define example schema with more columns
test = {
    "firstRow": {
        "SalesOrderNumber": "SO51555",
        "SalesOrderLineNumber": 7,
        "OrderDate": "2021-06-23T00:00:00Z",
        "CustomerName": "Chloe Garcia",
        "Email": "chloe27@adventure-works.com",
        "Item": "Patch Kit/8 Patches",
        "Quantity": 1,
        "UnitPrice": 2.29,
        "Tax": 0.1832
    }
}

# Write to temporary file
temp_path = "/tmp/test.json"
with open(temp_path, 'w') as f:
    json.dump(test, f)

print(f"Schema temporarily saved to {temp_path}")

Use the following code to save it to the file section of lakehouse, where the path uses the File API Path.

# File API path
lakehouse_dir = "/lakehouse/default/Files/"
lakehouse_path = lakehouse_dir + "test.json"

if not os.path.exists(lakehouse_dir):
    os.makedirs(lakehouse_dir)

# Use shutil.copy to copy files to the target location
shutil.copy(temp_path, lakehouse_path)

print(f"Schema saved to {lakehouse_path}")

Finally you can use this code to see if your stored json output is correct.

import json
import pandas as pd

# set Lakehouse path
lakehouse_path = "/lakehouse/default/Files/test.json"

# use pandas to read JSON file
df = pd.read_json(lakehouse_path, orient='records', lines=True)

schema_loaded = json.loads(df.to_json(orient='records'))

print("Loaded schema:")
print(json.dumps(schema_loaded, indent=4))

By using notebook, you'll be able to save the json file in lakehouse's file.

If you have any other questions please feel free to contact me.

Best Regards,
Yang
Community Support Team

If there is any post helps, then please consider Accept it as the solution to help the other members find it more quickly.
If I misunderstand your needs or you still have problems on it, please feel free to let us know. Thanks a lot!

Tom_CA · ‎12-25-2024

Hey @Anonymous , thank you for your suggestion. It's really interesting, and I will implement it. I'd like to add some important observations about the process of passing parameters to the notebook:

String conversion: In my tests, it was necessary to convert the output to a string before passing it as a parameter to the notebook. This is crucial to ensure that the data is transmitted correctly.
Parsing in the notebook: If you want to work with a real JSON at the end of your process, it's essential to perform parsing in the notebook after receiving the output from the variable. This can be done as follows:

Python
array_json = json.loads(JsonVar)
Where 'JsonVar' is the parameter received in the notebook.
Complete flow:
- Pipeline: Convert the Lookup output to a string.
- Parameter passing: Use the string as a parameter for the notebook.
- Notebook: Parse the string back into a JSON object.

This process ensures that you can work with complex data structures while maintaining data integrity during the transfer between the pipeline and the notebook.

Does that make sense? This approach provides flexibility and robustness when dealing with JSON data in Azure Data Factory or Synapse Analytics pipelines.

Thank you for your support.