Fabric Pipeline Activity-How to use API of activit...

AmruthaVarshini · ‎05-15-2024

Existing functionality:

AWS: Currently, we have a lambda function(python script) which has many conditions(fetches data from an on prem sql server) which in turn generates
parameters and sends them to a shell script(has sqoop import query) and similar way one more lambda for sqoop export.

Target:

We are planning to achieve the same in Fabric Datapipeline using one of the below 3 options. Please suggest what could be the best one to proceed with and why?

We are thinking of 3 options

1)Using only notebook activity approach(for the existing Sqoop import and Export)and calling copyData API for the data.
python notebook will handle all the conditional statments that lambda was handling earlier and then calls an api with parameters such as (source,target) to copy the data from sql server to adls gen 2.

2)Using notebook activity and datacopy activity approach (for the existing Sqoop import and Export)--Same as point 1 but using notebook activity and datacopy activity from pipeline

3)Azure function activity(for lambda functions) and CopyData activity approach for the existing Sqoop import and Export

AmruthaVarshini · ‎05-15-2024

Hi @Anonymous , This is pretty helpful, could you please ellaborate more on cons of API CopyData apart from streamlining. Also we have so many parameters that needs to be passed from Notebook to Copy Data activity(queries also sometimes), would this be feasible with Option1? Any thoughts would be much helpful. Thank you very much.

Anonymous · ‎05-15-2024

Hi @AmruthaVarshini ,

Cons of copyData API (compared to Datacopy Activity):

Lower-Level Control: copyData offers more granular control over data transfer, but requires writing more code within the notebook to handle details like serialization, error handling, and progress tracking. Datacopy Activity simplifies these aspects.
Error Handling: With copyData, you'll need to implement custom error handling logic within the notebook to capture and respond to potential issues during data transfer. Datacopy Activity provides built-in error handling and retries for robust data movement.
Monitoring and Logging: Monitoring data transfer progress and logging details can be more cumbersome with copyData. Datacopy Activity offers better integration with Datapipeline's monitoring and logging capabilities for improved visibility.
Security Considerations: When using copyData, ensure proper access control is in place for the API calls to prevent unauthorized data access. Datacopy Activity leverages Datapipeline's security model for secure data transfer.

Feasibility of Option 1 with Many Parameters:

Passing Parameters: Option 1 can handle a large number of parameters from the notebook to the copyData API. You can leverage libraries like requests to build dynamic API calls with parameters passed as arguments or within the request body.
Passing Queries: While feasible, passing complex queries as parameters can become cumbersome and less readable. Consider storing reusable queries in separate files or leveraging configuration management tools for better maintainability.

Overall Recommendation:

If your primary concern is simplicity and leveraging built-in functionalities, Option 2 (Notebook with Datacopy Activity) is still preferred. It provides a cleaner abstraction for data transfer with robust error handling and monitoring.

If you require maximum control over the data transfer process and have the resources to handle additional coding for error handling and monitoring, Option 1 (Notebook with copyData API) can be explored.

Additional Thoughts:

Explore Datapipeline's variable settings for storing frequently used parameters, reducing the number of values passed directly from the notebook.
If query complexity is a concern, consider pre-processing or storing reusable queries in separate locations and referencing them within the notebook for cleaner code.

Ultimately, the best choice depends on your specific requirements and priorities. If you have a strong preference for a more low-level approach with copyData, carefully consider the added development and maintenance overhead.

Hope this helps. Do let me know incase of further queries.

Anonymous · ‎05-17-2024

Hi @AmruthaVarshini ,

We haven’t heard from you on the last response and was just checking back to see if your query was answered.
Otherwise, will respond back with the more details and we will try to help .

Thanks

Anonymous · ‎05-20-2024

Hi @AmruthaVarshini ,

We haven’t heard from you on the last response and was just checking back to see if your query was answered.
Otherwise, will respond back with the more details and we will try to help .

Thanks

AmruthaVarshini · ‎05-20-2024

Hi @Anonymous ,

Thanks a lot for more insights. Currently we are doing a POC for Option 1 and Option 2 to understand which is more affordable and performant for our particular requirement. Would be getting back with few more questions once POC is done. Hope for same supoort from you and team.

Anonymous · ‎05-21-2024

Hi @AmruthaVarshini ,

Glad to know that you got some insights over your query. Do let me know incase of further queries.

Anonymous · ‎05-15-2024

Hi @AmruthaVarshini ,

Thanks for using Fabric Community.

I would suggest using Option 2: Using notebook activity and datacopy activity approach.

Pros of Option 2:

Flexibility: Notebooks provide a familiar Python environment for handling complex logic and conditional statements, similar to your existing Lambda functions.
Native Integration: Datacopy activity seamlessly integrates with Fabric Datapipeline, allowing efficient data movement from your on-premises SQL Server to ADLS Gen2. This eliminates the need for external shell scripts and simplifies the pipeline.
Cost-Effective: Notebooks within Datapipeline might be more cost-effective compared to Azure Functions, especially for simpler data transfer tasks. Functions incur separate execution costs.
Maintainability: Code resides within the pipeline, making it easier to manage and version control compared to separate Lambda functions.

Drawbacks of Other Options:

Option 1 (Only Notebook Activity): While feasible, using only notebooks with copyData API might require additional code to handle data transfer logic, making it less streamlined.
Option 3 (Azure Function Activity): Azure functions introduce additional complexity and potential cost compared to native datacopy within Datapipeline.

At last, it is completely depends upon your choice on choosing with approach.

Hope this is helpful. Do let me know incase of further queries.

Fabric Pipeline Activity-How to use API of activities in Notebook Activity

Helpful resources

Join us at the Microsoft Fabric Community Conference

Fabric Monthly Update - February 2025

Fabric Community Update - February 2025

New Offer! Become a Certified Fabric Data Engineer

Fabric Pipeline Activity-How to use API of activities in Notebook Activity

Helpful resources

Join us at the Microsoft Fabric Community Conference

Fabric Monthly Update - February 2025

Fabric Community Update - February 2025