Replicating AWS S3 Bucket into Fabric recommendati...

Tviney · ‎12-13-2023

Hi All,

I have a data lake in AWS S3 buckets and want to create a replica inside Fabric. I'm aware of the shortcuts feature and this is great, but it only allows connections to the parquet files.

What would be people recomend as the best method for replicating AWS S3 inside fabric, using either shortcuts, data factory, notebooks etc? I want to have all the tables available whilst being mindful of AWS egress costs.

At the moment I'm thinking of using shortcuts then apache spark job definitions to load the data on a schedule.

HimanshuS-msft · ‎12-13-2023

Hello @Tviney ,
As called out by @v-cboorla-msft , you can take any one of the above approach . But one thing worth keeping an eye is how are you making sure that the we are not copying the old files/blobs again . I am not sure what is the structure of the filesytem on the S3 side . If you are creating a partition it will easy to implement . In the worst case you may have to keep a watermark of the metadata of the blob which was copied the last time .

HTH

Thanks
Himanshu

v-cboorla-msft · ‎12-14-2023

Hi @Tviney

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet.
In case if you have any resolution please do share that same with the community as it can be helpful to others.
Otherwise, will respond back with the more details and we will try to help.

Thanks

v-cboorla-msft · ‎12-18-2023

Hi @Tviney

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others.
If you have any question relating to the current thread, please do let us know and we will try out best to help you.
In case if you have any other question on a different issue, we request you to open a new thread.

Thanks

v-cboorla-msft · ‎12-13-2023

Hi @Tviney

Thanks for using Microsoft Fabric Community.

Your current approach of using shortcuts for initial data import and Apache Spark job definitions for scheduled updates is a viable option and has several advantages:

Advantages:

Ease of use: Shortcuts offer a simple setup without requiring complex scripting.
Incremental data updates: You can efficiently load only new or changed data, minimizing data transfer and processing costs.
Scalability: Spark jobs can handle large data volumes efficiently.
Cost-effectiveness: By focusing on transferring only new data, you minimize AWS egress costs.

The ideal approach depends on your specific needs and expertise.

Shortcuts + Spark Jobs: Simple and efficient, good for initial import and incremental updates.
Data Factory Pipelines: Flexible and customizable, suitable for complex scenarios and handling various data formats.

There's no single "best" method. The optimal approach depends on your specific requirements, technical expertise, and budget. Evaluating the pros and cons of each option and considering your specific needs will help you make the best decision for your situation.

I hope this helps. If you have any further questions please do let us know.

Tviney · ‎12-18-2023

Hi,

I understand that there is no best single method, in terms of using shortcuts + spark jobs, would this actually minimise data egress from AWS?

At the moment, my shortcut is just 1 parquet file. If I want to do incremental refresh, would I not need to return the whole dataset into my spark dataframe in order to filter just the newly updated records?

Additionally, My latest understanding is that the AWS data lake is not in a delta table format, otherwise fabric would recognise the delta parquet files as tables and add them to the tables section so they can be queried.

Are there any other viable options for loading data on a schedule from AWS to Fabric?

Replicating AWS S3 Bucket into Fabric recommendations

Helpful resources

New forum boards available in Synapse

Fabric certifications survey

Fabric Monthly Update - April 2024

Fabric Community Update - April 2024