Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Earn a 50% discount on the DP-600 certification exam by completing the Fabric 30 Days to Learn It challenge.

Reply
Tviney
Regular Visitor

Replicating AWS S3 Bucket into Fabric recommendations

Hi All,

 

I have a data lake in AWS S3 buckets and want to create a replica inside Fabric. I'm aware of the shortcuts feature and this is great, but it only allows connections to the parquet files.

 

What would be people recomend as the best method for replicating AWS S3 inside fabric, using either shortcuts, data factory, notebooks etc? I want to have all the tables available whilst being mindful of AWS egress costs.

 

At the moment I'm thinking of using shortcuts then apache spark job definitions to load the data on a schedule.

5 REPLIES 5
HimanshuS-msft
Community Support
Community Support

Hello @Tviney ,
As called out by @v-cboorla-msft , you can take any one of the above approach . But one thing worth keeping an eye is how are you making sure that the we are not copying the old files/blobs again . I am not sure what is the structure of the filesytem on the S3 side . If you are creating a partition it will easy to implement . In the worst case you may have to keep a watermark of the metadata of the blob which was copied the last time .

HTH 

Thanks 
Himanshu 

Hi @Tviney 

 

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet.
In case if you have any resolution please do share that same with the community as it can be helpful to others.
Otherwise, will respond back with the more details and we will try to help.


Thanks

Hi @Tviney 

 

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others.
If you have any question relating to the current thread, please do let us know and we will try out best to help you.
In case if you have any other question on a different issue, we request you to open a new thread.


Thanks

v-cboorla-msft
Community Support
Community Support

Hi @Tviney 

 

Thanks for using Microsoft Fabric Community.

Your current approach of using shortcuts for initial data import and Apache Spark job definitions for scheduled updates is a viable option and has several advantages:

Advantages:

  • Ease of use: Shortcuts offer a simple setup without requiring complex scripting.
  • Incremental data updates: You can efficiently load only new or changed data, minimizing data transfer and processing costs.
  • Scalability: Spark jobs can handle large data volumes efficiently.
  • Cost-effectiveness: By focusing on transferring only new data, you minimize AWS egress costs.

The ideal approach depends on your specific needs and expertise.

  • Shortcuts + Spark Jobs: Simple and efficient, good for initial import and incremental updates.
  • Data Factory Pipelines: Flexible and customizable, suitable for complex scenarios and handling various data formats.

There's no single "best" method. The optimal approach depends on your specific requirements, technical expertise, and budget. Evaluating the pros and cons of each option and considering your specific needs will help you make the best decision for your situation.

 

I hope this helps. If you have any further questions please do let us know.

Hi,

 

I understand that there is no best single method, in terms of using shortcuts + spark jobs, would this actually minimise data egress from AWS?

At the moment, my shortcut is just 1 parquet file. If I want to do incremental refresh, would I not need to return the whole dataset into my spark dataframe in order to filter just the newly updated records?

Additionally, My latest understanding is that the AWS data lake is not in a delta table format, otherwise fabric would recognise the delta parquet files as tables and add them to the tables section so they can be queried.

Are there any other viable options for loading data on a schedule from AWS to Fabric?

Helpful resources

Announcements
Expanding the Synapse Forums

New forum boards available in Synapse

Ask questions in Data Engineering, Data Science, Data Warehouse and General Discussion.

LearnSurvey

Fabric certifications survey

Certification feedback opportunity for the community.

April Fabric Update Carousel

Fabric Monthly Update - April 2024

Check out the April 2024 Fabric update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.