March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount! Early bird discount ends December 31.
Register NowBe one of the first to start using Fabric Databases. View on-demand sessions with database experts and the Microsoft product team to learn just how easy it is to get started. Watch now
Hi All,
I have a data lake in AWS S3 buckets and want to create a replica inside Fabric. I'm aware of the shortcuts feature and this is great, but it only allows connections to the parquet files.
What would be people recomend as the best method for replicating AWS S3 inside fabric, using either shortcuts, data factory, notebooks etc? I want to have all the tables available whilst being mindful of AWS egress costs.
At the moment I'm thinking of using shortcuts then apache spark job definitions to load the data on a schedule.
Hello @Tviney ,
As called out by @v-cboorla-msft , you can take any one of the above approach . But one thing worth keeping an eye is how are you making sure that the we are not copying the old files/blobs again . I am not sure what is the structure of the filesytem on the S3 side . If you are creating a partition it will easy to implement . In the worst case you may have to keep a watermark of the metadata of the blob which was copied the last time .
HTH
Thanks
Himanshu
Hi @Tviney
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet.
In case if you have any resolution please do share that same with the community as it can be helpful to others.
Otherwise, will respond back with the more details and we will try to help.
Thanks
Hi @Tviney
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others.
If you have any question relating to the current thread, please do let us know and we will try out best to help you.
In case if you have any other question on a different issue, we request you to open a new thread.
Thanks
Hi @Tviney
Thanks for using Microsoft Fabric Community.
Your current approach of using shortcuts for initial data import and Apache Spark job definitions for scheduled updates is a viable option and has several advantages:
Advantages:
The ideal approach depends on your specific needs and expertise.
There's no single "best" method. The optimal approach depends on your specific requirements, technical expertise, and budget. Evaluating the pros and cons of each option and considering your specific needs will help you make the best decision for your situation.
I hope this helps. If you have any further questions please do let us know.
Hi,
I understand that there is no best single method, in terms of using shortcuts + spark jobs, would this actually minimise data egress from AWS?
At the moment, my shortcut is just 1 parquet file. If I want to do incremental refresh, would I not need to return the whole dataset into my spark dataframe in order to filter just the newly updated records?
Additionally, My latest understanding is that the AWS data lake is not in a delta table format, otherwise fabric would recognise the delta parquet files as tables and add them to the tables section so they can be queried.
Are there any other viable options for loading data on a schedule from AWS to Fabric?
March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!
Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.