How to process huge amounts of data as quickly as ...

carlpaul153 · ‎09-10-2020

I've been polishing my algorithm with small data samples for days, and trying to make it as efficient as possible. Currently my main table has 23 steps, of which a couple of them uses external lists of about 200 elements.

The problem is that I want to apply that same series of steps to about 100 .csv files, each with 1 column and between 20 and 70 million rows with phrases (~ 2gb). If there is no alternative, I can settle for a sample of 1 million rows from each file, but I would like the bigger the sample the better.

On my computer (surface pro 4, 8GB, i5), this process takes a very long time for each file (I have not been able to measure accurately, but with a million rows it is over an hour).

I would like to know what recommendations you have for my case. I don't know anything about cloud computing, but I heard that it consists of hiring powerful servers to execute tasks. Anyway, could someone with experience guide me on which path I should take?

carlpaul153 · ‎09-11-2020

@Greg_Deckler I am completely unaware of those technologies. I have started doing my research based on your message, but I get dizzy with endless new terms and technical questions. Do you think it would take me a long time to learn what I need to do to achieve my goal, or should I hire a professional? I already have the files and the algorithm, all I need is to run them with high power.
By the way, I would really appreciate if you could give me a brief introduction or guidance on the subject, whatever you think can save me time. Thank you very much

@Jimmy801 Thanks, I had already done that 🙂

Greg_Deckler · ‎09-11-2020

@carlpaul153 Well, if I was trying to do it, I would hire someone! But not sure of your technical savvy but at the scale you are talking it is a no joke implementation so I would pay someone to do it right and already has the requisite skills. Just to be sure I state this, that is only a single route to go, there are other technologies you could use like Snowflake or Azure Synapse, etc. But they all have common steps, ship the data to the cloud landing in some low cost storage like blob storage. Transform the data and land it in some high power, scalable database, cube, etc. If you need data transformation horsepower then you are looking at DataBricks but for normal stuff Azure Data Factory should suffice.

Follow on LinkedIn
@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!: DAX For Humans

DAX is easy, CALCULATE makes DAX hard...

carlpaul153 · ‎10-01-2020

I return to this question after many days. I am at a dead end. I can't find any tutorial that explains how to run my query on a server like Azure Data Lake. I have put the work in freelance pages for someone to run it on a server but they offer me for prices that I cannot pay. I have a student package from Github that includes $ 100 free Azure, but I still have no idea how to run the queries on their servers. Can someone please give me a hand? I have tried a lot to search for solutions on the internet but I have not moved forward or find a way for weeks.

Jimmy801 · ‎09-10-2020

Hello @carlpaul153

the first thing that comes into my mind is to put your list in a List.Buffer. Because if you apply your list in a new column, this list has to be read often, whereas when you put it in a List.Buffer only once.

Give it a try and let us know

You could also post your query, maybe it's easier to help you.

All the best

Jimmy

Greg_Deckler · ‎09-10-2020

@carlpaul153 Well, this sounds like optimizing Power Query so that is probably a good topic for @edhans or @ImkeF . But, that's a lot of rows. I would highly recommend importing all of that data into something like Analysis Services which can scale to handle that sort of thing. Basically Azure Data Factory to Azure Data Lake Gen 2 storage and then on to Analysis Services is the typical path. Then you use Direct Query and aggregation tables. I highly doubt you are going to get a few billion rows into a data model on your laptop.

Follow on LinkedIn
@ me in replies or I'll lose your thread!!!
Instead of a Kudo, please vote for this idea
Become an expert!: Enterprise DNA
External Tools: MSHGQM
YouTube Channel!: Microsoft Hates Greg
Latest book!: DAX For Humans

DAX is easy, CALCULATE makes DAX hard...

How to process huge amounts of data as quickly as possible

Helpful resources

Fabric Community Update - July 2025

Power BI Monthly Update - July 2025

Join us at FabCon Vienna from September 15-18, 2025

How to process huge amounts of data as quickly as possible

Helpful resources

Fabric Community Update - July 2025

Power BI Monthly Update - July 2025