The ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM.
Get registeredCompete to become Power BI Data Viz World Champion! First round ends August 18th. Get started.
I've been polishing my algorithm with small data samples for days, and trying to make it as efficient as possible. Currently my main table has 23 steps, of which a couple of them uses external lists of about 200 elements.
The problem is that I want to apply that same series of steps to about 100 .csv files, each with 1 column and between 20 and 70 million rows with phrases (~ 2gb). If there is no alternative, I can settle for a sample of 1 million rows from each file, but I would like the bigger the sample the better.
On my computer (surface pro 4, 8GB, i5), this process takes a very long time for each file (I have not been able to measure accurately, but with a million rows it is over an hour).
I would like to know what recommendations you have for my case. I don't know anything about cloud computing, but I heard that it consists of hiring powerful servers to execute tasks. Anyway, could someone with experience guide me on which path I should take?
@Greg_Deckler I am completely unaware of those technologies. I have started doing my research based on your message, but I get dizzy with endless new terms and technical questions. Do you think it would take me a long time to learn what I need to do to achieve my goal, or should I hire a professional? I already have the files and the algorithm, all I need is to run them with high power.
By the way, I would really appreciate if you could give me a brief introduction or guidance on the subject, whatever you think can save me time. Thank you very much
@Jimmy801 Thanks, I had already done that 🙂
@carlpaul153 Well, if I was trying to do it, I would hire someone! But not sure of your technical savvy but at the scale you are talking it is a no joke implementation so I would pay someone to do it right and already has the requisite skills. Just to be sure I state this, that is only a single route to go, there are other technologies you could use like Snowflake or Azure Synapse, etc. But they all have common steps, ship the data to the cloud landing in some low cost storage like blob storage. Transform the data and land it in some high power, scalable database, cube, etc. If you need data transformation horsepower then you are looking at DataBricks but for normal stuff Azure Data Factory should suffice.
Hello @carlpaul153
the first thing that comes into my mind is to put your list in a List.Buffer. Because if you apply your list in a new column, this list has to be read often, whereas when you put it in a List.Buffer only once.
Give it a try and let us know
You could also post your query, maybe it's easier to help you.
All the best
Jimmy
@carlpaul153 Well, this sounds like optimizing Power Query so that is probably a good topic for @edhans or @ImkeF . But, that's a lot of rows. I would highly recommend importing all of that data into something like Analysis Services which can scale to handle that sort of thing. Basically Azure Data Factory to Azure Data Lake Gen 2 storage and then on to Analysis Services is the typical path. Then you use Direct Query and aggregation tables. I highly doubt you are going to get a few billion rows into a data model on your laptop.