Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Special holiday offer! You and a friend can attend FabCon with a BOGO code. Supplies are limited. Register now.

Reply
richa_gupta_224
New Member

How do you decide which ML algorithm to start with for a new dataset?

Hi everyone,
I’m currently learning data science and machine learning, and I often get confused when starting a new project. With so many algorithms available (Linear Regression, Random Forest, XGBoost, SVM, etc.), how do you decide which one to try first?

Do you rely more on dataset size, feature types, interpretability, or just experimentation? I’d really appreciate hearing how experienced data scientists approach this step in real-world projects.

Thanks in advance!

1 ACCEPTED SOLUTION
v-echaithra
Community Support
Community Support

Hi @richa_gupta_224 ,

In real-world ML projects, we usually don’t try to identify the “best” algorithm upfront. Instead, the focus is on understanding the data and establishing baselines.

A typical approach looks like this:

Start with a simple baseline (Linear or Logistic Regression). This helps validate the data, uncover leakage, and set a reference point. Evaluate a few strong models in parallel, usually one linear model, one tree-based model (Random Forest), and one boosting model (XGBoost/LightGBM). The data often makes the choice clearer.

Let practical constraints guide selection:

Need interpretability > simpler models

Tabular data and performance-driven > tree-based or boosting models

Small datasets > avoid overly complex models

Be cautious of extremely high accuracy. Iterate over time. As data, features, or business requirements evolve, the “best” model can change.
The key takeaway is that model selection is an iterative engineering process, not a one-time decision. In most cases, improvements come more from better data and features than from switching algorithms.

Hope this helps.
Thank you.

View solution in original post

5 REPLIES 5
AnShikagautam
New Member

We. Can use for coming technology 

v-echaithra
Community Support
Community Support

Hi @richa_gupta_224 ,

In real-world ML projects, we usually don’t try to identify the “best” algorithm upfront. Instead, the focus is on understanding the data and establishing baselines.

A typical approach looks like this:

Start with a simple baseline (Linear or Logistic Regression). This helps validate the data, uncover leakage, and set a reference point. Evaluate a few strong models in parallel, usually one linear model, one tree-based model (Random Forest), and one boosting model (XGBoost/LightGBM). The data often makes the choice clearer.

Let practical constraints guide selection:

Need interpretability > simpler models

Tabular data and performance-driven > tree-based or boosting models

Small datasets > avoid overly complex models

Be cautious of extremely high accuracy. Iterate over time. As data, features, or business requirements evolve, the “best” model can change.
The key takeaway is that model selection is an iterative engineering process, not a one-time decision. In most cases, improvements come more from better data and features than from switching algorithms.

Hope this helps.
Thank you.

Thank you for the clear explanation. I really liked the idea of starting with simple baselines and letting data and constraints guide the choice, it makes the process feel much more practical than chasing the best algorithm upfront.
I had one follow-up question that When setting a baseline, how do you decide if the model is good enough to move forward, or if the issue lies more in feature engineering rather than trying a more complex model?

lbendlin
Super User
Super User

you do as many as you can in parallel. Then pick the one that has the best success with your actual data.  Marvel at what you have achieved, and then rinse and repeat every so often, when your chosen model is no longer the "best".

 

Do not pick a model that has close to 100% success rate. Such a rate is an indication that you don't need ML in the first place.

 

75% to 90% is very respectable.

Thank you for the clear explanation, this really helped me understand the importance of baselines and iteration.

Helpful resources

Announcements
December Fabric Update Carousel

Fabric Monthly Update - December 2025

Check out the December 2025 Fabric Holiday Recap!

FabCon Atlanta 2026 carousel

FabCon Atlanta 2026

Join us at FabCon Atlanta, March 16-20, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.