Join us at FabCon Atlanta from March 16 - 20, 2026, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.
Register now!Special holiday offer! You and a friend can attend FabCon with a BOGO code. Supplies are limited. Register now.
Hi everyone,
I’m currently learning data science and machine learning, and I often get confused when starting a new project. With so many algorithms available (Linear Regression, Random Forest, XGBoost, SVM, etc.), how do you decide which one to try first?
Do you rely more on dataset size, feature types, interpretability, or just experimentation? I’d really appreciate hearing how experienced data scientists approach this step in real-world projects.
Thanks in advance!
Solved! Go to Solution.
Hi @richa_gupta_224 ,
In real-world ML projects, we usually don’t try to identify the “best” algorithm upfront. Instead, the focus is on understanding the data and establishing baselines.
A typical approach looks like this:
Start with a simple baseline (Linear or Logistic Regression). This helps validate the data, uncover leakage, and set a reference point. Evaluate a few strong models in parallel, usually one linear model, one tree-based model (Random Forest), and one boosting model (XGBoost/LightGBM). The data often makes the choice clearer.
Let practical constraints guide selection:
Need interpretability > simpler models
Tabular data and performance-driven > tree-based or boosting models
Small datasets > avoid overly complex models
Be cautious of extremely high accuracy. Iterate over time. As data, features, or business requirements evolve, the “best” model can change.
The key takeaway is that model selection is an iterative engineering process, not a one-time decision. In most cases, improvements come more from better data and features than from switching algorithms.
Hope this helps.
Thank you.
We. Can use for coming technology
Hi @richa_gupta_224 ,
In real-world ML projects, we usually don’t try to identify the “best” algorithm upfront. Instead, the focus is on understanding the data and establishing baselines.
A typical approach looks like this:
Start with a simple baseline (Linear or Logistic Regression). This helps validate the data, uncover leakage, and set a reference point. Evaluate a few strong models in parallel, usually one linear model, one tree-based model (Random Forest), and one boosting model (XGBoost/LightGBM). The data often makes the choice clearer.
Let practical constraints guide selection:
Need interpretability > simpler models
Tabular data and performance-driven > tree-based or boosting models
Small datasets > avoid overly complex models
Be cautious of extremely high accuracy. Iterate over time. As data, features, or business requirements evolve, the “best” model can change.
The key takeaway is that model selection is an iterative engineering process, not a one-time decision. In most cases, improvements come more from better data and features than from switching algorithms.
Hope this helps.
Thank you.
Thank you for the clear explanation. I really liked the idea of starting with simple baselines and letting data and constraints guide the choice, it makes the process feel much more practical than chasing the best algorithm upfront.
I had one follow-up question that When setting a baseline, how do you decide if the model is good enough to move forward, or if the issue lies more in feature engineering rather than trying a more complex model?
you do as many as you can in parallel. Then pick the one that has the best success with your actual data. Marvel at what you have achieved, and then rinse and repeat every so often, when your chosen model is no longer the "best".
Do not pick a model that has close to 100% success rate. Such a rate is an indication that you don't need ML in the first place.
75% to 90% is very respectable.
Thank you for the clear explanation, this really helped me understand the importance of baselines and iteration.