Lessons Learned: 10 Common Mistakes in Data Science Projects
The more you know about the past, the better you are prepared for the future — Theodore Roosevelt
Data Scientist: Data scientists are a new branch of analytical data experts who examine which questions need answering, and have the analytical skills to solve more complicated problems.
In today’s world, many data analytics teams in companies such as Microsoft and IBM have been implementing CRISP-DM method. It stands for Cross-Industry Standard Process for Data Mining which is an efficient way to guide your data science efforts. These aids enhance team collaboration by recommending how the team roles work well together. Even if data science teams implement this cycle on the projects, several projects failed at the end because of several reasons.
Therefore, I would like to talk about the most common mistakes in which I and data scientists made at the projects as “lessons learned”. Based on my discussions with many senior data scientists working for big tech companies such as Amazon, Google, Facebook, Uber, and Apple; I created this lessons learned document to help data scientist candidates in their future.
1. The biggest reason why many projects failed is a lack of business understanding and communication. Before moving to the “Data Acquisition & Preparation” part in the project, you should make sure that stakeholders and project members are on the same page. Therefore, the data analytics team should describe the project objectives by asking and refining “sharp” questions that are relevant, particular, and unambiguous in accordance with business requirements.
2. Exploring and verifying data are two vital processes in data preparation. If these steps are not properly completed, unsuccess is inevitable. In the real world, some data analysts move faster in this step to finalize the project on time. However, this leads to potential problems and time loss in future steps.
3. According to many data scientists, data preparation is the longest part of the data mining projects. They said that they spent the majority of the time (60 %) on data understanding & preparation part. Hence, several steps such as understanding data structures, preparing attributes for the model, dealing with missing data, finding the right inputs, and removing correlated features require more time than other steps. The more data scientists understand the dynamics of the data, the more accurate model will have appeared at the end.
4. Attributes have to be analyzed statistically at the beginning of the model selection to reach success. Data scientists sometimes underestimate the importance of statistical analysis. It is an undeniable fact that statistics plays a significant role in understanding the relationship between independent variables and dependent variable (sampling, hypothesis testing, correlation, and distributions). Besides, you have to focus more time on statistical techniques when you work on outlier detection and root cause analysis. This contributes to an increase in a project’s success rate by adding a scientific approach.
5. Generally, data scientists apply lots of advanced algorithms to conclude more quickly. Therefore, they ignore conceiving the logic behind these models. For instance, everyone knows the linear regression method. It can be generally used to solve regression problems. Nevertheless, data scientists forget checking whether your data meet 5 assumptions of linear regression. These are as follows of linearity, normality, and the independence of the residuals, little/no multicollinearity, and homoscedasticity. If your data does not meet these requirements, results from linear regression would not be favorable.
6. If data scientists want to finalize the project successfully, they must know how to use the algorithms and packages in the most effective way. For example, when you are using the Random Forest algorithm in Python, you have to spend more time on the parameters of the algorithm such as ‘max_features’, ‘max_depth’, ‘min_samples_split’, ‘min_samples_leaf’ and ‘bootstrap’. These parameters lead to the complexity of the model in which you run (overfitting vs underfitting). If you have a good understanding of packages, you can build an accurate model on the dataset.
7. Most of the models face overfitting problems in data mining projects. Hence, you always make an effort to handle ‘overfitting’ such as sampling, regularization, early stopping, cross-validation, pruning. On the other hand, the model which you created could face underfitting. This means that your model is not sufficient enough to capture relationships with the datasets’ features and target variables. You have to increase the complexity of the model by changing the parameters or doing feature engineering (e.g., adding external variables, PCA)
8. It is likely to face imbalanced datasets in the real world. Before modeling, please check the distribution of classes and if it happens, you should use resampling techniques such as over-sampling, under-sampling, and mix of them.
9. The efficiency of the code also plays a vital role in data science projects. When you move forward in data science projects, you have to verify whether the coding file is efficient in terms of time or space. Firstly, you have to write a code in the simplest way. Then, you have to increase the model’s efficiency at a large scale. If you are working on big data projects, the inefficient code could lead to destruction in a later phase.
10. Validation is a vital process in big data projects. You have to validate your code at each step since the approach in your mind and the code in you write in programming languages (Python or R) sometimes can be different. For example, you are working in Consumer Purchasing Goods (CPG) sector and you are trying to build a forecasting model for all SKU’s. At the same time, you have to take a random sample which represents all SKU population. Then, you have to make sure whether your intended and actual results are the same by checking the results of this random sample.
!!! Keep in mind that. If you want to be a data scientist,
1. Be Patient
2. Never Give Up
3. Love Working On New Challenges
4. Never Stop Learning
5. Understand The Business