5 Concepts Every Data Scientist Should Know
Once a Data Scientist, there are certain skills you will apply each and every day of your career. Some of these might be common techniques you learned during your education, while others may develop fully only after you become more established in your organization. Continuing to hone these skills will provide you with valuable professional benefits.
By Matthew Przybyla, Senior Data Scientist at Favor Delivery.
I have written about common skills that Data Scientists can expect to use in their professional careers, so now I want to highlight some key concepts of Data Science that can be beneficial to know and later employ. I may be discussing some that you know already and some that you do not know; my goal is to provide some professional explanation of why these concepts are beneficial regardless of what you do know now. Multicollinearity, one-hot encoding, undersampling and oversampling, error metrics, and lastly, storytelling, are the key concepts I think of first when thinking of a professional Data Scientist in their day-to-day. The last point, perhaps, is a combination of skill and a concept but wanted to highlight, still, its importance on your everyday work life as a Data Scientist. I will expound upon all of these concepts below.
Although the word is somewhat long and hard to say, when you break it down, multicollinearity is simple. Multi meaning many, and collinearity meaning linearly related. Multicollinearity can be described as the situation when two or more explanatory variables explain similar information or are highly related in a regression model. There are a few reasons this concept can raise a concern.
For some modeling techniques, it can cause overfitting and, ultimately, a decline in model performance.
The data becomes redundant, and not each feature or attribute is needed in your model. Therefore, there are some ways to find out which features you should remove that constitute multicollinearity.
- variance inflation factor (VIF)
- correlation matrices
These two techniques are commonly used amongst Data Scientists, especially correlation matrices and plots — usually visualized with a heatmap of some sort, while VIF is lesser-known.
The higher the VIF value, the less usable the feature is for your regression model.
A great, simple resource for VIF is Variance Inflation Factor – Statistics How To.
This form of feature transformation in your model is called one-hot encoding. You want to represent your categorical features numerically by encoding them. Whereas the categorical features have text values themselves, one-hot encoding transposes that information so that each value becomes the feature, and the observation in the row is either denoted as a 0 or 1. For example, if we have the categorical variable gender, the numerical representation after one-hot encoding would look like (gender before, and male/female after):
Before and after one-hot encoding. Screenshot by Author.
This transformation is useful when you are not just working with numerical features, and need to create that numerical representation with text/categorical features.
When you do not have enough data, oversampling may be suggested as a form of compensation. Say you are working on a classification problem and you have a minority class like the example down below:
class_1 = 100 rows
class_2 = 1000 rows
class_3 = 1100 rows
As you can see, class_1 has a small amount of data for its class, which means your dataset is imbalanced and will be referred to as the minority class. There are several oversampling techniques. One of them is called SMOTE, which stands for Synthetic Minority Over-sampling Technique. One of the ways that SMOTE works is by utilizing a K-neighbor method for finding the nearest neighbor to create synthetic samples. There are similar techniques that use the reverse method for undersampling.
These techniques are beneficial when you have outliers in your class or regression data even, and you want to ensure your sampling is the best representation of the data that your model will run on in the future.
There are plenty of error metrics used for both classification and regression models in Data Science. According to the scikit-learn library, here are some that you can use specifically for regression models:
The two most popular error metrics for regression from above are MSE and RMSE:
MSE: the concept is → mean absolute error regression loss (sklearn)
RMSE: the concept is → mean squared error regression loss (sklearn)
For classification, you can expect to evaluate your model’s performance with accuracy and AUC (Area Under the Curve).
I wanted to add a unique concept of Data Science that is storytelling. I cannot stress enough how important this concept is. It can be seen as a concept or skill, but the label here is not important. What is, is how well you articulate your problem-solving techniques in a business setting. A lot of Data Scientists will focus solely on model accuracy, but will then fail to understand the entire business process. That process includes:
- what is the business?
- what is the problem?
- why do we need Data Science?
- what is the goal of Data Science here?
- when will we get usable results?
- how can we apply our results?
- what is the impact of our results?
- how do we share our results and overall process?
As you can see, none of these points is the model itself or corresponds to an improvement in accuracy. The focus here is how you will use data to solve your company’s problems. It is beneficial to become acquainted with stakeholders and your non-technical coworkers whom you will ultimately be working with. You will also work with Product Managers who will work alongside you in assessing the problem, and Data Engineers to collect the data before even running a base model. At the end of your model process, you will share your results with key individuals who will usually like to see its impact in most likely some type of visual representation (Tableau, Google Slide deck, etc.), so being able to present and communicate is beneficial as well.
Original. Reposted with permission.
Top Stories Past 30 Days