Statistical and Visual Exploratory Data Analysis with One Line of Code
If EDA is not executed correctly, it can cause us to start modeling with “unclean” data. See how to use Pandas Profiling to perform EDA with a single line of code.
By Brenda Hali, Marketing Data Specialist
Exploratory Data Analysis (EDA) is, in my opinion, the most important part of Machine Learning Modeling in new datasets. If EDA is not executed correctly, it can cause us to start modeling with “unclean” data, and this is just as a snowball downhill, it gets bigger and worse.
Basic elements of a good Exploratory Data Analysis
The exploratory Data Analysis can be as deep as you want or need it to be, but a basic analysis needs to have elements below:
- First and last values
- Dataset shape (#rows and #columns)
- Data/Variables types
- Missing and Null values
- Duplicated values
- Descriptive Statistics (Mean, Minimum, Maximum)
- Variables distribution
I enjoy performing manual EDA to get to know my data better, but a couple of months ago, Adi Bronshtein introduced me to Pandas Profiling. As it takes quite some time to process, I use it when I want to explore small datasets quickly, and I hope that it speeds up your EDA, too.
Getting started with Pandas Profiling
In this demonstration, I will conduct EDA in NASA´s Meteorite Landings Dataset.
Did you run it already?
et Voilà, easy peasy!
Now the fun starts.
Explore more about Pandas Profiling on their documentation here: https://pandas-profiling.github.io/pandas-profiling/docs/
Did you enjoy this text? You might want to check The Best Free Data Science eBooks.
Bio: Brenda Hali is a Marketing Data Specialist based in Washington, D.C. She is passionate about women’s inclusion in technology and data.
Original. Reposted with permission.
Top Stories Past 30 Days