A data wrangling cheat sheet for python data-scientists starts with an initial exploratory analysis. This is the crucial first step in data cleaning. Just like an experienced chess player’s first moves can be scripted, a data-scientist might have several scripted steps to get familiar with the data.
Data Wrangling Cheat Sheet First Steps
Some good commands to start the data analysis are to import pandas, load the data into a data frame, and get an overview of the data. These commands are as follows:
import pandas as pd# Read the file into a DataFrame: df
df = pd.read_csv(‘file_name.csv’)# Print the head of df
# Print the tail of df
# Print the shape of df
# Print the columns of df
The methods above will reveal if the data alredy meets the standard of tidy data. Note that .shape and .columns are not methods, but attributes. Thus, they don’t require the parenthesis.
Next, consider these steps:
print(df.info())# Print the value counts for ‘String_Column’
print(df[‘String_Column’].value_counts(dropna=False))# Print the description for ‘Numeric_Column’
The .info() method will indicate if there is missing data. Moreover, it tells the number of rows, number of columns, and the data type for each column.
The .describe() method will calculate summary statistics on numeric data columns.
The .value_counts() method returns the frequency counts for each unique value in a column. This method also has an optional parameter called dropna which is True by default. Set dropna=False so it will provide the frequency counts of missing data as well.
Once a through exploration of the data is complete, then use visualization techniques such as scatter plots, histograms, etc.