Data Cleaning with Pandas Code Examples Part 1

Data cleaning is the process of preparing raw data for analysis.  This is also know as data pre-processing or data wrangling.  Data cleaning is a critical step in the process of data analysis and machine learning.  Missing values must be considered for their impact on any future data analysis. 

Here is a summary of the principles with the application Pandas for Python commands.  I’ll use the the abbreviation df  where df  = dataframe and the Pandas function.

If you’re cleaning data for data science/machine learning, you should split the dataset into train and test first and then only clean the data in the train dataset.

  • How should you deal with missing values?  Your options are the following:
    • Try to find the value in from the original data source.
    • Drop the missing value or the entire entry which is done in pandas with the function df.dropna().
      • df.dropna(subset=[“quantity”], axis=0, inplace=true) will drop every row from your panda dataframe df where there is a missing value in the quantity column.
    • Replace the missing value.  You can fill the unknown value with the average of similar numerical data points.  If it’s a categorical value you can use the mode or judgment if you have insight into the data which is done in pandas using the function. df.replace(missing_value, new_value)
      • To replace a missing quantity with the mean perform the following:
        • mean = df[“quantity”].mean()
        • df[“quantity”].replace(np.nan, mean)
        • Alternately this can all be done with one step with df=df[“quantity”].fillna(df.mean()). 
    • Leave it as is.

In demand planning, missing demand values can be the result of errors, stockouts, or true periods of zero demand.  Understanding the reason for the periods of zero demand are critical to understanding how to deal with them.  If the periods of zero demand are not errors or canceled orders due stockouts, then you must determine whether the zero demand will repeat or if this is a one time event.

 

Leave a Reply

Your email address will not be published.