Data Cleaning with Pandas Code Examples Part 1

Data cleaning is the process of preparing raw data for analysis.  This is also know as data pre-processing or data wrangling.  Data cleaning is a critical step in the process of data analysis and machine learning.  Missing values must be considered for their impact on any future data analysis. 

Here is a summary of the principles with the application Pandas for Python commands.  I’ll use the the abbreviation df  where df  = dataframe and the Pandas function.

If you’re cleaning data for data science/machine learning, you should split the dataset into train and test first and then only clean the data in the train dataset.

  • How should you deal with missing values?  Your options are the following:
    • Try to find the value in from the original data source.
    • Drop the missing value or the entire entry which is done in pandas with the function df.dropna().
      • df.dropna(subset=[“quantity”], axis=0, inplace=true) will drop every row from your panda dataframe df where there is a missing value in the quantity column.
    • Replace the missing value.  You can fill the unknown value with the average of similar numerical data points.  If it’s a categorical value you can use the mode or judgment if you have insight into the data which is done in pandas using the function. df.replace(missing_value, new_value)
      • To replace a missing quantity with the mean perform the following:
        • mean = df[“quantity”].mean()
        • df[“quantity”].replace(np.nan, mean)
        • Alternately this can all be done with one step with df=df[“quantity”].fillna(df.mean()). 
    • Leave it as is.

In demand planning, missing demand values can be the result of errors, stockouts, or true periods of zero demand.  Understanding the reason for the periods of zero demand are critical to understanding how to deal with them.  If the periods of zero demand are not errors or canceled orders due stockouts, then you must determine whether the zero demand will repeat or if this is a one time event.

 

Installing Python for Data Science

Python has some fantastic libraries with powerful data analysis tools.  can be difficult to install.  These are the installation instructions for a Mac.

  1.  You’re going to need to install the version of XCODE that matches your operating system.  XCODE comes with the development and command line tools needed for Python.  You can find XCODE at https://developer.apple.com/download/more/.
  2.  Homebrew is the best way to install and manage Python so you’re going to need to install this first. In your terminal run: $ ruby -e “$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)”.  Just copy and paste this code into your command line tool and Homebrew will automatically download and install.
  3.  Install Python 3 using Homebrew by running this command: $ brew install python to install the latest stable version of Python onto your Mac.

Now you’re going to need to install some data science libraries to get going:

  1. To install Matplotlib, the 2D Python library that can plot bar charts,  scatterplots, errorcharts, histograms, and more with just a few lines of code. Past the command pip3 install matplotlib into your command shell.
  2. To install Pandas, the data analysis and data structure toolkit, us the command pip3 install pandas in your command shell.

There are many more tools available but running is all you need to start doing some serious data analysis.