- Home
- Getting Started with Machine Learning using Python-Data Preprocessing.

**Step 1: Importing the required libraries.**

These two are essential libraries which we will import every time.

Numpy is a library that contains Mathematical functions.

Pandas is the library used to import and manage the data sets.

**Step 2: Importing the Data set**

Data sets are generally available in CSV format. A CSV file stores tabular data in plain text. Each line of the file is a data record. We use the read.csv method of the pandas library to read a local CSV file as a dataframe. Then we make separate Matrix and Vector of independent and dependent variables from the dataframe.

**Step 3: Handling the missing data.**

The data we get is rarely homogenous. Data can be missing due to various reasons and needs to be handled so that it does not reduce the performance of our machine learning model. We can replace the missing data by the mean or median of the entire column. We use imputer class of sklearn preprocessing for this task.

**Step 4: Encoding Categorical Data**

Categorical data are variables that contain label values rather than numeric values values. The number of possible values is often limited to a fixed set. Example values such as “Yes” and “No” cannot be used in a mathematical equation of the model so we need to encode these variables into numbers. To achieve this, we import LabelEncoder class from sklearn preprocessing library.

**Step 5: Splitting the dataset into test set and training set.**

We make two partitions of dataset one for training the model called training set and other for testing the performance of the trained model called test set. The split is generally 80/20 . We import,train_test_split method of sklearn crossvalidation library.

**Step 6: Feature Scaling.**

Most of the machine learning algorithms use the Eucledian distance between two data points in these computations, features highly varying in magnitudes, features and range which poses a problem. If left alone, these algorithms only take in the magnitude of features neglecting the units. The results would vary greatly between different units, 5kg and 5000gms. The features with high magnitudes will weigh in a lot more in the distance calculations than features with low magnitudes.

To suppress this effect, we need to bring all features to the same level of magnitudes. This can be achieved by scaling.

One method for feature scaling is by standardization which replaces the values by their Z scores.