Data is of two types:

  1. Categorical
  2. Numerical

Furthermore categorical data can be:

Often machine learning algorithms require input and output to be numeric and hence if we have categorical data we need to be able to convert it to numeric format.

This can be done in two ways:

  1. Integer or label encoding
  2. One hot encoding

When we have ordinal categorical data, we can simply assign a unique integer to each category. This is called integer or label encoding. This also advantageous because the ML algorithm can take advantage of the knowledge about the implicit order among these categories.

On the other hand, if we have nominal categorical data, integer encoding actually makes it worse since we imply an order via the integers where none exist. Instead we create "dummy variables" which are just binary variables for each category. For each such variable we assign a 0 or 1 depending on the absence or presence of that category respectively for that data point.

In essence we convert the categorical data to a binary numeric vector.