Fundamentally, classification is about predicting a label and regression is about predicting a quantity.
Predictive modeling can be described as the mathematical problem of approximating a mapping function (f) from input variables (X) to output variables (y). This is called the problem of function approximation.
The job of the modeling algorithm is to find the best mapping function we can given the time and resources available.
Classification Predictive Modeling
Classification predictive modeling is the task of approximating a mapping function (f) from input variables (X) to discrete output variables (y).The output variables are often called labels or categories. The mapping function predicts the class or category for a given observation.
For example, an email of text can be classified as belonging to one of two classes: “spam“and “not spam“.
- A classification problem requires that examples be classified into one of two or more classes.
- A classification can have real-valued or discrete input variables.
- A problem with two classes is often called a two-class or binary classification problem.
- A problem with more than two classes is often called a multi-class classification problem.
- A problem where an example is assigned multiple classes is called a multi-label classification problem.
For example, a specific email of text may be assigned the probabilities of 0.1 as being “spam” and 0.9 as being “not spam”. We can convert these probabilities to a class label by selecting the “not spam” label as it has the highest predicted likelihood.
Regression Predictive Modeling
Regression predictive modeling is the task of approximating a mapping function (f) from input variables (X) to a continuous output variable (y).A continuous output variable is a real-value, such as an integer or floating point value. These are often quantities, such as amounts and sizes.
For example, a house may be predicted to sell for a specific dollar value, perhaps in the range of $100,000 to $200,000.
- A regression problem requires the prediction of a quantity.
- A regression can have real valued or discrete input variables.
- A problem with multiple input variables is often called a multivariate regression problem.
- A regression problem where input variables are ordered by time is called a time series forecasting problem.
Classification vs Regression
Classification predictive modeling problems are different from regression predictive modeling problems.- Classification is the task of predicting a discrete class label.
- Regression is the task of predicting a continuous quantity.
- A classification algorithm may predict a continuous value, but the continuous value is in the form of a probability for a class label.
- A regression algorithm may predict a discrete value, but the discrete value in the form of an integer quantity.
Importantly, the way that we evaluate classification and regression predictions varies and does not overlap, for example:
- Classification predictions can be evaluated using accuracy, whereas regression predictions cannot.
- Regression predictions can be evaluated using root mean squared error, whereas classification predictions cannot.
Conclusion
- That predictive modeling is about the problem of learning a mapping function from inputs to outputs called function approximation.
- That classification is the problem of predicting a discrete class label output for an example.
- That regression is the problem of predicting a continuous quantity output for an example.