Testing and Evaluating a ML Model

In order to measure performance of any ML model it needs to be tested and evaluated. The dataset must be divided into training and test set. Model is trained using training data and tested using test data.

Its not that, you trained a model on whole dataset you have and directly deploy via the software systems. The model must be validated on training set and finally evaluated on test set. The final model should be able to generalize well on unseen data without overfitting training set.

Validation

During training stage, several models are trained with different set of hyper-parameters. Each model is validated during training. Validation data is allocated from the training data.

Cross-validation is the technique used for validation which automatically takes care of dividing training set into number of training and test pairs with in the training set. When the best model which outperforms other models is selected, it is finally tested on test data.

Cross Validation (Credit: Scikit-learn)

Testing

The model with best validation score is evaluated on test data. Test data should be:
- representative of data set as a whole
- unseen to the training model
- large enough to yield meaningful results.

Note: Never train on test data.

The model is trained on whole training set with best hyper-parameter values and evaluated on test data. The model has better performance when it can generalize well on unseen data. The model should not overfit the training data nor underfit the training data.

Performance Measures

The ML model is evaluated using various performance measures instead of just making assumptions based on the predictions. According to the needs, different models are evaluated using different metrics. Like, MAE and MSE are popular for regression models whereas for evaluating classification models, Confusion Matrix and Accuracy are widely used.

Most common evaluation metrics are:

Mean Absolute Error (MAE):

It calculates the difference between predicted value and actual value, and calculates the absolute value of the difference. The average of the absolute values for N samples is the MAE. It gives the measure of how far the predictions were from the actual target values.
Mathematically, it is represented as:

Mean Squared Error (MSE):

This is similar to MEA except that, Mean Squared Error computes square of the difference instead of calculating absolute value. By taking square, the large error becomes more larger than the small error which helps the model to focus on large errors than small errors.
Mathematically, it is represented as:

Confusion Matrix:

It gives matrix as output and describes the overall performance of the model.
For example, the confusion matrix of the binary classifier below tells a lot about the classifier:

Where,
P = Postive Class
N = Negative Class
TP = Model predicted positive for actual positive class.
FN = Model predicted negative for actual positive class.
TN = Model predicted negative for actual negative class.
FP = Model predicted positive for actual negative class.


From above confusion matrix, following metrics can be calculated:
Accuracy:
It gives the fraction of correct predictions over total number of predictions made. It works well only if there are equal number of samples on each class.
Precision:
It attempts to answer that: what proportion of  positive predictions was actually correct.
Recall:
It attempts to answer that: what proportion of actual positives was identified.
Note: Precision and Recall are inverse to each other. i.e. when Precision increases, Recall decreases and vice-versa.

Other Metrics:

- Logarithmic Loss
- F1 Score
- Area Under Curve (AUC)

Comments