Classification - SGD Classifier / Precision & Recall

Hello!

I am working along the session on ‘Classification in Machine Learning’ being explained with the MNIST data set having 70,000 images.

In the explanation, when the decision scores are calculated using cross_val_predict, the length of the output is equivalent to the number of instances (i.e. 60,000 in the test set). However, when we calculate the precision and recall for each level of threshold, their length reduces below 60,000. The code is replicated hereunder.

I am really curious to understand this behaviour and would appreciate the help.

y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method=‘decision_function’)
print(y_scores.size)
>> 60000

from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)
print(‘Precision:’, precisions.size, ‘\nRecall:’, recalls.size, ‘\nThreshold:’, thresholds.size)
>> Precision: 59903
>> Recall: 59903
>> Threshold: 59902

@sgiri: Please help with this query.

Very interesting question.

Actually, it removes all the thresholds beyond which the recall rate has touched 1. Therefore, not all scores are part of thresholds.

Take a look at the source code here:

fps, tps, thresholds = _binary_clf_curve(y_true, probas_pred,
                                             pos_label=pos_label,
                                             sample_weight=sample_weight)

    precision = tps / (tps + fps)
    precision[np.isnan(precision)] = 0
    recall = tps / tps[-1]

    # stop when full recall attained
    # and reverse the outputs so recall is decreasing
    last_ind = tps.searchsorted(tps[-1])
    sl = slice(last_ind, None, -1)
    return np.r_[precision[sl], 1], np.r_[recall[sl], 0], thresholds[sl]
3 Likes

Many thanks for the guidance.

It was indeed helpful.

Regards,
Dhyey Kotecha