When you run a binary classifier over a population you get an estimate of the proportion of true positives in that population. This is known as the prevalence.
But that estimate is biased, because no classifier is perfect. For example, if your classifier tells you that you have 20% of positive cases, but its precision is known to be only 50%, you would expect the true prevalence to be
This leads to the common formula for getting the true prevalence
But suppose that you want to run the classifier more than once. For example, you might want to do this at regular intervals to detect trends in the prevalence. You can’t use this formula anymore, because precision depends on the prevalence. To use the formula above you would have to re-estimate the precision regularly (say, with human eval), but then you could just as well also re-estimate the prevalence itself.
How do we get out of circular reasoning? It turns out that binary classifiers have other performance metrics (besides precision) that do not depend on the prevalence. These include not only the recall
where:
is the true prevalence is the specificity is the sensitivity or recall is the proportion of positives
The proof is straightforward:
Solving for
Notice that this formula breaks down when the denominator

An ROC curve like this one plots recall
Enough theory, let’s see if this works in practice:
# randomly draw some covariate x <- runif (10000, -1, 1) # take the logit and draw the outcome logit <- plogis (x) y <- runif (10000) < logit # fit a logistic regression model m <- glm (y ~ x, family = binomial) # make some predictions, using an absurdly low threshold y_hat <- predict (m, type = "response" ) < 0.3 # get the recall (aka sensitivity) and specificity c <- caret:: confusionMatrix ( factor (y_hat), factor (y), positive = "TRUE" ) recall <- unname (c$byClass[ 'Sensitivity' ]) specificity <- unname (c$byClass[ 'Specificity' ]) # get the adjusted prevalence ( mean (y_hat) - (1 - specificity)) / (recall - (1 - specificity)) # compare with actual prevalence mean (y) |
In this simulation I get recall = 0.049
and specificity = 0.875
. The predicted prevalence is a ridiculously biased 0.087
, but the adjusted prevalence is essentially equal to the true prevalence (0.498
).
To sum up: this shows how, using a classifier’s recall and specificity, you can adjusted the predicted prevalence to track it over time, assuming that recall and specificity are stable over time. You cannot do this using precision and recall because precision depends on the prevalence, whereas recall and specificity don’t.
Thank you. Excellent post.
Can you elaborate on why you:
“# make some predictions, using an absurdly low threshold”
Is the starting threshold irrelevant when using adjusted prevalence?
Yes, because the threshold you use will lead to a different recall and specificity; the adjusted prevalence will remain the same.