We study the statistical properties of weighted estimators in unbiased pool-based active learning where instances are sampled at random with unequal probabilities. For classification problems, the use of probabilistic uncertainty sampling has previously been suggested for such algorithms, motivated by the heuristic argument that this would target the most informative instances, and further by the assertion that this also would minimise the variance of the estimated (logarithmic) loss. We show that probabilistic uncertainty sampling does, in fact, not reach any of these targets.

   Considering a general family of parametric prediction models, we derive asymptotic expansions for the mean squared prediction error and for the variance of the total loss, and consequently present sampling schemes that minimise these quantities. We show that the resulting sampling schemes depend both on label uncertainty and on the influence on model fitting through the location of data points in the feature space, and have a close connection to statistical leverage.

   The proposed active learning algorithm is evaluated on a number of datasets, and we demonstrate better predictive performance than competing methods on all benchmark datasets. In contrast, deterministic uncertainty sampling always performed worse than simple random sampling, as did probabilistic uncertainty sampling in one of the examples.

Time: 22 January, 1 - 2 pm Place: B705

All welcome.