Which one should I trust more: a survey with a random sample that covers 1% of the population but with 60% response rate, or a dataset taken from social media, if this datasets covers 80% of the population? This is a motivating question for Xiao-Li Meng, and he is able to give a general answer to it (spoiler: the social media dataset is more trustworthy). This amazing paper is worth a discussion. A key formula relates the error of an estimate of the population mean to the product of three and only three factors: the “data quality”, the “data quantity” and the “problem difficulty”. A further result is that the design effect for a large non-probability sample is N-1 times the square of the expected “data quality”, where N is the population size. That is, the design effects grows with N.

Reference: Meng, X.-L. (2018). Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election. The Annals of Applied Statistics, 12(2), 685-726.

