organised by

Invited Talk 2017

Data Quality and Data Mining with Crowdsourcing

Prof. Shengli Victor Sheng

University of Central Arkansas/USA


Crowdsourcing systems provide convenient platforms to collect human intelligence for a variety of tasks (e.g., labeling objects) from a vast pool of independent workers (a crowd). Compared with traditional expert labeling methods, crowdsourcing is obviously more efficient and cost-effective, but the quality of a single labeler cannot be guaranteed. In taking advantage of the low cost of crowdsourcing, it is common to obtain multiple labels per object (i.e., repeated labeling) from the crowd. In this talk, we outline our research on crowdsourcing from three aspects: (1) crowdsourcing mechanisms, specifically on repeated labeling strategies; (2) ground truth inference, specifically on noise correction after inference and biased wisdom of the crowd; and (3) learning from crowdsourced data.

We first present repeated-labeling strategies of increasing complexity to obtain multiple labels. Repeatedly labeling a carefully chosen set of points is generally preferable. A robust technique that combines different notions of uncertainty to select data points for more labels is recommended. Recent research on crowdsourcing focuses on deriving an integrated label from multiple noisy labels via expectation-maximization based (EM-based) ground truth inference. We present a novel framework that introduces noise correction techniques to further improve the label quality of the integrated labels obtained after ground truth inference. We further show that biased labeling is a systematic tendency. State-of-the-art ground truth inference algorithms cannot handle the biased labeling issue very well. Our simple consensus algorithm performs much better. Finally, we present pairwise solutions for maximizing the utility of multiple noisy labels for learning. Pairwise solutions can completely avoid the potential bias introduced in ground truth inference. They have both sides (potential correct and incorrect/noisy information) considered, so that they have very good performance whenever there are a few or many labels available.


presentation © Shutterstock presentation © Petra Perner presentation © Shutterstock