In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs.
An example would be assigning a given email into "spam" or "non-spam" classes or assigning a diagnosis to a given patient as described by observed characteristics of the patient (gender, blood pressure, presence or absence of certain symptoms, etc.).
Types of questions
Let say there exist 3 classes: cat, dog and raccoon. Then the main goal is to classify all the images that each user will be presented. To classify, each user will be presented randomly with 2 type of questions:
- Yes or No questions: for each object, each user will response one, two or three of these questions:
Is this a cat? yes or no
Is this a dog? yes or no
Is this a raccoon? yes or no
- A or B or C questions: for some objects (at max 10% of the questions) each user will response this question:
What is the class of this object? cat or dog or racoon
Crowdsourcing is the process of obtaining needed content by soliciting contributions from a large group of people, especially an on-line community. In a classification scenario, the content asked may be the class of an observation.
The Catalina Surveys Data
In this project we present a subset of the Catalina Surveys Data to a crowd of labelers. Every labeler must to give response for about 1000 questions in order to finish its participation in the project: Catalina DB.
The selected subset contains only 4 classes of stars:
- Eclipsing Binary
- Long-Period Variable
- RR Lyrae
Therefore, all the questions assume that the only possible classes are these last four.