In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs.

An example would be assigning a given email into "spam" or "non-spam" classes or assigning a diagnosis to a given patient as described by observed characteristics of the patient (gender, blood pressure, presence or absence of certain symptoms, etc.).

Types of questions

Let say there exist 3 classes: cat, dog and raccoon. Then the main goal is to classify all the images that each user will be presented. To classify, each user will be presented randomly with 2 type of questions:

  • Yes or No questions: for each object, each user will response one, two or three of these questions:

Is this a cat? yes or no
Is this a dog? yes or no
Is this a raccoon? yes or no

  • A or B or C questions: for some objects (at max 10% of the questions) each user will response this question:

What is the class of this object? cat or dog or racoon


Crowdsourcing is the process of obtaining needed content by soliciting contributions from a large group of people, especially an on-line community. In a classification scenario, the content asked may be the class of an observation.

The Catalina Surveys Data

Go to the surveys data

Go to the Catalogs paper

In this project we present a subset of the Catalina Surveys Data to a crowd of labelers. Every labeler must to give response for about 1000 questions in order to finish its participation in the project: Catalina DB.

The selected subset contains only 4 classes of stars:

  • Cepheid-II
  • Eclipsing Binary
  • Long-Period Variable
  • RR Lyrae

Therefore, all the questions assume that the only possible classes are these last four.

Learn about the stars