From education to employment

Toloka to present new dataset at prestigious Data-Centric AI workshop launched by Andrew Ng

Andrew Ng, founder of Landing AI and and co-founder and former head of Google Brain

@NeurIPSConf – @TolokaAI, a platform which generates #MachineLearning data at scale, has been selected to present during Data-Centric #AI at NeurIPS 2021, a workshop spearheaded by renowned data scientist Andrew Ng. Ng is the founder of Landing AI and and co-founder and former head of Google Brain. 

Apart from Andrew Ng, the organizing committee of the workshop includes such renowned AI experts as Lora Aroyo (Google Research), Cody Coleman (Stanford University), Greg Diamos (Landing AI), Vijay Janapa Reddi (Harvard University), Joaquin Vanschoren (Eindhoven University of Technology), Carole-Jean Wu (Facebook), Sharon Zhou (Stanford University).

Toloka’s paper presents a dataset for evaluating AI methods for enabling  subjective human responses to improve human-centric AI (Artificial Intelligence) systems. The dataset will ultimately nurture the further development of vital  human-computer systems such as e-commerce, recommendations, ranking and search applications. 

Data-Centric AI was launched by Ng as part of his broader data-centric AI movement. The workshop aims to address the challenges in accelerating dataset creation while increasing the efficiency of use and reuse by democratizing data engineering and evaluation. Of more than 150 papers submitted, “IMDB-WIKI-SbS: An Evaluation Dataset for Crowdsourced Pairwise Comparisons,” is one of several granted a presentation slot.

Olga Megorskaya, CEO and co-founder of Toloka, comments: “The Data-Centric AI mission is closely aligned with Toloka’s mission to advance global AI by empowering individuals and making data ownership universally available. AI is developing at a fast pace, but it could be much faster if it wasn’t impeded by a lack of original, high-quality data that could be fed to AI models. We are honoured to join this important conversation by presenting our ground-breaking dataset as an example of quality, scalability, and speed in data collection and hope to further the advancement of data engineering by sharing these findings with the Data-Centric AI community.”

Pairwise comparison tasks ask humans to choose between two options and account for people’s preferences when developing and improving challenging AI tasks such as information retrieval and recommender system evaluation. IMDB-WIKI-SbS is a new large-scale dataset that contains 9,150 images appearing in 250,249 pairs annotated on a crowdsourcing platform. Before Toloka released IMDB-WIKI-SbS, the only available datasets for pairwise comparisons were small and/or proprietary, slowing progress in gathering better feedback from human users. The new dataset uses the well-known IMDB-WIKI dataset as ground truth and has a balanced distribution of age and gender. The research to be presented at Data-Centric AI describes how IMDB-WIKI-SbS was built and compares several baseline methods, indicating the dataset’s suitability for AI model evaluation.

The workshop will be held virtually on 14 December, 2021, and Toloka’s paper and presentation will be published on the Data-Centric AI website.

Related Articles