Add NetEaseCrowd dataset by shenxiangzhuang · Pull Request #101 · Toloka/crowd-kit

shenxiangzhuang · 2024-03-12T08:00:34Z

Checklist

I have read the CONTRIBUTING document
I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en
My change requires a change to the documentation
I have updated the documentation accordingly
I have added tests to cover my changes
All new and existing tests passed

Dataset info

Adding our open-source dataset, NetEaseCrowd(https://github.com/fuxiAIlab/NetEaseCrowd-Dataset).

NetEaseCrowd is a large-scale dataset for long-term and online crowdsourcing truth inference, which contains about 2,400 workers, 1,000,000 tasks, and 6,000,000 annotations collected over 6 months. We believe that this dataset could be an invaluable asset to the Toloka/crowd-kit community by providing a new benchmark for crowdsourcing-related research and development.

codecov-commenter · 2024-03-12T08:08:36Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.96%. Comparing base (07c4240) to head (08440a2).
Report is 34 commits behind head on main.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #101      +/-   ##
==========================================
+ Coverage   92.80%   92.96%   +0.15%     
==========================================
  Files          47       47              
  Lines        2070     2216     +146     
==========================================
+ Hits         1921     2060     +139     
- Misses        149      156       +7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pilot7747

Hi @shenxiangzhuang! Thank you for contributing this dataset. Lgtm

shenxiangzhuang · 2024-03-12T08:11:46Z

Besides the CI test, I also tested to use this dataset do categorical aggregation and it works well:

from crowdkit.aggregation import DawidSkene
from crowdkit.datasets import load_dataset

df, gt = load_dataset('netease_crowd')

ds = DawidSkene(10)
result = ds.fit_predict(df)

print(len(result))
# 999799

dustalov

Thank you for a very well-done PR! I noticed a small imperfection in the dataset metadata. Could you please check my suggestion?

Co-authored-by: Dmitry Ustalov <dmitry.ustalov@gmail.com>

shenxiangzhuang · 2024-03-12T11:37:04Z

Thank you for a very well-done PR! I noticed a small imperfection in the dataset metadata. Could you please check my suggestion?

Thanks a lot for your carefully review!

dustalov · 2024-03-12T11:48:44Z

Great job, thank you again!

shenxiangzhuang added 2 commits March 12, 2024 15:55

add: netease_crowd dataset

af3501c

change: revert the debug setting

08440a2

shenxiangzhuang requested review from DrhF, Pocoder, alexdrydew, aliskin, denaxen, dustalov, pilot7747, varfolomeii and vlad-mois as code owners March 12, 2024 08:00

shenxiangzhuang mentioned this pull request Mar 12, 2024

Add the dataset to crowd-kit fuxiAIlab/NetEaseCrowd-Dataset#5

Closed

pilot7747 approved these changes Mar 12, 2024

View reviewed changes

dustalov approved these changes Mar 12, 2024

View reviewed changes

Comment thread crowdkit/datasets/_loaders.py

Fix description spaces

d4674eb

Co-authored-by: Dmitry Ustalov <dmitry.ustalov@gmail.com>

dustalov merged commit ec05dcc into Toloka:main Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NetEaseCrowd dataset#101

Add NetEaseCrowd dataset#101
dustalov merged 3 commits intoToloka:mainfrom
shenxiangzhuang:add/netease_classification_dataset

shenxiangzhuang commented Mar 12, 2024

Uh oh!

codecov-commenter commented Mar 12, 2024

Uh oh!

pilot7747 left a comment

Uh oh!

shenxiangzhuang commented Mar 12, 2024

Uh oh!

dustalov left a comment

Uh oh!

Uh oh!

shenxiangzhuang commented Mar 12, 2024

Uh oh!

dustalov commented Mar 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

shenxiangzhuang commented Mar 12, 2024

Checklist

Dataset info

Uh oh!

codecov-commenter commented Mar 12, 2024

Codecov Report

Uh oh!

pilot7747 left a comment

Choose a reason for hiding this comment

Uh oh!

shenxiangzhuang commented Mar 12, 2024

Uh oh!

dustalov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shenxiangzhuang commented Mar 12, 2024

Uh oh!

dustalov commented Mar 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants