Skip to content

Add NetEaseCrowd dataset#101

Merged
dustalov merged 3 commits intoToloka:mainfrom
shenxiangzhuang:add/netease_classification_dataset
Mar 12, 2024
Merged

Add NetEaseCrowd dataset#101
dustalov merged 3 commits intoToloka:mainfrom
shenxiangzhuang:add/netease_classification_dataset

Conversation

@shenxiangzhuang
Copy link
Copy Markdown
Contributor

Checklist

  • I have read the CONTRIBUTING document
  • I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en
  • My change requires a change to the documentation
  • I have updated the documentation accordingly
  • I have added tests to cover my changes
  • All new and existing tests passed

Dataset info

Adding our open-source dataset, NetEaseCrowd(https://github.com/fuxiAIlab/NetEaseCrowd-Dataset).

NetEaseCrowd is a large-scale dataset for long-term and online crowdsourcing truth inference, which contains about 2,400 workers, 1,000,000 tasks, and 6,000,000 annotations collected over 6 months. We believe that this dataset could be an invaluable asset to the Toloka/crowd-kit community by providing a new benchmark for crowdsourcing-related research and development.

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.96%. Comparing base (07c4240) to head (08440a2).
Report is 34 commits behind head on main.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #101      +/-   ##
==========================================
+ Coverage   92.80%   92.96%   +0.15%     
==========================================
  Files          47       47              
  Lines        2070     2216     +146     
==========================================
+ Hits         1921     2060     +139     
- Misses        149      156       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Copy Markdown
Collaborator

@pilot7747 pilot7747 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @shenxiangzhuang! Thank you for contributing this dataset. Lgtm

@shenxiangzhuang
Copy link
Copy Markdown
Contributor Author

Besides the CI test, I also tested to use this dataset do categorical aggregation and it works well:

from crowdkit.aggregation import DawidSkene
from crowdkit.datasets import load_dataset

df, gt = load_dataset('netease_crowd')

ds = DawidSkene(10)
result = ds.fit_predict(df)

print(len(result))
# 999799

Copy link
Copy Markdown
Collaborator

@dustalov dustalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for a very well-done PR! I noticed a small imperfection in the dataset metadata. Could you please check my suggestion?

Comment thread crowdkit/datasets/_loaders.py
Co-authored-by: Dmitry Ustalov <dmitry.ustalov@gmail.com>
@shenxiangzhuang
Copy link
Copy Markdown
Contributor Author

Thank you for a very well-done PR! I noticed a small imperfection in the dataset metadata. Could you please check my suggestion?

Thanks a lot for your carefully review!

@dustalov dustalov merged commit ec05dcc into Toloka:main Mar 12, 2024
@dustalov
Copy link
Copy Markdown
Collaborator

Great job, thank you again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants