Transformer Training Exercise

At CineStream, we ingest a huge volume of long-form user reviews for newly released films and series, far more than teams can read manually. We need a sentiment model that classifies reviews as positive or negative to power real-time dashboards, flag early signs of a release going off track, and surface strong positive reactions for marketing. We use the IMDB review dataset as a close public proxy to build and validate the training pipeline before moving to our internal reviews.

Objective

Train a transformer model for binary sentiment classification on a small dataset and report performance. We recommend using a BERT model but you can use any model you like.

Dataset

Using the following imdb dataset. https://huggingface.co/datasets/stanfordnlp/imdb/viewer/plain_text/train

from datasets import load_dataset
ds = load_dataset("imdb", split="train[:1000]").train_test_split(test_size=0.2)

Be sure to examine its suitability for the task on huggingface first, you can downsample further if needed to keep training fast.

Tasks

1) Load a pretrained model + tokenizer

Use an auto model from Hugging Face (distilbert-base-uncased is a good default).
Configure it for binary classification, consider using AutoTokenizer and AutoModelForSequenceClassification

2) Prepare + tokenize the dataset

Tokenize text with truncation + padding.
Keep max length reasonable for speed (e.g., 128–256).
Ensure labels are correctly attached.

3) Training the model

We recommend using a trainer for this exercise but a custom loop is also fine. Be sure to choose sensible hyperparameters. The training should run on the CPU and only take a few minutes so we have enough time to iterate.

4) Run sample predictions

Provide predictions for 3 reviews of your choice, e.g.:

“I loved this film.”
“This was a waste of time.”
“Pretty good overall, but slow in parts.”

Determine if the classification is correct and iterate until you are happy with the models performance.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Transformer Training Exercise

Objective

Dataset

Tasks

1) Load a pretrained model + tokenizer

2) Prepare + tokenize the dataset

3) Training the model

4) Run sample predictions

About

Uh oh!

Releases

Packages

Languages

BuildCircle/transformer-kata

Folders and files

Latest commit

History

Repository files navigation

Transformer Training Exercise

Objective

Dataset

Tasks

1) Load a pretrained model + tokenizer

2) Prepare + tokenize the dataset

3) Training the model

4) Run sample predictions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages