[BUG]: Bug problem for sft_dataset.py

### 🐛 Describe the bug

https://github.com/hpcaitech/ColossalAI/blob/09fe9dc704dd388f08e3b19dca65d3d0be64f106/applications/Chat/coati/dataset/sft_dataset.py#L97-L113

`label[:source_len] = IGNORE_INDEX` in `preprocess()` function may meet error when left padding.

![img_v2_133b5dca-f6a9-46e8-a7fe-1180163b001g](https://github.com/hpcaitech/ColossalAI/assets/31888981/77f2b3e0-79bc-4d4a-9db2-9577d9453eb9)

### Environment

_No response_

	def preprocess(
	sources: Sequence[str],
	targets: Sequence[str],
	tokenizer: transformers.PreTrainedTokenizer,
	max_length: int,
	) -> Dict:
	"""Preprocess the data by tokenizing."""
	examples = [s + t for s, t in zip(sources, targets)]
	examples_tokenized, sources_tokenized = [
	_tokenize_fn(strings, tokenizer, max_length)
	for strings in (examples, sources)
	]
	input_ids = examples_tokenized["input_ids"]
	labels = copy.deepcopy(input_ids)
	for label, source_len in zip(labels, sources_tokenized["input_ids_lens"]):
	label[:source_len] = IGNORE_INDEX
	return dict(input_ids=input_ids, labels=labels)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Bug problem for sft_dataset.py #4135

🐛 Describe the bug

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]: Bug problem for sft_dataset.py #4135

Description

🐛 Describe the bug

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions