Vision Utils Breaks on Video Only Samples

If we try to use `UnslothVisionDataCollator` on a dataset of conversations that **do not** have images and **only has** samples that have videos, as multimodal inputs the script fails in two places

1. **Empty Images List** If no images are provided as a part of the message, `images` defaults to an empty list (not None). Downstream processors (like Qwen-VL-2.5's ImageProcessor) does not know how to handle an empty list for a sample's images and throws and error.

https://github.com/unslothai/unsloth-zoo/blob/ea85a264ef188a4ea1b20edab9864f90907e59a7/unsloth_zoo/vision_utils.py#L775-L807

2. **`_cast_pixel_values_dtype_inplace` expects `pixel_values`** If there are no images, there is `no pixel_values` value and the `_cast_pixel_values_dtype_inplace` function errors out.

https://github.com/unslothai/unsloth-zoo/blob/ea85a264ef188a4ea1b20edab9864f90907e59a7/unsloth_zoo/vision_utils.py#L950-L965

	images = []
	videos = []
	video_kwargs = {'fps': []}
	for example in examples:
	messages = self._select_messages_or_raw(example)

	# Check if data format is correct for VLMs!
	if len(messages) != 0:
	messages = self._validate_and_normalize_first_message(messages)

	# Also fix the messages if assistant must only be 1 string!
	# Only affects Mistral V3 I think!
	if self.assistant_single_content:
	messages = self._collapse_assistant_content(messages)
	pass

	message = self.processor.apply_chat_template(
	messages,
	tokenize = False,
	add_generation_prompt = False,
	)
	texts.append(message)
	# Dataset with 2 columns messages / images
	image, video, video_kwarg = self._extract_images_videos_for_example(example, messages)
	image = self._resize_images_inplace(image)
	images.append(image)

	if len(video) > 0: # Works for list, tuple or tensor
	videos.append(video)
	if video_kwarg is None:
	video_kwarg = {"fps": []}
	video_kwargs['fps'].extend(video_kwarg['fps'])
	pass

	def _cast_pixel_values_dtype_inplace(self, batch):
	# Pixtral accepts multiple images, so we have to cast it individually
	pixel_values = batch["pixel_values"]
	if type(pixel_values) is list:
	for j, pixel_value_j in enumerate(pixel_values):
	if type(pixel_value_j) is list:
	for k, pixel_value_k in enumerate(pixel_value_j):
	pixel_value_j[k] = pixel_value_k.to(self.dtype)
	else:
	pixel_values[j] = pixel_value_j.to(self.dtype)
	pass
	batch["pixel_values"] = pixel_values
	else:
	batch["pixel_values"] = batch["pixel_values"].to(self.dtype)
	pass
	return batch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vision Utils Breaks on Video Only Samples #319

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Vision Utils Breaks on Video Only Samples #319

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions