If we try to use UnslothVisionDataCollator on a dataset of conversations that do not have images and only has samples that have videos, as multimodal inputs the script fails in two places
- Empty Images List If no images are provided as a part of the message,
images defaults to an empty list (not None). Downstream processors (like Qwen-VL-2.5's ImageProcessor) does not know how to handle an empty list for a sample's images and throws and error.
|
images = [] |
|
videos = [] |
|
video_kwargs = {'fps': []} |
|
for example in examples: |
|
messages = self._select_messages_or_raw(example) |
|
|
|
# Check if data format is correct for VLMs! |
|
if len(messages) != 0: |
|
messages = self._validate_and_normalize_first_message(messages) |
|
|
|
# Also fix the messages if assistant must only be 1 string! |
|
# Only affects Mistral V3 I think! |
|
if self.assistant_single_content: |
|
messages = self._collapse_assistant_content(messages) |
|
pass |
|
|
|
message = self.processor.apply_chat_template( |
|
messages, |
|
tokenize = False, |
|
add_generation_prompt = False, |
|
) |
|
texts.append(message) |
|
# Dataset with 2 columns messages / images |
|
image, video, video_kwarg = self._extract_images_videos_for_example(example, messages) |
|
image = self._resize_images_inplace(image) |
|
images.append(image) |
|
|
|
if len(video) > 0: # Works for list, tuple or tensor |
|
videos.append(video) |
|
if video_kwarg is None: |
|
video_kwarg = {"fps": []} |
|
video_kwargs['fps'].extend(video_kwarg['fps']) |
|
pass |
_cast_pixel_values_dtype_inplace expects pixel_values If there are no images, there is no pixel_values value and the _cast_pixel_values_dtype_inplace function errors out.
|
def _cast_pixel_values_dtype_inplace(self, batch): |
|
# Pixtral accepts multiple images, so we have to cast it individually |
|
pixel_values = batch["pixel_values"] |
|
if type(pixel_values) is list: |
|
for j, pixel_value_j in enumerate(pixel_values): |
|
if type(pixel_value_j) is list: |
|
for k, pixel_value_k in enumerate(pixel_value_j): |
|
pixel_value_j[k] = pixel_value_k.to(self.dtype) |
|
else: |
|
pixel_values[j] = pixel_value_j.to(self.dtype) |
|
pass |
|
batch["pixel_values"] = pixel_values |
|
else: |
|
batch["pixel_values"] = batch["pixel_values"].to(self.dtype) |
|
pass |
|
return batch |
If we try to use
UnslothVisionDataCollatoron a dataset of conversations that do not have images and only has samples that have videos, as multimodal inputs the script fails in two placesimagesdefaults to an empty list (not None). Downstream processors (like Qwen-VL-2.5's ImageProcessor) does not know how to handle an empty list for a sample's images and throws and error.unsloth-zoo/unsloth_zoo/vision_utils.py
Lines 775 to 807 in ea85a26
_cast_pixel_values_dtype_inplaceexpectspixel_valuesIf there are no images, there isno pixel_valuesvalue and the_cast_pixel_values_dtype_inplacefunction errors out.unsloth-zoo/unsloth_zoo/vision_utils.py
Lines 950 to 965 in ea85a26