Search before asking
Description
As far as I know, Qwen2.5-VL is the first open source multimodal model that can extract bounding boxes.
e.g. from https://github.com/QwenLM/Qwen2.5-VL/blob/main/cookbooks/spatial_understanding.ipynb:

It would be great to support this so that other models can support this as well.
Use case
We would use this for generative process automation in https://github.com/OpenAdaptAI/OpenAdapt
Additional
No response
Are you willing to submit a PR?
Search before asking
Description
As far as I know, Qwen2.5-VL is the first open source multimodal model that can extract bounding boxes.
e.g. from https://github.com/QwenLM/Qwen2.5-VL/blob/main/cookbooks/spatial_understanding.ipynb:
It would be great to support this so that other models can support this as well.
Use case
We would use this for generative process automation in https://github.com/OpenAdaptAI/OpenAdapt
Additional
No response
Are you willing to submit a PR?