Skip to content

Default to optimal confidence from model-eval#2206

Open
leeclemnet wants to merge 16 commits intomainfrom
feat/model-eval-recommended-defaults
Open

Default to optimal confidence from model-eval#2206
leeclemnet wants to merge 16 commits intomainfrom
feat/model-eval-recommended-defaults

Conversation

@leeclemnet
Copy link
Copy Markdown
Contributor

@leeclemnet leeclemnet commented Apr 7, 2026

What does this PR do?

Model eval calculates F1-optimal confidence thresholds but they aren't currently used for model inference. This PR together with https://github.com/roboflow/roboflow/pull/11053 makes it so. This feature applies only to the inference_models pathway. The legacy inference pathways keep their existing default confidence.

The key changes are

Wire recommendedParameters from model_eval through to inference

  • New RecommendedParameters pydantic model in inference_models/weights_providers/entities.py (confidence, per_class_confidence); parsed from Roboflow API in weights_providers/roboflow.py and threaded through auto_loaders/core.py → initialize_model
  • Auto-loader injects it onto the model instance via hasattr(type(model), "recommended_parameters"), so model classes opt in by declaring a class-level recommended_parameters: Optional[RecommendedParameters] = None

New ConfidenceFilter and post_process_with_confidence_filter() wrapper (inference_models/models/base/)

  • Encodes the 4-tier priority chain: explicit user → per-class optimal → global optimal → hardcoded default
  • Wrapper rewrites the confidence kwarg to the filter's floor before calling post_process (so NMS keeps boxes any class might still want), then refines per-class on the way out
  • Added on ObjectDetectionModel, InstanceSegmentationModel, KeypointsDetectionModel, SemanticSegmentationModel, MultiLabelClassificationModel. Single-label classification deliberately opts out (top-1 always wins)

Inference adapters route through the new wrapper (inference/core/models/inference_models_adapters.py)

  • OD/IS/KP/multi-label classification adapters call post_process_with_confidence_filter instead of post_process
  • Multi-label response builder now reads prediction.class_ids directly instead of re-thresholding the full confidence vector — the model's per-class decision survives to the API response

API request schema (inference/core/entities/requests/inference.py)

  • ObjectDetectionInferenceRequest.confidence default flipped from 0.4 → None so model-eval recommendations can take effect; explicit user values still win. Description updated to document the fallback chain

OLD inference path compatibility (inference/core/models/{object_detection,instance_segmentation,classification}_base.py)

  • Coalesce confidence is None → existing per-class default at the entry of infer() / make_response(), so the USE_INFERENCE_MODELS=false matrix variant still works after the request default flipped to None

Plumbing

  • BackendType moved from auto_loaders/entities.py → weights_providers/entities.py to break a circular import introduced by the new schema
  • CI workflow integration_tests_workflows_x86.yml overlays a local source build of inference_models (pip install --force-reinstall --no-deps ./inference_models) so adapter changes are actually exercised against pinned PyPI requirements

Dependencies:

Type of Change

  • New feature (non-breaking change that adds functionality)

Testing

  • I have tested this change locally
  • I have added/updated tests for this change

Test details:

  • New unit tests: test_confidence_filter.py, test_confidence_filter_attribute.py, test_post_process_filter.py, test_recommended_parameters.py, plus expanded test_roboflow.py and test_core.py coverage

  • Tested rfdetr OD workflow in staging against local inference server without https://github.com/roboflow/roboflow/pull/11053 deployed - verify the hard-coded default is still in effect even if the API doesn't yet serve the recommendedParameters

Debug logging:

ConfidenceFilter: tier 4 (hardcoded default), floor=0.4000, fallback=0.4000, per_class=None
image

Debug logging:

ConfidenceFilter: tier 2 (per-class), floor=0.2000, fallback=0.3600, per_class={'bishop': 0.5, 'black-bishop': 0.26, 'black-king': 0.75, 'black-knight': 0.22, 'black-pawn': 0.47, 'black-queen': 0.24, 'black-rook': 0.27, 'white-bishop': 0.67, 'white-king': 0.46, 'white-knight': 0.33, 'white-pawn': 0.47, 'white-queen': 0.2, 'white-rook': 0.45}
image image

Debug logging:

ConfidenceFilter: tier 2 (per-class), floor=0.4600, fallback=0.4600, per_class={'Car-rims': 0.46, 'music-note': 0.5}
image

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code where necessary, particularly in hard-to-understand areas
  • My changes generate no new warnings or errors
  • I have updated the documentation accordingly (if applicable)

Additional Context

@leeclemnet leeclemnet force-pushed the feat/model-eval-recommended-defaults branch 5 times, most recently from a19cb78 to 5660d57 Compare April 8, 2026 18:37
@leeclemnet leeclemnet changed the title Use recommended optimal confidence from model-eval Default to optimal per-class/global confidence from model-eval Apr 8, 2026
@leeclemnet leeclemnet force-pushed the feat/model-eval-recommended-defaults branch 2 times, most recently from 32fa48d to 473dd08 Compare April 8, 2026 20:07
@leeclemnet leeclemnet changed the title Default to optimal per-class/global confidence from model-eval Default to optimal confidence from model-eval Apr 8, 2026
Comment thread inference_models/inference_models/models/auto_loaders/core.py Outdated
Comment thread inference_models/inference_models/models/auto_loaders/entities.py
@leeclemnet leeclemnet force-pushed the feat/model-eval-recommended-defaults branch 5 times, most recently from 33ea1b0 to 262b5bf Compare April 10, 2026 16:39
@leeclemnet leeclemnet force-pushed the feat/model-eval-recommended-defaults branch from ba6acfd to 65e7aad Compare April 14, 2026 12:36
Comment thread inference_models/inference_models/models/base/confidence_filter.py Outdated
)
)
if confidence_filter.has_per_class_refinement:
results = [
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one needs to stay as a second loop since post_process_predictions_for_precomputed_embeddings is also used by OWLv2

@leeclemnet leeclemnet force-pushed the feat/model-eval-recommended-defaults branch from 581382e to da83b8e Compare April 14, 2026 18:39
@leeclemnet
Copy link
Copy Markdown
Contributor Author

Changes:

  • Moved ConfidenceFilter into common post_processing
  • ConfidenceFilter now accepts per-model default confidence and uses it as the fallback if no other confidence is set
  • Optimization: refactored prediction filters and moved them into the existing post-process loops (except Instant because of shared post-processing method with OWLv2) -- now only one loop over detections instead of two
  • Made confidence Optional[float] = None on concrete classes to match bases

call when this is False."""
return self._per_class is not None

def passes(self, class_name: str, confidence: float) -> bool:
Copy link
Copy Markdown
Collaborator

@PawelPeczek-Roboflow PawelPeczek-Roboflow Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is passes(...) actually used outside?
also - the name is misleading - it looks like it should return false if confidence is below global floor, regardless of the self._per_class presence

if not used - I am voting to turn into private helper

if global_optimal is not None
else default_confidence
)
self._per_class = dict(per_class)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is dict(...) needed?

3. Global optimal — single threshold for everything
4. Model's hardcoded default — single threshold for everything

Exposes:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try to reduce comments that express interface meaning in natural language - interface should be interpretable easily looking at the code or redesigned such that the reader builds good intuitions just looking at the signarures

also handles the no-refinement case (returns all-True)."""
n = len(class_ids)
if not self.has_per_class_refinement:
return torch.ones(n, dtype=torch.bool)
Copy link
Copy Markdown
Collaborator

@PawelPeczek-Roboflow PawelPeczek-Roboflow Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retracted
still haven't consumed all the PR, but why wouldn't just shortcut with confidences >= torch.full_like(confidences, self._fallback)
name of the method requires additional clarification in the comment increasing the cognitive load

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make this private helper - maybe not worth creating mask and filtering with this mask when will be all True and simply always return original object?

class_id=detections.class_id[keep],
confidence=detections.confidence[keep],
image_metadata=detections.image_metadata,
bboxes_metadata=detections.bboxes_metadata,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like bboxes_metadata not filtered?
maybe we should re-use other method?

bboxes_metadata=bboxes_metadata,
)

def refine_keypoints_and_detections(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from what I remember Detections production is optional for keypoints models - we shall probably reflect that, or let user decide to refine separately

)
return refined_keypoints, refined_detections

def refine_multilabel_prediction(
Copy link
Copy Markdown
Collaborator

@PawelPeczek-Roboflow PawelPeczek-Roboflow Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retracted
use of passes(...) which only works with per-class refinement

image_metadata=prediction.image_metadata,
)

def refine_segmentation_result(
Copy link
Copy Markdown
Collaborator

@PawelPeczek-Roboflow PawelPeczek-Roboflow Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retracted
this method looks aligned to what I would expect (auto-fallback even if client does not care about has_per_class_refinement) and at the same time not aligned to other methods creating inconsistency

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should check if class alignment is there and only apply if present?

safe_idx = class_ids_long.clamp(0, max(len(class_names) - 1, 0))
per_detection_thresholds = torch.where(
in_range,
thresholds_per_class[safe_idx] if len(class_names) > 0 else torch.full_like(confidences, self._fallback),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe empty class names deserves shortcut earlier in the logic?

)
return confidences >= per_detection_thresholds

def per_class_thresholds(self, class_names: List[str]) -> List[float]:
Copy link
Copy Markdown
Collaborator

@PawelPeczek-Roboflow PawelPeczek-Roboflow Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe tensor should be returned from the function?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also private helper?

recommended_parameters: Optional[RecommendedParameters],
default_confidence: float,
):
# Tier 1: explicit user value wins outright. No per-class refinement
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just sanitisation, optional - maybe a helper function to establish all values we want to set and set the state of the class based on that - this way when any field needs to be added, it's easier not to get lost in a jungle of if-elif-else

# existing per-image loop when has_per_class_refinement is True.
# ------------------------------------------------------------------

def refine_detections(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like the true public interface is refine_detections, refine_instance_detections, refine_keypoints_and_detections, refine_multilabel_prediction, refine_segmentation_result - would keep them on top of the class

return image_bboxes, masks


class ConfidenceFilter:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at the code over and over and still have feeling that this does not fully match the puzzles.

When I see ConfidenceFilter - I believe the responsibility of this class is generating and applying filtering criteria based on confidence. And I believe this gut feeling is fair.

Current implementations are trivial (comparing tensor to constant float value)
Examples:

# rfdetr OD
confidence_mask = predicted_confidence > confidence

# rfdetr IS
confidence_mask = confidence > threshold

# resnet ML
batch_element_confidence >= confidence

# resnet single lablel: N/A

# YOLO IS
mask = class_conf > conf_thresh

Proposition in this class is to extend the logic to be confidence filter on steroids, which basically

  • requires to run old style filtering (but based on this class dictated floor value)
  • and then run another construction of tensors in the output entities
  • at the same time entangling knowledge about internals of data formats outputed from the model classes at the end of whole forward pass - which do not need to be known by ConfidenceFilter

we end up with:

  • worst-case scenario double filtering which is avoidable (performance loss)
  • not natural interactions between model and this class
  • class interface which is not generic and requires mutation for future variations of output entities.

Let's discuss if you find thsoe correct observations - maybe I lack some visibility, maybe this is for some reason not possible.

@leeclemnet leeclemnet marked this pull request as draft April 15, 2026 15:00
@leeclemnet leeclemnet force-pushed the feat/model-eval-recommended-defaults branch 2 times, most recently from cdbc2bd to e688ae7 Compare April 15, 2026 19:35
@leeclemnet leeclemnet force-pushed the feat/model-eval-recommended-defaults branch from e688ae7 to 341300d Compare April 15, 2026 19:58
@leeclemnet leeclemnet marked this pull request as ready for review April 15, 2026 23:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants