Skip to content

Fix audit tool type undercount for Copilot MCP-only runs#26689

Merged
pelikhan merged 5 commits intomainfrom
copilot/fix-tool-types-audit-issue
Apr 16, 2026
Merged

Fix audit tool type undercount for Copilot MCP-only runs#26689
pelikhan merged 5 commits intomainfrom
copilot/fix-tool-types-audit-issue

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 16, 2026

Copilot workflow audits could report tool_types: 0 (and tool_breadth: narrow) even when MCP tools were clearly used, causing misleading behavior fingerprints and observability summaries. This change ensures MCP usage is included in tool-type calculations when standard parsed tool calls are missing.

  • Tool usage aggregation now includes MCP sources

    • buildAuditData now merges MCP-derived tool usage into tool_usage before generating behavior fingerprint and observability insights.
    • Added mergeMCPToolUsageInfo(...) to combine:
      • parsed metrics.ToolCalls
      • mcp_tool_usage.summary (primary source)
      • mcp_tool_usage.tool_calls (fallback when summary is absent)
  • Consistent downstream analytics

    • Behavior fingerprint tool_types/tool_breadth now reflect MCP activity for Copilot runs.
    • Observability execution insight text/evidence now reports MCP-derived tool type counts instead of 0.
  • Regression coverage for Copilot-style MCP-only runs

    • Added focused tests for:
      • summary-based MCP merge
      • tool-call fallback merge
      • end-to-end audit data behavior where metrics tool calls are empty but MCP usage exists
toolUsage := buildToolUsageInfo(metrics)
toolUsage = mergeMCPToolUsageInfo(toolUsage, mcpToolUsage)

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/d7000781-6a64-476c-84a4-6eb743492530

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
@github-actions

This comment has been minimized.

@github-actions
Copy link
Copy Markdown
Contributor

👋 Thanks for this agent-created WIP draft targeting the tool_types: 0 bug in Copilot engine audit runs! The problem description, reproduction steps, expected vs. actual behaviour, root cause hypothesis, and impact sections are all well laid out — great foundation for the fix.

Since this is still a placeholder (no code written yet), the only gap to address once implementation begins is test coverage for the corrected tool_types counting logic.

When you're ready to add tests, here's a prompt to get started:

The `tool_types` counter in the Copilot engine audit path has been fixed. Please add unit tests that cover:
1. A Copilot engine audit run where tool calls are present — assert that `tool_types` reflects the correct non-zero count.
2. A Copilot engine audit run with no tool calls — assert that `tool_types` is 0.
3. Any edge cases identified in the root cause analysis (e.g., mixed tool types, large tool lists).
Follow the existing test conventions in the audit package (table-driven tests, `require`/`assert` from testify, `//go:build !integration` build tag at the top of the test file).

Generated by Contribution Check · ● 1.9M ·

Copilot AI and others added 2 commits April 16, 2026 17:26
Copilot AI changed the title [WIP] Fix incorrect tool types in Copilot engine audits Fix audit tool type undercount for Copilot MCP-only runs Apr 16, 2026
Copilot AI requested a review from pelikhan April 16, 2026 17:35
@pelikhan pelikhan marked this pull request as ready for review April 16, 2026 18:22
Copilot AI review requested due to automatic review settings April 16, 2026 18:22
@pelikhan pelikhan merged commit 2289a7b into main Apr 16, 2026
54 of 55 checks passed
@pelikhan pelikhan deleted the copilot/fix-tool-types-audit-issue branch April 16, 2026 18:26
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Ensures audit reporting counts MCP tool usage when standard parsed tool calls are missing (e.g., Copilot MCP-only runs), fixing tool_types: 0 / misleading behavior fingerprints and observability insights.

Changes:

  • Merge MCP-derived tool usage into tool_usage during buildAuditData generation.
  • Add mergeMCPToolUsageInfo(...) to combine parsed tool calls with MCP summary/tool-call fallback.
  • Add regression tests covering MCP-only runs and merge behavior (summary and tool-call fallback).
Show a summary per file
File Description
pkg/cli/audit_report.go Merges MCP tool usage into toolUsage before downstream analytics.
pkg/cli/audit_agentic_analysis.go Introduces mergeMCPToolUsageInfo helper for combining tool-usage sources.
pkg/cli/audit_mcp_tool_usage_test.go Adds end-to-end regression test ensuring MCP-only runs produce nonzero tool-type analytics.
pkg/cli/audit_agentic_analysis_test.go Adds focused unit tests for summary-based and tool-call fallback MCP merges.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 4/4 changed files
  • Comments generated: 2

Comment on lines +140 to +156
if len(mcpToolUsage.Summary) > 0 {
for _, summary := range mcpToolUsage.Summary {
switch {
case summary.ServerName != "" && summary.ToolName != "":
addOrUpdateToolUsage(summary.ServerName+"."+summary.ToolName, summary.CallCount, summary.MaxInputSize, summary.MaxOutputSize, summary.MaxDuration)
case summary.ToolName != "":
addOrUpdateToolUsage(summary.ToolName, summary.CallCount, summary.MaxInputSize, summary.MaxOutputSize, summary.MaxDuration)
}
}
} else {
for _, call := range mcpToolUsage.ToolCalls {
switch {
case call.ServerName != "" && call.ToolName != "":
addOrUpdateToolUsage(call.ServerName+"."+call.ToolName, 1, call.InputSize, call.OutputSize, call.Duration)
case call.ToolName != "":
addOrUpdateToolUsage(call.ToolName, 1, call.InputSize, call.OutputSize, call.Duration)
}
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MCP tool names are currently constructed as serverName+"."+toolName before calling workflow.PrettifyToolName, which means they won’t dedupe/merge with parsed tool names from metrics.ToolCalls (those are prettified from mcp__server__tool into server_tool). This can double-count tool types (and break any substring-based logic like task-domain detection that expects underscore names). Consider constructing an MCP-style raw name first (e.g., mcp__<server>__<tool>) or otherwise normalizing to the same canonical format as PrettifyToolName so MCP and parsed sources collapse to one entry.

Copilot uses AI. Check for mistakes.
Comment on lines +336 to +355
require.Len(t, merged, 2, "MCP summary should be represented as tool usage entries")
assert.Equal(t, "safeoutputs.create_discussion", merged[0].Name, "higher call-count MCP tool should sort first")
assert.Equal(t, 2, merged[0].CallCount, "call count should come from MCP summary")
assert.Equal(t, "safeoutputs.push_repo_memory", merged[1].Name, "second MCP tool should be preserved")
}

func TestMergeMCPToolUsageInfoFallsBackToToolCalls(t *testing.T) {
merged := mergeMCPToolUsageInfo(nil, &MCPToolUsageData{
ToolCalls: []MCPToolCall{
{ServerName: "safeoutputs", ToolName: "create_discussion", InputSize: 20, OutputSize: 200},
{ServerName: "safeoutputs", ToolName: "create_discussion", InputSize: 40, OutputSize: 250},
{ServerName: "safeoutputs", ToolName: "push_repo_memory", InputSize: 10, OutputSize: 100},
},
})

require.Len(t, merged, 2, "fallback should aggregate MCP tool calls when summary is absent")
assert.Equal(t, "safeoutputs.create_discussion", merged[0].Name, "tool call fallback should aggregate by MCP tool name")
assert.Equal(t, 2, merged[0].CallCount, "fallback should count repeated MCP tool calls")
assert.Equal(t, 40, merged[0].MaxInputSize, "fallback should keep max input size from calls")
assert.Equal(t, 250, merged[0].MaxOutputSize, "fallback should keep max output size from calls")
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These assertions expect merged MCP tool names to use a dot separator (e.g. safeoutputs.create_discussion), but elsewhere the codebase represents MCP tools in underscore form after prettification (e.g. mcp__github__search_issues -> github_search_issues). If mergeMCPToolUsageInfo is updated to normalize MCP tool names to match workflow.PrettifyToolName output, these expectations should be updated accordingly to avoid locking in an inconsistent naming scheme that prevents deduping with parsed tool calls.

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown
Contributor

🧪 Test Quality Sentinel Report

Test Quality Score: 100/100

Excellent test quality

Metric Value
New/modified tests analyzed 3
✅ Design tests (behavioral contracts) 3 (100%)
⚠️ Implementation tests (low value) 0 (0%)
Tests with error/edge cases 3 (100%)
Duplicate test clusters 0
Test inflation detected No
🚨 Coding-guideline violations None

Test Classification Details

View all 3 test classifications
Test File Classification Issues Detected
TestMergeMCPToolUsageInfoUsesSummaryWhenMetricsToolCallsEmpty pkg/cli/audit_agentic_analysis_test.go:328 ✅ Design None
TestMergeMCPToolUsageInfoFallsBackToToolCalls pkg/cli/audit_agentic_analysis_test.go:344 ✅ Design None
TestBuildAuditDataUsesMCPToolUsageForToolTypes pkg/cli/audit_mcp_tool_usage_test.go:210 ✅ Design None

Test Analysis

TestMergeMCPToolUsageInfoUsesSummaryWhenMetricsToolCallsEmpty

Classification: ✅ Design test — behavioral contract
What it enforces: When mergeMCPToolUsageInfo receives no metrics tool calls but a populated MCP Summary, it must produce correctly-named, correctly-sorted tool usage entries with accurate call counts.
Edge/error coverage: Explicitly exercises the empty-metrics edge case (nil first argument), covering the primary bug scenario this PR fixes. All assertions carry descriptive messages.

TestMergeMCPToolUsageInfoFallsBackToToolCalls

Classification: ✅ Design test — behavioral contract
What it enforces: When the MCP Summary is absent but raw ToolCalls are present, mergeMCPToolUsageInfo must aggregate by tool name, count repetitions, and preserve max input/output sizes.
Edge/error coverage: Covers the fallback code path with aggregation logic and boundary values (max of two call sizes). All 5 assertions carry descriptive messages.

TestBuildAuditDataUsesMCPToolUsageForToolTypes

Classification: ✅ Design test — behavioral contract
What it enforces: buildAuditData must incorporate MCP-derived tool types into ToolUsage, BehaviorFingerprint.ToolBreadth, and the execution ObservabilityInsight.Evidence even when LogMetrics.ToolCalls is nil (the exact scenario for Copilot MCP-only runs).
Edge/error coverage: Directly models the production bug (nil ToolCalls), and asserts three distinct observable outputs. Uses require correctly for preconditions (Len, NotNil) and assert for validations.


Coding Guideline Checks

Check Status
Build tags (//go:build !integration) ✅ Both test files have correct tags
No mock libraries (gomock, testify/mock, .EXPECT()) ✅ No mocks used
Assertion messages present ✅ All assertions include descriptive messages
No test inflation (test/prod line ratio ≤ 2:1) ✅ 69 test lines added vs 84 production lines (ratio 0.82)

Score Breakdown

Component Score
Behavioral coverage (3/3 design tests) 40/40
Error/edge case coverage (3/3 tests) 30/30
Low duplication (0 duplicate clusters) 20/20
Proportional growth (ratio 0.82, no inflation) 10/10
Total 100/100

Language Support

Tests analyzed:

  • 🐹 Go (*_test.go): 3 tests — unit (//go:build !integration)
  • 🟨 JavaScript (*.test.cjs, *.test.js): 0 tests

Verdict

Check passed. 0% of new tests are implementation tests (threshold: 30%). All 3 tests enforce behavioral contracts directly tied to the bug being fixed — they verify observable outputs (ToolUsage, ToolBreadth, Evidence) under the exact real-world scenario that caused the undercount.


📖 Understanding Test Classifications

Design Tests (High Value) verify what the system does:

  • Assert on observable outputs, return values, or state changes
  • Cover error paths and boundary conditions
  • Would catch a behavioral regression if deleted
  • Remain valid even after internal refactoring

Implementation Tests (Low Value) verify how the system does it:

  • Assert on internal function calls (mocking internals)
  • Only test the happy path with typical inputs
  • Break during legitimate refactoring even when behavior is correct
  • Give false assurance: they pass even when the system is wrong

Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators.

References: §24526833184

🧪 Test quality analysis by Test Quality Sentinel · ● 589.6K ·

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Test Quality Sentinel: 100/100. Test quality is excellent — 0% of new tests are implementation tests (threshold: 30%). All 3 new tests enforce clear behavioral contracts covering the exact bug scenario fixed in this PR.

@github-actions
Copy link
Copy Markdown
Contributor


Warning

The push_to_pull_request_branch operation failed: Failed to fetch branch copilot/fix-tool-types-audit-issue: The process '/usr/bin/git' failed with exit code 128. The code changes were not applied.

🏗️ Design Decision Gate — ADR Required

This PR makes significant changes to core business logic (pkg/) with 153 new lines of code, but does not have a linked Architecture Decision Record (ADR).

AI has analyzed the PR diff and generated a draft ADR to help you get started:

📄 Draft ADR: docs/adr/26689-merge-mcp-tool-usage-into-audit-behavior-fingerprint.md

What to do next

  1. Review the draft ADR committed to your branch — it was generated from the PR diff
  2. Complete the missing sections — add context the AI couldn't infer, refine the decision rationale, and list real alternatives you considered
  3. Commit the finalized ADR to docs/adr/ on your branch
  4. Reference the ADR in this PR body by adding a line such as:

    ADR: ADR-26689: Merge MCP Tool Usage into Audit Behavior Fingerprint

Once an ADR is linked in the PR body, this gate will re-run and verify the implementation matches the decision.

Why ADRs Matter

"AI made me procrastinate on key design decisions. Because refactoring was cheap, I could always say 'I'll deal with this later.' Deferring decisions corroded my ability to think clearly."

ADRs create a searchable, permanent record of why the codebase looks the way it does. Future contributors (and your future self) will thank you.


📋 Michael Nygard ADR Format Reference

An ADR must contain these four sections to be considered complete:

  • Context — What is the problem? What forces are at play?
  • Decision — What did you decide? Why?
  • Alternatives Considered — What else could have been done?
  • Consequences — What are the trade-offs (positive and negative)?

All ADRs are stored in docs/adr/ as Markdown files numbered by PR number (e.g., 26689-merge-mcp-tool-usage-into-audit-behavior-fingerprint.md for PR #26689).

🔒 This PR cannot merge until an ADR is linked in the PR body.

References: §24526833175

🏗️ ADR gate enforced by Design Decision Gate 🏗️ · ● 90.7K ·

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[cli-tools-test] audit: tool_types=0 for Copilot engine runs despite MCP tool usage

3 participants