Fix audit tool type undercount for Copilot MCP-only runs by Copilot · Pull Request #26689 · github/gh-aw

Copilot · 2026-04-16T16:43:34Z

Copilot workflow audits could report tool_types: 0 (and tool_breadth: narrow) even when MCP tools were clearly used, causing misleading behavior fingerprints and observability summaries. This change ensures MCP usage is included in tool-type calculations when standard parsed tool calls are missing.

Tool usage aggregation now includes MCP sources
- buildAuditData now merges MCP-derived tool usage into tool_usage before generating behavior fingerprint and observability insights.
- Added mergeMCPToolUsageInfo(...) to combine:
  - parsed metrics.ToolCalls
  - mcp_tool_usage.summary (primary source)
  - mcp_tool_usage.tool_calls (fallback when summary is absent)
Consistent downstream analytics
- Behavior fingerprint tool_types/tool_breadth now reflect MCP activity for Copilot runs.
- Observability execution insight text/evidence now reports MCP-derived tool type counts instead of 0.
Regression coverage for Copilot-style MCP-only runs
- Added focused tests for:
  - summary-based MCP merge
  - tool-call fallback merge
  - end-to-end audit data behavior where metrics tool calls are empty but MCP usage exists

toolUsage := buildToolUsageInfo(metrics)
toolUsage = mergeMCPToolUsageInfo(toolUsage, mcpToolUsage)

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/d7000781-6a64-476c-84a4-6eb743492530 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

github-actions · 2026-04-16T17:24:32Z

👋 Thanks for this agent-created WIP draft targeting the tool_types: 0 bug in Copilot engine audit runs! The problem description, reproduction steps, expected vs. actual behaviour, root cause hypothesis, and impact sections are all well laid out — great foundation for the fix.

Since this is still a placeholder (no code written yet), the only gap to address once implementation begins is test coverage for the corrected tool_types counting logic.

When you're ready to add tests, here's a prompt to get started:

The `tool_types` counter in the Copilot engine audit path has been fixed. Please add unit tests that cover:
1. A Copilot engine audit run where tool calls are present — assert that `tool_types` reflects the correct non-zero count.
2. A Copilot engine audit run with no tool calls — assert that `tool_types` is 0.
3. Any edge cases identified in the root cause analysis (e.g., mixed tool types, large tool lists).
Follow the existing test conventions in the audit package (table-driven tests, `require`/`assert` from testify, `//go:build !integration` build tag at the top of the test file).

Generated by Contribution Check · ● 1.9M · ◷

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/d7000781-6a64-476c-84a4-6eb743492530 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Copilot

Pull request overview

Ensures audit reporting counts MCP tool usage when standard parsed tool calls are missing (e.g., Copilot MCP-only runs), fixing tool_types: 0 / misleading behavior fingerprints and observability insights.

Changes:

Merge MCP-derived tool usage into tool_usage during buildAuditData generation.
Add mergeMCPToolUsageInfo(...) to combine parsed tool calls with MCP summary/tool-call fallback.
Add regression tests covering MCP-only runs and merge behavior (summary and tool-call fallback).

Show a summary per file

File	Description
pkg/cli/audit_report.go	Merges MCP tool usage into `toolUsage` before downstream analytics.
pkg/cli/audit_agentic_analysis.go	Introduces `mergeMCPToolUsageInfo` helper for combining tool-usage sources.
pkg/cli/audit_mcp_tool_usage_test.go	Adds end-to-end regression test ensuring MCP-only runs produce nonzero tool-type analytics.
pkg/cli/audit_agentic_analysis_test.go	Adds focused unit tests for summary-based and tool-call fallback MCP merges.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Files reviewed: 4/4 changed files
Comments generated: 2

Copilot · 2026-04-16T18:27:22Z

+	if len(mcpToolUsage.Summary) > 0 {
+		for _, summary := range mcpToolUsage.Summary {
+			switch {
+			case summary.ServerName != "" && summary.ToolName != "":
+				addOrUpdateToolUsage(summary.ServerName+"."+summary.ToolName, summary.CallCount, summary.MaxInputSize, summary.MaxOutputSize, summary.MaxDuration)
+			case summary.ToolName != "":
+				addOrUpdateToolUsage(summary.ToolName, summary.CallCount, summary.MaxInputSize, summary.MaxOutputSize, summary.MaxDuration)
+			}
+		}
+	} else {
+		for _, call := range mcpToolUsage.ToolCalls {
+			switch {
+			case call.ServerName != "" && call.ToolName != "":
+				addOrUpdateToolUsage(call.ServerName+"."+call.ToolName, 1, call.InputSize, call.OutputSize, call.Duration)
+			case call.ToolName != "":
+				addOrUpdateToolUsage(call.ToolName, 1, call.InputSize, call.OutputSize, call.Duration)
+			}


MCP tool names are currently constructed as serverName+"."+toolName before calling workflow.PrettifyToolName, which means they won’t dedupe/merge with parsed tool names from metrics.ToolCalls (those are prettified from mcp__server__tool into server_tool). This can double-count tool types (and break any substring-based logic like task-domain detection that expects underscore names). Consider constructing an MCP-style raw name first (e.g., mcp__<server>__<tool>) or otherwise normalizing to the same canonical format as PrettifyToolName so MCP and parsed sources collapse to one entry.

Copilot · 2026-04-16T18:27:22Z

+	require.Len(t, merged, 2, "MCP summary should be represented as tool usage entries")
+	assert.Equal(t, "safeoutputs.create_discussion", merged[0].Name, "higher call-count MCP tool should sort first")
+	assert.Equal(t, 2, merged[0].CallCount, "call count should come from MCP summary")
+	assert.Equal(t, "safeoutputs.push_repo_memory", merged[1].Name, "second MCP tool should be preserved")
+}
+
+func TestMergeMCPToolUsageInfoFallsBackToToolCalls(t *testing.T) {
+	merged := mergeMCPToolUsageInfo(nil, &MCPToolUsageData{
+		ToolCalls: []MCPToolCall{
+			{ServerName: "safeoutputs", ToolName: "create_discussion", InputSize: 20, OutputSize: 200},
+			{ServerName: "safeoutputs", ToolName: "create_discussion", InputSize: 40, OutputSize: 250},
+			{ServerName: "safeoutputs", ToolName: "push_repo_memory", InputSize: 10, OutputSize: 100},
+		},
+	})
+
+	require.Len(t, merged, 2, "fallback should aggregate MCP tool calls when summary is absent")
+	assert.Equal(t, "safeoutputs.create_discussion", merged[0].Name, "tool call fallback should aggregate by MCP tool name")
+	assert.Equal(t, 2, merged[0].CallCount, "fallback should count repeated MCP tool calls")
+	assert.Equal(t, 40, merged[0].MaxInputSize, "fallback should keep max input size from calls")
+	assert.Equal(t, 250, merged[0].MaxOutputSize, "fallback should keep max output size from calls")


These assertions expect merged MCP tool names to use a dot separator (e.g. safeoutputs.create_discussion), but elsewhere the codebase represents MCP tools in underscore form after prettification (e.g. mcp__github__search_issues -> github_search_issues). If mergeMCPToolUsageInfo is updated to normalize MCP tool names to match workflow.PrettifyToolName output, these expectations should be updated accordingly to avoid locking in an inconsistent naming scheme that prevents deduping with parsed tool calls.

github-actions · 2026-04-16T18:28:11Z

🧪 Test Quality Sentinel Report

Test Quality Score: 100/100

✅ Excellent test quality

Metric	Value
New/modified tests analyzed	3
✅ Design tests (behavioral contracts)	3 (100%)
⚠️ Implementation tests (low value)	0 (0%)
Tests with error/edge cases	3 (100%)
Duplicate test clusters	0
Test inflation detected	No
🚨 Coding-guideline violations	None

Test Classification Details

View all 3 test classifications

Test	File	Classification	Issues Detected
`TestMergeMCPToolUsageInfoUsesSummaryWhenMetricsToolCallsEmpty`	`pkg/cli/audit_agentic_analysis_test.go:328`	✅ Design	None
`TestMergeMCPToolUsageInfoFallsBackToToolCalls`	`pkg/cli/audit_agentic_analysis_test.go:344`	✅ Design	None
`TestBuildAuditDataUsesMCPToolUsageForToolTypes`	`pkg/cli/audit_mcp_tool_usage_test.go:210`	✅ Design	None

Test Analysis

`TestMergeMCPToolUsageInfoUsesSummaryWhenMetricsToolCallsEmpty`

Classification: ✅ Design test — behavioral contract
What it enforces: When mergeMCPToolUsageInfo receives no metrics tool calls but a populated MCP Summary, it must produce correctly-named, correctly-sorted tool usage entries with accurate call counts.
Edge/error coverage: Explicitly exercises the empty-metrics edge case (nil first argument), covering the primary bug scenario this PR fixes. All assertions carry descriptive messages.

`TestMergeMCPToolUsageInfoFallsBackToToolCalls`

Classification: ✅ Design test — behavioral contract
What it enforces: When the MCP Summary is absent but raw ToolCalls are present, mergeMCPToolUsageInfo must aggregate by tool name, count repetitions, and preserve max input/output sizes.
Edge/error coverage: Covers the fallback code path with aggregation logic and boundary values (max of two call sizes). All 5 assertions carry descriptive messages.

`TestBuildAuditDataUsesMCPToolUsageForToolTypes`

Classification: ✅ Design test — behavioral contract
What it enforces: buildAuditData must incorporate MCP-derived tool types into ToolUsage, BehaviorFingerprint.ToolBreadth, and the execution ObservabilityInsight.Evidence even when LogMetrics.ToolCalls is nil (the exact scenario for Copilot MCP-only runs).
Edge/error coverage: Directly models the production bug (nil ToolCalls), and asserts three distinct observable outputs. Uses require correctly for preconditions (Len, NotNil) and assert for validations.

Coding Guideline Checks

Check	Status
Build tags (`//go:build !integration`)	✅ Both test files have correct tags
No mock libraries (`gomock`, `testify/mock`, `.EXPECT()`)	✅ No mocks used
Assertion messages present	✅ All assertions include descriptive messages
No test inflation (test/prod line ratio ≤ 2:1)	✅ 69 test lines added vs 84 production lines (ratio 0.82)

Score Breakdown

Component	Score
Behavioral coverage (3/3 design tests)	40/40
Error/edge case coverage (3/3 tests)	30/30
Low duplication (0 duplicate clusters)	20/20
Proportional growth (ratio 0.82, no inflation)	10/10
Total	100/100

Language Support

Tests analyzed:

🐹 Go (*_test.go): 3 tests — unit (//go:build !integration)
🟨 JavaScript (*.test.cjs, *.test.js): 0 tests

Verdict

✅ Check passed. 0% of new tests are implementation tests (threshold: 30%). All 3 tests enforce behavioral contracts directly tied to the bug being fixed — they verify observable outputs (ToolUsage, ToolBreadth, Evidence) under the exact real-world scenario that caused the undercount.

📖 Understanding Test Classifications

Design Tests (High Value) verify what the system does:

Assert on observable outputs, return values, or state changes
Cover error paths and boundary conditions
Would catch a behavioral regression if deleted
Remain valid even after internal refactoring

Implementation Tests (Low Value) verify how the system does it:

Assert on internal function calls (mocking internals)
Only test the happy path with typical inputs
Break during legitimate refactoring even when behavior is correct
Give false assurance: they pass even when the system is wrong

Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators.

References: §24526833184

🧪 Test quality analysis by Test Quality Sentinel · ● 589.6K · ◷

github-actions

✅ Test Quality Sentinel: 100/100. Test quality is excellent — 0% of new tests are implementation tests (threshold: 30%). All 3 new tests enforce clear behavioral contracts covering the exact bug scenario fixed in this PR.

github-actions · 2026-04-16T18:28:47Z

Warning

The push_to_pull_request_branch operation failed: Failed to fetch branch copilot/fix-tool-types-audit-issue: The process '/usr/bin/git' failed with exit code 128. The code changes were not applied.

🏗️ Design Decision Gate — ADR Required

This PR makes significant changes to core business logic (pkg/) with 153 new lines of code, but does not have a linked Architecture Decision Record (ADR).

AI has analyzed the PR diff and generated a draft ADR to help you get started:

📄 Draft ADR: docs/adr/26689-merge-mcp-tool-usage-into-audit-behavior-fingerprint.md

What to do next

Review the draft ADR committed to your branch — it was generated from the PR diff
Complete the missing sections — add context the AI couldn't infer, refine the decision rationale, and list real alternatives you considered
Commit the finalized ADR to docs/adr/ on your branch
Reference the ADR in this PR body by adding a line such as:

ADR: ADR-26689: Merge MCP Tool Usage into Audit Behavior Fingerprint

Once an ADR is linked in the PR body, this gate will re-run and verify the implementation matches the decision.

Why ADRs Matter

"AI made me procrastinate on key design decisions. Because refactoring was cheap, I could always say 'I'll deal with this later.' Deferring decisions corroded my ability to think clearly."

ADRs create a searchable, permanent record of why the codebase looks the way it does. Future contributors (and your future self) will thank you.

📋 Michael Nygard ADR Format Reference

An ADR must contain these four sections to be considered complete:

Context — What is the problem? What forces are at play?
Decision — What did you decide? Why?
Alternatives Considered — What else could have been done?
Consequences — What are the trade-offs (positive and negative)?

All ADRs are stored in docs/adr/ as Markdown files numbered by PR number (e.g., 26689-merge-mcp-tool-usage-into-audit-behavior-fingerprint.md for PR #26689).

🔒 This PR cannot merge until an ADR is linked in the PR body.

References: §24526833175

🏗️ ADR gate enforced by Design Decision Gate 🏗️ · ● 90.7K · ◷

Initial plan

fc34b7f

Copilot AI assigned Copilot and pelikhan Apr 16, 2026

Copilot started work on behalf of pelikhan April 16, 2026 17:15 View session

Copilot AI linked an issue Apr 16, 2026 that may be closed by this pull request

[cli-tools-test] audit: tool_types=0 for Copilot engine runs despite MCP tool usage #26571

Closed

chore: plan audit tool_types fix

8b11a30

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/d7000781-6a64-476c-84a4-6eb743492530 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

This comment has been minimized.

Sign in to view

github-actions bot added the needs-work label Apr 16, 2026

github-actions bot mentioned this pull request Apr 16, 2026

[Contribution Check Report] PR #26533 — Contribution Check Report #26537

Closed

Copilot AI and others added 2 commits April 16, 2026 17:26

fix(audit): count MCP tools in tool_types for copilot runs

3ebd6aa

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/d7000781-6a64-476c-84a4-6eb743492530 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

refactor(audit): tighten MCP duration merge logic

e177276

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/d7000781-6a64-476c-84a4-6eb743492530 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix incorrect tool types in Copilot engine audits~~ Fix audit tool type undercount for Copilot MCP-only runs Apr 16, 2026

Copilot finished work on behalf of pelikhan April 16, 2026 17:35

Copilot AI requested a review from pelikhan April 16, 2026 17:35

pelikhan marked this pull request as ready for review April 16, 2026 18:22

Copilot AI review requested due to automatic review settings April 16, 2026 18:22

Copilot started reviewing on behalf of pelikhan April 16, 2026 18:23 View session

Merge branch 'main' into copilot/fix-tool-types-audit-issue

4dfd8fb

pelikhan merged commit 2289a7b into main Apr 16, 2026
54 of 55 checks passed

pelikhan deleted the copilot/fix-tool-types-audit-issue branch April 16, 2026 18:26

Copilot AI reviewed Apr 16, 2026

View reviewed changes

github-actions bot approved these changes Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix audit tool type undercount for Copilot MCP-only runs#26689

Fix audit tool type undercount for Copilot MCP-only runs#26689
pelikhan merged 5 commits intomainfrom
copilot/fix-tool-types-audit-issue

Copilot AI commented Apr 16, 2026 •

edited

Loading

Uh oh!

This comment has been minimized.

github-actions bot commented Apr 16, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

github-actions bot commented Apr 16, 2026

Uh oh!

github-actions bot left a comment

Uh oh!

github-actions bot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Copilot AI commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment has been minimized.

github-actions bot commented Apr 16, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 16, 2026

🧪 Test Quality Sentinel Report

Test Quality Score: 100/100

Test Classification Details

Test Analysis

TestMergeMCPToolUsageInfoUsesSummaryWhenMetricsToolCallsEmpty

TestMergeMCPToolUsageInfoFallsBackToToolCalls

TestBuildAuditDataUsesMCPToolUsageForToolTypes

Coding Guideline Checks

Score Breakdown

Language Support

Verdict

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 16, 2026

🏗️ Design Decision Gate — ADR Required

What to do next

Why ADRs Matter

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Apr 16, 2026 •

edited

Loading

`TestMergeMCPToolUsageInfoUsesSummaryWhenMetricsToolCallsEmpty`

`TestMergeMCPToolUsageInfoFallsBackToToolCalls`

`TestBuildAuditDataUsesMCPToolUsageForToolTypes`