Fix audit tool type undercount for Copilot MCP-only runs#26689
Fix audit tool type undercount for Copilot MCP-only runs#26689
Conversation
Agent-Logs-Url: https://github.com/github/gh-aw/sessions/d7000781-6a64-476c-84a4-6eb743492530 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
This comment has been minimized.
This comment has been minimized.
|
👋 Thanks for this agent-created WIP draft targeting the Since this is still a placeholder (no code written yet), the only gap to address once implementation begins is test coverage for the corrected When you're ready to add tests, here's a prompt to get started:
|
Agent-Logs-Url: https://github.com/github/gh-aw/sessions/d7000781-6a64-476c-84a4-6eb743492530 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Agent-Logs-Url: https://github.com/github/gh-aw/sessions/d7000781-6a64-476c-84a4-6eb743492530 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Ensures audit reporting counts MCP tool usage when standard parsed tool calls are missing (e.g., Copilot MCP-only runs), fixing tool_types: 0 / misleading behavior fingerprints and observability insights.
Changes:
- Merge MCP-derived tool usage into
tool_usageduringbuildAuditDatageneration. - Add
mergeMCPToolUsageInfo(...)to combine parsed tool calls with MCP summary/tool-call fallback. - Add regression tests covering MCP-only runs and merge behavior (summary and tool-call fallback).
Show a summary per file
| File | Description |
|---|---|
| pkg/cli/audit_report.go | Merges MCP tool usage into toolUsage before downstream analytics. |
| pkg/cli/audit_agentic_analysis.go | Introduces mergeMCPToolUsageInfo helper for combining tool-usage sources. |
| pkg/cli/audit_mcp_tool_usage_test.go | Adds end-to-end regression test ensuring MCP-only runs produce nonzero tool-type analytics. |
| pkg/cli/audit_agentic_analysis_test.go | Adds focused unit tests for summary-based and tool-call fallback MCP merges. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 4/4 changed files
- Comments generated: 2
| if len(mcpToolUsage.Summary) > 0 { | ||
| for _, summary := range mcpToolUsage.Summary { | ||
| switch { | ||
| case summary.ServerName != "" && summary.ToolName != "": | ||
| addOrUpdateToolUsage(summary.ServerName+"."+summary.ToolName, summary.CallCount, summary.MaxInputSize, summary.MaxOutputSize, summary.MaxDuration) | ||
| case summary.ToolName != "": | ||
| addOrUpdateToolUsage(summary.ToolName, summary.CallCount, summary.MaxInputSize, summary.MaxOutputSize, summary.MaxDuration) | ||
| } | ||
| } | ||
| } else { | ||
| for _, call := range mcpToolUsage.ToolCalls { | ||
| switch { | ||
| case call.ServerName != "" && call.ToolName != "": | ||
| addOrUpdateToolUsage(call.ServerName+"."+call.ToolName, 1, call.InputSize, call.OutputSize, call.Duration) | ||
| case call.ToolName != "": | ||
| addOrUpdateToolUsage(call.ToolName, 1, call.InputSize, call.OutputSize, call.Duration) | ||
| } |
There was a problem hiding this comment.
MCP tool names are currently constructed as serverName+"."+toolName before calling workflow.PrettifyToolName, which means they won’t dedupe/merge with parsed tool names from metrics.ToolCalls (those are prettified from mcp__server__tool into server_tool). This can double-count tool types (and break any substring-based logic like task-domain detection that expects underscore names). Consider constructing an MCP-style raw name first (e.g., mcp__<server>__<tool>) or otherwise normalizing to the same canonical format as PrettifyToolName so MCP and parsed sources collapse to one entry.
| require.Len(t, merged, 2, "MCP summary should be represented as tool usage entries") | ||
| assert.Equal(t, "safeoutputs.create_discussion", merged[0].Name, "higher call-count MCP tool should sort first") | ||
| assert.Equal(t, 2, merged[0].CallCount, "call count should come from MCP summary") | ||
| assert.Equal(t, "safeoutputs.push_repo_memory", merged[1].Name, "second MCP tool should be preserved") | ||
| } | ||
|
|
||
| func TestMergeMCPToolUsageInfoFallsBackToToolCalls(t *testing.T) { | ||
| merged := mergeMCPToolUsageInfo(nil, &MCPToolUsageData{ | ||
| ToolCalls: []MCPToolCall{ | ||
| {ServerName: "safeoutputs", ToolName: "create_discussion", InputSize: 20, OutputSize: 200}, | ||
| {ServerName: "safeoutputs", ToolName: "create_discussion", InputSize: 40, OutputSize: 250}, | ||
| {ServerName: "safeoutputs", ToolName: "push_repo_memory", InputSize: 10, OutputSize: 100}, | ||
| }, | ||
| }) | ||
|
|
||
| require.Len(t, merged, 2, "fallback should aggregate MCP tool calls when summary is absent") | ||
| assert.Equal(t, "safeoutputs.create_discussion", merged[0].Name, "tool call fallback should aggregate by MCP tool name") | ||
| assert.Equal(t, 2, merged[0].CallCount, "fallback should count repeated MCP tool calls") | ||
| assert.Equal(t, 40, merged[0].MaxInputSize, "fallback should keep max input size from calls") | ||
| assert.Equal(t, 250, merged[0].MaxOutputSize, "fallback should keep max output size from calls") |
There was a problem hiding this comment.
These assertions expect merged MCP tool names to use a dot separator (e.g. safeoutputs.create_discussion), but elsewhere the codebase represents MCP tools in underscore form after prettification (e.g. mcp__github__search_issues -> github_search_issues). If mergeMCPToolUsageInfo is updated to normalize MCP tool names to match workflow.PrettifyToolName output, these expectations should be updated accordingly to avoid locking in an inconsistent naming scheme that prevents deduping with parsed tool calls.
🧪 Test Quality Sentinel ReportTest Quality Score: 100/100✅ Excellent test quality
Test Classification DetailsView all 3 test classifications
Test Analysis
|
| Check | Status |
|---|---|
Build tags (//go:build !integration) |
✅ Both test files have correct tags |
No mock libraries (gomock, testify/mock, .EXPECT()) |
✅ No mocks used |
| Assertion messages present | ✅ All assertions include descriptive messages |
| No test inflation (test/prod line ratio ≤ 2:1) | ✅ 69 test lines added vs 84 production lines (ratio 0.82) |
Score Breakdown
| Component | Score |
|---|---|
| Behavioral coverage (3/3 design tests) | 40/40 |
| Error/edge case coverage (3/3 tests) | 30/30 |
| Low duplication (0 duplicate clusters) | 20/20 |
| Proportional growth (ratio 0.82, no inflation) | 10/10 |
| Total | 100/100 |
Language Support
Tests analyzed:
- 🐹 Go (
*_test.go): 3 tests — unit (//go:build !integration) - 🟨 JavaScript (
*.test.cjs,*.test.js): 0 tests
Verdict
✅ Check passed. 0% of new tests are implementation tests (threshold: 30%). All 3 tests enforce behavioral contracts directly tied to the bug being fixed — they verify observable outputs (
ToolUsage,ToolBreadth,Evidence) under the exact real-world scenario that caused the undercount.
📖 Understanding Test Classifications
Design Tests (High Value) verify what the system does:
- Assert on observable outputs, return values, or state changes
- Cover error paths and boundary conditions
- Would catch a behavioral regression if deleted
- Remain valid even after internal refactoring
Implementation Tests (Low Value) verify how the system does it:
- Assert on internal function calls (mocking internals)
- Only test the happy path with typical inputs
- Break during legitimate refactoring even when behavior is correct
- Give false assurance: they pass even when the system is wrong
Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators.
References: §24526833184
🧪 Test quality analysis by Test Quality Sentinel · ● 589.6K · ◷
|
Warning The 🏗️ Design Decision Gate — ADR RequiredThis PR makes significant changes to core business logic ( AI has analyzed the PR diff and generated a draft ADR to help you get started: 📄 Draft ADR: What to do next
Once an ADR is linked in the PR body, this gate will re-run and verify the implementation matches the decision. Why ADRs Matter
ADRs create a searchable, permanent record of why the codebase looks the way it does. Future contributors (and your future self) will thank you. 📋 Michael Nygard ADR Format ReferenceAn ADR must contain these four sections to be considered complete:
All ADRs are stored in
References: §24526833175
|
Copilot workflow audits could report
tool_types: 0(andtool_breadth: narrow) even when MCP tools were clearly used, causing misleading behavior fingerprints and observability summaries. This change ensures MCP usage is included in tool-type calculations when standard parsed tool calls are missing.Tool usage aggregation now includes MCP sources
buildAuditDatanow merges MCP-derived tool usage intotool_usagebefore generating behavior fingerprint and observability insights.mergeMCPToolUsageInfo(...)to combine:metrics.ToolCallsmcp_tool_usage.summary(primary source)mcp_tool_usage.tool_calls(fallback when summary is absent)Consistent downstream analytics
tool_types/tool_breadthnow reflect MCP activity for Copilot runs.0.Regression coverage for Copilot-style MCP-only runs