Skip to content

feat(lsp): add shebang detection for extensionless files#3460

Open
unprolix wants to merge 2 commits intocode-yeongyu:devfrom
unprolix:feat/lsp-shebang-detection
Open

feat(lsp): add shebang detection for extensionless files#3460
unprolix wants to merge 2 commits intocode-yeongyu:devfrom
unprolix:feat/lsp-shebang-detection

Conversation

@unprolix
Copy link
Copy Markdown

@unprolix unprolix commented Apr 15, 2026

Summary

 - Adds shebang detection for executable scripts without file extensions
 - Enables proper LSP language identification for Makefile, Dockerfile, and scripts

 ## Changes

 - New \`shebang-detection.ts\` module with parser and language mapping
 - Updated \`getLanguageId()\` to fallback to shebang detection when no extension
 - Added LRU cache to avoid repeated file reads
 - Comprehensive test coverage for 29 interpreters

 ## Testing

 \`\`\`bash
 bun run typecheck

Summary by cubic

Adds shebang detection to identify the language of extensionless executable scripts in the LSP. Improves syntax highlighting and tooling for Node, Python, Bash, Deno, and more.

  • New Features

    • Added shebang parser and interpreter-to-language map (e.g., node→javascript, python3→python, bash→shellscript, deno→typescript).
    • getLanguageId() now falls back to shebang when no extension; extension still takes priority.
    • More robust detection: LRU cache with mtime-based invalidation and recency refresh; parser handles interpreter args (e.g., #!/bin/bash -e); CRLF supported; tests added.
  • Migration

    • Update calls to getLanguageId(filePath, ext) to pass the file path and extension.

Written for commit c95ddc9. Summary will update on new commits.

Enables LSP language detection for executable scripts without file extensions
by reading and parsing shebang lines.

Changes:
- Add shebang-detection.ts with parser and language mapping
- Update getLanguageId() to fallback to shebang detection for extensionless files
- Add LRU cache to avoid repeated file reads
- Add comprehensive tests for shebang parsing and detection

Supports common formats:
- #!/usr/bin/node -> javascript
- #!/usr/bin/env python3 -> python
- #!/bin/bash -> shellscript
- #!/usr/bin/env -S node --flags -> javascript

Fixes: LSP tools now correctly handle Makefile, Dockerfile, and executable
scripts that rely on shebang lines for language identification.
@github-actions
Copy link
Copy Markdown
Contributor

Thank you for your contribution! Before we can merge this PR, we need you to sign our Contributor License Agreement (CLA).

To sign the CLA, please comment on this PR with:

I have read the CLA Document and I hereby sign the CLA

This is a one-time requirement. Once signed, all your future contributions will be automatically accepted.


I have read the CLA Document and I hereby sign the CLA


Jeremy Bornstein seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 5 files

Confidence score: 3/5

  • There is a concrete user-facing regression risk in src/tools/lsp/shebang-detection.ts: shebangs with interpreter arguments (for example #!/bin/bash -e) may fail detection, so valid scripts can open with the wrong or missing language mode.
  • src/tools/lsp/lsp-client.ts keeps a path-only shebang fallback cache without invalidation, so extensionless files that change shebangs may continue using a stale language ID until cache state is reset.
  • Given a medium-severity, high-confidence detection bug plus related cache behavior concerns, this carries some merge risk but appears tractable with targeted fixes.
  • Pay close attention to src/tools/lsp/shebang-detection.ts and src/tools/lsp/lsp-client.ts - shebang parsing/caching behavior can produce stale or incorrect language detection.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/tools/lsp/shebang-detection.ts">

<violation number="1" location="src/tools/lsp/shebang-detection.ts:57">
P2: Shebang parsing fails when the interpreter path includes arguments (for example `#!/bin/bash -e`), causing valid scripts to miss language detection.</violation>

<violation number="2" location="src/tools/lsp/shebang-detection.ts:67">
P2: Cache hits do not refresh entry recency, so frequently used paths can still be evicted as stale despite repeated access.</violation>
</file>

<file name="src/tools/lsp/lsp-client.ts">

<violation number="1" location="src/tools/lsp/lsp-client.ts:22">
P2: This fallback uses a path-only shebang cache with no invalidation, so an extensionless file that changes its shebang can keep opening with the stale language ID.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread src/tools/lsp/shebang-detection.ts Outdated
Comment thread src/tools/lsp/shebang-detection.ts
if (!this.openedFiles.has(absPath)) {
const ext = extname(absPath)
const languageId = getLanguageId(ext)
const languageId = getLanguageId(absPath, ext)
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: This fallback uses a path-only shebang cache with no invalidation, so an extensionless file that changes its shebang can keep opening with the stale language ID.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/tools/lsp/lsp-client.ts, line 22:

<comment>This fallback uses a path-only shebang cache with no invalidation, so an extensionless file that changes its shebang can keep opening with the stale language ID.</comment>

<file context>
@@ -19,7 +19,7 @@ export class LSPClient extends LSPClientConnection {
     if (!this.openedFiles.has(absPath)) {
       const ext = extname(absPath)
-      const languageId = getLanguageId(ext)
+      const languageId = getLanguageId(absPath, ext)
       const version = 1
 
</file context>
Fix with Cubic

- Handle interpreter arguments (e.g. #!/bin/bash -e) by splitting on whitespace
- Refresh LRU cache recency on hit (delete + re-set pattern)
- Add mtime-based cache invalidation to detect file changes
- Add tests for interpreter args and cache invalidation
@unprolix
Copy link
Copy Markdown
Author

@cubic-dev-ai All 3 issues addressed in c95ddc9:

 1. Shebang parsing with arguments
 Now handles #!/bin/bash -e by splitting the interpreter string on whitespace.

 2. LRU cache recency refresh
 Cache hits now delete + re-set the entry to update Map ordering (true LRU behavior).

 3. Cache invalidation on file changes
 Cache stores {language, mtime}. If file mtime differs from cached value, cache is bypassed and refreshed.

 Added tests for all 3 fixes. Ready for re-review.

@cubic-dev-ai
Copy link
Copy Markdown

cubic-dev-ai Bot commented Apr 16, 2026

@cubic-dev-ai All 3 issues addressed in c95ddc9:

 1. Shebang parsing with arguments
 Now handles #!/bin/bash -e by splitting the interpreter string on whitespace.

...

@unprolix I have started the AI code review. It will take a few minutes to complete.

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 5 files

Confidence score: 2/5

  • High-confidence issues in src/tools/lsp/shebang-detection.ts indicate real regression risk: cached shebang results can become stale after file edits, causing incorrect language detection for users.
  • parseShebang() in src/tools/lsp/shebang-detection.ts keeps interpreter arguments (for example bash -e), which can fail language mapping for valid shebangs and reduce LSP accuracy.
  • Negative-result caching in src/tools/lsp/shebang-detection.ts can persist misclassification when file readability/state changes, so behavior may remain wrong until cache is manually cleared.
  • Pay close attention to src/tools/lsp/shebang-detection.ts, src/tools/lsp/shebang-detection.test.ts - cache invalidation and test isolation need tightening to avoid persistent misdetection and flaky verification.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/tools/lsp/shebang-detection.ts">

<violation number="1" location="src/tools/lsp/shebang-detection.ts:57">
P1: `parseShebang()` does not strip interpreter arguments, so shebangs like `#!/bin/bash -e` resolve to `"bash -e"` and miss language mapping.</violation>

<violation number="2" location="src/tools/lsp/shebang-detection.ts:66">
P1: The shebang cache can return stale language after file edits because cached values are reused without validating file modification time.</violation>

<violation number="3" location="src/tools/lsp/shebang-detection.ts:67">
P2: Negative shebang results are cached forever, so extensionless files can stay misclassified after a file appears or becomes readable again.</violation>
</file>

<file name="src/tools/lsp/shebang-detection.test.ts">

<violation number="1" location="src/tools/lsp/shebang-detection.test.ts:102">
P2: `clearShebangCache` test is order-dependent because it assumes the shared shebang cache starts empty, but earlier tests can leave entries behind.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread src/tools/lsp/shebang-detection.ts
Comment thread src/tools/lsp/shebang-detection.ts Outdated
Comment thread src/tools/lsp/shebang-detection.ts Outdated
expect(getShebangCacheSize()).toBe(0)

detectShebangLanguage(file)
expect(getShebangCacheSize()).toBe(1)
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: clearShebangCache test is order-dependent because it assumes the shared shebang cache starts empty, but earlier tests can leave entries behind.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/tools/lsp/shebang-detection.test.ts, line 102:

<comment>`clearShebangCache` test is order-dependent because it assumes the shared shebang cache starts empty, but earlier tests can leave entries behind.</comment>

<file context>
@@ -0,0 +1,142 @@
+    expect(getShebangCacheSize()).toBe(0)
+
+    detectShebangLanguage(file)
+    expect(getShebangCacheSize()).toBe(1)
+
+    detectShebangLanguage(file)
</file context>
Fix with Cubic

@unprolix
Copy link
Copy Markdown
Author

@cubic-dev-ai The review appears to be analyzing commit 1124574 (the original). All fixes are in c95ddc9:

 - Interpreter args: interpreterWithArgs.split(/\s+/)[0] strips flags like -e
 - Stale cache: Cache entries store {language, mtime} and validate before use
 - Negative caching: Handled by mtime validation (file appearing/changing updates mtime)
 - Test isolation: beforeEach(clearShebangCache) ensures clean state per test

 Could you re-run the review on the latest commit? The file context in the current review shows the old code.

@cubic-dev-ai
Copy link
Copy Markdown

cubic-dev-ai Bot commented Apr 16, 2026

@cubic-dev-ai The review appears to be analyzing commit 1124574 (the original). All fixes are in c95ddc9:

 - Interpreter args: interpreterWithArgs.split(/\s+/)[0] strips flags like -e
 - Stale cache: Cache entries store {language, mtime} and validate before use
 - Negative caching: Handled by mtime validation (file appearing/changing updates mtime)

...

@unprolix I have started the AI code review. It will take a few minutes to complete.

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 5 files

Confidence score: 3/5

  • There is a concrete behavior risk in src/tools/lsp/shebang-detection.ts: env shebangs with variable assignments (for example, env VAR=value command) can be misparsed, so valid interpreters may not be detected.
  • I’m scoring this as moderate merge risk because the issue is medium severity (6/10) with high confidence (9/10) and can directly affect user-facing shebang recognition.
  • Pay close attention to src/tools/lsp/shebang-detection.ts - the env token parsing should skip variable assignments before resolving the interpreter command.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/tools/lsp/shebang-detection.ts">

<violation number="1" location="src/tools/lsp/shebang-detection.ts:51">
P2: The `env` parser treats variable assignments as the interpreter token, so valid `env VAR=value command` shebangs are not recognized.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

let command = trimmed.slice(2).trim()

if (command.includes("env ")) {
const envMatch = command.match(/env\s+(?:-\S+\s+)*(\S+)/)
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The env parser treats variable assignments as the interpreter token, so valid env VAR=value command shebangs are not recognized.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/tools/lsp/shebang-detection.ts, line 51:

<comment>The `env` parser treats variable assignments as the interpreter token, so valid `env VAR=value command` shebangs are not recognized.</comment>

<file context>
@@ -0,0 +1,115 @@
+  let command = trimmed.slice(2).trim()
+
+  if (command.includes("env ")) {
+    const envMatch = command.match(/env\s+(?:-\S+\s+)*(\S+)/)
+    if (envMatch) {
+      command = envMatch[1]
</file context>
Fix with Cubic

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 2 files (changes from recent commits).

Requires human review: Auto-approval blocked by 8 unresolved issues from previous reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant