Skip to content

Feishu/Lark WebSocket drops lead to Zombie Gateway Process without auto-recovery #10616

@watsonctl

Description

@watsonctl

Description

When using the Feishu integration, if the underlying connection suffers a keepalive ping timeout, the SDK's message loop exits, but the main Hermes agent process doesn't terminate or successfully reconnect. This leaves the Gateway in a zombie state where it appears "running" to system daemon managers (like systemd) but accepts no messages.

Logs

[Lark] [ERROR] receive message loop exit, err: sent 1011 (internal error) keepalive ping timeout; no close frame received
[Lark] [WARNING] ping failed, err: sent 1011 (internal error) keepalive ping timeout

Expected Behavior (Crash-Only Architecture)

If the Feishu websocket loop permanently drops and cannot intrinsically reconnect, the feishu.py integration thread should raise a SystemExit(1) or bubble the exception to the parent thread. System level managers (Restart=always) can then forcefully respawn a healthy agent stack.

Environment

  • OS: Ubuntu 24.04 via WSL2
  • Deploy type: systemd service
  • Provider: feishu

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions