Skip to content

Data race in pkg/leakybucket.EventsFromQueue causes 'concurrent map iteration/read and map write' panic (v1.7.7) #4459

@sirianap

Description

@sirianap

What happened?

CrowdSec v1.7.7 (docker) crashes repeatedly with Go runtime fatal data race in pkg/leakybucket.EventsFromQueue, called from NewAlert during bucket overflow. Two distinct race variants observed within ~35 min on the same instance:

Crash 1 — concurrent map iteration and map write:

fatal error: concurrent map iteration and map write

goroutine 420788 [running]:
internal/runtime/maps.(*Iter).Next(...)
      internal/runtime/maps/table.go:792 +0x86
github.com/crowdsecurity/crowdsec/pkg/leakybucket.EventsFromQueue(...)
      github.com/crowdsecurity/crowdsec/pkg/leakybucket/overflows.go:215 +0x129
github.com/crowdsecurity/crowdsec/pkg/leakybucket.NewAlert(0xc003ff91e0, 0xc0053258e0)
      github.com/crowdsecurity/crowdsec/pkg/leakybucket/overflows.go:350 +0x95c
github.com/crowdsecurity/crowdsec/pkg/leakybucket.(*Leaky).overflow(...)
      github.com/crowdsecurity/crowdsec/pkg/leakybucket/bucket.go:294 +0x68
github.com/crowdsecurity/crowdsec/pkg/leakybucket.(*Leaky).LeakRoutine(...)
      github.com/crowdsecurity/crowdsec/pkg/leakybucket/bucket.go:209 +0xcdd
github.com/crowdsecurity/crowdsec/pkg/leakybucket.LoadOrStoreBucketFromHolder.func1()
      github.com/crowdsecurity/crowdsec/pkg/leakybucket/manager_run.go:188 +0x97
created by github.com/crowdsecurity/crowdsec/pkg/leakybucket.LoadOrStoreBucketFromHolder in goroutine 989
      github.com/crowdsecurity/crowdsec/pkg/leakybucket/manager_run.go:184 +0x35d

Crash 2 — concurrent map read and map write (35 min later, same instance):

fatal error: concurrent map read and map write

goroutine 695697 [running]:
github.com/crowdsecurity/crowdsec/pkg/leakybucket.EventsFromQueue(...)
      github.com/crowdsecurity/crowdsec/pkg/leakybucket/overflows.go:222 +0x725
github.com/crowdsecurity/crowdsec/pkg/leakybucket.NewAlert(0xc0011651e0, 0xc005cc22e0)
      github.com/crowdsecurity/crowdsec/pkg/leakybucket/overflows.go:350 +0x95c
github.com/crowdsecurity/crowdsec/pkg/leakybucket.(*Leaky).overflow(...)
      github.com/crowdsecurity/crowdsec/pkg/leakybucket/bucket.go:294 +0x68
github.com/crowdsecurity/crowdsec/pkg/leakybucket.(*Leaky).LeakRoutine(...)
      github.com/crowdsecurity/crowdsec/pkg/leakybucket/bucket.go:209 +0xcdd

Both races point at EventsFromQueue iterating/reading qEvents[idx].Meta (an Event Meta map) without synchronization. Looking at master pkg/leakybucket/overflows.go:

for idx := range qEvents {
    if qEvents[idx].Meta == nil { continue }
    meta := models.Meta{}
    skeys := make([]string, 0, len(qEvents[idx].Meta))
    for k := range qEvents[idx].Meta {        // <-- iteration race (line 215 in v1.7.7)
        skeys = append(skeys, k)
    }
    sort.Strings(skeys)
    for _, k := range skeys {
        v := qEvents[idx].Meta[k]              // <-- read race (line 222 in v1.7.7)
        ...
    }
}

The Event queued in the bucket appears to still be referenced/mutated by other goroutines (e.g. enrichment/parse paths) while the alert builder iterates its Meta.

What did you expect to happen?

No fatal panic; alert generation should be safe against concurrent access of Event.Meta, either by deep-copying the map before iteration or guarding writes with a mutex.

How can we reproduce it?

  • Image: crowdsecurity/crowdsec:latest (digest sha256:6ca53ad26196ca59ddd4fa692a586b73d8fcde085046163b9ca2f04887dca563)
  • Version: v1.7.7-981e6166, Go 1.25.5, docker build 2026-04-01
  • Host: Ubuntu 24.04, Linux 6.17, x86_64, Docker Engine
  • Stack: Traefik (reverse proxy with crowdsec-bouncer-traefik-plugin) → CrowdSec LAPI + AppSec
  • Collections: crowdsecurity/traefik, crowdsecurity/appsec-virtual-patching, crowdsecurity/appsec-generic-rules
  • Acquisition: Traefik access log file + AppSec endpoint
  • Traffic profile during incident:
    • ~6 bouncer GET /v1/decisions?ip= queries per second (32 685 queries over 90 min)
    • 160 730 lines read from Traefik log, 94 990 poured to buckets in ~90 min
    • 402 AppSec events
    • Active scanner traffic generating multiple overflow scenarios (crowdsecurity/http-admin-interface-probing, crowdsecurity/http-sensitive-files) every few seconds

Steps:

  1. Run crowdsecurity/crowdsec:v1.7.7 with the Traefik + AppSec collections.
  2. Drive sustained mixed traffic — legitimate + scanner — sufficient to overflow several scenarios per minute.
  3. Container crashes within ~30–45 minutes with one of the two stack traces above. Without a restart policy, CrowdSec stays down. With restart: always, the crash is masked but RestartCount accumulates.

Why this differs from #4216

Issue #4216 reports nil-pointer panics in the same area and was tentatively attributed to "high memory pressure". This report is different and rules that explanation out:

  • Crash is not nil-pointer — Go runtime explicitly says concurrent map iteration and map write / concurrent map read and map write. The runtime only emits these messages when it detects parallel access on the same map.
  • Memory headroom is large: container RSS 127 MiB / 15.42 GiB host RAM (0.8 %), no cgroup limit, no swap pressure, no OOM-kill (OOMKilled: false, ExitCode: 0).
  • The race is reproducible across two independent goroutines on the same instance within 35 minutes.

Anything else we need to know?

Crowdsec version

version: v1.7.7-981e6166
Codename: alphaga
BuildDate: 2026-04-01_13:14:30
GoVersion: 1.25.5
Platform: docker
libre2: C++

OS version

Ubuntu Server 24.04, Linux 6.17.0-1012-aws, x86_64
Docker Engine (rootful)

Enabled collections and parsers

  • crowdsecurity/traefik
  • crowdsecurity/appsec-virtual-patching
  • crowdsecurity/appsec-generic-rules
  • crowdsecurity/whitelists
  • a few local custom postoverflows / parsers / appsec rules (whitelists only)

Acquisition config

  • Traefik JSON access log via file source
  • AppSec endpoint

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions