Is it platform specific
generic
Importance or Severity
Critical
Description of the bug
PTF released a new version 0.12.0 on 05/02, and the last version 0.10.0 was from 3 years ago:
https://pypi.org/project/ptf/#history
docker-sonic-mgmt doesn't use a fixed ptf version: https://github.com/sonic-net/sonic-buildimage/blob/master/dockers/docker-sonic-mgmt/Dockerfile.j2#L106
On sonic-mgmt builds with new ptf version, we started seeing multiple test failures with error: A worker was found in a dead state. Can reproduce the problem 100% of the time on acl/test_stress_acl.py:
> pt_assert(False, failure_message)
E Failed: Processes "['_check_processes_on_dut--<MultiAsicSonicHost ld600>']" failed with exit code "1"
E Exception:
E A worker was found in a dead state
E Traceback:
E Traceback (most recent call last):
E File "/data/tests/common/helpers/parallel.py", line 87, in run
E Process.run(self)
E File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
E self._target(*self._args, **self._kwargs)
E File "/data/tests/common/helpers/parallel.py", line 323, in wrapper
E target(*args, **kwargs)
E File "/data/tests/common/plugins/sanity_check/checks.py", line 947, in _check_processes_on_dut
E processes_status = dut.all_critical_process_status()
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E File "/data/tests/common/devices/sonic.py", line 724, in all_critical_process_status
E group_process_results = self.critical_group_process()
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E File "/data/tests/common/devices/sonic.py", line 643, in critical_group_process
E results = self.shell_cmds(cmds=cmds, continue_on_fail=True, module_ignore_errors=True, timeout=30)['results']
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E File "/data/tests/common/devices/base.py", line 125, in _run
E adhoc_res: AdHocResult = self.module(*module_args, **complex_args)
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E File "/opt/venv/lib/python3.12/site-packages/pytest_ansible/module_dispatcher/v213.py", line 264, in _run
E tqm.run(play)
E File "/opt/venv/lib/python3.12/site-packages/ansible/executor/task_queue_manager.py", line 344, in run
E play_return = strategy.run(iterator, play_context)
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E File "/opt/venv/lib/python3.12/site-packages/ansible/plugins/strategy/linear.py", line 216, in run
E results.extend(self._wait_on_pending_results(iterator))
E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E File "/opt/venv/lib/python3.12/site-packages/ansible/plugins/strategy/__init__.py", line 812, in _wait_on_pending_results
E raise AnsibleError("A worker was found in a dead state")
E ansible.errors.AnsibleError: A worker was found in a dead state
alive = []
args = ()
concurrent_tasks = 24
delta_time = datetime.timedelta(seconds=1, microseconds=666717)
end_time = datetime.datetime(2026, 5, 5, 22, 28, 18, 640590)
failed_processes = {'_check_processes_on_dut--<MultiAsicSonicHost ld600>': {'exception': (A worker was found in a dead state, 'Traceback ...rker was found in a dead state")\\nansible.errors.AnsibleError: A worker was found in a dead state\\n'), 'exit_code': 1}}
failure_message = 'Processes "[\\'_check_processes_on_dut--<MultiAsicSonicHost ld600>\\']" failed with exit code "1"\\nException:\\nA worker... AnsibleError("A worker was found in a dead state")\\nansible.errors.AnsibleError: A worker was found in a dead state\\n'
force_terminate = <function parallel_run.<locals>.force_terminate at 0x7f4c01ea9620>
gone = [<SonicProcess name='_check_processes_on_dut--<MultiAsicSonicHost ld600>' pid=15516 parent=14513 stopped exitcode=1>]
init_result = {'check_item': 'processes', 'failed': False, 'host': 'ld600'}
kwargs = {'node': <MultiAsicSonicHost ld600>, 'results': <DictProxy object, typeid 'dict' at 0x7f4c0501cd70>, 'stage': 'stage_post_test'}
node = <MultiAsicSonicHost ld600>
nodes = []
nodes_list = [<MultiAsicSonicHost ld600>]
on_terminate = <function parallel_run.<locals>.on_terminate at 0x7f4c01ea9580>
only_mux_status_fail = False
p_exception = A worker was found in a dead state
p_exitcode = 1
p_traceback = 'Traceback (most recent call last):\\n File "/data/tests/common/helpers/parallel.py", line 87, in run\\n Process.run... AnsibleError("A worker was found in a dead state")\\nansible.errors.AnsibleError: A worker was found in a dead state\\n'
process = {'exception': (A worker was found in a dead state, 'Traceback (most recent call last):\\n File "/data/tests/common/hel...orker was found in a dead state")\\nansible.errors.AnsibleError: A worker was found in a dead state\\n'), 'exit_code': 1}
process_name = '_check_processes_on_dut--<MultiAsicSonicHost ld600>'
results = <DictProxy object, typeid 'dict' at 0x7f4c0501cd70>
start_time = datetime.datetime(2026, 5, 5, 22, 28, 16, 973873)
target = <function reset_ansible_local_tmp.<locals>.wrapper at 0x7f4c01599bc0>
tasks_done = 1
tasks_running = 0
timeout = 600
total_tasks = 1
total_timeout = 600
worker = <SonicProcess name='_check_processes_on_dut--<MultiAsicSonicHost ld600>' pid=15516 parent=14513 stopped exitcode=1>
worker_exception = (A worker was found in a dead state, 'Traceback (most recent call last):\\n File "/data/tests/common/helpers/parallel....AnsibleError("A worker was found in a dead state")\\nansible.errors.AnsibleError: A worker was found in a dead state\\n')
workers = []
Can confirm that the problem is gone if we downgrade ptf back to 0.10.0 version using pip. And we can reproduce the problem if we upgrade ptf to 0.12.0 on old sonic-mgmt builds.
Steps to Reproduce
Run acl/test_stress_acl.py with sonic-mgmt with new ptf version
Actual Behavior and Expected Behavior
We shouldn't hit the error.
Relevant log output
Output of show version, show techsupport
Attach files (if any)
No response
Is it platform specific
generic
Importance or Severity
Critical
Description of the bug
PTF released a new version 0.12.0 on 05/02, and the last version 0.10.0 was from 3 years ago:
https://pypi.org/project/ptf/#history
docker-sonic-mgmt doesn't use a fixed ptf version: https://github.com/sonic-net/sonic-buildimage/blob/master/dockers/docker-sonic-mgmt/Dockerfile.j2#L106
On sonic-mgmt builds with new ptf version, we started seeing multiple test failures with error: A worker was found in a dead state. Can reproduce the problem 100% of the time on acl/test_stress_acl.py:
Can confirm that the problem is gone if we downgrade ptf back to 0.10.0 version using pip. And we can reproduce the problem if we upgrade ptf to 0.12.0 on old sonic-mgmt builds.
Steps to Reproduce
Run acl/test_stress_acl.py with sonic-mgmt with new ptf version
Actual Behavior and Expected Behavior
We shouldn't hit the error.
Relevant log output
Output of
show version,show techsupportAttach files (if any)
No response