Skip to content

Commit b3749f1

Browse files
prati0100akpm00
authored andcommitted
mm: memfd_luo: allow preserving memfd
The ability to preserve a memfd allows userspace to use KHO and LUO to transfer its memory contents to the next kernel. This is useful in many ways. For one, it can be used with IOMMUFD as the backing store for IOMMU page tables. Preserving IOMMUFD is essential for performing a hypervisor live update with passthrough devices. memfd support provides the first building block for making that possible. For another, applications with a large amount of memory that takes time to reconstruct, reboots to consume kernel upgrades can be very expensive. memfd with LUO gives those applications reboot-persistent memory that they can use to quickly save and reconstruct that state. While memfd is backed by either hugetlbfs or shmem, currently only support on shmem is added. To be more precise, support for anonymous shmem files is added. The handover to the next kernel is not transparent. All the properties of the file are not preserved; only its memory contents, position, and size. The recreated file gets the UID and GID of the task doing the restore, and the task's cgroup gets charged with the memory. Once preserved, the file cannot grow or shrink, and all its pages are pinned to avoid migrations and swapping. The file can still be read from or written to. Use vmalloc to get the buffer to hold the folios, and preserve it using kho_preserve_vmalloc(). This doesn't have the size limit. Link: https://lkml.kernel.org/r/20251125165850.3389713-15-pasha.tatashin@soleen.com Signed-off-by: Pratyush Yadav <ptyadav@amazon.de> Co-developed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Tested-by: David Matlack <dmatlack@google.com> Cc: Aleksander Lobakin <aleksander.lobakin@intel.com> Cc: Alexander Graf <graf@amazon.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: anish kumar <yesanishhere@gmail.com> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Chanwoo Choi <cw00.choi@samsung.com> Cc: Chen Ridong <chenridong@huawei.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Daniel Wagner <wagi@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Jeffery <djeffery@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guixin Liu <kanie@linux.alibaba.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lennart Poettering <lennart@poettering.net> Cc: Leon Romanovsky <leon@kernel.org> Cc: Leon Romanovsky <leonro@nvidia.com> Cc: Lukas Wunner <lukas@wunner.de> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Matthew Maurer <mmaurer@google.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Myugnjoo Ham <myungjoo.ham@samsung.com> Cc: Parav Pandit <parav@nvidia.com> Cc: Pratyush Yadav <pratyush@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: Samiullah Khawaja <skhawaja@google.com> Cc: Song Liu <song@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Stuart Hayes <stuart.w.hayes@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Weißschuh <linux@weissschuh.net> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: William Tu <witu@nvidia.com> Cc: Yoann Congal <yoann.congal@smile.fr> Cc: Zhu Yanjun <yanjun.zhu@linux.dev> Cc: Zijun Hu <quic_zijuhu@quicinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
1 parent 8def186 commit b3749f1

4 files changed

Lines changed: 596 additions & 0 deletions

File tree

MAINTAINERS

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14469,6 +14469,7 @@ F: tools/testing/selftests/livepatch/
1446914469
LIVE UPDATE
1447014470
M: Pasha Tatashin <pasha.tatashin@soleen.com>
1447114471
M: Mike Rapoport <rppt@kernel.org>
14472+
R: Pratyush Yadav <pratyush@kernel.org>
1447214473
L: linux-kernel@vger.kernel.org
1447314474
S: Maintained
1447414475
F: Documentation/core-api/liveupdate.rst
@@ -14477,6 +14478,7 @@ F: include/linux/liveupdate.h
1447714478
F: include/linux/liveupdate/
1447814479
F: include/uapi/linux/liveupdate.h
1447914480
F: kernel/liveupdate/
14481+
F: mm/memfd_luo.c
1448014482

1448114483
LLC (802.2)
1448214484
L: netdev@vger.kernel.org

include/linux/kho/abi/memfd.h

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
/* SPDX-License-Identifier: GPL-2.0 */
2+
3+
/*
4+
* Copyright (c) 2025, Google LLC.
5+
* Pasha Tatashin <pasha.tatashin@soleen.com>
6+
*
7+
* Copyright (C) 2025 Amazon.com Inc. or its affiliates.
8+
* Pratyush Yadav <ptyadav@amazon.de>
9+
*/
10+
11+
#ifndef _LINUX_KHO_ABI_MEMFD_H
12+
#define _LINUX_KHO_ABI_MEMFD_H
13+
14+
#include <linux/types.h>
15+
#include <linux/kexec_handover.h>
16+
17+
/**
18+
* DOC: memfd Live Update ABI
19+
*
20+
* This header defines the ABI for preserving the state of a memfd across a
21+
* kexec reboot using the LUO.
22+
*
23+
* The state is serialized into a packed structure `struct memfd_luo_ser`
24+
* which is handed over to the next kernel via the KHO mechanism.
25+
*
26+
* This interface is a contract. Any modification to the structure layout
27+
* constitutes a breaking change. Such changes require incrementing the
28+
* version number in the MEMFD_LUO_FH_COMPATIBLE string.
29+
*/
30+
31+
/**
32+
* MEMFD_LUO_FOLIO_DIRTY - The folio is dirty.
33+
*
34+
* This flag indicates the folio contains data from user. A non-dirty folio is
35+
* one that was allocated (say using fallocate(2)) but not written to.
36+
*/
37+
#define MEMFD_LUO_FOLIO_DIRTY BIT(0)
38+
39+
/**
40+
* MEMFD_LUO_FOLIO_UPTODATE - The folio is up-to-date.
41+
*
42+
* An up-to-date folio has been zeroed out. shmem zeroes out folios on first
43+
* use. This flag tracks which folios need zeroing.
44+
*/
45+
#define MEMFD_LUO_FOLIO_UPTODATE BIT(1)
46+
47+
/**
48+
* struct memfd_luo_folio_ser - Serialized state of a single folio.
49+
* @pfn: The page frame number of the folio.
50+
* @flags: Flags to describe the state of the folio.
51+
* @index: The page offset (pgoff_t) of the folio within the original file.
52+
*/
53+
struct memfd_luo_folio_ser {
54+
u64 pfn:52;
55+
u64 flags:12;
56+
u64 index;
57+
} __packed;
58+
59+
/**
60+
* struct memfd_luo_ser - Main serialization structure for a memfd.
61+
* @pos: The file's current position (f_pos).
62+
* @size: The total size of the file in bytes (i_size).
63+
* @nr_folios: Number of folios in the folios array.
64+
* @folios: KHO vmalloc descriptor pointing to the array of
65+
* struct memfd_luo_folio_ser.
66+
*/
67+
struct memfd_luo_ser {
68+
u64 pos;
69+
u64 size;
70+
u64 nr_folios;
71+
struct kho_vmalloc folios;
72+
} __packed;
73+
74+
/* The compatibility string for memfd file handler */
75+
#define MEMFD_LUO_FH_COMPATIBLE "memfd-v1"
76+
77+
#endif /* _LINUX_KHO_ABI_MEMFD_H */

mm/Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,7 @@ obj-$(CONFIG_NUMA) += memory-tiers.o
100100
obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
101101
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
102102
obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
103+
obj-$(CONFIG_LIVEUPDATE) += memfd_luo.o
103104
obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
104105
obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
105106
ifdef CONFIG_SWAP

0 commit comments

Comments
 (0)