Skip to content

Commit b0319c4

Browse files
committed
Merge tag 'nfsd-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever: - Mike Snitzer's mechanism for disabling I/O caching introduced in v6.18 is extended to include using direct I/O. The goal is to further reduce the memory footprint consumed by NFS clients accessing large data sets via NFSD. - The NFSD community adopted a maintainer entry profile during this cycle. See Documentation/filesystems/nfs/nfsd-maintainer-entry-profile.rst - Work continues on hardening NFSD's implementation of the pNFS block layout type. This type enables pNFS clients to directly access the underlying block devices that contain an exported file system, reducing server overhead and increasing data throughput. - The remaining patches are clean-ups and minor optimizations. Many thanks to the contributors, reviewers, testers, and bug reporters who participated during the v6.19 NFSD development cycle. * tag 'nfsd-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (38 commits) NFSD: nfsd-io-modes: Separate lists NFSD: nfsd-io-modes: Wrap shell snippets in literal code blocks NFSD: Add toctree entry for NFSD IO modes docs NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst NFSD: Implement NFSD_IO_DIRECT for NFS WRITE NFSD: Make FILE_SYNC WRITEs comply with spec NFSD: Add trace point for SCSI fencing operation. NFSD: use correct reservation type in nfsd4_scsi_fence_client xdrgen: Don't generate unnecessary semicolon xdrgen: Fix union declarations NFSD: don't start nfsd if sv_permsocks is empty xdrgen: handle _XdrString in union encoder/decoder xdrgen: Fix the variable-length opaque field decoder template xdrgen: Make the xdrgen script location-independent xdrgen: Generalize/harden pathname construction lockd: don't allow locking on reexported NFSv2/3 MAINTAINERS: add a nfsd blocklayout reviewer nfsd: Use MD5 library instead of crypto_shash nfsd: stop pretending that we cache the SEQUENCE reply. NFS: nfsd-maintainer-entry-profile: Inline function name prefixes ...
2 parents 1a68aef + df8c841 commit b0319c4

50 files changed

Lines changed: 1431 additions & 373 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

Documentation/filesystems/nfs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,5 +13,6 @@ NFS
1313
rpc-cache
1414
rpc-server-gss
1515
nfs41-server
16+
nfsd-io-modes
1617
knfsd-stats
1718
reexport
Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
=============
4+
NFSD IO MODES
5+
=============
6+
7+
Overview
8+
========
9+
10+
NFSD has historically always used buffered IO when servicing READ and
11+
WRITE operations. BUFFERED is NFSD's default IO mode, but it is possible
12+
to override that default to use either DONTCACHE or DIRECT IO modes.
13+
14+
Experimental NFSD debugfs interfaces are available to allow the NFSD IO
15+
mode used for READ and WRITE to be configured independently. See both:
16+
17+
- /sys/kernel/debug/nfsd/io_cache_read
18+
- /sys/kernel/debug/nfsd/io_cache_write
19+
20+
The default value for both io_cache_read and io_cache_write reflects
21+
NFSD's default IO mode (which is NFSD_IO_BUFFERED=0).
22+
23+
Based on the configured settings, NFSD's IO will either be:
24+
25+
- cached using page cache (NFSD_IO_BUFFERED=0)
26+
- cached but removed from page cache on completion (NFSD_IO_DONTCACHE=1)
27+
- not cached stable_how=NFS_UNSTABLE (NFSD_IO_DIRECT=2)
28+
29+
To set an NFSD IO mode, write a supported value (0 - 2) to the
30+
corresponding IO operation's debugfs interface, e.g.::
31+
32+
echo 2 > /sys/kernel/debug/nfsd/io_cache_read
33+
echo 2 > /sys/kernel/debug/nfsd/io_cache_write
34+
35+
To check which IO mode NFSD is using for READ or WRITE, simply read the
36+
corresponding IO operation's debugfs interface, e.g.::
37+
38+
cat /sys/kernel/debug/nfsd/io_cache_read
39+
cat /sys/kernel/debug/nfsd/io_cache_write
40+
41+
If you experiment with NFSD's IO modes on a recent kernel and have
42+
interesting results, please report them to linux-nfs@vger.kernel.org
43+
44+
NFSD DONTCACHE
45+
==============
46+
47+
DONTCACHE offers a hybrid approach to servicing IO that aims to offer
48+
the benefits of using DIRECT IO without any of the strict alignment
49+
requirements that DIRECT IO imposes. To achieve this buffered IO is used
50+
but the IO is flagged to "drop behind" (meaning associated pages are
51+
dropped from the page cache) when IO completes.
52+
53+
DONTCACHE aims to avoid what has proven to be a fairly significant
54+
limition of Linux's memory management subsystem if/when large amounts of
55+
data is infrequently accessed (e.g. read once _or_ written once but not
56+
read until much later). Such use-cases are particularly problematic
57+
because the page cache will eventually become a bottleneck to servicing
58+
new IO requests.
59+
60+
For more context on DONTCACHE, please see these Linux commit headers:
61+
62+
- Overview: 9ad6344568cc3 ("mm/filemap: change filemap_create_folio()
63+
to take a struct kiocb")
64+
- for READ: 8026e49bff9b1 ("mm/filemap: add read support for
65+
RWF_DONTCACHE")
66+
- for WRITE: 974c5e6139db3 ("xfs: flag as supporting FOP_DONTCACHE")
67+
68+
NFSD_IO_DONTCACHE will fall back to NFSD_IO_BUFFERED if the underlying
69+
filesystem doesn't indicate support by setting FOP_DONTCACHE.
70+
71+
NFSD DIRECT
72+
===========
73+
74+
DIRECT IO doesn't make use of the page cache, as such it is able to
75+
avoid the Linux memory management's page reclaim scalability problems
76+
without resorting to the hybrid use of page cache that DONTCACHE does.
77+
78+
Some workloads benefit from NFSD avoiding the page cache, particularly
79+
those with a working set that is significantly larger than available
80+
system memory. The pathological worst-case workload that NFSD DIRECT has
81+
proven to help most is: NFS client issuing large sequential IO to a file
82+
that is 2-3 times larger than the NFS server's available system memory.
83+
The reason for such improvement is NFSD DIRECT eliminates a lot of work
84+
that the memory management subsystem would otherwise be required to
85+
perform (e.g. page allocation, dirty writeback, page reclaim). When
86+
using NFSD DIRECT, kswapd and kcompactd are no longer commanding CPU
87+
time trying to find adequate free pages so that forward IO progress can
88+
be made.
89+
90+
The performance win associated with using NFSD DIRECT was previously
91+
discussed on linux-nfs, see:
92+
https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
93+
94+
But in summary:
95+
96+
- NFSD DIRECT can significantly reduce memory requirements
97+
- NFSD DIRECT can reduce CPU load by avoiding costly page reclaim work
98+
- NFSD DIRECT can offer more deterministic IO performance
99+
100+
As always, your mileage may vary and so it is important to carefully
101+
consider if/when it is beneficial to make use of NFSD DIRECT. When
102+
assessing comparative performance of your workload please be sure to log
103+
relevant performance metrics during testing (e.g. memory usage, cpu
104+
usage, IO performance). Using perf to collect perf data that may be used
105+
to generate a "flamegraph" for work Linux must perform on behalf of your
106+
test is a really meaningful way to compare the relative health of the
107+
system and how switching NFSD's IO mode changes what is observed.
108+
109+
If NFSD_IO_DIRECT is specified by writing 2 (or 3 and 4 for WRITE) to
110+
NFSD's debugfs interfaces, ideally the IO will be aligned relative to
111+
the underlying block device's logical_block_size. Also the memory buffer
112+
used to store the READ or WRITE payload must be aligned relative to the
113+
underlying block device's dma_alignment.
114+
115+
But NFSD DIRECT does handle misaligned IO in terms of O_DIRECT as best
116+
it can:
117+
118+
Misaligned READ:
119+
If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
120+
DIO-aligned block (on either end of the READ). The expanded READ is
121+
verified to have proper offset/len (logical_block_size) and
122+
dma_alignment checking.
123+
124+
Misaligned WRITE:
125+
If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
126+
middle and end as needed. The large middle segment is DIO-aligned
127+
and the start and/or end are misaligned. Buffered IO is used for the
128+
misaligned segments and O_DIRECT is used for the middle DIO-aligned
129+
segment. DONTCACHE buffered IO is _not_ used for the misaligned
130+
segments because using normal buffered IO offers significant RMW
131+
performance benefit when handling streaming misaligned WRITEs.
132+
133+
Tracing:
134+
The nfsd_read_direct trace event shows how NFSD expands any
135+
misaligned READ to the next DIO-aligned block (on either end of the
136+
original READ, as needed).
137+
138+
This combination of trace events is useful for READs::
139+
140+
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
141+
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_direct/enable
142+
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
143+
echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
144+
145+
The nfsd_write_direct trace event shows how NFSD splits a given
146+
misaligned WRITE into a DIO-aligned middle segment.
147+
148+
This combination of trace events is useful for WRITEs::
149+
150+
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable
151+
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_direct/enable
152+
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable
153+
echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable

0 commit comments

Comments
 (0)