Add syscall epoll_pwait2, an epoll_wait variant with nsec resolution that
replaces int timeout with struct timespec. It is equivalent otherwise.
int epoll_pwait2(int fd, struct epoll_event *events,
int maxevents,
const struct timespec *timeout,
const sigset_t *sigset);
The underlying hrtimer is already programmed with nsec resolution.
pselect and ppoll also set nsec resolution timeout with timespec.
The sigset_t in epoll_pwait has a compat variant. epoll_pwait2 needs
the same.
For timespec, only support this new interface on 2038 aware platforms
that define __kernel_timespec_t. So no CONFIG_COMPAT_32BIT_TIME.
Link: https://lkml.kernel.org/r/20201121144401.3727659-3-willemdebruijn.kernel@gmail.com
Change-Id: I8cb4e756aacb4bee7cbe1c2cb5320a59e07626f8
Signed-off-by: Willem de Bruijn <willemb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Using the helper functions do_epoll_create() and do_epoll_wait() allows us
to remove in-kernel calls to the related syscall functions.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Change-Id: I73c8f43d7db845e0b7ec5ec00fbc575fbcb8ca5e
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
commit 678379e1d4f7443b170939525d3312cfc37bf86b upstream.
Cloning a descriptor table picks the size that would cover all currently
opened files. That's fine for clone() and unshare(), but for close_range()
there's an additional twist - we clone before we close, and it would be
a shame to have
close_range(3, ~0U, CLOSE_RANGE_UNSHARE)
leave us with a huge descriptor table when we are not going to keep
anything past stderr, just because some large file descriptor used to
be open before our call has taken it out.
Unfortunately, it had been dealt with in an inherently racy way -
sane_fdtable_size() gets a "don't copy anything past that" argument
(passed via unshare_fd() and dup_fd()), close_range() decides how much
should be trimmed and passes that to unshare_fd().
The problem is, a range that used to extend to the end of descriptor
table back when close_range() had looked at it might very well have stuff
grown after it by the time dup_fd() has allocated a new files_struct
and started to figure out the capacity of fdtable to be attached to that.
That leads to interesting pathological cases; at the very least it's a
QoI issue, since unshare(CLONE_FILES) is atomic in a sense that it takes
a snapshot of descriptor table one might have observed at some point.
Since CLOSE_RANGE_UNSHARE close_range() is supposed to be a combination
of unshare(CLONE_FILES) with plain close_range(), ending up with a
weird state that would never occur with unshare(2) is confusing, to put
it mildly.
It's not hard to get rid of - all it takes is passing both ends of the
range down to sane_fdtable_size(). There we are under ->files_lock,
so the race is trivially avoided.
So we do the following:
* switch close_files() from calling unshare_fd() to calling
dup_fd().
* undo the calling convention change done to unshare_fd() in
60997c3d45d9 "close_range: add CLOSE_RANGE_UNSHARE"
* introduce struct fd_range, pass a pointer to that to dup_fd()
and sane_fdtable_size() instead of "trim everything past that point"
they are currently getting. NULL means "we are not going to be punching
any holes"; NR_OPEN_MAX is gone.
* make sane_fdtable_size() use find_last_bit() instead of
open-coding it; it's easier to follow that way.
* while we are at it, have dup_fd() report errors by returning
ERR_PTR(), no need to use a separate int *errorp argument.
Fixes: 60997c3d45d9 "close_range: add CLOSE_RANGE_UNSHARE"
Cc: stable@vger.kernel.org
Change-Id: I6782a2edf98970b6c2d662048061e28f7e57b9c9
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit d888c83fcec75194a8a48ccd283953bdba7b2550 ]
Jason Donenfeld reports that my commit 1c24a186398f ("fs: fd tables have
to be multiples of BITS_PER_LONG") doesn't work, and the reason is an
embarrassing brown-paper-bag bug.
Yes, we want to align the number of fds to BITS_PER_LONG, and yes, the
reason they might not be aligned is because the incoming 'max_fd'
argument might not be aligned.
But aligining the argument - while simple - will cause a "infinitely
big" maxfd (eg NR_OPEN_MAX) to just overflow to zero. Which most
definitely isn't what we want either.
The obvious fix was always just to do the alignment last, but I had
moved it earlier just to make the patch smaller and the code look
simpler. Duh. It certainly made _me_ look simple.
Fixes: 1c24a186398f ("fs: fd tables have to be multiples of BITS_PER_LONG")
Reported-and-tested-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Fedor Pchelkin <aissur0002@gmail.com>
Cc: Alexey Khoroshilov <khoroshilov@ispras.ru>
Cc: Christian Brauner <brauner@kernel.org>
Change-Id: I6d3a0c28896e8cf8cab30f60679aab785aeee193
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1c24a186398f59c80adb9a967486b65c1423a59d ]
This has always been the rule: fdtables have several bitmaps in them,
and as a result they have to be sized properly for bitmaps. We walk
those bitmaps in chunks of 'unsigned long' in serveral cases, but even
when we don't, we use the regular kernel bitops that are defined to work
on arrays of 'unsigned long', not on some byte array.
Now, the distinction between arrays of bytes and 'unsigned long'
normally only really ends up being noticeable on big-endian systems, but
Fedor Pchelkin and Alexey Khoroshilov reported that copy_fd_bitmaps()
could be called with an argument that wasn't even a multiple of
BITS_PER_BYTE. And then it fails to do the proper copy even on
little-endian machines.
The bug wasn't in copy_fd_bitmap(), but in sane_fdtable_size(), which
didn't actually sanitize the fdtable size sufficiently, and never made
sure it had the proper BITS_PER_LONG alignment.
That's partly because the alignment historically came not from having to
explicitly align things, but simply from previous fdtable sizes, and
from count_open_files(), which counts the file descriptors by walking
them one 'unsigned long' word at a time and thus naturally ends up doing
sizing in the proper 'chunks of unsigned long'.
But with the introduction of close_range(), we now have an external
source of "this is how many files we want to have", and so
sane_fdtable_size() needs to do a better job.
This also adds that explicit alignment to alloc_fdtable(), although
there it is mainly just for documentation at a source code level. The
arithmetic we do there to pick a reasonable fdtable size already aligns
the result sufficiently.
In fact,clang notices that the added ALIGN() in that function doesn't
actually do anything, and does not generate any extra code for it.
It turns out that gcc ends up confusing itself by combining a previous
constant-sized shift operation with the variable-sized shift operations
in roundup_pow_of_two(). And probably due to that doesn't notice that
the ALIGN() is a no-op. But that's a (tiny) gcc misfeature that doesn't
matter. Having the explicit alignment makes sense, and would actually
matter on a 128-bit architecture if we ever go there.
This also adds big comments above both functions about how fdtable sizes
have to have that BITS_PER_LONG alignment.
Fixes: 60997c3d45d9 ("close_range: add CLOSE_RANGE_UNSHARE")
Reported-by: Fedor Pchelkin <aissur0002@gmail.com>
Reported-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Link: https://lore.kernel.org/all/20220326114009.1690-1-aissur0002@gmail.com/
Tested-and-acked-by: Christian Brauner <brauner@kernel.org>
Change-Id: Ib64319f20fecec1c367bf0167e1b14b47751537a
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1d3b4bec3ce55e0c46cdce7d0402dbd6b4af3a3d ]
First of all, tell it how many slots do we want, not which slot
is wanted. It makes one caller (dup_fd()) more straightforward
and doesn't harm another (expand_fdtable()).
Furthermore, make it return ERR_PTR() on failure rather than
returning NULL. Simplifies the callers.
Simplify the size calculation, while we are at it - note that we
always have slots_wanted greater than BITS_PER_LONG. What the
rules boil down to is
* use the smallest power of two large enough to give us
that many slots
* on 32bit skip 64 and 128 - the minimal capacity we want
there is 256 slots (i.e. 1Kb fd array).
* on 64bit don't skip anything, the minimal capacity is
128 - and we'll never be asked for 64 or less. 128 slots means
1Kb fd array, again.
* on 128bit, if that ever happens, don't skip anything -
we'll never be asked for 128 or less, so the fd array allocation
will be at least 2Kb.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Change-Id: Id6ae21da951bdb42fdb4361285ada3eb377f3a6b
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Ulrich Hecht <uli@kernel.org>
It never looked too pleasant and it doesn't really buy us anything
anymore now that CLOSE_RANGE_CLOEXEC exists and we need to retake the
current maximum under the lock for it anyway. This also makes the logic
easier to follow.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Giuseppe Scrivano <gscrivan@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Change-Id: I72be5b078d7edc061318b0eda36a351e1dc01e56
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
syzbot reported a bug when putting the last reference to a tasks file
descriptor table. Debugging this showed we didn't recalculate the
current maximum fd number for CLOSE_RANGE_UNSHARE | CLOSE_RANGE_CLOEXEC
after we unshared the file descriptors table. So max_fd could exceed the
current fdtable maximum causing us to set excessive bits. As a concrete
example, let's say the user requested everything from fd 4 to ~0UL to be
closed and their current fdtable size is 256 with their highest open fd
being 4. With CLOSE_RANGE_UNSHARE the caller will end up with a new
fdtable which has room for 64 file descriptors since that is the lowest
fdtable size we accept. But now max_fd will still point to 255 and needs
to be adjusted. Fix this by retrieving the correct maximum fd value in
__range_cloexec().
Reported-by: syzbot+283ce5a46486d6acdbaf@syzkaller.appspotmail.com
Fixes: 582f1fb6b721 ("fs, close_range: add flag CLOSE_RANGE_CLOEXEC")
Fixes: fec8a6a69103 ("close_range: unshare all fds for CLOSE_RANGE_UNSHARE | CLOSE_RANGE_CLOEXEC")
Cc: Christoph Hellwig <hch@lst.de>
Cc: Giuseppe Scrivano <gscrivan@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
Change-Id: Ib6aeb02d2b4556ba048cd4ca8150f9d2c29abdb3
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
After introducing CLOSE_RANGE_CLOEXEC syzbot reported a crash when
CLOSE_RANGE_CLOEXEC is specified in conjunction with CLOSE_RANGE_UNSHARE.
When CLOSE_RANGE_UNSHARE is specified the caller will receive a private
file descriptor table in case their file descriptor table is currently
shared.
For the case where the caller has requested all file descriptors to be
actually closed via e.g. close_range(3, ~0U, 0) the kernel knows that
the caller does not need any of the file descriptors anymore and will
optimize the close operation by only copying all files in the range from
0 to 3 and no others.
However, if the caller requested CLOSE_RANGE_CLOEXEC together with
CLOSE_RANGE_UNSHARE the caller wants to still make use of the file
descriptors so the kernel needs to copy all of them and can't optimize.
The original patch didn't account for this and thus could cause oopses
as evidenced by the syzbot report because it assumed that all fds had
been copied. Fix this by handling the CLOSE_RANGE_CLOEXEC case.
syzbot reported
==================================================================
BUG: KASAN: null-ptr-deref in instrument_atomic_read include/linux/instrumented.h:71 [inline]
BUG: KASAN: null-ptr-deref in atomic64_read include/asm-generic/atomic-instrumented.h:837 [inline]
BUG: KASAN: null-ptr-deref in atomic_long_read include/asm-generic/atomic-long.h:29 [inline]
BUG: KASAN: null-ptr-deref in filp_close+0x22/0x170 fs/open.c:1274
Read of size 8 at addr 0000000000000077 by task syz-executor511/8522
CPU: 1 PID: 8522 Comm: syz-executor511 Not tainted 5.10.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack+0x107/0x163 lib/dump_stack.c:120
__kasan_report mm/kasan/report.c:549 [inline]
kasan_report.cold+0x5/0x37 mm/kasan/report.c:562
check_memory_region_inline mm/kasan/generic.c:186 [inline]
check_memory_region+0x13d/0x180 mm/kasan/generic.c:192
instrument_atomic_read include/linux/instrumented.h:71 [inline]
atomic64_read include/asm-generic/atomic-instrumented.h:837 [inline]
atomic_long_read include/asm-generic/atomic-long.h:29 [inline]
filp_close+0x22/0x170 fs/open.c:1274
close_files fs/file.c:402 [inline]
put_files_struct fs/file.c:417 [inline]
put_files_struct+0x1cc/0x350 fs/file.c:414
exit_files+0x12a/0x170 fs/file.c:435
do_exit+0xb4f/0x2a00 kernel/exit.c:818
do_group_exit+0x125/0x310 kernel/exit.c:920
get_signal+0x428/0x2100 kernel/signal.c:2792
arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811
handle_signal_work kernel/entry/common.c:147 [inline]
exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
exit_to_user_mode_prepare+0x124/0x200 kernel/entry/common.c:201
__syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:302
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x447039
Code: Unable to access opcode bytes at RIP 0x44700f.
RSP: 002b:00007f1b1225cdb8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: 0000000000000001 RBX: 00000000006dbc28 RCX: 0000000000447039
RDX: 00000000000f4240 RSI: 0000000000000081 RDI: 00000000006dbc2c
RBP: 00000000006dbc20 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006dbc2c
R13: 00007fff223b6bef R14: 00007f1b1225d9c0 R15: 00000000006dbc2c
==================================================================
syzbot has tested the proposed patch and the reproducer did not trigger any issue:
Reported-and-tested-by: syzbot+96cfd2b22b3213646a93@syzkaller.appspotmail.com
Tested on:
commit: 10f7cddd selftests/core: add regression test for CLOSE_RAN..
git tree: git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git vfs
kernel config: https://syzkaller.appspot.com/x/.config?x=5d42216b510180e3
dashboard link: https://syzkaller.appspot.com/bug?extid=96cfd2b22b3213646a93
compiler: gcc (GCC) 10.1.0-syz 20200507
Reported-by: syzbot+96cfd2b22b3213646a93@syzkaller.appspotmail.com
Fixes: 582f1fb6b721 ("fs, close_range: add flag CLOSE_RANGE_CLOEXEC")
Cc: Giuseppe Scrivano <gscrivan@redhat.com>
Cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20201217213303.722643-1-christian.brauner@ubuntu.com
Change-Id: I92120c3f81e2811d3a2a1a76fa60cf2e75a56f42
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
When the flag CLOSE_RANGE_CLOEXEC is set, close_range doesn't
immediately close the files but it sets the close-on-exec bit.
It is useful for e.g. container runtimes that usually install a
seccomp profile "as late as possible" before execv'ing the container
process itself. The container runtime could either do:
1 2
- install_seccomp_profile(); - close_range(MIN_FD, MAX_INT, 0);
- close_range(MIN_FD, MAX_INT, 0); - install_seccomp_profile();
- execve(...); - execve(...);
Both alternative have some disadvantages.
In the first variant the seccomp_profile cannot block the close_range
syscall, as well as opendir/read/close/... for the fallback on older
kernels.
In the second variant, close_range() can be used only on the fds
that are not going to be needed by the runtime anymore, and it must be
potentially called multiple times to account for the different ranges
that must be closed.
Using close_range(..., ..., CLOSE_RANGE_CLOEXEC) solves these issues.
The runtime is able to use the existing open fds, the seccomp profile
can block close_range() and the syscalls used for its fallback.
Change-Id: I1c84a733698c2853a0126cd22960ada25b229c5a
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Link: https://lore.kernel.org/r/20201118104746.873084-2-gscrivan@redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
One of the use-cases of close_range() is to drop file descriptors just before
execve(). This would usually be expressed in the sequence:
unshare(CLONE_FILES);
close_range(3, ~0U);
as pointed out by Linus it might be desirable to have this be a part of
close_range() itself under a new flag CLOSE_RANGE_UNSHARE.
This expands {dup,unshare)_fd() to take a max_fds argument that indicates the
maximum number of file descriptors to copy from the old struct files. When the
user requests that all file descriptors are supposed to be closed via
close_range(min, max) then we can cap via unshare_fd(min) and hence don't need
to do any of the heavy fput() work for everything above min.
The patch makes it so that if CLOSE_RANGE_UNSHARE is requested and we do in
fact currently share our file descriptor table we create a new private copy.
We then close all fds in the requested range and finally after we're done we
install the new fd table.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Change-Id: I0813045886501e40a45693ee1edad50bdf2b66e5
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
This adds the close_range() syscall. It allows to efficiently close a range
of file descriptors up to all file descriptors of a calling task.
I was contacted by FreeBSD as they wanted to have the same close_range()
syscall as we proposed here. We've coordinated this and in the meantime, Kyle
was fast enough to merge close_range() into FreeBSD already in April:
https://reviews.freebsd.org/D21627https://svnweb.freebsd.org/base?view=revision&revision=359836
and the current plan is to backport close_range() to FreeBSD 12.2 (cf. [2])
once its merged in Linux too. Python is in the process of switching to
close_range() on FreeBSD and they are waiting on us to merge this to switch on
Linux as well: https://bugs.python.org/issue38061
The syscall came up in a recent discussion around the new mount API and
making new file descriptor types cloexec by default. During this
discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
syscall in this manner has been requested by various people over time.
First, it helps to close all file descriptors of an exec()ing task. This
can be done safely via (quoting Al's example from [1] verbatim):
/* that exec is sensitive */
unshare(CLONE_FILES);
/* we don't want anything past stderr here */
close_range(3, ~0U);
execve(....);
The code snippet above is one way of working around the problem that file
descriptors are not cloexec by default. This is aggravated by the fact that
we can't just switch them over without massively regressing userspace. For
a whole class of programs having an in-kernel method of closing all file
descriptors is very helpful (e.g. demons, service managers, programming
language standard libraries, container managers etc.).
(Please note, unshare(CLONE_FILES) should only be needed if the calling
task is multi-threaded and shares the file descriptor table with another
thread in which case two threads could race with one thread allocating file
descriptors and the other one closing them via close_range(). For the
general case close_range() before the execve() is sufficient.)
Second, it allows userspace to avoid implementing closing all file
descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
file descriptor. From looking at various large(ish) userspace code bases
this or similar patterns are very common in:
- service managers (cf. [4])
- libcs (cf. [6])
- container runtimes (cf. [5])
- programming language runtimes/standard libraries
- Python (cf. [2])
- Rust (cf. [7], [8])
As Dmitry pointed out there's even a long-standing glibc bug about missing
kernel support for this task (cf. [3]).
In addition, the syscall will also work for tasks that do not have procfs
mounted and on kernels that do not have procfs support compiled in. In such
situations the only way to make sure that all file descriptors are closed
is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
OPEN_MAX trickery (cf. comment [8] on Rust).
The performance is striking. For good measure, comparing the following
simple close_all_fds() userspace implementation that is essentially just
glibc's version in [6]:
static int close_all_fds(void)
{
int dir_fd;
DIR *dir;
struct dirent *direntp;
dir = opendir("/proc/self/fd");
if (!dir)
return -1;
dir_fd = dirfd(dir);
while ((direntp = readdir(dir))) {
int fd;
if (strcmp(direntp->d_name, ".") == 0)
continue;
if (strcmp(direntp->d_name, "..") == 0)
continue;
fd = atoi(direntp->d_name);
if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
continue;
close(fd);
}
closedir(dir);
return 0;
}
to close_range() yields:
1. closing 4 open files:
- close_all_fds(): ~280 us
- close_range(): ~24 us
2. closing 1000 open files:
- close_all_fds(): ~5000 us
- close_range(): ~800 us
close_range() is designed to allow for some flexibility. Specifically, it
does not simply always close all open file descriptors of a task. Instead,
callers can specify an upper bound.
This is e.g. useful for scenarios where specific file descriptors are
created with well-known numbers that are supposed to be excluded from
getting closed.
For extra paranoia close_range() comes with a flags argument. This can e.g.
be used to implement extension. Once can imagine userspace wanting to stop
at the first error instead of ignoring errors under certain circumstances.
There might be other valid ideas in the future. In any case, a flag
argument doesn't hurt and keeps us on the safe side.
From an implementation side this is kept rather dumb. It saw some input
from David and Jann but all nonsense is obviously my own!
- Errors to close file descriptors are currently ignored. (Could be changed
by setting a flag in the future if needed.)
- __close_range() is a rather simplistic wrapper around __close_fd().
My reasoning behind this is based on the nature of how __close_fd() needs
to release an fd. But maybe I misunderstood specifics:
We take the files_lock and rcu-dereference the fdtable of the calling
task, we find the entry in the fdtable, get the file and need to release
files_lock before calling filp_close().
In the meantime the fdtable might have been altered so we can't just
retake the spinlock and keep the old rcu-reference of the fdtable
around. Instead we need to grab a fresh reference to the fdtable.
If my reasoning is correct then there's really no point in fancyfying
__close_range(): We just need to rcu-dereference the fdtable of the
calling task once to cap the max_fd value correctly and then go on
calling __close_fd() in a loop.
/* References */
[1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/
[2]: 9e4f2f3a6b/Modules/_posixsubprocess.c (L220)
[3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7
[4]: 5238e95759/src/basic/fd-util.c (L217)
[5]: ddf4b77e11/src/lxc/start.c (L236)
[6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17
Note that this is an internal implementation that is not exported.
Currently, libc seems to not provide an exported version of this
because of missing kernel support to do this.
Note, in a recent patch series Florian made grantpt() a nop thereby
removing the code referenced here.
[7]: https://github.com/rust-lang/rust/issues/12148
[8]: 5f47c0613e/src/libstd/sys/unix/process2.rs (L303-L308)
Rust's solution is slightly different but is equally unperformant.
Rust calls getdtablesize() which is a glibc library function that
simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then
goes on to call close() on each fd. That's obviously overkill for most
tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or
OPEN_MAX.
Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set
to 1024. Even in this case, there's a very high chance that in the
common case Rust is calling the close() syscall 1021 times pointlessly
if the task just has 0, 1, and 2 open.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Change-Id: I2f7abcf9210a1e79855837d1b7c580cc7f7a38e2
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kyle Evans <self@kyle-evans.net>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dmitry V. Levin <ldv@altlinux.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
Remove the WQ_MEM_RECLAIM flag from alloc_ordered_workqueue() as
network driver flushes its workqueues without it which generates
a warning after MHI clean-up is performed.
Change-Id: I835203692496c75d58587fd2bcd5a1287ab59955
Signed-off-by: Bhaumik Bhatt <bbhatt@codeaurora.org>
We can use a inner function to init the disk time
of f2fs_inode_info for cleaning redundant code.
Change-Id: Iafd6cd55a764372c17b7eb90931336a9b9727f1a
Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
ceph_con_keepalive_expired() is the last user of timespec_add() and some
of the last uses of ktime_get_real_ts(). Replacing this with timespec64
based interfaces lets us remove that deprecated API.
I'm introducing new ceph_encode_timespec64()/ceph_decode_timespec64()
here that take timespec64 structures and convert to/from ceph_timespec,
which is defined to have an unsigned 32-bit tv_sec member. This extends
the range of valid times to year 2106, avoiding the year 2038 overflow.
The ceph file system portion still uses the old functions for inode
timestamps, this will be done separately after the VFS layer is converted.
Change-Id: I638be3e081770e6b766b972343ed7652a473383d
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The VFS structures are finally converted to always use 64-bit timestamps,
and this file system can represent a long range of on-disk timestamps
already, so now let's fit in the missing bits for udf.
Change-Id: Ic886649a0d3fa0fa6a0d1ae2771c29cffb35631d
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jan Kara <jack@suse.cz>
While the regular inode timestamps all use timespec64 now, the i_otime
field is btrfs specific and still needs to be converted to correctly
represent times beyond 2038.
Change-Id: I847b77ca8cca3b7d8eec7eb6cad22ab509beb052
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Both vfs and the on-disk inode structures can deal with fine-grained
timestamps now, so this is the last missing piece to make ubifs
y2038-safe on 32-bit architectures.
Change-Id: I8c1fa7cc6e64cd30771c51de8efd99ceaf8d94fd
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
The on-disk representation and the vfs both use 64-bit tv_sec values,
so let's change the last missing piece in the middle.
Change-Id: Iebc60247d79e006d101cd0a63a19591aaa871a16
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
All of fuse uses 64-bit timestamps with the exception of the
fuse_change_attributes(), so let's convert this one as well.
Change-Id: Ibe120305d151ea8c8231ef7a73a1ae593296d2a8
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Since the vfs structures are all using timespec64, we can now
change the internal representation, using ceph_encode_timespec64 and
ceph_decode_timespec64.
In case of ceph_aux_inode however, we need to avoid doing a memcmp()
on uninitialized padding data, so the members of the i_mtime field get
copied individually into 64-bit integers.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Change-Id: Ia1a10ea74dc07ce95bfc8ad6ddb5080d5ec41db7
As vfs moves to using struct timespec64 to represent times,
update the argument to timespec_truncate() to use
struct timespec64. Also change the name of the function.
The rest of the implementation logic is the same.
Move this to fs/inode.c instead of kernel/time/time.c as all the
users of this api are filesystems.
Change-Id: I19b58d7d1d5ec263b04237a33b8f981e7d407171
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: <viro@zeniv.linux.org.uk>
This prepares pstore for converting the VFS layer to timespec64.
Change-Id: Id2dfb977f0e3d41a935eb03344555eeace394875
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
__getnstimeofday() is a rather odd interface, with a number of quirks:
- The caller may come from NMI context, but the implementation is not NMI safe,
one way to get there from NMI is
NMI handler:
something bad
panic()
kmsg_dump()
pstore_dump()
pstore_record_init()
__getnstimeofday()
- The calling conventions are different from any other timekeeping functions,
to deal with returning an error code during suspended timekeeping.
Address the above issues by using a completely different method to get the
time: ktime_get_real_fast_ns() is NMI safe and has a reasonable behavior
when timekeeping is suspended: it returns the time at which it got
suspended. As Thomas Gleixner explained, this is safe, as
ktime_get_real_fast_ns() does not call into the clocksource driver that
might be suspended.
The result can easily be transformed into a timespec structure. Since
ktime_get_real_fast_ns() was not exported to modules, add the export.
The pstore behavior for the suspended case changes slightly, as it now
stores the timestamp at which timekeeping was suspended instead of storing
a zero timestamp.
This change is not addressing y2038-safety, that's subject to a more
complex follow up patch.
Change-Id: Id259b99ce7a7cba52abad39f61420ec977e26e37
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Colin Cross <ccross@android.com>
Link: https://lkml.kernel.org/r/20171110152530.1926955-1-arnd@arndb.de
As part of changing all the timekeeping code to use 64-bit
time_t consistently, this removes the uses of timeval
and timespec as much as possible from do_adjtimex() and
timekeeping_inject_offset(). The timeval_inject_offset_valid()
and timespec_inject_offset_valid() just complicate this,
so I'm folding them into the respective callers.
This leaves the actual 'struct timex' definition, which
is part of the user-space ABI and should be dealt with
separately when we have agreed on the ABI change.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Stephen Boyd <stephen.boyd@linaro.org>
Change-Id: I7e05c6e81c919d6174a759af6d8a734265df6d01
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The configurable printk timestamping wants access to clock realtime. Right
now there is no ktime_get_real_fast_ns() accessor because reading the
monotonic base and the realtime offset cannot be done atomically. Contrary
to boot time this offset can change during runtime and cause half updated
readouts.
struct tk_read_base was fully packed when the fast timekeeper access was
implemented. commit ceea5e3771 ("time: Fix clock->read(clock) race around
clocksource changes") removed the 'read' function pointer from the
structure, but of course left the comment stale.
So now the structure can fit a new 64bit member w/o violating the cache
line constraints.
Add real_base to tk_read_base and update it in the fast timekeeper update
sequence.
Implement an accessor which follows the same scheme as the accessor to
clock monotonic, but uses the new real_base to access clock real time.
The runtime overhead for updating real_base is minimal as it just adds two
cache hot values and stores them into an already dirtied cache line along
with the other fast timekeeper updates.
Change-Id: I67297bb116aba7400645dd593f83dba149ae65d3
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead,org>
Link: https://lkml.kernel.org/r/1505757060-2004-3-git-send-email-prarit@redhat.com
printk timestamps will be extended to include mono and boot time by using
the fast timekeeping accessors ktime_get_mono|boot_fast_ns(). The
functions can return garbage before timekeeping is initialized resulting in
garbage timestamps.
Initialize the fast timekeepers with dummy clocks which guarantee a 0
readout up to timekeeping_init().
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Change-Id: Ia92cfef7f56ea216c40bf18b32c1bf334a1aea72
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/1503922914-10660-2-git-send-email-prarit@redhat.com
Usage of these apis and their compat versions makes
the syscalls: select family of syscalls and their
compat implementations simpler.
This is a preparatory patch to isolate data conversions to
struct timespec64 at userspace boundaries. This helps contain
the changes needed to transition to new y2038 safe types.
Change-Id: I5fa34f0baccc0eba709d450dd7616bb6e45613f2
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Using getnstimeofday()/timespec_to_ns() causes an overflow on 32-bit
architectures in 2038, and may suffer from time jumps due to
settimeofday() or leap seconds.
I don't see a reason why this needs to be UTC, so either monotonic
or boot time would be better here. Assuming that the fw time keeps
running during suspend, boottime is better than monotonic, and
ktime_get_boot_ns() will also save the additional conversion to
nanoseconds.
Change-Id: I1857989fcef22ff6112fed7f8897b6ee86590c0a
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
We have replacements for all the deprecated timespec based interfaces now,
so this can finally convert iio_get_time_ns() to consistently use the
nanosecond or timespec64 based interfaces instead, avoiding the y2038
overflow.
Change-Id: I358a958789f8a8389ac3fb67e7f58ea6b3088a6f
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
The current_kernel_time64, get_monotonic_coarse64, getrawmonotonic64,
get_monotonic_boottime64 and timekeeping_clocktai64 interfaces have
rather inconsistent naming, and they differ in the calling conventions
by passing the output either by reference or as a return value.
Rename them to ktime_get_coarse_real_ts64, ktime_get_coarse_ts64,
ktime_get_raw_ts64, ktime_get_boottime_ts64 and ktime_get_clocktai_ts64
respectively, and provide the interfaces with macros or inline
functions as needed.
Change-Id: Ifa617b3cb1bff3821c1387efaeca899418e8bc9a
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: y2038@lists.linaro.org
Cc: John Stultz <john.stultz@linaro.org>
Link: https://lkml.kernel.org/r/20180427134016.2525989-4-arnd@arndb.de
In a move to make ktime_get_*() the preferred driver interface into the
timekeeping code, sanitizes ktime_get_real_ts64() to be a proper exported
symbol rather than an alias for getnstimeofday64().
The internal __getnstimeofday64() is no longer used, so remove that
and merge it into ktime_get_real_ts64().
Change-Id: I79e7ad3d71807eafb7eb0827e3945ccbfcd1cfdc
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: y2038@lists.linaro.org
Cc: John Stultz <john.stultz@linaro.org>
Link: https://lkml.kernel.org/r/20180427134016.2525989-3-arnd@arndb.de
At this point, we have converted most of the kernel to use timespec64
consistently in place of timespec, so it seems it's time to make
timespec64 the native structure and define timespec in terms of that
one on 64-bit architectures.
Starting with gcc-5, the compiler can completely optimize away the
timespec_to_timespec64 and timespec64_to_timespec functions on 64-bit
architectures. With older compilers, we introduce a couple of extra
copies of local variables, but those are easily avoided by using
the timespec64 based interfaces consistently, as we do in most of the
important code paths already.
The main upside of removing the hack is that printing the tv_sec
field of a timespec64 structure can now use the %lld format
string on all architectures without a cast to time64_t. Without
this patch, the field is a 'long' type and would have to be printed
using %ld on 64-bit architectures.
Change-Id: I5f90acafd007c8f2a2d0a0a597c5b10b7803a0d5
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: y2038@lists.linaro.org
Cc: John Stultz <john.stultz@linaro.org>
Link: https://lkml.kernel.org/r/20180427134016.2525989-2-arnd@arndb.de
The new struct __kernel_timespec is similar to current
internal kernel struct timespec64 on 64 bit architecture.
The compat structure however is similar to below on little
endian systems (padding and tv_nsec are switched for big
endian systems):
typedef s32 compat_long_t;
typedef s64 compat_kernel_time64_t;
struct compat_kernel_timespec {
compat_kernel_time64_t tv_sec;
compat_long_t tv_nsec;
compat_long_t padding;
};
This allows for both the native and compat representations to
be the same and syscalls using this type as part of their ABI
can have a single entry point to both.
Note that the compat define is not included anywhere in the
kernel explicitly to avoid confusion.
These types will be used by the new syscalls that will be
introduced in the consequent patches.
Most of the new syscalls are just an update to the existing
native ones with this new type. Hence, put this new type under
an ifdef so that the architectures can define CONFIG_64BIT_TIME
when they are ready to handle this switch.
Cc: linux-arch@vger.kernel.org
Change-Id: I06fbf455714d0f1bae69b9ac9def451e9a5abd53
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Dealing with 'struct timeval' users in the y2038 series is a bit tricky:
We have two definitions of timeval that are visible to user space,
one comes from glibc (or some other C library), the other comes from
linux/time.h. The kernel copy is what we want to be used for a number of
structures defined by the kernel itself, e.g. elf_prstatus (used it core
dumps), sysinfo and rusage (used in system calls). These generally tend
to be used for passing time intervals rather than absolute (epoch-based)
times, so they do not suffer from the y2038 overflow. Some of them
could be changed to use 64-bit timestamps by creating new system calls,
others like the core files cannot easily be changed.
An application using these interfaces likely also uses gettimeofday()
or other interfaces that use absolute times, and pass 'struct timeval'
pointers directly into kernel interfaces, so glibc must redefine their
timeval based on a 64-bit time_t when they introduce their y2038-safe
interfaces.
The only reasonable way forward I see is to remove the 'timeval'
definion from the kernel's uapi headers, and change the interfaces that
we do not want to (or cannot) duplicate for 64-bit times to use a new
__kernel_old_timeval definition instead. This type should be avoided
for all new interfaces (those can use 64-bit nanoseconds, or the 64-bit
version of timespec instead), and should be used with great care when
converting existing interfaces from timeval, to be sure they don't suffer
from the y2038 overflow, and only with consensus for the particular user
that using __kernel_old_timeval is better than moving to a 64-bit based
interface. The structure name is intentionally chosen to not conflict
with user space types, and to be ugly enough to discourage its use.
Note that ioctl based interfaces that pass a bare 'timeval' pointer
cannot change to '__kernel_old_timeval' because the user space source
code refers to 'timeval' instead, and we don't want to modify the user
space sources if possible. However, any application that relies on a
structure to contain an embedded 'timeval' (e.g. by passing a pointer
to the member into a function call that expects a timeval pointer) is
broken when that structure gets converted to __kernel_old_timeval. I
don't see any way around that, and we have to rely on the compiler to
produce a warning or compile failure that will alert users when they
recompile their sources against a new libc.
Change-Id: I89fee6b13307424100b8a630f75414c77129ee5c
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lkml.kernel.org/r/20180315161739.576085-1-arnd@arndb.de
On 64-bit architectures, the timespec64 based helpers in linux/time.h
are defined as macros pointing to their timespec based counterparts.
This made sense when they were first introduced, but as we are migrating
away from timespec in general, it's much less intuitive now.
This changes the macros to work in the exact opposite way: we always
provide the timespec64 based helpers and define the old interfaces as
macros for them. Now we can move those macros into linux/time32.h, which
already contains the respective helpers for 32-bit architectures.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Stephen Boyd <stephen.boyd@linaro.org>
Change-Id: I0305f4df7eebc94e364c702c39dc5ca79c63f67e
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Interfaces based on 'struct timespec' or 'struct timeval' should no
longer be used for new code, which can use either ktime_t or 'struct
timespec64' instead.
To make this a little clearer, this moves the various helpers into a new
time32.h header. For the moment, this gets included by the normal time.h,
but we may be able to separate it entirely when most users of time32.h
are gone.
Individual helpers in the new file can get removed once they become unused
in the future.
Since the contents of time32.h look a lot like what's in time64.h, I'm
reordering them during the move to make them more similar, and to allow
a follow-up patch to redirect the 'timespec' based functions to thei
'timespec64' based counterparts on 64-bit architectures later.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Stephen Boyd <stephen.boyd@linaro.org>
Change-Id: I10bbc02d2639115fec39d34480ace5b00dc3c8d2
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
[jstultz: Whitespace & checkpatch fixups]
Signed-off-by: John Stultz <john.stultz@linaro.org>
The (slow but) ongoing work on conversion from timespec to timespec64
has led some timespec based helper functions to become unused.
No new code should use them, so we can remove the functions entirely.
I'm planning to obsolete additional interfaces next and remove
more of these.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Stephen Boyd <stephen.boyd@linaro.org>
Change-Id: Ib2c46b76e8186b0f0d06252ccc328884bfb359e9
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The code to check the adjtimex() or clock_adjtime() arguments is spread
out across multiple files for presumably only historic reasons. As a
preparatation for a rework to get rid of the use of 'struct timeval'
and 'struct timespec' in there, this moves all the portions into
kernel/time/timekeeping.c and marks them as 'static'.
The warp_clock() function here is not as closely related as the others,
but I feel it still makes sense to move it here in order to consolidate
all callers of timekeeping_inject_offset().
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Stephen Boyd <stephen.boyd@linaro.org>
Change-Id: Ib268add953c41504b8a88067c18b4086fa37beb8
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
[jstultz: Whitespace fixup]
Signed-off-by: John Stultz <john.stultz@linaro.org>