Commit Graph

52648 Commits

Author SHA1 Message Date
Gilad Ben-Yossef
f1eb0c0b51 crypto: introduce crypto wait for async op
Invoking a possibly async. crypto op and waiting for completion
while correctly handling backlog processing is a common task
in the crypto API implementation and outside users of it.

This patch adds a generic implementation for doing so in
preparation for using it across the board instead of hand
rolled versions.

Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
CC: Eric Biggers <ebiggers3@gmail.com>
CC: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-01-04 18:47:13 -08:00
Eric Biggers
9e48a9fd98 fscrypt: new helper function - fscrypt_prepare_setattr()
Introduce a helper function for filesystems to call when processing
->setattr() on a possibly-encrypted inode.  It handles enforcing that an
encrypted file can only be truncated if its encryption key is available.

Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-01-04 18:47:11 -08:00
Eric Biggers
ec822ff8b5 fscrypt: new helper function - fscrypt_prepare_lookup()
Introduce a helper function which prepares to look up the given dentry
in the given directory.  If the directory is encrypted, it handles
loading the directory's encryption key, setting the dentry's ->d_op to
fscrypt_d_ops, and setting DCACHE_ENCRYPTED_WITH_KEY if the directory's
encryption key is available.

Note: once all filesystems switch over to this, we'll be able to move
fscrypt_d_ops and fscrypt_set_encrypted_dentry() to fscrypt_private.h.

Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-01-04 18:47:10 -08:00
Eric Biggers
98fe83a195 fscrypt: new helper function - fscrypt_prepare_rename()
Introduce a helper function which prepares to rename a file into a
possibly encrypted directory.  It handles loading the encryption keys
for the source and target directories if needed, and it handles
enforcing that if the target directory (and the source directory for a
cross-rename) is encrypted, then the file being moved into the directory
has the same encryption policy as its containing directory.

Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-01-04 18:47:10 -08:00
Eric Biggers
f521870259 fscrypt: new helper function - fscrypt_prepare_link()
Introduce a helper function which prepares to link an inode into a
possibly-encrypted directory.  It handles setting up the target
directory's encryption key, then verifying that the link won't violate
the constraint that all files in an encrypted directory tree use the
same encryption policy.

Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-01-04 18:47:09 -08:00
Eric Biggers
d61dffbd4f fscrypt: new helper function - fscrypt_file_open()
Add a helper function which prepares to open a regular file which may be
encrypted.  It handles setting up the file's encryption key, then
checking that the file's encryption policy matches that of its parent
directory (if the parent directory is encrypted).  It may be set as the
->open() method or it can be called from another ->open() method.

Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-01-04 18:47:08 -08:00
Eric Biggers
5190ed0766 fscrypt: new helper function - fscrypt_require_key()
Add a helper function which checks if an inode is encrypted, and if so,
tries to set up its encryption key.  This is a pattern which is
duplicated in multiple places in each of ext4, f2fs, and ubifs --- for
example, when a regular file is asked to be opened or truncated.

Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-01-04 18:47:07 -08:00
Eric Biggers
8745aa36e4 fscrypt: remove ->is_encrypted()
Now that all callers of fscrypt_operations.is_encrypted() have been
switched to IS_ENCRYPTED(), remove ->is_encrypted().

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-01-04 18:47:05 -08:00
Eric Biggers
d750ec720f fscrypt: switch from ->is_encrypted() to IS_ENCRYPTED()
IS_ENCRYPTED() now gives the same information as
i_sb->s_cop->is_encrypted() but is more efficient, since IS_ENCRYPTED()
is just a simple flag check.  Prepare to remove ->is_encrypted() by
switching all callers to IS_ENCRYPTED().

Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-01-04 18:47:04 -08:00
Eric Biggers
685285b0b3 fs, fscrypt: add an S_ENCRYPTED inode flag
Introduce a flag S_ENCRYPTED which can be set in ->i_flags to indicate
that the inode is encrypted using the fscrypt (fs/crypto/) mechanism.

Checking this flag will give the same information that
inode->i_sb->s_cop->is_encrypted(inode) currently does, but will be more
efficient.  This will be useful for adding higher-level helper functions
for filesystems to use.  For example we'll be able to replace this:

	if (ext4_encrypted_inode(inode)) {
		ret = fscrypt_get_encryption_info(inode);
		if (ret)
			return ret;
		if (!fscrypt_has_encryption_key(inode))
			return -ENOKEY;
	}

with this:

	ret = fscrypt_require_key(inode);
	if (ret)
		return ret;

... since we'll be able to retain the fast path for unencrypted files as
a single flag check, using an inline function.  This wasn't possible
before because we'd have had to frequently call through the
->i_sb->s_cop->is_encrypted function pointer, even when the encryption
support was disabled or not being used.

Note: we don't define S_ENCRYPTED to 0 if CONFIG_FS_ENCRYPTION is
disabled because we want to continue to return an error if an encrypted
file is accessed without encryption support, rather than pretending that
it is unencrypted.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-01-04 18:47:02 -08:00
Dave Chinner
1617929c3b fscrypt: clean up include file mess
Filesystems have to include different header files based on whether they
are compiled with encryption support or not. That's nasty and messy.

Instead, rationalise the headers so we have a single include fscrypt.h
and let it decide what internal implementation to include based on the
__FS_HAS_ENCRYPTION define.  Filesystems set __FS_HAS_ENCRYPTION to 1
before including linux/fscrypt.h if they are built with encryption
support.  Otherwise, they must set __FS_HAS_ENCRYPTION to 0.

Add guards to prevent fscrypt_supp.h and fscrypt_notsupp.h from being
directly included by filesystems.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[EB: use 1 and 0 rather than defined/undefined]
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-01-04 16:26:54 -08:00
Eric Biggers
e77e7df060 fscrypt: make ->dummy_context() return bool
This makes it consistent with ->is_encrypted(), ->empty_dir(), and
fscrypt_dummy_context_enabled().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-01-04 16:26:45 -08:00
Jaegeuk Kim
0108c481d7 f2fs: support quota sys files
This patch supports hidden quota files in the system, which will be used for
Android. It requires up-to-date f2fs-tools later than v1.9.0.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-15 22:11:12 -08:00
Jaegeuk Kim
d4c292db7b f2fs: add quota_ino feature infra
This patch adds quota_ino feature infra to be used for quota files.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-15 22:10:09 -08:00
Chao Yu
50ffaa980f f2fs: support flexible inline xattr size
Now, in product, more and more features based on file encryption were
introduced, their demand of xattr space is increasing, however, inline
xattr has fixed-size of 200 bytes, once inline xattr space is full, new
increased xattr data would occupy additional xattr block which may bring
us more space usage and performance regression during persisting.

In order to resolve above issue, it's better to expand inline xattr size
flexibly according to user's requirement.

So this patch introduces new filesystem feature 'flexible inline xattr',
and new mount option 'inline_xattr_size=%u', once mkfs enables the
feature, we can use the option to make f2fs supporting flexible inline
xattr size.

To support this feature, we add extra attribute i_inline_xattr_size in
inode layout, indicating that how many space inline xattr borrows from
block address mapping space in inode layout, by this, we can easily
locate and store flexible-sized inline xattr data in inode.

Inode disk layout:
  +----------------------+
  | .i_mode              |
  | ...                  |
  | .i_ext               |
  +----------------------+
  | .i_extra_isize       |
  | .i_inline_xattr_size |-----------+
  | ...                  |           |
  +----------------------+           |
  | .i_addr              |           |
  |  - block address or  |           |
  |  - inline data       |           |
  +----------------------+<---+      v
  |    inline xattr      |    +---inline xattr range
  +----------------------+<---+
  | .i_nid               |
  +----------------------+
  |   node_footer        |
  | (nid, ino, offset)   |
  +----------------------+

Note that, we have to cnosider backward compatibility which reserved
inline_data space, 200 bytes, all the time, reported by Sheng Yong.

Previous inline data or directory always reserved 200 bytes in inode layout,
even if inline_xattr is disabled. In order to keep inline_dentry's structure
for backward compatibility, we get the space back only from inline_data.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reported-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-15 22:09:59 -08:00
Chao Yu
4310125aa0 f2fs: support inode checksum
This patch adds to support inode checksum in f2fs.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix verification flow]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-21 13:55:21 -07:00
Chao Yu
5647b30281 f2fs: support project quota
This patch adds to support plain project quota.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-21 12:56:42 -07:00
Chao Yu
fbcf931d63 f2fs: enhance on-disk inode structure scalability
This patch add new flag F2FS_EXTRA_ATTR storing in inode.i_inline
to indicate that on-disk structure of current inode is extended.

In order to extend, we changed the inode structure a bit:

Original one:

struct f2fs_inode {
	...
	struct f2fs_extent i_ext;
	__le32 i_addr[DEF_ADDRS_PER_INODE];
	__le32 i_nid[DEF_NIDS_PER_INODE];
}

Extended one:

struct f2fs_inode {
        ...
        struct f2fs_extent i_ext;
	union {
		struct {
			__le16 i_extra_isize;
			__le16 i_padding;
			__le32 i_extra_end[0];
		};
		__le32 i_addr[DEF_ADDRS_PER_INODE];
	};
        __le32 i_nid[DEF_NIDS_PER_INODE];
}

Once F2FS_EXTRA_ATTR is set, we will steal four bytes in the head of
i_addr field for storing i_extra_isize and i_padding. with i_extra_isize,
we can calculate actual size of reserved space in i_addr, available
attribute fields included in total extra attribute fields for current
inode can be described as below:

  +--------------------+
  | .i_mode            |
  | ...                |
  | .i_ext             |
  +--------------------+
  | .i_extra_isize     |-----+
  | .i_padding         |     |
  | .i_prjid           |     |
  | .i_atime_extra     |     |
  | .i_ctime_extra     |     |
  | .i_mtime_extra     |<----+
  | .i_inode_cs        |<----- store blkaddr/inline from here
  | .i_xattr_cs        |
  | ...                |
  +--------------------+
  |                    |
  |    block address   |
  |                    |
  +--------------------+
  | .i_nid             |
  +--------------------+
  |   node_footer      |
  | (nid, ino, offset) |
  +--------------------+

Hence, with this patch, we would enhance scalability of f2fs inode for
storing more newly added attribute.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-21 12:55:41 -07:00
Chao Yu
e84f88e6c8 f2fs: make max inline size changeable
This patch tries to make below macros calculating max inline size,
inline dentry field size considerring reserving size-changeable
space:
- MAX_INLINE_DATA
- NR_INLINE_DENTRY
- INLINE_DENTRY_BITMAP_SIZE
- INLINE_RESERVED_SIZE

Then, when inline_{data,dentry} options is enabled, it allows us to
reserve inline space with different size flexibly for adding newly
introduced inode attribute.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-09-21 12:44:11 -07:00
Daniel Walter
8e989de6f8 fscrypt: add support for AES-128-CBC
fscrypt provides facilities to use different encryption algorithms which
are selectable by userspace when setting the encryption policy. Currently,
only AES-256-XTS for file contents and AES-256-CBC-CTS for file names are
implemented. This is a clear case of kernel offers the mechanism and
userspace selects a policy. Similar to what dm-crypt and ecryptfs have.

This patch adds support for using AES-128-CBC for file contents and
AES-128-CBC-CTS for file name encryption. To mitigate watermarking
attacks, IVs are generated using the ESSIV algorithm. While AES-CBC is
actually slightly less secure than AES-XTS from a security point of view,
there is more widespread hardware support. Using AES-CBC gives us the
acceptable performance while still providing a moderate level of security
for persistent storage.

Especially low-powered embedded devices with crypto accelerators such as
CAAM or CESA often only support AES-CBC. Since using AES-CBC over AES-XTS
is basically thought of a last resort, we use AES-128-CBC over AES-256-CBC
since it has less encryption rounds and yields noticeable better
performance starting from a file size of just a few kB.

Signed-off-by: Daniel Walter <dwalter@sigma-star.at>
[david@sigma-star.at: addressed review comments]
Signed-off-by: David Gstir <david@sigma-star.at>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-07-13 17:48:36 -07:00
Eric Biggers
05b79cd18f fscrypt: inline fscrypt_free_filename()
fscrypt_free_filename() only needs to do a kfree() of crypto_buf.name,
which works well as an inline function.  We can skip setting the various
pointers to NULL, since no user cares about it (the name is always freed
just before it goes out of scope).

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: David Gstir <david@sigma-star.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-07-13 17:48:35 -07:00
Jaegeuk Kim
e6b120d4d0 f2fs/fscrypt: catch up to v4.12
Applied f2fs-related patches only, and tried to exclude other layers' changes.

- fs/f2fs

c6b1e36c8fa0 Merge branch 'for-4.13/block' of git://git.kernel.dk/linux-block
81e3e044897b Merge tag 'uuid-for-4.13' of git://git.infradead.org/users/hch/uuid
fdd050b5b3c9 Merge branch 'uuid-types' of bombadil.infradead.org:public_git/uuid into nvme-base
4e4cbee93d56 block: switch bios to blk_status_t
d41519a69b35 crypto: Work around deallocated stack frame reference gcc bug on sparc.
85787090a21e fs: switch ->s_uuid to uuid_t
bf5f89463f5b Merge branch 'akpm' (patches from Andrew)
48fbfe50f1d5 fs: f2fs: use ktime_get_real_seconds for sit_info times
a7c3e901a46f mm: introduce kv[mz]alloc helpers
70ef8f0d3757 Merge tag 'for-f2fs-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
...

- fs/crypto/

4e4cbee93d56 block: switch bios to blk_status_t
17159420a6c1 fscrypt: introduce helper function for filename matching
6b06cdee81d6 fscrypt: avoid collisions when presenting long encrypted filenames
272f98f68462 fscrypt: fix context consistency check when key(s) unavailable
9c8268def612 fscrypt: Move key structure and constants to uapi
cd39e4bac111 fscrypt: remove unnecessary checks for NULL operations
a643f9054c21 Merge tag 'fscrypt-for-linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt
...

- fs/ext4/
: Applied fscrypt-related changes only.

Signed-off-by: Jaegeuk Kim <jaegeuk@google.com>
2017-07-10 19:45:14 -07:00
David Howells
4d7931d727 KEYS: Differentiate uses of rcu_dereference_key() and user_key_payload()
rcu_dereference_key() and user_key_payload() are currently being used in
two different, incompatible ways:

 (1) As a wrapper to rcu_dereference() - when only the RCU read lock used
     to protect the key.

 (2) As a wrapper to rcu_dereference_protected() - when the key semaphor is
     used to protect the key and the may be being modified.

Fix this by splitting both of the key wrappers to produce:

 (1) RCU accessors for keys when caller has the key semaphore locked:

	dereference_key_locked()
	user_key_payload_locked()

 (2) RCU accessors for keys when caller holds the RCU read lock:

	dereference_key_rcu()
	user_key_payload_rcu()

This should fix following warning in the NFS idmapper

  ===============================
  [ INFO: suspicious RCU usage. ]
  4.10.0 #1 Tainted: G        W
  -------------------------------
  ./include/keys/user-type.h:53 suspicious rcu_dereference_protected() usage!
  other info that might help us debug this:
  rcu_scheduler_active = 2, debug_locks = 0
  1 lock held by mount.nfs/5987:
    #0:  (rcu_read_lock){......}, at: [<d000000002527abc>] nfs_idmap_get_key+0x15c/0x420 [nfsv4]
  stack backtrace:
  CPU: 1 PID: 5987 Comm: mount.nfs Tainted: G        W       4.10.0 #1
  Call Trace:
    dump_stack+0xe8/0x154 (unreliable)
    lockdep_rcu_suspicious+0x140/0x190
    nfs_idmap_get_key+0x380/0x420 [nfsv4]
    nfs_map_name_to_uid+0x2a0/0x3b0 [nfsv4]
    decode_getfattr_attrs+0xfac/0x16b0 [nfsv4]
    decode_getfattr_generic.constprop.106+0xbc/0x150 [nfsv4]
    nfs4_xdr_dec_lookup_root+0xac/0xb0 [nfsv4]
    rpcauth_unwrap_resp+0xe8/0x140 [sunrpc]
    call_decode+0x29c/0x910 [sunrpc]
    __rpc_execute+0x140/0x8f0 [sunrpc]
    rpc_run_task+0x170/0x200 [sunrpc]
    nfs4_call_sync_sequence+0x68/0xa0 [nfsv4]
    _nfs4_lookup_root.isra.44+0xd0/0xf0 [nfsv4]
    nfs4_lookup_root+0xe0/0x350 [nfsv4]
    nfs4_lookup_root_sec+0x70/0xa0 [nfsv4]
    nfs4_find_root_sec+0xc4/0x100 [nfsv4]
    nfs4_proc_get_rootfh+0x5c/0xf0 [nfsv4]
    nfs4_get_rootfh+0x6c/0x190 [nfsv4]
    nfs4_server_common_setup+0xc4/0x260 [nfsv4]
    nfs4_create_server+0x278/0x3c0 [nfsv4]
    nfs4_remote_mount+0x50/0xb0 [nfsv4]
    mount_fs+0x74/0x210
    vfs_kern_mount+0x78/0x220
    nfs_do_root_mount+0xb0/0x140 [nfsv4]
    nfs4_try_mount+0x60/0x100 [nfsv4]
    nfs_fs_mount+0x5ec/0xda0 [nfs]
    mount_fs+0x74/0x210
    vfs_kern_mount+0x78/0x220
    do_mount+0x254/0xf70
    SyS_mount+0x94/0x100
    system_call+0x38/0xe0

Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: James Morris <james.l.morris@oracle.com>
2017-07-10 13:20:52 -07:00
Dave Kleikamp
68a5dc3857 coredump: Ensure proper size of sparse core files
[ Upstream commit 4d22c75d4c7b5c5f4bd31054f09103ee490878fd ]

If the last section of a core file ends with an unmapped or zero page,
the size of the file does not correspond with the last dump_skip() call.
gdb complains that the file is truncated and can be confusing to users.

After all of the vma sections are written, make sure that the file size
is no smaller than the current file position.

This problem can be demonstrated with gdb's bigcore testcase on the
sparc architecture.

Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-05 14:40:26 +02:00
Daniel Borkmann
251d00bf13 bpf: don't trigger OOM killer under pressure with map alloc
[ Upstream commit d407bd25a204bd66b7346dde24bd3d37ef0e0b05 ]

This patch adds two helpers, bpf_map_area_alloc() and bpf_map_area_free(),
that are to be used for map allocations. Using kmalloc() for very large
allocations can cause excessive work within the page allocator, so i) fall
back earlier to vmalloc() when the attempt is considered costly anyway,
and even more importantly ii) don't trigger OOM killer with any of the
allocators.

Since this is based on a user space request, for example, when creating
maps with element pre-allocation, we really want such requests to fail
instead of killing other user space processes.

Also, don't spam the kernel log with warnings should any of the allocations
fail under pressure. Given that, we can make backend selection in
bpf_map_area_alloc() generic, and convert all maps over to use this API
for spots with potentially large allocation requests.

Note, replacing the one kmalloc_array() is fine as overflow checks happen
earlier in htab_map_alloc(), since it must also protect the multiplication
for vmalloc() should kmalloc_array() fail.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-05 14:40:21 +02:00
jbrunet
97ace18307 net: phy: add an option to disable EEE advertisement
[ Upstream commit d853d145ea3e63387a2ac759aa41d5e43876e561 ]

This patch adds an option to disable EEE advertisement in the generic PHY
by providing a mask of prohibited modes corresponding to the value found in
the MDIO_AN_EEE_ADV register.

On some platforms, PHY Low power idle seems to be causing issues, even
breaking the link some cases. The patch provides a convenient way for these
platforms to disable EEE advertisement and work around the issue.

Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
Tested-by: Yegor Yefremov <yegorslists@googlemail.com>
Tested-by: Andreas Färber <afaerber@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-05 14:40:19 +02:00
John Stultz
a53bfdda06 time: Fix CLOCK_MONOTONIC_RAW sub-nanosecond accounting
commit 3d88d56c5873f6eebe23e05c3da701960146b801 upstream.

Due to how the MONOTONIC_RAW accumulation logic was handled,
there is the potential for a 1ns discontinuity when we do
accumulations. This small discontinuity has for the most part
gone un-noticed, but since ARM64 enabled CLOCK_MONOTONIC_RAW
in their vDSO clock_gettime implementation, we've seen failures
with the inconsistency-check test in kselftest.

This patch addresses the issue by using the same sub-ns
accumulation handling that CLOCK_MONOTONIC uses, which avoids
the issue for in-kernel users.

Since the ARM64 vDSO implementation has its own clock_gettime
calculation logic, this patch reduces the frequency of errors,
but failures are still seen. The ARM64 vDSO will need to be
updated to include the sub-nanosecond xtime_nsec values in its
calculation for this issue to be completely fixed.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Tested-by: Daniel Mentz <danielmentz@google.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Stephen Boyd <stephen.boyd@linaro.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Link: http://lkml.kernel.org/r/1496965462-20003-3-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-29 13:00:30 +02:00
John Stultz
02a37ccd63 time: Fix clock->read(clock) race around clocksource changes
commit ceea5e3771ed2378668455fa21861bead7504df5 upstream.

In tests, which excercise switching of clocksources, a NULL
pointer dereference can be observed on AMR64 platforms in the
clocksource read() function:

u64 clocksource_mmio_readl_down(struct clocksource *c)
{
	return ~(u64)readl_relaxed(to_mmio_clksrc(c)->reg) & c->mask;
}

This is called from the core timekeeping code via:

	cycle_now = tkr->read(tkr->clock);

tkr->read is the cached tkr->clock->read() function pointer.
When the clocksource is changed then tkr->clock and tkr->read
are updated sequentially. The code above results in a sequential
load operation of tkr->read and tkr->clock as well.

If the store to tkr->clock hits between the loads of tkr->read
and tkr->clock, then the old read() function is called with the
new clock pointer. As a consequence the read() function
dereferences a different data structure and the resulting 'reg'
pointer can point anywhere including NULL.

This problem was introduced when the timekeeping code was
switched over to use struct tk_read_base. Before that, it was
theoretically possible as well when the compiler decided to
reload clock in the code sequence:

     now = tk->clock->read(tk->clock);

Add a helper function which avoids the issue by reading
tk_read_base->clock once into a local variable clk and then issue
the read function via clk->read(clk). This guarantees that the
read() function always gets the proper clocksource pointer handed
in.

Since there is now no use for the tkr.read pointer, this patch
also removes it, and to address stopping the fast timekeeper
during suspend/resume, it introduces a dummy clocksource to use
rather then just a dummy read function.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Stephen Boyd <stephen.boyd@linaro.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Daniel Mentz <danielmentz@google.com>
Link: http://lkml.kernel.org/r/1496965462-20003-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-29 13:00:30 +02:00
Hugh Dickins
cfc0eb4038 mm: larger stack guard gap, between vmas
commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream.

Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.

This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.

Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.

One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications.  For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).

Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.

Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.

Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[wt: backport to 4.11: adjust context]
[wt: backport to 4.9: adjust context ; kernel doc was not in admin-guide]
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-24 07:11:18 +02:00
Don Zickus
b13b3b706a kernel/watchdog: prevent false hardlockup on overloaded system
[ Upstream commit b94f51183b0617e7b9b4fb4137d4cf1cab7547c2 ]

On an overloaded system, it is possible that a change in the watchdog
threshold can be delayed long enough to trigger a false positive.

This can easily be achieved by having a cpu spinning indefinitely on a
task, while another cpu updates watchdog threshold.

What happens is while trying to park the watchdog threads, the hrtimers
on the other cpus trigger and reprogram themselves with the new slower
watchdog threshold.  Meanwhile, the nmi watchdog is still programmed
with the old faster threshold.

Because the one cpu is blocked, it prevents the thread parking on the
other cpus from completing, which is needed to shutdown the nmi watchdog
and reprogram it correctly.  As a result, a false positive from the nmi
watchdog is reported.

Fix this by setting a park_in_progress flag to block all lockups until
the parking is complete.

Fix provided by Ulrich Obergfell.

[akpm@linux-foundation.org: s/park_in_progress/watchdog_park_in_progress/]
Link: http://lkml.kernel.org/r/1481041033-192236-1-git-send-email-dzickus@redhat.com
Signed-off-by: Don Zickus <dzickus@redhat.com>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17 06:41:57 +02:00
Babu Moger
0ce66ee6ae kernel/watchdog.c: move shared definitions to nmi.h
[ Upstream commit 249e52e35580fcfe5dad53a7dcd7c1252788749c ]

Patch series "Clean up watchdog handlers", v2.

This is an attempt to cleanup watchdog handlers.  Right now,
kernel/watchdog.c implements both softlockup and hardlockup detectors.
Softlockup code is generic.  Hardlockup code is arch specific.  Some
architectures don't use hardlockup detectors.  They use their own
watchdog detectors.  To make both these combination work, we have
numerous #ifdefs in kernel/watchdog.c.

We are trying here to make these handlers independent of each other.
Also provide an interface for architectures to implement their own
handlers.  watchdog_nmi_enable and watchdog_nmi_disable will be defined
as weak such that architectures can override its definitions.

Thanks to Don Zickus for his suggestions.
Here are our previous discussions
http://www.spinics.net/lists/sparclinux/msg16543.html
http://www.spinics.net/lists/sparclinux/msg16441.html

This patch (of 3):

Move shared macros and definitions to nmi.h so that watchdog.c, new file
watchdog_hld.c or any other architecture specific handler can use those
definitions.

Link: http://lkml.kernel.org/r/1478034826-43888-2-git-send-email-babu.moger@oracle.com
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Josh Hunt <johunt@akamai.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17 06:41:57 +02:00
Sean Nyekjaer
7dddbfcd96 net: phy: micrel: add support for KSZ8795
[ Upstream commit 9d162ed69f51cbd9ee5a0c7e82aba7acc96362ff ]

This is adds support for the PHYs in the KSZ8795 5port managed switch.

It will allow to detect the link between the switch and the soc
and uses the same read_status functions as the KSZ8873MLL switch.

Signed-off-by: Sean Nyekjaer <sean.nyekjaer@prevas.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17 06:41:54 +02:00
David Howells
11696dcea2 fscache: Fix dead object requeue
[ Upstream commit e26bfebdfc0d212d366de9990a096665d5c0209a ]

Under some circumstances, an fscache object can become queued such that it
fscache_object_work_func() can be called once the object is in the
OBJECT_DEAD state.  This results in the kernel oopsing when it tries to
invoke the handler for the state (which is hard coded to 0x2).

The way this comes about is something like the following:

 (1) The object dispatcher is processing a work state for an object.  This
     is done in workqueue context.

 (2) An out-of-band event comes in that isn't masked, causing the object to
     be queued, say EV_KILL.

 (3) The object dispatcher finishes processing the current work state on
     that object and then sees there's another event to process, so,
     without returning to the workqueue core, it processes that event too.
     It then follows the chain of events that initiates until we reach
     OBJECT_DEAD without going through a wait state (such as
     WAIT_FOR_CLEARANCE).

     At this point, object->events may be 0, object->event_mask will be 0
     and oob_event_mask will be 0.

 (4) The object dispatcher returns to the workqueue processor, and in due
     course, this sees that the object's work item is still queued and
     invokes it again.

 (5) The current state is a work state (OBJECT_DEAD), so the dispatcher
     jumps to it - resulting in an OOPS.

When I'm seeing this, the work state in (1) appears to have been either
LOOK_UP_OBJECT or CREATE_OBJECT (object->oob_table is
fscache_osm_lookup_oob).

The window for (2) is very small:

 (A) object->event_mask is cleared whilst the event dispatch process is
     underway - though there's no memory barrier to force this to the top
     of the function.

     The window, therefore is from the time the object was selected by the
     workqueue processor and made requeueable to the time the mask was
     cleared.

 (B) fscache_raise_event() will only queue the object if it manages to set
     the event bit and the corresponding event_mask bit was set.

     The enqueuement is then deferred slightly whilst we get a ref on the
     object and get the per-CPU variable for workqueue congestion.  This
     slight deferral slightly increases the probability by allowing extra
     time for the workqueue to make the item requeueable.

Handle this by giving the dead state a processor function and checking the
for the dead state address rather than seeing if the processor function is
address 0x2.  The dead state processor function can then set a flag to
indicate that it's occurred and give a warning if it occurs more than once
per object.

If this race occurs, an oops similar to the following is seen (note the RIP
value):

BUG: unable to handle kernel NULL pointer dereference at 0000000000000002
IP: [<0000000000000002>] 0x1
PGD 0
Oops: 0010 [#1] SMP
Modules linked in: ...
CPU: 17 PID: 16077 Comm: kworker/u48:9 Not tainted 3.10.0-327.18.2.el7.x86_64 #1
Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 12/27/2015
Workqueue: fscache_object fscache_object_work_func [fscache]
task: ffff880302b63980 ti: ffff880717544000 task.ti: ffff880717544000
RIP: 0010:[<0000000000000002>]  [<0000000000000002>] 0x1
RSP: 0018:ffff880717547df8  EFLAGS: 00010202
RAX: ffffffffa0368640 RBX: ffff880edf7a4480 RCX: dead000000200200
RDX: 0000000000000002 RSI: 00000000ffffffff RDI: ffff880edf7a4480
RBP: ffff880717547e18 R08: 0000000000000000 R09: dfc40a25cb3a4510
R10: dfc40a25cb3a4510 R11: 0000000000000400 R12: 0000000000000000
R13: ffff880edf7a4510 R14: ffff8817f6153400 R15: 0000000000000600
FS:  0000000000000000(0000) GS:ffff88181f420000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000002 CR3: 000000000194a000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
 ffffffffa0363695 ffff880edf7a4510 ffff88093f16f900 ffff8817faa4ec00
 ffff880717547e60 ffffffff8109d5db 00000000faa4ec18 0000000000000000
 ffff8817faa4ec18 ffff88093f16f930 ffff880302b63980 ffff88093f16f900
Call Trace:
 [<ffffffffa0363695>] ? fscache_object_work_func+0xa5/0x200 [fscache]
 [<ffffffff8109d5db>] process_one_work+0x17b/0x470
 [<ffffffff8109e4ac>] worker_thread+0x21c/0x400
 [<ffffffff8109e290>] ? rescuer_thread+0x400/0x400
 [<ffffffff810a5acf>] kthread+0xcf/0xe0
 [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140
 [<ffffffff816460d8>] ret_from_fork+0x58/0x90
 [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeremy McNicoll <jeremymc@redhat.com>
Tested-by: Frank Sorenson <sorenson@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17 06:41:52 +02:00
Dimitris Michailidis
f4d2d05ffb net: fix ndo_features_check/ndo_fix_features comment ordering
[ Upstream commit 1a2a14444d32b89b28116daea86f63ced1716668 ]

Commit cdba756f58 ("net: move ndo_features_check() close to
ndo_start_xmit()") inadvertently moved the doc comment for
.ndo_fix_features instead of .ndo_features_check. Fix the comment
ordering.

Fixes: cdba756f58 ("net: move ndo_features_check() close to ndo_start_xmit()")
Signed-off-by: Dimitris Michailidis <dmichail@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17 06:41:52 +02:00
Ard Biesheuvel
8de6ea44af log2: make order_base_2() behave correctly on const input value zero
commit 29905b52fad0854351f57bab867647e4982285bf upstream.

The function order_base_2() is defined (according to the comment block)
as returning zero on input zero, but subsequently passes the input into
roundup_pow_of_two(), which is explicitly undefined for input zero.

This has gone unnoticed until now, but optimization passes in GCC 7 may
produce constant folded function instances where a constant value of
zero is passed into order_base_2(), resulting in link errors against the
deliberately undefined '____ilog2_NaN'.

So update order_base_2() to adhere to its own documented interface.

[ See

     http://marc.info/?l=linux-kernel&m=147672952517795&w=2

  and follow-up discussion for more background. The gcc "optimization
  pass" is really just broken, but now the GCC trunk problem seems to
  have escaped out of just specially built daily images, so we need to
  work around it in mainline.    - Linus ]

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17 06:41:51 +02:00
Imre Deak
b372d35a52 PCI/PM: Add needs_resume flag to avoid suspend complete optimization
commit 4d071c3238987325b9e50e33051a40d1cce311cc upstream.

Some drivers - like i915 - may not support the system suspend direct
complete optimization due to differences in their runtime and system
suspend sequence.  Add a flag that when set resumes the device before
calling the driver's system suspend handlers which effectively disables
the optimization.

Needed by a future patch fixing suspend/resume on i915.

Suggested by Rafael.

Signed-off-by: Imre Deak <imre.deak@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: stable@vger.kernel.org
(rebased on v4.8, added kernel version to commit message stable tag)
Signed-off-by: Imre Deak <imre.deak@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17 06:41:48 +02:00
Fabian Frederick
61604a2626 fs: add i_blocksize()
commit 93407472a21b82f39c955ea7787e5bc7da100642 upstream.

Replace all 1 << inode->i_blkbits and (1 << inode->i_blkbits) in fs
branch.

This patch also fixes multiple checkpatch warnings: WARNING: Prefer
'unsigned int' to bare use of 'unsigned'

Thanks to Andrew Morton for suggesting more appropriate function instead
of macro.

[geliangtang@gmail.com: truncate: use i_blocksize()]
  Link: http://lkml.kernel.org/r/9c8b2cd83c8f5653805d43debde9fa8817e02fc4.1484895804.git.geliangtang@gmail.com
Link: http://lkml.kernel.org/r/1481319905-10126-1-git-send-email-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14 15:06:00 +02:00
Tejun Heo
829a1cab22 cpuset: consider dying css as offline
commit 41c25707d21716826e3c1f60967f5550610ec1c9 upstream.

In most cases, a cgroup controller don't care about the liftimes of
cgroups.  For the controller, a css becomes online when ->css_online()
is called on it and offline when ->css_offline() is called.

However, cpuset is special in that the user interface it exposes cares
whether certain cgroups exist or not.  Combined with the RCU delay
between cgroup removal and css offlining, this can lead to user
visible behavior oddities where operations which should succeed after
cgroup removals fail for some time period.  The effects of cgroup
removals are delayed when seen from userland.

This patch adds css_is_dying() which tests whether offline is pending
and updates is_cpuset_online() so that the function returns false also
while offline is pending.  This gets rid of the userland visible
delays.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Link: http://lkml.kernel.org/r/327ca1f5-7957-fbb9-9e5f-9ba149d40ba2@oracle.com
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14 15:06:00 +02:00
Waiman Long
dff4c8bb13 cgroup: Prevent kill_css() from being called more than once
commit 33c35aa4817864e056fd772230b0c6b552e36ea2 upstream.

The kill_css() function may be called more than once under the condition
that the css was killed but not physically removed yet followed by the
removal of the cgroup that is hosting the css. This patch prevents any
harmm from being done when that happens.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14 15:06:00 +02:00
Eric W. Biederman
7c24a70c70 ptrace: Properly initialize ptracer_cred on fork
commit c70d9d809fdeecedb96972457ee45c49a232d97f upstream.

When I introduced ptracer_cred I failed to consider the weirdness of
fork where the task_struct copies the old value by default.  This
winds up leaving ptracer_cred set even when a process forks and
the child process does not wind up being ptraced.

Because ptracer_cred is not set on non-ptraced processes whose
parents were ptraced this has broken the ability of the enlightenment
window manager to start setuid children.

Fix this by properly initializing ptracer_cred in ptrace_init_task

This must be done with a little bit of care to preserve the current value
of ptracer_cred when ptrace carries through fork.  Re-reading the
ptracer_cred from the ptracing process at this point is inconsistent
with how PT_PTRACE_CAP has been maintained all of these years.

Tested-by: Takashi Iwai <tiwai@suse.de>
Fixes: 64b875f7ac8a ("ptrace: Capture the ptracer's creds not PT_PTRACE_CAP")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14 15:05:54 +02:00
Michal Hocko
292f70cd96 mm: consider memblock reservations for deferred memory initialization sizing
commit 864b9a393dcb5aed09b8fd31b9bbda0fdda99374 upstream.

We have seen an early OOM killer invocation on ppc64 systems with
crashkernel=4096M:

	kthreadd invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=7, order=0, oom_score_adj=0
	kthreadd cpuset=/ mems_allowed=7
	CPU: 0 PID: 2 Comm: kthreadd Not tainted 4.4.68-1.gd7fe927-default #1
	Call Trace:
	  dump_stack+0xb0/0xf0 (unreliable)
	  dump_header+0xb0/0x258
	  out_of_memory+0x5f0/0x640
	  __alloc_pages_nodemask+0xa8c/0xc80
	  kmem_getpages+0x84/0x1a0
	  fallback_alloc+0x2a4/0x320
	  kmem_cache_alloc_node+0xc0/0x2e0
	  copy_process.isra.25+0x260/0x1b30
	  _do_fork+0x94/0x470
	  kernel_thread+0x48/0x60
	  kthreadd+0x264/0x330
	  ret_from_kernel_thread+0x5c/0xa4

	Mem-Info:
	active_anon:0 inactive_anon:0 isolated_anon:0
	 active_file:0 inactive_file:0 isolated_file:0
	 unevictable:0 dirty:0 writeback:0 unstable:0
	 slab_reclaimable:5 slab_unreclaimable:73
	 mapped:0 shmem:0 pagetables:0 bounce:0
	 free:0 free_pcp:0 free_cma:0
	Node 7 DMA free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:52428800kB managed:110016kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:320kB slab_unreclaimable:4672kB kernel_stack:1152kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
	lowmem_reserve[]: 0 0 0 0
	Node 7 DMA: 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 0kB
	0 total pagecache pages
	0 pages in swap cache
	Swap cache stats: add 0, delete 0, find 0/0
	Free swap  = 0kB
	Total swap = 0kB
	819200 pages RAM
	0 pages HighMem/MovableOnly
	817481 pages reserved
	0 pages cma reserved
	0 pages hwpoisoned

the reason is that the managed memory is too low (only 110MB) while the
rest of the the 50GB is still waiting for the deferred intialization to
be done.  update_defer_init estimates the initial memoty to initialize
to 2GB at least but it doesn't consider any memory allocated in that
range.  In this particular case we've had

	Reserving 4096MB of memory at 128MB for crashkernel (System RAM: 51200MB)

so the low 2GB is mostly depleted.

Fix this by considering memblock allocations in the initial static
initialization estimation.  Move the max_initialise to
reset_deferred_meminit and implement a simple memblock_reserved_memory
helper which iterates all reserved blocks and sums the size of all that
start below the given address.  The cumulative size is than added on top
of the initial estimation.  This is still not ideal because
reset_deferred_meminit doesn't consider holes and so reservation might
be above the initial estimation whihch we ignore but let's make the
logic simpler until we really need to handle more complicated cases.

Fixes: 3a80a7fa79 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
Link: http://lkml.kernel.org/r/20170531104010.GI27783@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07 12:07:49 +02:00
Vlad Yasevich
5f595d5297 vlan: Fix tcp checksum offloads in Q-in-Q vlans
[ Upstream commit 35d2f80b07bbe03fb358afb0bdeff7437a7d67ff ]

It appears that TCP checksum offloading has been broken for
Q-in-Q vlans.  The behavior was execerbated by the
series
    commit afb0bc972b ("Merge branch 'stacked_vlan_tso'")
that that enabled accleleration features on stacked vlans.

However, event without that series, it is possible to trigger
this issue.  It just requires a lot more specialized configuration.

The root cause is the interaction between how
netdev_intersect_features() works, the features actually set on
the vlan devices and HW having the ability to run checksum with
longer headers.

The issue starts when netdev_interesect_features() replaces
NETIF_F_HW_CSUM with a combination of NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM,
if the HW advertises IP|IPV6 specific checksums.  This happens
for tagged and multi-tagged packets.   However, HW that enables
IP|IPV6 checksum offloading doesn't gurantee that packets with
arbitrarily long headers can be checksummed.

This patch disables IP|IPV6 checksums on the packet for multi-tagged
packets.

CC: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
CC: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Acked-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07 12:07:45 +02:00
Mohamad Haj Yahia
4fb5fd27de net/mlx5: Avoid using pending command interface slots
[ Upstream commit 73dd3a4839c1d27c36d4dcc92e1ff44225ecbeb7 ]

Currently when firmware command gets stuck or it takes long time to
complete, the driver command will get timeout and the command slot is
freed and can be used for new commands, and if the firmware receive new
command on the old busy slot its behavior is unexpected and this could
be harmful.
To fix this when the driver command gets timeout we return failure,
but we don't free the command slot and we wait for the firmware to
explicitly respond to that command.
Once all the entries are busy we will stop processing new firmware
commands.

Fixes: 9cba4ebcf3 ('net/mlx5: Fix potential deadlock in command mode change')
Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com>
Cc: kernel-team@fb.com
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07 12:07:45 +02:00
Thomas Gleixner
dd0023d710 tracing/kprobes: Enforce kprobes teardown after testing
commit 30e7d894c1478c88d50ce94ddcdbd7f9763d9cdd upstream.

Enabling the tracer selftest triggers occasionally the warning in
text_poke(), which warns when the to be modified page is not marked
reserved.

The reason is that the tracer selftest installs kprobes on functions marked
__init for testing. These probes are removed after the tests, but that
removal schedules the delayed kprobes_optimizer work, which will do the
actual text poke. If the work is executed after the init text is freed,
then the warning triggers. The bug can be reproduced reliably when the work
delay is increased.

Flush the optimizer work and wait for the optimizing/unoptimizing lists to
become empty before returning from the kprobes tracer selftest. That
ensures that all operations which were queued due to the probes removal
have completed.

Link: http://lkml.kernel.org/r/20170516094802.76a468bb@gandalf.local.home

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 6274de498 ("kprobes: Support delayed unoptimizing")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25 15:44:47 +02:00
Srinivas Pandruvada
6bee0b1fe4 iio: hid-sensor: Store restore poll and hysteresis on S3
commit 5d9854eaea776441b38a9a45b4e6879524c4f48c upstream.

This change undo the change done by 'commit 3bec24747446
("iio: hid-sensor-trigger: Change get poll value function order to avoid
sensor properties losing after resume from S3")' as this breaks some
USB/i2c sensor hubs.

Instead of relying on HW for restoring poll and hysteresis, driver stores
and restores on resume (S3). In this way user space modified settings are
not lost for any kind of sensor hub behavior.

In this change, whenever user space modifies sampling frequency or
hysteresis driver will get the feature value from the hub and store in the
per device hid_sensor_common data structure. On resume callback from S3,
system will set the feature to sensor hub, if user space ever modified the
feature value.

Fixes: 3bec24747446 ("iio: hid-sensor-trigger: Change get poll value function order to avoid sensor properties losing after resume from S3")
Reported-by: Ritesh Raj Sarraf <rrs@researchut.com>
Tested-by: Ritesh Raj Sarraf <rrs@researchut.com>
Tested-by: Song, Hongyan <hongyan.song@intel.com>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25 15:44:41 +02:00
Ilya Dryomov
6a7620744e block: get rid of blk_integrity_revalidate()
commit 19b7ccf8651df09d274671b53039c672a52ad84d upstream.

Commit 25520d55cd ("block: Inline blk_integrity in struct gendisk")
introduced blk_integrity_revalidate(), which seems to assume ownership
of the stable pages flag and unilaterally clears it if no blk_integrity
profile is registered:

    if (bi->profile)
            disk->queue->backing_dev_info->capabilities |=
                    BDI_CAP_STABLE_WRITES;
    else
            disk->queue->backing_dev_info->capabilities &=
                    ~BDI_CAP_STABLE_WRITES;

It's called from revalidate_disk() and rescan_partitions(), making it
impossible to enable stable pages for drivers that support partitions
and don't use blk_integrity: while the call in revalidate_disk() can be
trivially worked around (see zram, which doesn't support partitions and
hence gets away with zram_revalidate_disk()), rescan_partitions() can
be triggered from userspace at any time.  This breaks rbd, where the
ceph messenger is responsible for generating/verifying CRCs.

Since blk_integrity_{un,}register() "must" be used for (un)registering
the integrity profile with the block layer, move BDI_CAP_STABLE_WRITES
setting there.  This way drivers that call blk_integrity_register() and
use integrity infrastructure won't interfere with drivers that don't
but still want stable pages.

Fixes: 25520d55cd ("block: Inline blk_integrity in struct gendisk")
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
[idryomov@gmail.com: backport to < 4.11: bdi is embedded in queue]
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-14 14:00:22 +02:00
Jin Qian
93862955cb f2fs: sanity check segment count
commit b9dd46188edc2f0d1f37328637860bb65a771124 upstream.

F2FS uses 4 bytes to represent block address. As a result, supported
size of disk is 16 TB and it equals to 16 * 1024 * 1024 / 2 segments.

Signed-off-by: Jin Qian <jinqian@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-14 14:00:22 +02:00
Stephen Boyd
b6c17077e0 usb: chipidea: Handle extcon events properly
commit a89b94b53371bbfa582787c2fa3378000ea4263d upstream.

We're currently emulating the vbus and id interrupts in the OTGSC
read API, but we also need to make sure that if we're handling
the events with extcon that we don't enable the interrupts for
those events in the hardware. Therefore, properly emulate this
register if we're using extcon, but don't enable the interrupts.
This allows me to get my cable connect/disconnect working
properly without getting spurious interrupts on my device that
uses an extcon for these two events.

Acked-by: Peter Chen <peter.chen@nxp.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Ivan T. Ivanov" <iivanov.xz@gmail.com>
Fixes: 3ecb3e09b0 ("usb: chipidea: Use extcon framework for VBUS and ID detect")
Signed-off-by: Stephen Boyd <stephen.boyd@linaro.org>
Signed-off-by: Peter Chen <peter.chen@nxp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-14 14:00:18 +02:00
Alexander Kochetkov
ae6a762dcd net: phy: fix auto-negotiation stall due to unavailable interrupt
[ Upstream commit f555f34fdc586a56204cd16d9a7c104ec6cb6650 ]

The Ethernet link on an interrupt driven PHY was not coming up if the Ethernet
cable was plugged before the Ethernet interface was brought up.

The patch trigger PHY state machine to update link state if PHY was requested to
do auto-negotiation and auto-negotiation complete flag already set.

During power-up cycle the PHY do auto-negotiation, generate interrupt and set
auto-negotiation complete flag. Interrupt is handled by PHY state machine but
doesn't update link state because PHY is in PHY_READY state. After some time
MAC bring up, start and request PHY to do auto-negotiation. If there are no new
settings to advertise genphy_config_aneg() doesn't start PHY auto-negotiation.
PHY continue to stay in auto-negotiation complete state and doesn't fire
interrupt. At the same time PHY state machine expect that PHY started
auto-negotiation and is waiting for interrupt from PHY and it won't get it.

Fixes: 321beec504 ("net: phy: Use interrupts when available in NOLINK state")
Signed-off-by: Alexander Kochetkov <al.kochet@gmail.com>
Cc: stable <stable@vger.kernel.org> # v4.9+
Tested-by: Roger Quadros <rogerq@ti.com>
Tested-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-03 08:36:37 -07:00
Al Viro
ff76ab9e03 new privimitive: iov_iter_revert()
commit 27c0e3748e41ca79171ffa3e97415a20af6facd0 upstream.

opposite to iov_iter_advance(); the caller is responsible for never
using it to move back past the initial position.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-21 09:31:21 +02:00