Commit Graph

97 Commits

Author SHA1 Message Date
Hou Tao
4c84b3e072 dm btree remove: assign new_root only when removal succeeds
commit b6e58b5466b2959f83034bead2e2e1395cca8aeb upstream.

remove_raw() in dm_btree_remove() may fail due to IO read error
(e.g. read the content of origin block fails during shadowing),
and the value of shadow_spine::root is uninitialized, but
the uninitialized value is still assign to new_root in the
end of dm_btree_remove().

For dm-thin, the value of pmd->details_root or pmd->root will become
an uninitialized value, so if trying to read details_info tree again
out-of-bound memory may occur as showed below:

  general protection fault, probably for non-canonical address 0x3fdcb14c8d7520
  CPU: 4 PID: 515 Comm: dmsetup Not tainted 5.13.0-rc6
  Hardware name: QEMU Standard PC
  RIP: 0010:metadata_ll_load_ie+0x14/0x30
  Call Trace:
   sm_metadata_count_is_more_than_one+0xb9/0xe0
   dm_tm_shadow_block+0x52/0x1c0
   shadow_step+0x59/0xf0
   remove_raw+0xb2/0x170
   dm_btree_remove+0xf4/0x1c0
   dm_pool_delete_thin_device+0xc3/0x140
   pool_message+0x218/0x2b0
   target_message+0x251/0x290
   ctl_ioctl+0x1c4/0x4d0
   dm_ctl_ioctl+0xe/0x20
   __x64_sys_ioctl+0x7b/0xb0
   do_syscall_64+0x40/0xb0
   entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixing it by only assign new_root when removal succeeds

Signed-off-by: Hou Tao <houtao1@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-07-20 16:22:40 +02:00
Joe Thornber
013642e6bf dm space maps: don't reset space map allocation cursor when committing
[ Upstream commit 5faafc77f7de69147d1e818026b9a0cbf036a7b2 ]

Current commit code resets the place where the search for free blocks
will begin back to the start of the metadata device.  There are a couple
of repercussions to this:

- The first allocation after the commit is likely to take longer than
  normal as it searches for a free block in an area that is likely to
  have very few free blocks (if any).

- Any free blocks it finds will have been recently freed.  Reusing them
  means we have fewer old copies of the metadata to aid recovery from
  hardware error.

Fix these issues by leaving the cursor alone, only resetting when the
search hits the end of the metadata device.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-20 16:22:37 +02:00
Joe Thornber
e9ebaaa29e dm space map common: fix division bug in sm_ll_find_free_block()
commit 5208692e80a1f3c8ce2063a22b675dd5589d1d80 upstream.

This division bug meant the search for free metadata space could skip
the final allocation bitmap's worth of entries. Fix affects DM thinp,
cache and era targets.

Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Tested-by: Ming-Hung Tsai <mtsai@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-22 10:38:20 +02:00
Joe Thornber
e9a972de2d dm persistent data: packed struct should have an aligned() attribute too
commit a88b2358f1da2c9f9fcc432f2e0a79617fea397c upstream.

Otherwise most non-x86 architectures (e.g. riscv, arm) will resort to
byte-by-byte access.

Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-22 10:38:20 +02:00
Joe Thornber
4c3816edfb dm space map common: fix to ensure new block isn't already in use
commit 4feaef830de7ffdd8352e1fe14ad3bf13c9688f8 upstream.

The space-maps track the reference counts for disk blocks allocated by
both the thin-provisioning and cache targets.  There are variants for
tracking metadata blocks and data blocks.

Transactionality is implemented by never touching blocks from the
previous transaction, so we can rollback in the event of a crash.

When allocating a new block we need to ensure the block is free (has
reference count of 0) in both the current and previous transaction.
Prior to this fix we were doing this by searching for a free block in
the previous transaction, and relying on a 'begin' counter to track
where the last allocation in the current transaction was.  This
'begin' field was not being updated in all code paths (eg, increment
of a data block reference count due to breaking sharing of a neighbour
block in the same btree leaf).

This fix keeps the 'begin' field, but now it's just a hint to speed up
the search.  Instead the current transaction is searched for a free
block, and then the old transaction is double checked to ensure it's
free.  Much simpler.

This fixes reports of sm_disk_new_block()'s BUG_ON() triggering when
DM thin-provisioning's snapshots are heavily used.

Reported-by: Eric Wheeler <dm-devel@lists.ewheeler.net>
Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-14 16:29:56 -05:00
Hou Tao
14883c635a dm btree: increase rebalance threshold in __rebalance2()
commit 474e559567fa631dea8fb8407ab1b6090c903755 upstream.

We got the following warnings from thin_check during thin-pool setup:

  $ thin_check /dev/vdb
  examining superblock
  examining devices tree
    missing devices: [1, 84]
      too few entries in btree_node: 41, expected at least 42 (block 138, max_entries = 126)
  examining mapping tree

The phenomenon is the number of entries in one node of details_info tree is
less than (max_entries / 3). And it can be easily reproduced by the following
procedures:

  $ new a thin pool
  $ presume the max entries of details_info tree is 126
  $ new 127 thin devices (e.g. 1~127) to make the root node being full
    and then split
  $ remove the first 43 (e.g. 1~43) thin devices to make the children
    reblance repeatedly
  $ stop the thin pool
  $ thin_check

The root cause is that the B-tree removal procedure in __rebalance2()
doesn't guarantee the invariance: the minimal number of entries in
non-root node should be >= (max_entries / 3).

Simply fix the problem by increasing the rebalance threshold to
make sure the number of entries in each child will be greater
than or equal to (max_entries / 3 + 1), so no matter which
child is used for removal, the number will still be valid.

Cc: stable@vger.kernel.org
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-21 10:35:46 +01:00
ZhangXiaoxu
77529c247c dm space map metadata: fix missing store of apply_bops() return value
commit ae148243d3f0816b37477106c05a2ec7d5f32614 upstream.

In commit 6096d91af0 ("dm space map metadata: fix occasional leak
of a metadata block on resize"), we refactor the commit logic to a new
function 'apply_bops'.  But when that logic was replaced in out() the
return value was not stored.  This may lead out() returning a wrong
value to the caller.

Fixes: 6096d91af0 ("dm space map metadata: fix occasional leak of a metadata block on resize")
Cc: stable@vger.kernel.org
Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-06 10:18:10 +02:00
ZhangXiaoxu
063c4d1868 dm btree: fix order of block initialization in btree_split_beneath
commit e4f9d6013820d1eba1432d51dd1c5795759aa77f upstream.

When btree_split_beneath() splits a node to two new children, it will
allocate two blocks: left and right.  If right block's allocation
failed, the left block will be unlocked and marked dirty.  If this
happened, the left block'ss content is zero, because it wasn't
initialized with the btree struct before the attempot to allocate the
right block.  Upon return, when flushing the left block to disk, the
validator will fail when check this block.  Then a BUG_ON is raised.

Fix this by completely initializing the left block before allocating and
initializing the right block.

Fixes: 4dcb8b57df ("dm btree: fix leak of bufio-backed block in btree_split_beneath error path")
Cc: stable@vger.kernel.org
Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-06 10:18:10 +02:00
Joe Thornber
9c7755af77 dm btree: fix serious bug in btree_split_beneath()
commit bc68d0a43560e950850fc69b58f0f8254b28f6d6 upstream.

When inserting a new key/value pair into a btree we walk down the spine of
btree nodes performing the following 2 operations:

  i) space for a new entry
  ii) adjusting the first key entry if the new key is lower than any in the node.

If the _root_ node is full, the function btree_split_beneath() allocates 2 new
nodes, and redistibutes the root nodes entries between them.  The root node is
left with 2 entries corresponding to the 2 new nodes.

btree_split_beneath() then adjusts the spine to point to one of the two new
children.  This means the first key is never adjusted if the new key was lower,
ie. operation (ii) gets missed out.  This can result in the new key being
'lost' for a period; until another low valued key is inserted that will uncover
it.

This is a serious bug, and quite hard to make trigger in normal use.  A
reproducing test case ("thin create devices-in-reverse-order") is
available as part of the thin-provision-tools project:
  https://github.com/jthornber/thin-provisioning-tools/blob/master/functional-tests/device-mapper/dm-tests.scm#L593

Fix the issue by changing btree_split_beneath() so it no longer adjusts
the spine.  Instead it unlocks both the new nodes, and lets the main
loop in btree_insert_raw() relock the appropriate one and make any
neccessary adjustments.

Reported-by: Monty Pavel <monty_pavel@sina.com>
Signed-off-by: Joe Thornber <thornber@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-23 19:50:16 +01:00
Joe Thornber
d3df9403c0 dm space map disk: fix some book keeping in the disk space map
commit 0377a07c7a035e0d033cd8b29f0cb15244c0916a upstream.

When decrementing the reference count for a block, the free count wasn't
being updated if the reference count went to zero.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25 14:30:08 +02:00
Vinothkumar Raja
fa499b361b dm btree: fix for dm_btree_find_lowest_key()
commit 7d1fedb6e96a960aa91e4ff70714c3fb09195a5a upstream.

dm_btree_find_lowest_key() is giving incorrect results.  find_key()
traverses the btree correctly for finding the highest key, but there is
an error in the way it traverses the btree for retrieving the lowest
key.  dm_btree_find_lowest_key() fetches the first key of the rightmost
block of the btree instead of fetching the first key from the leftmost
block.

Fix this by conditionally passing the correct parameter to value64()
based on the @find_highest flag.

Signed-off-by: Erez Zadok <ezk@fsl.cs.sunysb.edu>
Signed-off-by: Vinothkumar Raja <vinraja@cs.stonybrook.edu>
Signed-off-by: Nidhi Panpalia <npanpalia@cs.stonybrook.edu>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25 14:30:07 +02:00
Benjamin Marzinski
701ec6e5ce dm space map metadata: fix 'struct sm_metadata' leak on failed create
commit 314c25c56c1ee5026cf99c570bdfe01847927acb upstream.

In dm_sm_metadata_create() we temporarily change the dm_space_map
operations from 'ops' (whose .destroy function deallocates the
sm_metadata) to 'bootstrap_ops' (whose .destroy function doesn't).

If dm_sm_metadata_create() fails in sm_ll_new_metadata() or
sm_ll_extend(), it exits back to dm_tm_create_internal(), which calls
dm_sm_destroy() with the intention of freeing the sm_metadata, but it
doesn't (because the dm_space_map operations is still set to
'bootstrap_ops').

Fix this by setting the dm_space_map operations back to 'ops' if
dm_sm_metadata_create() fails when it is set to 'bootstrap_ops'.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06 11:16:15 +01:00
Mike Snitzer
d7a853e6f7 dm space map metadata: remove unused variable in brb_pop()
commit 512167788a6fe9481a33a3cce5f80b684631a1bb upstream.

Remove the unused struct block_op pointer that was inadvertantly
introduced, via cut-and-paste of previous brb_op() code, as part of
commit 50dd842ad.

(Cc'ing stable@ because commit 50dd842ad did)

Fixes: 50dd842ad ("dm space map metadata: fix ref counting bug when bootstrapping a new space map")
Reported-by: David Binderman <dcb314@hotmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:07:14 -08:00
Joe Thornber
ed8b45a367 dm btree: fix bufio buffer leaks in dm_btree_del() error path
If dm_btree_del()'s call to push_frame() fails, e.g. due to
btree_node_validator finding invalid metadata, the dm_btree_del() error
path must unlock all frames (which have active dm-bufio buffers) that
were pushed onto the del_stack.

Otherwise, dm_bufio_client_destroy() will BUG_ON() because dm-bufio
buffers have leaked, e.g.:
  device-mapper: bufio: leaked buffer 3, hold count 1, list 0

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-12-10 10:30:18 -05:00
Joe Thornber
50dd842ad8 dm space map metadata: fix ref counting bug when bootstrapping a new space map
When applying block operations (BOPs) do not remove them from the
uncommitted BOP ring-buffer until after they've been applied -- in case
we recurse.

Also, perform BOP_INC operation, in dm_sm_metadata_create() and
sm_metadata_extend(), in terms of the uncommitted BOP ring-buffer rather
than using direct calls to sm_ll_inc().

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-12-09 13:27:25 -05:00
Joe Thornber
993ceab919 dm thin metadata: fix bug in dm_thin_remove_range()
dm_btree_remove_leaves() only unmaps a contiguous region so we need a
loop, in __remove_range(), to handle ranges that contain multiple
regions.

A new btree function, dm_btree_lookup_next(), is introduced which is
more efficiently able to skip over regions of the thin device which
aren't mapped.  __remove_range() uses dm_btree_lookup_next() for each
iteration of __remove_range()'s loop.

Also, improve description of dm_btree_remove_leaves().

Fixes: 6550f075 ("dm thin metadata: add dm_thin_remove_range()")
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.1+
2015-12-02 13:26:49 -05:00
Mike Snitzer
30ce6e1cc5 dm btree: fix leak of bufio-backed block in btree_split_sibling error path
The block allocated at the start of btree_split_sibling() is never
released if later insert_at() fails.

Fix this by releasing the previously allocated bufio block using
unlock_block().

Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-12-02 13:20:34 -05:00
Linus Torvalds
e0700ce709 Merge tag 'dm-4.4-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
 "Smaller set of DM changes for this merge.  I've based these changes on
  Jens' for-4.4/reservations branch because the associated DM changes
  required it.

   - Revert a dm-multipath change that caused a regression for
     unprivledged users (e.g. kvm guests) that issued ioctls when a
     multipath device had no available paths.

   - Include Christoph's refactoring of DM's ioctl handling and add
     support for passing through persistent reservations with DM
     multipath.

   - All other changes are very simple cleanups"

* tag 'dm-4.4-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm switch: simplify conditional in alloc_region_table()
  dm delay: document that offsets are specified in sectors
  dm delay: capitalize the start of an delay_ctr() error message
  dm delay: Use DM_MAPIO macros instead of open-coded equivalents
  dm linear: remove redundant target name from error messages
  dm persistent data: eliminate unnecessary return values
  dm: eliminate unused "bioset" process for each bio-based DM device
  dm: convert ffs to __ffs
  dm: drop NULL test before kmem_cache_destroy() and mempool_destroy()
  dm: add support for passing through persistent reservations
  dm: refactor ioctl handling
  Revert "dm mpath: fix stalls when handling invalid ioctls"
  dm: initialize non-blk-mq queue data before queue is used
2015-11-04 21:19:53 -08:00
Mikulas Patocka
4c7da06f5a dm persistent data: eliminate unnecessary return values
dm_bm_unlock and dm_tm_unlock return an integer value but the returned
value is always 0.  The calling code sometimes checks the return value
and sometimes doesn't.

Eliminate these unnecessary return values and also the checks for them.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-10-31 19:06:02 -04:00
Mike Snitzer
4dcb8b57df dm btree: fix leak of bufio-backed block in btree_split_beneath error path
btree_split_beneath()'s error path had an outstanding FIXME that speaks
directly to the potential for _not_ cleaning up a previously allocated
bufio-backed block.

Fix this by releasing the previously allocated bufio block using
unlock_block().

Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <thornber@redhat.com>
Cc: stable@vger.kernel.org
2015-10-23 14:02:55 -04:00
Joe Thornber
2871c69e02 dm btree remove: fix a bug when rebalancing nodes after removal
Commit 4c7e309340 ("dm btree remove: fix bug in redistribute3") wasn't
a complete fix for redistribute3().

The redistribute3 function takes 3 btree nodes and shares out the entries
evenly between them.  If the three nodes in total contained
(MAX_ENTRIES * 3) - 1 entries between them then this was erroneously getting
rebalanced as (MAX_ENTRIES - 1) on the left and right, and (MAX_ENTRIES + 1) in
the center.

Fix this issue by being more careful about calculating the target number
of entries for the left and right nodes.

Unit tested in userspace using this program:
https://github.com/jthornber/redistribute3-test/blob/master/redistribute3_t.c

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-10-23 14:02:55 -04:00
viresh kumar
fc0a446152 dm: remove unlikely() before IS_ERR()
IS_ERR() already contains an 'unlikely' compiler flag so there is no
need to do that again from IS_ERR() callers.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-08-12 11:32:21 -04:00
Vivek Goyal
8c747fd0c3 dm btree remove: remove unused function get_nr_entries()
rebalance_children() calls get_nr_entries() and assigns the result to an
unused local 'child_entries' variable.  Remove get_nr_entries() and
cleanup rebalance_children() accordingly.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-08-12 11:32:19 -04:00
Vivek Goyal
0a8d4c3ef8 dm btree: remove unused "dm_block_t root" parameter in btree_split_sibling()
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-08-12 11:32:19 -04:00
Joe Thornber
b0dc3c8bc1 dm btree: add ref counting ops for the leaves of top level btrees
When using nested btrees, the top leaves of the top levels contain
block addresses for the root of the next tree down.  If we shadow a
shared leaf node the leaf values (sub tree roots) should be incremented
accordingly.

This is only an issue if there is metadata sharing in the top levels.
Which only occurs if metadata snapshots are being used (as is possible
with dm-thinp).  And could result in a block from the thinp metadata
snap being reused early, thus corrupting the thinp metadata snap.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-08-12 10:50:37 -04:00
Joe Thornber
aa0cd28d05 dm btree remove: fix bug in remove_one()
remove_one() was not incrementing the key for the beginning of the
range, so not all entries were being removed.  This resulted in
discards that were not unmapping all blocks.

Fixes: 4ec331c3ea ("dm btree: add dm_btree_remove_leaves()")
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-08-07 11:56:43 -04:00
Joe Thornber
1c7518794a dm btree: silence lockdep lock inversion in dm_btree_del()
Allocate memory using GFP_NOIO when deleting a btree.  dm_btree_del()
can be called via an ioctl and we don't want to recurse into the FS or
block layer.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-07-06 10:45:02 -04:00
Dennis Yang
4c7e309340 dm btree remove: fix bug in redistribute3
redistribute3() shares entries out across 3 nodes.  Some entries were
being moved the wrong way, breaking the ordering.  This manifested as a
BUG() in dm-btree-remove.c:shift() when entries were removed from the
btree.

For additional context see:
https://www.redhat.com/archives/dm-devel/2015-May/msg00113.html

Signed-off-by: Dennis Yang <shinrairis@gmail.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-07-05 17:43:58 -04:00
Joe Thornber
6096d91af0 dm space map metadata: fix occasional leak of a metadata block on resize
The metadata space map has a simplified 'bootstrap' mode that is
operational when extending the space maps.  Whilst in this mode it's
possible for some refcount decrement operations to become queued (eg, as
a result of shadowing one of the bitmap indexes).  These decrements were
not being applied when switching out of bootstrap mode.

The effect of this bug was the leaking of a 4k metadata block.  This is
detected by the latest version of thin_check as a non fatal error.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-06-17 10:09:23 -04:00
Joe Thornber
4ec331c3ea dm btree: add dm_btree_remove_leaves()
Removes a range of leaf values from the tree.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11 17:13:03 -04:00
Mike Snitzer
49f154c732 dm thin metadata: remove in-core 'read_only' flag
Leverage the block manager's read_only flag instead of duplicating it;
access with new dm_bm_is_read_only() method.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:18:59 -04:00
Linus Torvalds
a911dcdba1 Merge tag 'dm-3.20-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull more device mapper changes from Mike Snitzer:

- Significant dm-crypt CPU scalability performance improvements thanks
  to changes that enable effective use of an unbound workqueue across
  all available CPUs.  A large battery of tests were performed to
  validate these changes, summary of results is available here:
  https://www.redhat.com/archives/dm-devel/2015-February/msg00106.html

- A few additional stable fixes (to DM core, dm-snapshot and dm-mirror)
  and a small fix to the dm-space-map-disk.

* tag 'dm-3.20-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm snapshot: fix a possible invalid memory access on unload
  dm: fix a race condition in dm_get_md
  dm crypt: sort writes
  dm crypt: add 'submit_from_crypt_cpus' option
  dm crypt: offload writes to thread
  dm crypt: remove unused io_pool and _crypt_io_pool
  dm crypt: avoid deadlock in mempools
  dm crypt: don't allocate pages for a partial request
  dm crypt: use unbound workqueue for request processing
  dm io: reject unsupported DISCARD requests with EOPNOTSUPP
  dm mirror: do not degrade the mirror on discard error
  dm space map disk: fix sm_disk_count_is_more_than_one()
2015-02-21 13:28:45 -08:00
Mike Snitzer
145b9006a0 dm space map disk: fix sm_disk_count_is_more_than_one()
dm_tm_shadow_block() is the only caller of
dm_sm_count_is_more_than_one() which only ever operates on a metadata
space-map.  So in practice, sm_disk_count_is_more_than_one() isn't
actually used (which explains why this bug never amounted to anything).

But fix sm_disk_count_is_more_than_one() to properly set *result and
return 0.

Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-02-13 19:32:58 -05:00
Christoph Jaeger
6341e62b21 kconfig: use bool instead of boolean for type definition attributes
Support for keyword 'boolean' will be dropped later on.

No functional change.

Reference: http://lkml.kernel.org/r/cover.1418003065.git.cj@linux.com
Signed-off-by: Christoph Jaeger <cj@linux.com>
Signed-off-by: Michal Marek <mmarek@suse.cz>
2015-01-07 13:08:04 +01:00
Joe Thornber
02717d9855 dm space map metadata: fix sm_bootstrap_get_count()
Must set 'result' accordingly rather than return it.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-02 10:25:06 -05:00
Dan Carpenter
c1c6156fe4 dm space map metadata: fix sm_bootstrap_get_nr_blocks()
This function isn't right and it causes a static checker warning:

	drivers/md/dm-thin.c:3016 maybe_resize_data_dev()
	error: potentially using uninitialized 'sb_data_size'.

It should set "*count" and return zero on success the same as the
sm_metadata_get_nr_blocks() function does earlier.

Fixes: 3241b1d3e0 ('dm: add persistent data library')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-01 11:31:58 -05:00
Joe Thornber
8001e87d0e dm array: if resizing the array is a noop set the new root to the old one
This could've been quite bad (to return success but not update the new
root to point at the old) but in practice the only known consumer of the
dm array code is the DM cache target.  And the DM cache target passes in
the same old root to array_resize() anyway.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-12-01 11:30:07 -05:00
Joe Thornber
4646015d7e dm transaction manager: add support for prefetching blocks of metadata
Introduce the dm_tm_issue_prefetches interface.  If you're using a
non-blocking clone the tm will build up a list of requested blocks that
weren't in core.  dm_tm_issue_prefetches will request those blocks to be
prefetched.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-11-10 15:25:26 -05:00
Joe Thornber
9b460d3699 dm btree: fix a recursion depth bug in btree walking code
The walk code was using a 'ro_spine' to hold it's locked btree nodes.
But this data structure is designed for the rolling lock scheme, and
as such automatically unlocks blocks that are two steps up the call
chain.  This is not suitable for the simple recursive walk algorithm,
which retraces its steps.

This code is only used by the persistent array code, which in turn is
only used by dm-cache.  In order to trigger it you need to have a
mapping tree that is more than 2 levels deep; which equates to 8-16
million cache blocks.  For instance a 4T ssd with a very small block
size of 32k only just triggers this bug.

The fix just places the locked blocks on the stack, and stops using
the ro_spine altogether.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-11-10 15:23:58 -05:00
Joe Thornber
a9d45396f5 dm transaction manager: fix corruption due to non-atomic transaction commit
The persistent-data library used by dm-thin, dm-cache, etc is
transactional.  If anything goes wrong, such as an io error when writing
new metadata or a power failure, then we roll back to the last
transaction.

Atomicity when committing a transaction is achieved by:

a) Never overwriting data from the previous transaction.
b) Writing the superblock last, after all other metadata has hit the
   disk.

This commit and the following commit ("dm: take care to copy the space
map roots before locking the superblock") fix a bug associated with (b).
When committing it was possible for the superblock to still be written
in spite of an io error occurring during the preceeding metadata flush.
With these commits we're careful not to take the write lock out on the
superblock until after the metadata flush has completed.

Change the transaction manager's semantics for dm_tm_commit() to assume
all data has been flushed _before_ the single superblock that is passed
in.

As a prerequisite, split the block manager's block unlocking and
flushing by simplifying dm_bm_flush_and_unlock() to dm_bm_flush().  Now
the unlocking must be done separately.

This issue was discovered by forcing io errors at the crucial time
using dm-flakey.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-03-27 16:56:23 -04:00
Joe Thornber
428e469864 dm bitset: only flush the current word if it has been dirtied
This change offers a big performance boost for dm-era.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-27 16:56:23 -04:00
Joe Thornber
cebc2de44d dm space map metadata: fix refcount decrement below 0 which caused corruption
This has been a relatively long-standing issue that wasn't nailed down
until Teng-Feng Yang's meticulous bug report to dm-devel on 3/7/2014,
see: http://www.redhat.com/archives/dm-devel/2014-March/msg00021.html

From that report:
  "When decreasing the reference count of a metadata block with its
  reference count equals 3, we will call dm_btree_remove() to remove
  this enrty from the B+tree which keeps the reference count info in
  metadata device.

  The B+tree will try to rebalance the entry of the child nodes in each
  node it traversed, and the rebalance process contains the following
  steps.

  (1) Finding the corresponding children in current node (shadow_current(s))
  (2) Shadow the children block (issue BOP_INC)
  (3) redistribute keys among children, and free children if necessary (issue BOP_DEC)

  Since the update of a metadata block's reference count could be
  recursive, we will stash these reference count update operations in
  smm->uncommitted and then process them in a FILO fashion.

  The problem is that step(3) could free the children which is created
  in step(2), so the BOP_DEC issued in step(3) will be carried out
  before the BOP_INC issued in step(2) since these BOPs will be
  processed in FILO fashion. Once the BOP_DEC from step(3) tries to
  decrease the reference count of newly shadow block, it will report
  failure for its reference equals 0 before decreasing. It looks like we
  can solve this issue by processing these BOPs in a FIFO fashion
  instead of FILO."

Commit 5b564d80 ("dm space map: disallow decrementing a reference count
below zero") changed the code to report an error for this temporary
refcount decrement below zero.  So what was previously a harmless
invalid refcount became a hard failure due to the new error path:

 device-mapper: space map common: unable to decrement a reference count below 0
 device-mapper: thin: 253:6: dm_thin_insert_block() failed: error = -22
 device-mapper: thin: 253:6: switching pool to read-only mode

This bug is in dm persistent-data code that is common to the DM thin and
cache targets.  So any users of those targets should apply this fix.

Fix this by applying recursive space map operations in FIFO order rather
than FILO.

Resolves: https://bugzilla.kernel.org/show_bug.cgi?id=68801

Reported-by: Apollon Oikonomopoulos <apoikos@debian.org>
Reported-by: edwillam1007@gmail.com
Reported-by: Teng-Feng Yang <shinrairis@gmail.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.13+
2014-03-07 12:02:47 -05:00
Mike Snitzer
c64d240df3 dm: fix Kconfig indentation
Since DM_DEBUG_BLOCK_STACK_TRACING is a DM_PERSISTENT_DATA config option
move it from drivers/md/Kconfig to drivers/md/persistent-data/Kconfig.

Doing so fixes indentation for other DM config options.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-03 17:31:07 -05:00
Mike Snitzer
7d48935eff dm thin: allow metadata space larger than supported to go unused
It was always intended that a user could provide a thin metadata device
that is larger than the max supported by the on-disk format.  The extra
space would just go unused.

Unfortunately that never worked.  If the user attempted to use a larger
metadata device on creation they would get an error like the following:

 device-mapper: space map common: space map too large
 device-mapper: transaction manager: couldn't create metadata space map
 device-mapper: thin metadata: tm_create_with_sm failed
 device-mapper: table: 252:17: thin-pool: Error creating metadata object
 device-mapper: ioctl: error adding target to table

Fix this by allowing the initial metadata space map creation to cap its
size at the max number of blocks supported (DM_SM_METADATA_MAX_BLOCKS).
get_metadata_dev_size() must also impose DM_SM_METADATA_MAX_BLOCKS (via
THIN_METADATA_MAX_SECTORS), otherwise extending metadata would cap at
THIN_METADATA_MAX_SECTORS_WARNING (which is larger than supported).

Also, the calculation for THIN_METADATA_MAX_SECTORS didn't account for
the sizeof the disk_bitmap_header.  So the supported maximum metadata
size is a bit smaller (reduced from 33423360 to 33292800 sectors).

Lastly, remove the "excess space will not be used" warning message from
get_metadata_dev_size(); it resulted in printing the warning multiple
times.  Factor out warn_if_metadata_device_too_big(), call it from
pool_ctr() and maybe_resize_metadata_dev().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2014-02-27 11:49:08 -05:00
Joe Thornber
fca028438f dm space map metadata: fix bug in resizing of thin metadata
This bug was introduced in commit 7e664b3dec ("dm space map metadata:
fix extending the space map").

When extending a dm-thin metadata volume we:

- Switch the space map into a simple bootstrap mode, which allocates
  all space linearly from the newly added space.
- Add new bitmap entries for the new space
- Increment the reference counts for those newly allocated bitmap
  entries
- Commit changes to disk
- Switch back out of bootstrap mode.

But, the disk commit may allocate space itself, if so this fact will be
lost when switching out of bootstrap mode.

The bug exhibited itself as an error when the bitmap_root, with an
erroneous ref count of 0, was subsequently decremented as part of a
later disk commit.  This would cause the disk commit to fail, and thinp
to enter read_only mode.  The metadata was not damaged (thin_check
passed).

The fix is to put the increments + commit into a loop, running until
the commit has not allocated extra space.  In practise this loop only
runs twice.

With this fix the following device mapper testsuite test passes:
 dmtest run --suite thin-provisioning -n thin_remove_works_after_resize

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # depends on commit 7e664b3dec
2014-01-21 12:15:01 -05:00
Joe Thornber
f164e6900f dm btree: add dm_btree_find_lowest_key
dm_btree_find_lowest_key is the reciprocal of dm_btree_find_highest_key.
Factor out common code for dm_btree_find_{highest,lowest}_key.

dm_btree_find_lowest_key is needed for an upcoming DM target, as such it
is best to get this interface in place.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-01-09 16:29:17 -05:00
Joe Thornber
7e664b3dec dm space map metadata: fix extending the space map
When extending a metadata space map we should do the first commit whilst
still in bootstrap mode -- a mode where all blocks get allocated in the
new area.

That way the commit overhead is allocated from the newly added space.
Otherwise we risk running out of space.

With this fix, and the previous commit "dm space map common: make sure
new space is used during extend", the following device mapper testsuite
test passes:
 dmtest run --suite thin-provisioning -n /resize_metadata_no_io/

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-01-07 21:05:18 -05:00
Joe Thornber
12c91a5c2d dm space map common: make sure new space is used during extend
When extending a low level space map we should update nr_blocks at
the start so the new space is used for the index entries.

Otherwise extend can fail, e.g.: sm_metadata_extend call sequence
that fails:
 -> sm_ll_extend
    -> dm_tm_new_block -> dm_sm_new_block -> sm_bootstrap_new_block
    => returns -ENOSPC because smm->begin == smm->ll.nr_blocks

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2014-01-07 21:05:17 -05:00
Mike Snitzer
10343180f5 dm persistent data: cleanup dm-thin specific references in text
DM's persistent-data library is now used my multiple targets so
exclusive references to "pool" or "thin provisioning" need to be
cleaned up.  Adjust Kconfig's DM_DEBUG_BLOCK_STACK_TRACING text
and remove "pool" from a block manager error message.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2014-01-07 10:11:54 -05:00
Mike Snitzer
c46985e211 dm space map metadata: limit errors in sm_metadata_new_block
The "unable to allocate new metadata block" error can be a particularly
verbose error if there is a systemic issue with the metadata device.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2014-01-07 10:11:46 -05:00