Squashed commit of the following: commit 37695a77521cfccbf92840cc13dcc4d8cb7dda96 Author: pwnrazr <1644943+pwnrazr@users.noreply.github.com> Date: Thu Feb 16 00:00:20 2023 +0800 raphael_defconfig: enable erofs highpri percpu kthread commit 816e4801de2002f5f53e7cd2f7aea282755d5391 Author: John Galt <johngaltfirstrun@gmail.com> Date: Mon Mar 6 15:48:21 2023 -0500 fs/(erofs || f2fs): drop WQ_UNBOUND Due to asym arm64 latency regression on WQ_UNBOUND commit d0e5cb53f102962d0d40ff12f548542d71f6340e Author: John Galt <johngaltfirstrun@gmail.com> Date: Wed Feb 15 10:44:37 2023 -0500 erofs/zdata: modify set sched to use RR at high prio for lower latency Fixes: bdd668d3b54202 commit afc1c08015966909a27c9d3d53d8796e80c3e4ef Author: Sandeep Dhavale <dhavale@google.com> Date: Wed Feb 8 06:53:49 2023 +0000 [WIP] BACKPORT: FROMLIST: erofs: add per-cpu threads for decompression Using per-cpu thread pool we can reduce the scheduling latency compared to workqueue implementation. With this patch scheduling latency and variation is reduced as per-cpu threads are high priority kthread_workers. The results were evaluated on arm64 Android devices running 5.10 kernel. The table below shows resulting improvements of total scheduling latency for the same app launch benchmark runs with 50 iterations. Scheduling latency is the latency between when the task (workqueue kworker vs kthread_worker) became eligible to run to when it actually started running. +-------------------------+-----------+----------------+---------+ | | workqueue | kthread_worker | diff | +-------------------------+-----------+----------------+---------+ | Average (us) | 15253 | 2914 | -80.89% | | Median (us) | 14001 | 2912 | -79.20% | | Minimum (us) | 3117 | 1027 | -67.05% | | Maximum (us) | 30170 | 3805 | -87.39% | | Standard deviation (us) | 7166 | 359 | | +-------------------------+-----------+----------------+---------+ Background: Boot times and cold app launch benchmarks are very important to the android ecosystem as they directly translate to responsiveness from user point of view. While erofs provides a lot of important features like space savings, we saw some performance penalty in cold app launch benchmarks in few scenarios. Analysis showed that the significant variance was coming from the scheduling cost while decompression cost was more or less the same. Having per-cpu thread pool we can see from the above table that this variation is reduced by ~80% on average. This problem was discussed at LPC 2022. Link to LPC 2022 slides and talk at [1] [1] https://lpc.events/event/16/contributions/1338/ Link: https://lore.kernel.org/lkml/Y+DP6V9fZG7XPPGy@debian/ Change-Id: I454da5bc17f285d99047b93dc1fc70444f287156 Signed-off-by: Sandeep Dhavale <dhavale@google.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> commit 354d97368e8ffd832a43f6aa0d7c43f52268ca80 Author: pwnrazr <1644943+pwnrazr@users.noreply.github.com> Date: Sat May 7 13:21:24 2022 +0800 sm8150: dtsi: remove barrier and discard mount options commit 6c0b4a711ecb5b0e30c6115959b48af641e9b5bf Author: pwnrazr <1644943+pwnrazr@users.noreply.github.com> Date: Sat May 7 13:20:47 2022 +0800 Revert "arch: arm64: disable erofs" This reverts commit fe6fe5ef6107fc245ca50cd38f585e580fe2fc59. commit 515b1441ad6ac0f9e1c74013cd80e9b30065edc0 Author: kondors1995 <normandija1945@gmail.com> Date: Wed Feb 8 16:43:29 2023 +0200 Revert "raphael_defconfig: Revert FBEv2 defconfig changes" This reverts commit 97bb4a1d5d103804c72617481fca9b6cf93660a2. commit c010e1a5176d73f3829ce49cfdb0fcc0ee5c777c Author: Yue Hu <huyue2@coolpad.com> Date: Thu Apr 7 13:05:43 2022 +0800 erofs: do not prompt for risk any more when using big pcluster The big pcluster feature has been merged for a year, it has been mostly stable now. Signed-off-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20220407050505.12683-1-huyue2@coolpad.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Cyber Knight <cyberknight755@gmail.com> commit b135290ae7af3f5f7b69e24c6ca678c4f6572cf2 Author: John Galt <johngaltfirstrun@gmail.com> Date: Mon Jun 6 13:23:06 2022 -0400 erofs: Squashed revert of some recent backports: Keep out of release branch until d71eb1da8e8b59a7072c51ce48175e159ecfd79a is fixed, and also readmore decompress strategy is introduced. commit b9494371e2493f1a8ccc18b1c80f67867f6f623a Author: John Galt <johngaltfirstrun@gmail.com> Date: Mon Jun 6 13:22:49 2022 -0400 Revert "erofs: iomap support for non-tailpacking DIO" This reverts commit 804ddc92b769a9cc9926d0262725e6330d0f0a76. commit 0649a6ed5e759857aabc334abeddacbe4eac7859 Author: John Galt <johngaltfirstrun@gmail.com> Date: Mon Jun 6 13:22:41 2022 -0400 Revert "erofs: adapt 3f4e33b91a28 to our tree" This reverts commit 016f1ffa36da74ab67ed99abd474a0b2da5133eb. commit a3704a5a79990f75c8336c9001939db6e6d21181 Author: John Galt <johngaltfirstrun@gmail.com> Date: Mon Jun 6 13:22:33 2022 -0400 Revert "erofs: add support for the full decompressed length" This reverts commit a4a195b954114aeb741cf4f8b14256ed92e7c545. commit 5a506fe78d7624f1a94e60d0e3d7113ae6934ea7 Author: John Galt <johngaltfirstrun@gmail.com> Date: Mon Jun 6 13:22:27 2022 -0400 Revert "erofs: add fiemap support with iomap" This reverts commit 07577933c3fb397791f113ad36fac7a061385826. commit dd93cf9efb3d1f9608780c44a50a860eb9921cf4 Author: John Galt <johngaltfirstrun@gmail.com> Date: Mon Jun 6 13:22:16 2022 -0400 Revert "erofs: introduce chunk-based file on-disk format" This reverts commit 690f4dc6d3b27ed6278b8fbae20273883f616e56. commit a1846fe6257df43564f42eb153131796f3fd84ed Author: John Galt <johngaltfirstrun@gmail.com> Date: Mon Jun 6 13:22:08 2022 -0400 Revert "erofs: support reading chunk-based uncompressed files" This reverts commit 5bd83bfc55b6169af5bbf3c0ba4528577c2fa1ff. commit 3e1c2530db00b6605d8db09e207cb3633e61cdba Author: John Galt <johngaltfirstrun@gmail.com> Date: Mon Jun 6 13:22:03 2022 -0400 Revert "erofs: fix double free of 'copied'" This reverts commit c608a6f861e0d457d6c9a5905e8b3d928e672075. commit 7a9e0f351f8d41a01a0763316bbd4b6ace94bea0 Author: John Galt <johngaltfirstrun@gmail.com> Date: Mon Jun 6 13:21:52 2022 -0400 Revert "erofs: fix misbehavior of unsupported chunk format check" This reverts commit 751e7c533e451b3c6a51f7d2a69224cca39e8c20. commit 37b05816e45d519643dd9d162b827311abf3b034 Author: John Galt <johngaltfirstrun@gmail.com> Date: Mon Jun 6 13:21:44 2022 -0400 Revert "erofs: get compression algorithms directly on mapping" This reverts commit 98b09cde747826f6fe3aae50eb05659f7f2803f7. commit de74ca4af181a35ac037a44f07cf6a7e55e0f127 Author: John Galt <johngaltfirstrun@gmail.com> Date: Mon Jun 6 13:21:35 2022 -0400 Revert "erofs: introduce the secondary compression head" This reverts commit feea4ee667bf5d5fa2c6d0c5f57697476dce7ca7. commit dda6e8eaddd3203cfafd6c82d2e751f2e6d16766 Author: John Galt <johngaltfirstrun@gmail.com> Date: Mon Jun 6 13:21:29 2022 -0400 Revert "erofs: clean up z_erofs_extent_lookback" This reverts commit c08dbda40a4f3016ee6c60ae2a19e3ecc518361c. commit 2e5fd527a76eba733464b0ba71fe92abc839b62b Author: John Galt <johngaltfirstrun@gmail.com> Date: Mon Jun 6 13:21:23 2022 -0400 Revert "erofs: clean up erofs_map_blocks tracepoints" This reverts commit d71eb1da8e8b59a7072c51ce48175e159ecfd79a. commit ed6e7f36515d6d80c75c4d0803b636e17f328a6c Author: Gao Xiang <hsiangkao@linux.alibaba.com> Date: Thu Dec 9 09:29:18 2021 +0800 erofs: clean up erofs_map_blocks tracepoints Since the new type of chunk-based files is introduced, there is no need to leave flatmode tracepoints. Rename to erofs_map_blocks instead. Link: https://lore.kernel.org/r/20211209012918.30337-1-hsiangkao@linux.alibaba.com Reviewed-by: Yue Hu <huyue2@yulong.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> commit 525147ad9beef7e521c1667509db763e970c06d3 Author: Gao Xiang <hsiangkao@linux.alibaba.com> Date: Fri Mar 11 02:27:42 2022 +0800 erofs: clean up z_erofs_extent_lookback Avoid the unnecessary tail recursion since it can be converted into a loop directly in order to prevent potential stack overflow. It's a pretty straightforward conversion. Link: https://lore.kernel.org/r/20220310182743.102365-1-hsiangkao@linux.alibaba.com Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> commit db45bcfb35a2cd8d49e159c0cc70635b713183a4 Author: Gao Xiang <hsiangkao@linux.alibaba.com> Date: Mon Oct 18 00:57:21 2021 +0800 erofs: introduce the secondary compression head Previously, for each HEAD lcluster, it can be either HEAD or PLAIN lcluster to indicate whether the whole pcluster is compressed or not. In this patch, a new HEAD2 head type is introduced to specify another compression algorithm other than the primary algorithm for each compressed file, which can be used for upcoming LZMA compression and LZ4 range dictionary compression for various data patterns. It has been stayed in the EROFS roadmap for years. Complete it now! Link: https://lore.kernel.org/r/20211017165721.2442-1-xiang@kernel.org Reviewed-by: Yue Hu <huyue2@yulong.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> commit f0fe9e97d03ed484a51f764373ad0c5941949869 Author: Gao Xiang <hsiangkao@linux.alibaba.com> Date: Sat Oct 9 04:08:37 2021 +0800 erofs: get compression algorithms directly on mapping Currently, z_erofs_map_blocks_iter() returns whether extents are compressed or not, and the decompression frontend gets the specific algorithms then. It works but not quite well in many aspests, for example: - The decompression frontend has to deal with whether extents are compressed or not again and lookup the algorithms if compressed. It's duplicated and too detailed about the on-disk mapping. - A new secondary compression head will be introduced later so that each file can have 2 compression algorithms at most for different type of data. It could increase the complexity of the decompression frontend if still handled in this way; - A new readmore decompression strategy will be introduced to get better performance for much bigger pcluster and lzma, which needs the specific algorithm in advance as well. Let's look up compression algorithms in z_erofs_map_blocks_iter() directly instead. Link: https://lore.kernel.org/r/20211008200839.24541-2-xiang@kernel.org Reviewed-by: Chao Yu <chao@kernel.org> Reviewed-by: Yue Hu <huyue2@yulong.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> commit 588fc2156404c552d4c2c7bcc5def820966a1ba1 Author: Gao Xiang <hsiangkao@linux.alibaba.com> Date: Wed Sep 22 17:51:41 2021 +0800 erofs: fix misbehavior of unsupported chunk format check Unsupported chunk format should be checked with "if (vi->chunkformat & ~EROFS_CHUNK_FORMAT_ALL)" Found when checking with 4k-byte blockmap (although currently mkfs uses inode chunk indexes format by default.) Link: https://lore.kernel.org/r/20210922095141.233938-1-hsiangkao@linux.alibaba.com Fixes: c5aa903a59db ("erofs: support reading chunk-based uncompressed files") Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> commit 613122535bafaabb0e58a9c347c5b6f1b8e6fa91 Author: Gao Xiang <hsiangkao@linux.alibaba.com> Date: Wed Aug 25 20:07:57 2021 +0800 erofs: fix double free of 'copied' Dan reported a new smatch warning [1] "fs/erofs/inode.c:210 erofs_read_inode() error: double free of 'copied'" Due to new chunk-based format handling logic, the error path can be called after kfree(copied). Set "copied = NULL" after kfree(copied) to fix this. [1] https://lore.kernel.org/r/202108251030.bELQozR7-lkp@intel.com Link: https://lore.kernel.org/r/20210825120757.11034-1-hsiangkao@linux.alibaba.com Fixes: c5aa903a59db ("erofs: support reading chunk-based uncompressed files") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> commit 7b648f684ea7c99deab7278f0c2cbbf74797a56d Author: Gao Xiang <hsiangkao@linux.alibaba.com> Date: Fri Aug 20 18:00:19 2021 +0800 erofs: support reading chunk-based uncompressed files Add runtime support for chunk-based uncompressed files described in the previous patch. Link: https://lore.kernel.org/r/20210820100019.208490-2-hsiangkao@linux.alibaba.com Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> commit d9737546275a3c460177a3ce9e01096bc3cfc3ad Author: Gao Xiang <hsiangkao@linux.alibaba.com> Date: Fri Aug 20 18:00:18 2021 +0800 erofs: introduce chunk-based file on-disk format Currently, uncompressed data except for tail-packing inline is consecutive on disk. In order to support chunk-based data deduplication, add a new corresponding inode data layout. In the future, the data source of chunks can be either (un)compressed. Link: https://lore.kernel.org/r/20210820100019.208490-1-hsiangkao@linux.alibaba.com Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> commit 47f6bed39a7a83aa59be667657cba886dbd4b79b Author: Gao Xiang <hsiangkao@linux.alibaba.com> Date: Fri Aug 13 13:29:31 2021 +0800 erofs: add fiemap support with iomap This adds fiemap support for both uncompressed files and compressed files by using iomap infrastructure. Link: https://lore.kernel.org/r/20210813052931.203280-3-hsiangkao@linux.alibaba.com Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> commit 82cc95ee585c9b033a43b0564173d4c444e3a4ac Author: Gao Xiang <hsiangkao@linux.alibaba.com> Date: Wed Aug 18 23:22:31 2021 +0800 erofs: add support for the full decompressed length Previously, there is no need to get the full decompressed length since EROFS supports partial decompression. However for some other cases such as fiemap, the full decompressed length is necessary for iomap to make it work properly. This patch adds a way to get the full decompressed length. Note that it takes more metadata overhead and it'd be avoided if possible in the performance sensitive scenario. Link: https://lore.kernel.org/r/20210818152231.243691-1-hsiangkao@linux.alibaba.com Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> commit 8ff30ee6aaa1130bc26af4a98a818d91820c0bdb Author: John Galt <johngaltfirstrun@gmail.com> Date: Thu May 12 12:08:04 2022 -0400 erofs: adapt 3f4e33b91a28 to our tree commit 71e2f8865698e382349a16d8f90e5d74f935ff2a Author: Huang Jianan <huangjianan@oppo.com> Date: Thu Aug 5 08:35:59 2021 +0800 erofs: iomap support for non-tailpacking DIO Add iomap support for non-tailpacking uncompressed data in order to support DIO and DAX. Direct I/O is useful in certain scenarios for uncompressed files. For example, double pagecache can be avoid by direct I/O when loop device is used for uncompressed files containing upper layer compressed filesystem. This adds iomap DIO support for non-tailpacking cases first and tail-packing inline files are handled in the follow-up patch. Link: https://lore.kernel.org/r/20210805003601.183063-2-hsiangkao@linux.alibaba.com Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Huang Jianan <huangjianan@oppo.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> commit 8bc571a229c3701405ac47f689db283ac99f2b2d Author: Goldwyn Rodrigues <rgoldwyn@suse.com> Date: Fri Aug 30 12:09:24 2019 -0500 fs: export generic_file_buffered_read() Export generic_file_buffered_read() to be used to supplement incomplete direct reads. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> commit 34c8cbbc7b932ac50e90da6e838524fd1f162aca Author: Dan Williams <dan.j.williams@intel.com> Date: Wed Mar 7 15:26:44 2018 -0800 fs, dax: prepare for dax-specific address_space_operations In preparation for the dax implementation to start associating dax pages to inodes via page->mapping, we need to provide a 'struct address_space_operations' instance for dax. Define some generic VFS aops helpers for dax. These noop implementations are there in the dax case to prevent the VFS from falling back to operations with page-cache assumptions, dax_writeback_mapping_range() may not be referenced in the FS_DAX=n case. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: Matthew Wilcox <mawilcox@microsoft.com> Suggested-by: Jan Kara <jack@suse.cz> Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Suggested-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> commit b0da008763834f165e8a055e011f223b3981316d Author: Andreas Gruenbacher <agruenba@redhat.com> Date: Sun Oct 1 17:55:54 2017 -0400 iomap: Switch from blkno to disk offset Replace iomap->blkno, the sector number, with iomap->addr, the disk offset in bytes. For invalid disk offsets, use the special value IOMAP_NULL_ADDR instead of IOMAP_NULL_BLOCK. This allows to use iomap for mappings which are not block aligned, such as inline data on ext4. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> # iomap, xfs Reviewed-by: Jan Kara <jack@suse.cz> commit b74997cce993dd0408a0beeb36bd28652e272108 Author: Matthew Wilcox <mawilcox@microsoft.com> Date: Tue Nov 28 15:39:51 2017 -0500 idr: Rename idr_for_each_entry_ext Most places in the kernel that we need to distinguish functions by the type of their arguments, we use '_ul' as a suffix for the unsigned long variant, not '_ext'. Also add kernel-doc. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> commit a562faeba73cfb13de1f278c95be606faa3e4f21 Author: Matthew Wilcox <mawilcox@microsoft.com> Date: Tue Nov 28 10:14:27 2017 -0500 idr: Add idr_alloc_u32 helper All current users of idr_alloc_ext() actually want to allocate a u32 and idr_alloc_u32() fits their needs better. Like idr_get_next(), it uses a 'nextid' argument which serves as both a pointer to the start ID and the assigned ID (instead of a separate minimum and pointer-to-assigned-ID argument). It uses a 'max' argument rather than 'end' because the semantics that idr_alloc has for 'end' don't work well for unsigned types. Since idr_alloc_u32() returns an errno instead of the allocated ID, mark it as __must_check to help callers use it correctly. Include copious kernel-doc. Chris Mi <chrism@mellanox.com> has promised to contribute test-cases for idr_alloc_u32. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> commit 4b24e4564260899c64b9532440a9b5545dbfe7f9 Author: Matthew Wilcox <mawilcox@microsoft.com> Date: Tue Apr 10 16:36:48 2018 -0700 fscache: use appropriate radix tree accessors Don't open-code accesses to data structure internals. Link: http://lkml.kernel.org/r/20180313132639.17387-7-willy@infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> commit 7469480be01c3394807cbd0991f06b8d6f2d4403 Author: Matthew Wilcox <mawilcox@microsoft.com> Date: Tue Apr 10 16:36:44 2018 -0700 export __set_page_dirty XFS currently contains a copy-and-paste of __set_page_dirty(). Export it from buffer.c instead. Link: http://lkml.kernel.org/r/20180313132639.17387-6-willy@infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: Dave Chinner <david@fromorbit.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> commit c53045287025992bc775081dbab63ac926a597e8 Author: Matthew Wilcox <mawilcox@microsoft.com> Date: Tue Apr 10 16:36:28 2018 -0700 radix tree: use GFP_ZONEMASK bits of gfp_t for flags Patch series "XArray", v9. (First part thereof). This patchset is, I believe, appropriate for merging for 4.17. It contains the XArray implementation, to eventually replace the radix tree, and converts the page cache to use it. This conversion keeps the radix tree and XArray data structures in sync at all times. That allows us to convert the page cache one function at a time and should allow for easier bisection. Other than renaming some elements of the structures, the data structures are fundamentally unchanged; a radix tree walk and an XArray walk will touch the same number of cachelines. I have changes planned to the XArray data structure, but those will happen in future patches. Improvements the XArray has over the radix tree: - The radix tree provides operations like other trees do; 'insert' and 'delete'. But what most users really want is an automatically resizing array, and so it makes more sense to give users an API that is like an array -- 'load' and 'store'. We still have an 'insert' operation for users that really want that semantic. - The XArray considers locking as part of its API. This simplifies a lot of users who formerly had to manage their own locking just for the radix tree. It also improves code generation as we can now tell RCU that we're holding a lock and it doesn't need to generate as much fencing code. The other advantage is that tree nodes can be moved (not yet implemented). - GFP flags are now parameters to calls which may need to allocate memory. The radix tree forced users to decide what the allocation flags would be at creation time. It's much clearer to specify them at allocation time. - Memory is not preloaded; we don't tie up dozens of pages on the off chance that the slab allocator fails. Instead, we drop the lock, allocate a new node and retry the operation. We have to convert all the radix tree, IDA and IDR preload users before we can realise this benefit, but I have not yet found a user which cannot be converted. - The XArray provides a cmpxchg operation. The radix tree forces users to roll their own (and at least four have). - Iterators take a 'max' parameter. That simplifies many users and will reduce the amount of iteration done. - Iteration can proceed backwards. We only have one user for this, but since it's called as part of the pagefault readahead algorithm, that seemed worth mentioning. - RCU-protected pointers are not exposed as part of the API. There are some fun bugs where the page cache forgets to use rcu_dereference() in the current codebase. - Value entries gain an extra bit compared to radix tree exceptional entries. That gives us the extra bit we need to put huge page swap entries in the page cache. - Some iterators now take a 'filter' argument instead of having separate iterators for tagged/untagged iterations. The page cache is improved by this: - Shorter, easier to read code - More efficient iterations - Reduction in size of struct address_space - Fewer walks from the top of the data structure; the XArray API encourages staying at the leaf node and conducting operations there. This patch (of 8): None of these bits may be used for slab allocations, so we can use them as radix tree flags as long as we mask them off before passing them to the slab allocator. Move the IDR flag from the high bits to the GFP_ZONEMASK bits. Link: http://lkml.kernel.org/r/20180313132639.17387-3-willy@infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Jeff Layton <jlayton@kernel.org> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> commit c95250f9f545568b87775f1a2a48d412203161f7 Author: John Galt <johngaltfirstrun@gmail.com> Date: Mon May 16 10:45:14 2022 -0400 Revert "erofs: compression fixes" This reverts commit 208dabff2d5e3e616a86df8bdba814d54b1a8a1f. Fixes a deadlock when fix shrinking erofs slab. commit d07627505cd871bb1a539377434dede2f4a18d9c Author: John Galt <johngaltfirstrun@gmail.com> Date: Mon May 16 09:41:14 2022 -0400 Revert "erofs: fixes for compilation" This reverts commit c7bf11979051cda0e7b37857289503fa4831c549. commit 7846d0f267ba3572570917e4880d60c79939bf5c Author: Hongyu Jin <hongyu.jin@unisoc.com> Date: Fri Apr 1 19:55:27 2022 +0800 erofs: fix use-after-free of on-stack io[] The root cause is the race as follows: Thread #1 Thread #2(irq ctx) z_erofs_runqueue() struct z_erofs_decompressqueue io_A[]; submit bio A z_erofs_decompress_kickoff(,,1) z_erofs_decompressqueue_endio(bio A) z_erofs_decompress_kickoff(,,-1) spin_lock_irqsave() atomic_add_return() io_wait_event() -> pending_bios is already 0 [end of function] wake_up_locked(io_A[]) // crash Referenced backtrace in kernel 5.4: [ 10.129422] Unable to handle kernel paging request at virtual address eb0454a4 [ 10.364157] CPU: 0 PID: 709 Comm: getprop Tainted: G WC O 5.4.147-ab09225 #1 [ 11.556325] [<c01b33b8>] (__wake_up_common) from [<c01b3300>] (__wake_up_locked+0x40/0x48) [ 11.565487] [<c01b3300>] (__wake_up_locked) from [<c044c8d0>] (z_erofs_vle_unzip_kickoff+0x6c/0xc0) [ 11.575438] [<c044c8d0>] (z_erofs_vle_unzip_kickoff) from [<c044c854>] (z_erofs_vle_read_endio+0x16c/0x17c) [ 11.586082] [<c044c854>] (z_erofs_vle_read_endio) from [<c06a80e8>] (clone_endio+0xb4/0x1d0) [ 11.595428] [<c06a80e8>] (clone_endio) from [<c04a1280>] (blk_update_request+0x150/0x4dc) [ 11.604516] [<c04a1280>] (blk_update_request) from [<c06dea28>] (mmc_blk_cqe_complete_rq+0x144/0x15c) [ 11.614640] [<c06dea28>] (mmc_blk_cqe_complete_rq) from [<c04a5d90>] (blk_done_softirq+0xb0/0xcc) [ 11.624419] [<c04a5d90>] (blk_done_softirq) from [<c010242c>] (__do_softirq+0x184/0x56c) [ 11.633419] [<c010242c>] (__do_softirq) from [<c01051e8>] (irq_exit+0xd4/0x138) [ 11.641640] [<c01051e8>] (irq_exit) from [<c010c314>] (__handle_domain_irq+0x94/0xd0) [ 11.650381] [<c010c314>] (__handle_domain_irq) from [<c04fde70>] (gic_handle_irq+0x50/0xd4) [ 11.659641] [<c04fde70>] (gic_handle_irq) from [<c0101b70>] (__irq_svc+0x70/0xb0) Signed-off-by: Hongyu Jin <hongyu.jin@unisoc.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20220401115527.4935-1-hongyu.jin.cn@gmail.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> commit 9fa705504bf016a360c10edc3c9c5cbf8d870a78 Author: John Galt <johngaltfirstrun@gmail.com> Date: Thu May 5 22:40:43 2022 -0400 erofs: extend 3812dc21ec commit 4cda8c8c3d0ea4b3cb0f660db01697b50f7bfddc Author: Yue Hu <huyue2@yulong.com> Date: Thu Oct 14 14:57:44 2021 +0800 erofs: remove the fast path of per-CPU buffer decompression As Xiang mentioned, such path has no real impact to our current decompression strategy, remove it directly. Also, update the return value of z_erofs_lz4_decompress() to 0 if success to keep consistent with LZMA which will return 0 as well for that case. Link: https://lore.kernel.org/r/20211014065744.1787-1-zbestahu@gmail.com Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Yue Hu <huyue2@yulong.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> commit 20122adf7721eff6c6ff90db545e0597501d942f Author: Yue Hu <huyue2@yulong.com> Date: Tue Sep 14 11:59:15 2021 +0800 erofs: clear compacted_2b if compacted_4b_initial > totalidx Currently, the whole indexes will only be compacted 4B if compacted_4b_initial > totalidx. So, the calculated compacted_2b is worthless for that case. It may waste CPU resources. No need to update compacted_4b_initial as mkfs since it's used to fulfill the alignment of the 1st compacted_2b pack and would handle the case above. We also need to clarify compacted_4b_end here. It's used for the last lclusters which aren't fitted in the previous compacted_2b packs. Some messages are from Xiang. Link: https://lore.kernel.org/r/20210914035915.1190-1-zbestahu@gmail.com Signed-off-by: Yue Hu <huyue2@yulong.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> [ Gao Xiang: it's enough to use "compacted_4b_initial < totalidx". ] Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> commit 3243783e85d10ccc00b9e8cb37960ed1fc1e9fef Author: Yue Hu <huyue2@yulong.com> Date: Tue Aug 10 15:24:16 2021 +0800 erofs: remove the mapping parameter from erofs_try_to_free_cached_page() The mapping is not used at all, remove it and update related code. Link: https://lore.kernel.org/r/20210810072416.1392-1-zbestahu@gmail.com Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Yue Hu <huyue2@yulong.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> commit 2936d3798b6c340459813a0eeb2409a4cb34e44f Author: Yue Hu <huyue2@yulong.com> Date: Tue Aug 10 14:54:50 2021 +0800 erofs: directly use wrapper erofs_page_is_managed() when shrinking We already have the wrapper function to identify managed page. Link: https://lore.kernel.org/r/20210810065450.1320-1-zbestahu@gmail.com Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Yue Hu <huyue2@yulong.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> commit 09b3effb67cdec2ce718d83a363c0a2df5f3d372 Author: Yue Hu <huyue2@yulong.com> Date: Mon Apr 19 18:26:23 2021 +0800 erofs: remove the occupied parameter from z_erofs_pagevec_enqueue() No any behavior to variable occupied in z_erofs_attach_page() which is only caller to z_erofs_pagevec_enqueue(). Link: https://lore.kernel.org/r/20210419102623.2015-1-zbestahu@gmail.com Signed-off-by: Yue Hu <huyue2@yulong.com> Reviewed-by: Gao Xiang <xiang@kernel.org> Signed-off-by: Gao Xiang <xiang@kernel.org> commit b5b28aefcf024c86c3f930293ba36482f96faf34 Author: Gao Xiang <xiang@kernel.org> Date: Mon May 10 14:47:15 2021 +0800 erofs: fix 1 lcluster-sized pcluster for big pcluster If the 1st NONHEAD lcluster of a pcluster isn't CBLKCNT lcluster type rather than a HEAD or PLAIN type instead, which means its pclustersize _must_ be 1 lcluster (since its uncompressed size < 2 lclusters), as illustrated below: HEAD HEAD / PLAIN lcluster type ____________ ____________ |_:__________|_________:__| file data (uncompressed) . . .____________. |____________| pcluster data (compressed) Such on-disk case was explained before [1] but missed to be handled properly in the runtime implementation. It can be observed if manually generating 1 lcluster-sized pcluster with 2 lclusters (thus CBLKCNT doesn't exist.) Let's fix it now. [1] https://lore.kernel.org/r/20210407043927.10623-1-xiang@kernel.org Link: https://lore.kernel.org/r/20210510064715.29123-1-xiang@kernel.org Fixes: cec6e93beadf ("erofs: support parsing big pcluster compress indexes") Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <xiang@kernel.org> commit 2cfa0bcf32db1431e18d636e0ff5c592768b9620 Author: Gao Xiang <hsiangkao@redhat.com> Date: Wed Apr 7 12:39:27 2021 +0800 erofs: enable big pcluster feature Enable COMPR_CFGS and BIG_PCLUSTER since the implementations are all settled properly. Link: https://lore.kernel.org/r/20210407043927.10623-11-xiang@kernel.org Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> commit d75144d8d0395bca0a1a629a3b9ab6a95112a083 Author: Gao Xiang <hsiangkao@redhat.com> Date: Wed Apr 7 12:39:26 2021 +0800 erofs: support decompress big pcluster for lz4 backend Prior to big pcluster, there was only one compressed page so it'd easy to map this. However, when big pcluster is enabled, more work needs to be done to handle multiple compressed pages. In detail, - (maptype 0) if there is only one compressed page + no need to copy inplace I/O, just map it directly what we did before; - (maptype 1) if there are more compressed pages + no need to copy inplace I/O, vmap such compressed pages instead; - (maptype 2) if inplace I/O needs to be copied, use per-CPU buffers for decompression then. Another thing is how to detect inplace decompression is feasable or not (it's still quite easy for non big pclusters), apart from the inplace margin calculation, inplace I/O page reusing order is also needed to be considered for each compressed page. Currently, if the compressed page is the xth page, it shouldn't be reused as [0 ... nrpages_out - nrpages_in + x], otherwise a full copy will be triggered. Although there are some extra optimization ideas for this, I'd like to make big pcluster work correctly first and obviously it can be further optimized later since it has nothing with the on-disk format at all. Link: https://lore.kernel.org/r/20210407043927.10623-10-xiang@kernel.org Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> commit f344f71c42af2866c748ae22e1b133a02594b367 Author: Gao Xiang <hsiangkao@redhat.com> Date: Wed Apr 7 12:39:25 2021 +0800 erofs: support parsing big pcluster compact indexes Different from non-compact indexes, several lclusters are packed as the compact form at once and an unique base blkaddr is stored for each pack, so each lcluster index would take less space on avarage (e.g. 2 bytes for COMPACT_2B.) btw, that is also why BIG_PCLUSTER switch should be consistent for compact head0/1. Prior to big pcluster, the size of all pclusters was 1 lcluster. Therefore, when a new HEAD lcluster was scanned, blkaddr would be bumped by 1 lcluster. However, that way doesn't work anymore for big pcluster since we actually don't know the compressed size of pclusters in advance (before reading CBLKCNT lcluster). So, instead, let blkaddr of each pack be the first pcluster blkaddr with a valid CBLKCNT, in detail, 1) if CBLKCNT starts at the pack, this first valid pcluster is itself, e.g. _____________________________________________________________ |_CBLKCNT0_|_NONHEAD_| .. |_HEAD_|_CBLKCNT1_| ... |_HEAD_| ... ^ = blkaddr base ^ += CBLKCNT0 ^ += CBLKCNT1 2) if CBLKCNT doesn't start at the pack, the first valid pcluster is the next pcluster, e.g. _________________________________________________________ | NONHEAD_| .. |_HEAD_|_CBLKCNT0_| ... |_HEAD_|_HEAD_| ... ^ = blkaddr base ^ += CBLKCNT0 ^ += 1 When a CBLKCNT is found, blkaddr will be increased by CBLKCNT lclusters, or a new HEAD is found immediately, bump blkaddr by 1 instead (see the picture above.) Also noted if CBLKCNT is the end of the pack, instead of storing delta1 (distance of the next HEAD lcluster) as normal NONHEADs, it still uses the compressed block count (delta0) since delta1 can be calculated indirectly but the block count can't. Adjust decoding logic to fit big pcluster compact indexes as well. Link: https://lore.kernel.org/r/20210407043927.10623-9-xiang@kernel.org Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> commit 7af2a5cf065073d6f43298b2c96676f9315709d5 Author: Gao Xiang <hsiangkao@redhat.com> Date: Wed Apr 7 12:39:24 2021 +0800 erofs: support parsing big pcluster compress indexes When INCOMPAT_BIG_PCLUSTER sb feature is enabled, legacy compress indexes will also have the same on-disk header compact indexes to keep per-file configurations instead of leaving it zeroed. If ADVISE_BIG_PCLUSTER is set for a file, CBLKCNT will be loaded for each pcluster in this file by parsing 1st non-head lcluster. Link: https://lore.kernel.org/r/20210407043927.10623-8-xiang@kernel.org Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> commit 81a0c5100c6b09b91b7cfdad429fc66d65335be2 Author: Gao Xiang <hsiangkao@redhat.com> Date: Wed Apr 7 12:39:23 2021 +0800 erofs: adjust per-CPU buffers according to max_pclusterblks Adjust per-CPU buffers on demand since big pcluster definition is available. Also, bail out unsupported pcluster size according to Z_EROFS_PCLUSTER_MAX_SIZE. Link: https://lore.kernel.org/r/20210407043927.10623-7-xiang@kernel.org Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> commit 56612c78a9aeefc38d6b9bd7a6fef06eebe0c4b6 Author: Gao Xiang <hsiangkao@redhat.com> Date: Wed Apr 7 12:39:22 2021 +0800 erofs: add big physical cluster definition Big pcluster indicates the size of compressed data for each physical pcluster is no longer fixed as block size, but could be more than 1 block (more accurately, 1 logical pcluster) When big pcluster feature is enabled for head0/1, delta0 of the 1st non-head lcluster index will keep block count of this pcluster in lcluster size instead of 1. Or, the compressed size of pcluster should be 1 lcluster if pcluster has no non-head lcluster index. Also note that BIG_PCLUSTER feature reuses COMPR_CFGS feature since it depends on COMPR_CFGS and will be released together. Link: https://lore.kernel.org/r/20210407043927.10623-6-xiang@kernel.org Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> commit a67309917444753f1cebfee2d2503cf68269e54a Author: Gao Xiang <hsiangkao@redhat.com> Date: Wed Apr 7 12:39:21 2021 +0800 erofs: fix up inplace I/O pointer for big pcluster When picking up inplace I/O pages, it should be traversed in reverse order in aligned with the traversal order of file-backed online pages. Also, index should be updated together when preloading compressed pages. Previously, only page-sized pclustersize was supported so no problem at all. Also rename `compressedpages' to `icpage_ptr' to reflect its functionality. Link: https://lore.kernel.org/r/20210407043927.10623-5-xiang@kernel.org Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> commit 8fabf77d1a435d68b2bbb89c51f8351ef8efed26 Author: Gao Xiang <hsiangkao@redhat.com> Date: Wed Apr 7 12:39:20 2021 +0800 erofs: introduce physical cluster slab pools Since multiple pcluster sizes could be used at once, the number of compressed pages will become a variable factor. It's necessary to introduce slab pools rather than a single slab cache now. This limits the pclustersize to 1M (Z_EROFS_PCLUSTER_MAX_SIZE), and get rid of the obsolete EROFS_FS_CLUSTER_PAGE_LIMIT, which has no use now. Link: https://lore.kernel.org/r/20210407043927.10623-4-xiang@kernel.org Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> commit c9b891a3fd81d315815f496f1282c95e98507812 Author: Gao Xiang <hsiangkao@redhat.com> Date: Sat Apr 10 03:06:30 2021 +0800 erofs: introduce multipage per-CPU buffers To deal the with the cases which inplace decompression is infeasible for some inplace I/O. Per-CPU buffers was introduced to get rid of page allocation latency and thrash for low-latency decompression algorithms such as lz4. For the big pcluster feature, introduce multipage per-CPU buffers to keep such inplace I/O pclusters temporarily as well but note that per-CPU pages are just consecutive virtually. When a new big pcluster fs is mounted, its max pclustersize will be read and per-CPU buffers can be growed if needed. Shrinking adjustable per-CPU buffers is more complex (because we don't know if such size is still be used), so currently just release them all when unloading. Link: https://lore.kernel.org/r/20210409190630.19569-1-xiang@kernel.org Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> commit 6751c7549b38cfe2044fc3d6e03c25c0067e700d Author: Gao Xiang <hsiangkao@redhat.com> Date: Wed Apr 7 12:39:18 2021 +0800 erofs: reserve physical_clusterbits[] Formal big pcluster design is actually more powerful / flexable than the previous thought whose pclustersize was fixed as power-of-2 blocks, which was obviously inefficient and space-wasting. Instead, pclustersize can now be set independently for each pcluster, so various pcluster sizes can also be used together in one file if mkfs wants (for example, according to data type and/or compression ratio). Let's get rid of previous physical_clusterbits[] setting (also notice that corresponding on-disk fields are still 0 for now). Therefore, head1/2 can be used for at most 2 different algorithms in one file and again pclustersize is now independent of these. Link: https://lore.kernel.org/r/20210407043927.10623-2-xiang@kernel.org Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> commit 7c717bd2fb96a7ee82346bc88ddd28c5812c689d Author: Ruiqi Gong <gongruiqi1@huawei.com> Date: Wed Mar 31 05:39:20 2021 -0400 erofs: Clean up spelling mistakes found in fs/erofs zmap.c: s/correspoinding/corresponding zdata.c: s/endding/ending Link: https://lore.kernel.org/r/20210331093920.31923-1-gongruiqi1@huawei.com Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Ruiqi Gong <gongruiqi1@huawei.com> Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> commit 44f277dee13de691fe1fc483b55b4bc8ade3da36 Author: Gao Xiang <hsiangkao@redhat.com> Date: Mon Mar 29 18:00:12 2021 +0800 erofs: add on-disk compression configurations Add a bitmap for available compression algorithms and a variable-sized on-disk table for compression options in preparation for upcoming big pcluster and LZMA algorithm, which follows the end of super block. To parse the compression options, the bitmap is scanned one by one. For each available algorithm, there is data followed by 2-byte `length' correspondingly (it's enough for most cases, or entire fs blocks should be used.) With such available algorithm bitmap, kernel itself can also refuse to mount such filesystem if any unsupported compression algorithm exists. Note that COMPR_CFGS feature will be enabled with BIG_PCLUSTER. Link: https://lore.kernel.org/r/20210329100012.12980-1-hsiangkao@aol.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> commit e43a280cd5ca073e9d8cfa0471cdabf8f8500181 Author: Gao Xiang <hsiangkao@redhat.com> Date: Mon Mar 29 09:23:07 2021 +0800 erofs: introduce on-disk lz4 fs configurations Introduce z_erofs_lz4_cfgs to store all lz4 configurations. Currently it's only max_distance, but will be used for new features later. Link: https://lore.kernel.org/r/20210329012308.28743-4-hsiangkao@aol.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> commit d4108bf277b411bfdfa0eb12c2172b4035471d8b Author: Huang Jianan <huangjianan@oppo.com> Date: Mon Mar 29 09:23:06 2021 +0800 erofs: support adjust lz4 history window size lz4 uses LZ4_DISTANCE_MAX to record history preservation. When using rolling decompression, a block with a higher compression ratio will cause a larger memory allocation (up to 64k). It may cause a large resource burden in extreme cases on devices with small memory and a large number of concurrent IOs. So appropriately reducing this value can improve performance. Decreasing this value will reduce the compression ratio (except when input_size <LZ4_DISTANCE_MAX). But considering that erofs currently only supports 4k output, reducing this value will not significantly reduce the compression benefits. The maximum value of LZ4_DISTANCE_MAX defined by lz4 is 64k, and we can only reduce this value. For the old kernel, it just can't reduce the memory allocation during rolling decompression without affecting the decompression result. Link: https://lore.kernel.org/r/20210329012308.28743-3-hsiangkao@aol.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Huang Jianan <huangjianan@oppo.com> Signed-off-by: Guo Weichao <guoweichao@oppo.com> [ Gao Xiang: introduce struct erofs_sb_lz4_info for configurations. ] Signed-off-by: Gao Xiang <hsiangkao@redhat.com> commit 89a30917b8f584f34216b053ff5e4b8e1fa1a81a Author: Gao Xiang <hsiangkao@redhat.com> Date: Mon Mar 29 09:23:05 2021 +0800 erofs: introduce erofs_sb_has_xxx() helpers Introduce erofs_sb_has_xxx() to make long checks short, especially for later big pcluster & LZMA features. Link: https://lore.kernel.org/r/20210329012308.28743-2-hsiangkao@aol.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> commit 83849318acff8125846f2447ed318f80db4dde38 Author: Yue Hu <huyue2@yulong.com> Date: Thu Mar 25 15:10:08 2021 +0800 erofs: don't use erofs_map_blocks() any more Currently, erofs_map_blocks() will be called only from erofs_{bmap, read_raw_page} which are all for uncompressed files. So, the compression branch in erofs_map_blocks() is pointless. Let's remove it and use erofs_map_blocks_flatmode() directly. Also update related comments. Link: https://lore.kernel.org/r/20210325071008.573-1-zbestahu@gmail.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Yue Hu <huyue2@yulong.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> commit dd3b7a71fb79a620a8df1138d74c990df27e04a5 Author: Gao Xiang <hsiangkao@redhat.com> Date: Mon Mar 22 02:32:27 2021 +0800 erofs: complete a missing case for inplace I/O Add a missing case which could cause unnecessary page allocation but not directly use inplace I/O instead, which increases runtime extra memory footprint. The detail is, considering an online file-backed page, the right half of the page is chosen to be cached (e.g. the end page of a readahead request) and some of its data doesn't exist in managed cache, so the pcluster will be definitely kept in the submission chain. (IOWs, it cannot be decompressed without I/O, e.g., due to the bypass queue). Currently, DELAYEDALLOC/TRYALLOC cases can be downgraded as NOINPLACE, and stop online pages from inplace I/O. After this patch, unneeded page allocations won't be observed in pickup_page_for_submission() then. Link: https://lore.kernel.org/r/20210321183227.5182-1-hsiangkao@aol.com Signed-off-by: Gao Xiang <hsiangkao@redhat.com> commit 2195652f604a78eff0b808c94f7c31c0648d42e8 Author: Huang Jianan <huangjianan@oppo.com> Date: Wed Mar 17 11:54:47 2021 +0800 erofs: use workqueue decompression for atomic contexts only z_erofs_decompressqueue_endio may not be executed in the atomic context, for example, when dm-verity is turned on. In this scenario, data can be decompressed directly to get rid of additional kworker scheduling overhead. Link: https://lore.kernel.org/r/20210317035448.13921-2-huangjianan@oppo.com Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Huang Jianan <huangjianan@oppo.com> Signed-off-by: Guo Weichao <guoweichao@oppo.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> commit 50a12c462dbc5e3e4d14dc427392fc8e571b1b0b Author: Huang Jianan <huangjianan@oppo.com> Date: Tue Mar 16 11:15:14 2021 +0800 erofs: avoid memory allocation failure during rolling decompression Currently, err would be treated as io error. Therefore, it'd be better to ensure memory allocation during rolling decompression to avoid such io error. In the long term, we might consider adding another !Uptodate case for such case. Link: https://lore.kernel.org/r/20210316031515.90954-1-huangjianan@oppo.com Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Huang Jianan <huangjianan@oppo.com> Signed-off-by: Guo Weichao <guoweichao@oppo.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> commit 5a664357076596a3af1100413bafb00a88dc5ef2 Author: kondors1995 <normandija1945@gmail.com> Date: Mon May 9 16:44:49 2022 +0000 raphael_defconfig: Enable EROFS commit 2409ea765730e7ca72fcc71dc3989eb37306ed81 Author: Tom Levy <tomlevy93@gmail.com> Date: Tue Jul 16 16:30:24 2019 -0700 include/linux/lz4.h: fix spelling and copy-paste errors in documentation Fix a few spelling and grammar errors, and two places where fast/safe in the documentation did not match the function. Link: http://lkml.kernel.org/r/20190321014452.13297-1-tomlevy93@gmail.com Signed-off-by: Tom Levy <tomlevy93@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Jiri Kosina <trivial@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live> commit 416572f0ce1a90146cb73dd5ea3667899d0f8241 Author: John Galt <johngaltfirstrun@gmail.com> Date: Tue May 3 16:09:48 2022 -0400 erofs: compression fixes commit 8af69e641af0cd017664fbf2fbd9ce2509b2b8dc Author: Luan Cachoroski Halaiko <luhalaiko@gmail.com> Date: Tue Feb 8 20:20:47 2022 -0300 erofs: fixes for compilation Signed-off-by: Luan Cachoroski Halaiko <luhalaiko@gmail.com> commit ad81e37ce0d0af5bdb0115a7eccd673c03d293f0 Author: Gao Xiang <hsiangkao@redhat.com> Date: Wed Dec 9 20:37:17 2020 +0800 erofs: force inplace I/O under low memory scenario Try to forcely switch to inplace I/O under low memory scenario in order to avoid direct memory reclaim due to cached page allocation. Link: https://lore.kernel.org/r/20201209123717.12430-1-hsiangkao@aol.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Change-Id: I8ea2d3b59c68125271f66853cf5dc6ca39e7aaa9 commit e4018facd91f25eb223b94416d1b64f641618577 Author: Gao Xiang <hsiangkao@redhat.com> Date: Tue Dec 8 17:58:34 2020 +0800 erofs: simplify try_to_claim_pcluster() simplify try_to_claim_pcluster() by directly using cmpxchg() here (the retry loop caused more overhead.) Also, move the chain loop detection in and rename it to z_erofs_try_to_claim_pcluster(). Link: https://lore.kernel.org/r/20201208095834.3133565-3-hsiangkao@redhat.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Change-Id: I8d091ff44123b099ef199eaa4200a00b8854623f commit f28d114732f644b4a6445316095db1f0e818472f Author: Gao Xiang <hsiangkao@redhat.com> Date: Tue Dec 8 17:58:33 2020 +0800 erofs: insert to managed cache after adding to pcl Previously, it could be some concern to call add_to_page_cache_lru() with page->mapping == Z_EROFS_MAPPING_STAGING (!= NULL). In contrast, page->private is used instead now, so partially revert commit 5ddcee1f3a1c ("erofs: get rid of __stagingpage_alloc helper") with some adaption for simplicity. Link: https://lore.kernel.org/r/20201208095834.3133565-2-hsiangkao@redhat.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Change-Id: If250d62b47083649e96d0937eb1990b6c84d768f commit 1a79fe1a476ae08ed0609618951fe863df0ac03a Author: Gao Xiang <hsiangkao@redhat.com> Date: Tue Dec 8 17:58:32 2020 +0800 erofs: get rid of magical Z_EROFS_MAPPING_STAGING Previously, we played around with magical page->mapping for short-lived temporary pages since we need to identify different types of pages in the same pcluster but both invalidated and short-lived temporary pages can have page->mapping == NULL. It was considered as safe because that temporary pages are all non-LRU / non-movable pages. This patch tends to use specific page->private to identify short-lived pages instead so it won't rely on page->mapping anymore. Details are described in "compress.h" as well. Link: https://lore.kernel.org/r/20201208095834.3133565-1-hsiangkao@redhat.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Change-Id: I2c8650e80cb6016ed828d04f89f8bd3512ca3fb2 commit a50789da7af81e73a8cb0081e788cea5543eff5c Author: Vladimir Zapolskiy <vladimir@tuxera.com> Date: Fri Oct 30 14:28:39 2020 +0200 erofs: remove a void EROFS_VERSION macro set in Makefile Since commit 4f761fa253b4 ("erofs: rename errln/infoln/debugln to erofs_{err, info, dbg}") the defined macro EROFS_VERSION has no affect, therefore removing it from the Makefile is a non-functional change. Link: https://lore.kernel.org/r/20201030122839.25431-1-vladimir@tuxera.com Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Vladimir Zapolskiy <vladimir@tuxera.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Change-Id: Id63ad279985db2a156d62be814bf381c9bea8342 commit d929ef94d4aab35ae96fb6d6efd1a0a23f7d1b48 Author: Gao Xiang <hsiangkao@linux.alibaba.com> Date: Mon Aug 30 11:44:53 2021 +0800 erofs: move from drivers/staging/ to fs/ Since 5.4, erofs has been moved into fs/. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Change-Id: I95dd967a0097629a9d8eaed1dc11e2cd04f47701 commit 2758a8239cc772c63d5463073b44626ee4e7695a Author: Gao Xiang <hsiangkao@linux.alibaba.com> Date: Wed Aug 25 11:42:03 2021 +0800 erofs: sync up with kernel 5.10 Backport 5.10 LTS erofs to 4.19. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Change-Id: Ibf9c0c47e46090b72e75f09a347100f4ff64f28d commit 1ee3b56216b0d92e2134d6134d2027c842f495b6 Author: Gao Xiang <hsiangkao@redhat.com> Date: Mon Mar 29 08:36:14 2021 +0800 erofs: add unsupported inode i_format check commit 24a806d849c0b0c1d0cd6a6b93ba4ae4c0ec9f08 upstream. If any unknown i_format fields are set (may be of some new incompat inode features), mark such inode as unsupported. Just in case of any new incompat i_format fields added in the future. Link: https://lore.kernel.org/r/20210329003614.6583-1-hsiangkao@aol.com Fixes: 431339ba9042 ("staging: erofs: add inode operations") Cc: <stable@vger.kernel.org> # 4.19+ Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 316472dda45a6a8142fc80800fa92f2846911008 Author: Gao Xiang <hsiangkao@redhat.com> Date: Thu Jul 30 01:58:01 2020 +0800 erofs: fix extended inode could cross boundary commit 0dcd3c94e02438f4a571690e26f4ee997524102a upstream. Each ondisk inode should be aligned with inode slot boundary (32-byte alignment) because of nid calculation formula, so all compact inodes (32 byte) cannot across page boundary. However, extended inode is now 64-byte form, which can across page boundary in principle if the location is specified on purpose, although it's hard to be generated by mkfs due to the allocation policy and rarely used by Android use case now mainly for > 4GiB files. For now, only two fields `i_ctime_nsec` and `i_nlink' couldn't be read from disk properly and cause out-of-bound memory read with random value. Let's fix now. Fixes: 431339ba9042 ("staging: erofs: add inode operations") Cc: <stable@vger.kernel.org> # 4.19+ Link: https://lore.kernel.org/r/20200729175801.GA23973@xiangao.remote.csb Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> [ Gao Xiang: resolve non-trivial conflicts for latest 4.19.y. ] Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit ee000f1badb6ca558527d2e99e6130e56fe6acfb Author: Gao Xiang <hsiangkao@redhat.com> Date: Sun Nov 1 03:51:02 2020 +0800 erofs: derive atime instead of leaving it empty commit d3938ee23e97bfcac2e0eb6b356875da73d700df upstream. EROFS has _only one_ ondisk timestamp (ctime is currently documented and recorded, we might also record mtime instead with a new compat feature if needed) for each extended inode since EROFS isn't mainly for archival purposes so no need to keep all timestamps on disk especially for Android scenarios due to security concerns. Also, romfs/cramfs don't have their own on-disk timestamp, and squashfs only records mtime instead. Let's also derive access time from ondisk timestamp rather than leaving it empty, and if mtime/atime for each file are really needed for specific scenarios as well, we can also use xattrs to record them then. Link: https://lore.kernel.org/r/20201031195102.21221-1-hsiangkao@aol.com [ Gao Xiang: It'd be better to backport for user-friendly concern. ] Fixes: 431339ba9042 ("staging: erofs: add inode operations") Cc: stable <stable@vger.kernel.org> # 4.19+ Reported-by: nl6720 <nl6720@gmail.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> [ Gao Xiang: Manually backport to 4.19.y due to trivial conflicts. ] Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 0601575a0ca46c49aaf765aaba9df8c1ce63cc9a Author: Gao Xiang <hsiangkao@redhat.com> Date: Fri Jun 19 07:43:49 2020 +0800 erofs: fix partially uninitialized misuse in z_erofs_onlinepage_fixup commit 3c597282887fd55181578996dca52ce697d985a5 upstream. Hongyu reported "id != index" in z_erofs_onlinepage_fixup() with specific aarch64 environment easily, which wasn't shown before. After digging into that, I found that high 32 bits of page->private was set to 0xaaaaaaaa rather than 0 (due to z_erofs_onlinepage_init behavior with specific compiler options). Actually we only use low 32 bits to keep the page information since page->private is only 4 bytes on most 32-bit platforms. However z_erofs_onlinepage_fixup() uses the upper 32 bits by mistake. Let's fix it now. Reported-and-tested-by: Hongyu Jin <hongyu.jin@unisoc.com> Fixes: 3883a79abd02 ("staging: erofs: introduce VLE decompression support") Cc: <stable@vger.kernel.org> # 4.19+ Reviewed-by: Chao Yu <yuchao0@huawei.com> Link: https://lore.kernel.org/r/20200618234349.22553-1-hsiangkao@aol.com Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 02cee974cb788dd6b23837c04e347dbadccb7e67 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Wed Feb 26 16:10:06 2020 +0800 erofs: correct the remaining shrink objects commit 9d5a09c6f3b5fb85af20e3a34827b5d27d152b34 upstream. The remaining count should not include successful shrink attempts. Fixes: e7e9a307be9d ("staging: erofs: introduce workstation for decompression") Cc: <stable@vger.kernel.org> # 4.19+ Link: https://lore.kernel.org/r/20200226081008.86348-1-gaoxiang25@huawei.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit afe022d9f5721497e63d11d3fdb06c95c6256c23 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Sun Dec 1 16:01:09 2019 +0800 erofs: zero out when listxattr is called with no xattr commit 926d1650176448d7684b991fbe1a5b1a8289e97c upstream. As David reported [1], ENODATA returns when attempting to modify files by using EROFS as an overlayfs lower layer. The root cause is that listxattr could return unexpected -ENODATA by mistake for inodes without xattr. That breaks listxattr return value convention and it can cause copy up failure when used with overlayfs. Resolve by zeroing out if no xattr is found for listxattr. [1] https://lore.kernel.org/r/CAEvUa7nxnby+rxK-KRMA46=exeOMApkDMAV08AjMkkPnTPV4CQ@mail.gmail.com Link: https://lore.kernel.org/r/20191201084040.29275-1-hsiangkao@aol.com Fixes: cadf1ccf1b00 ("staging: erofs: add error handling for xattr submodule") Cc: <stable@vger.kernel.org> # 4.19+ Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit fceffbd856369cedfa23b313844d3906de8fd36e Author: Gao Xiang <gaoxiang25@huawei.com> Date: Wed Oct 9 18:12:39 2019 +0800 staging: erofs: detect potential multiref due to corrupted images commit e12a0ce2fa69798194f3a8628baf6edfbd5c548f upstream. As reported by erofs-utils fuzzer, currently, multiref (ondisk deduplication) hasn't been supported for now, we should forbid it properly. Fixes: 3883a79abd02 ("staging: erofs: introduce VLE decompression support") Cc: <stable@vger.kernel.org> # 4.19+ Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Link: https://lore.kernel.org/r/20190821140152.229648-1-gaoxiang25@huawei.com [ Gao Xiang: Since earlier kernels don't define EFSCORRUPTED, let's use EIO instead. ] Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 9b3495631f1dba2feac41c880e564df6e242c8ab Author: Gao Xiang <gaoxiang25@huawei.com> Date: Wed Oct 9 18:12:38 2019 +0800 staging: erofs: add two missing erofs_workgroup_put for corrupted images commit 138e1a0990e80db486ab9f6c06bd5c01f9a97999 upstream. As reported by erofs-utils fuzzer, these error handling path will be entered to handle corrupted images. Lack of erofs_workgroup_puts will cause unmounting unsuccessfully. Fix these return values to EFSCORRUPTED as well. Fixes: 3883a79abd02 ("staging: erofs: introduce VLE decompression support") Cc: <stable@vger.kernel.org> # 4.19+ Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Link: https://lore.kernel.org/r/20190819103426.87579-4-gaoxiang25@huawei.com [ Gao Xiang: Older kernel versions don't have length validity check and EFSCORRUPTED, thus backport pageofs check for now. ] Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 20b9eea304f612a2cff8690eebc57d228e45b95e Author: Gao Xiang <gaoxiang25@huawei.com> Date: Wed Oct 9 18:12:37 2019 +0800 staging: erofs: some compressed cluster should be submitted for corrupted images commit ee45197c807895e156b2be0abcaebdfc116487c8 upstream. As reported by erofs_utils fuzzer, a logical page can belong to at most 2 compressed clusters, if one compressed cluster is corrupted, but the other has been ready in submitting chain. The chain needs to submit anyway in order to keep the page working properly (page unlocked with PG_error set, PG_uptodate not set). Let's fix it now. Fixes: 3883a79abd02 ("staging: erofs: introduce VLE decompression support") Cc: <stable@vger.kernel.org> # 4.19+ Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Link: https://lore.kernel.org/r/20190819103426.87579-2-gaoxiang25@huawei.com [ Gao Xiang: Manually backport to v4.19.y stable. ] Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit c61556faf792f95db0edbee6646fa2f52c8515d1 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Wed Oct 9 18:12:36 2019 +0800 staging: erofs: fix an error handling in erofs_readdir() commit acb383f1dcb4f1e79b66d4be3a0b6f519a957b0d upstream. Richard observed a forever loop of erofs_read_raw_page() [1] which can be generated by forcely setting ->u.i_blkaddr to 0xdeadbeef (as my understanding block layer can handle access beyond end of device correctly). After digging into that, it seems the problem is highly related with directories and then I found the root cause is an improper error handling in erofs_readdir(). Let's fix it now. [1] https://lore.kernel.org/r/1163995781.68824.1566084358245.JavaMail.zimbra@nod.at/ Reported-by: Richard Weinberger <richard@nod.at> Fixes: 3aa8ec716e52 ("staging: erofs: add directory operations") Cc: <stable@vger.kernel.org> # 4.19+ Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Link: https://lore.kernel.org/r/20190818125457.25906-1-hsiangkao@aol.com [ Gao Xiang: Since earlier kernels don't define EFSCORRUPTED, let's use original error code instead. ] Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 44e25b73c4772f5f08d483bbdcfe81c95758e955 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Thu Jun 13 16:35:41 2019 +0800 staging: erofs: add requirements field in superblock commit 5efe5137f05bbb4688890620934538c005e7d1d6 upstream. There are some backward incompatible features pending for months, mainly due to on-disk format expensions. However, we should ensure that it cannot be mounted with old kernels. Otherwise, it will causes unexpected behaviors. Fixes: ba2b77a82022 ("staging: erofs: add super block operations") Cc: <stable@vger.kernel.org> # 4.19+ Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit c458b3206aa217c67af63679c67cda21d1bb63fd Author: Gao Xiang <gaoxiang25@huawei.com> Date: Fri Mar 29 04:14:58 2019 +0800 staging: erofs: keep corrupted fs from crashing kernel in erofs_readdir() commit 33bac912840fe64dbc15556302537dc6a17cac63 upstream. After commit 419d6efc50e9, kernel cannot be crashed in the namei path. However, corrupted nameoff can do harm in the process of readdir for scenerios without dm-verity as well. Fix it now. Fixes: 3aa8ec716e52 ("staging: erofs: add directory operations") Cc: <stable@vger.kernel.org> # 4.19+ Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 77a2c8cadafb7972b2812c097f518ac3099e8a3b Author: Gao Xiang <gaoxiang25@huawei.com> Date: Mon Mar 25 11:40:07 2019 +0800 staging: erofs: fix error handling when failed to read compresssed data commit b6391ac73400eff38377a4a7364bd3df5efb5178 upstream. Complete read error handling paths for all three kinds of compressed pages: 1) For cache-managed pages, PG_uptodate will be checked since read_endio will unlock and SetPageUptodate for these pages; 2) For inplaced pages, read_endio cannot SetPageUptodate directly since it should be used to mark the final decompressed data, PG_error will be set with page locked for IO error instead; 3) For staging pages, PG_error is used, which is similar to what we do for inplaced pages. Fixes: 3883a79abd02 ("staging: erofs: introduce VLE decompression support") Cc: <stable@vger.kernel.org> # 4.19+ Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 74528ff6c38df709674cc676f67e79eac815e23f Author: Chao Yu <yuchao0@huawei.com> Date: Mon Mar 11 23:10:10 2019 +0800 staging: erofs: fix to handle error path of erofs_vmap() commit 8bce6dcede65139a087ff240127e3f3c01363eed upstream. erofs_vmap() wrapped vmap() and vm_map_ram() to return virtual continuous memory, but both of them can failed due to a lot of reason, previously, erofs_vmap()'s callers didn't handle them, which can potentially cause NULL pointer access, fix it. Fixes: 3883a79abd02 ("staging: erofs: introduce VLE decompression support") Fixes: 0d40d6e399c1 ("staging: erofs: add a generic z_erofs VLE decompressor") Cc: <stable@vger.kernel.org> # 4.19+ Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 910cd92ee289977f064971f7659cda0228ec1615 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Fri Nov 23 01:16:00 2018 +0800 staging: erofs: fix race when the managed cache is enabled commit 51232df5e4b268936beccde5248f312a316800be upstream. When the managed cache is enabled, the last reference count of a workgroup must be used for its workstation. Otherwise, it could lead to incorrect (un)freezes in the reclaim path, and it would be harmful. A typical race as follows: Thread 1 (In the reclaim path) Thread 2 workgroup_freeze(grp, 1) refcnt = 1 ... workgroup_unfreeze(grp, 1) refcnt = 1 workgroup_get(grp) refcnt = 2 (x) workgroup_put(grp) refcnt = 1 (x) ...unexpected behaviors * grp is detached but still used, which violates cache-managed freeze constraint. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit a906ead6ff3295233d3643d662309cddb7efd896 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Mon Mar 11 14:08:58 2019 +0800 staging: erofs: keep corrupted fs from crashing kernel in erofs_namei() commit 419d6efc50e94bcf5d6b35cd8c71f79edadec564 upstream. As Al pointed out, " ... and while we are at it, what happens to unsigned int nameoff = le16_to_cpu(de[mid].nameoff); unsigned int matched = min(startprfx, endprfx); struct qstr dname = QSTR_INIT(data + nameoff, unlikely(mid >= ndirents - 1) ? maxsize - nameoff : le16_to_cpu(de[mid + 1].nameoff) - nameoff); /* string comparison without already matched prefix */ int ret = dirnamecmp(name, &dname, &matched); if le16_to_cpu(de[...].nameoff) is not monotonically increasing? I.e. what's to prevent e.g. (unsigned)-1 ending up in dname.len? Corrupted fs image shouldn't oops the kernel.. " Revisit the related lookup flow to address the issue. Fixes: d72d1ce60174 ("staging: erofs: add namei functions") Cc: <stable@vger.kernel.org> # 4.19+ Suggested-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 6dbf1a15dcd2f0097d819daa4ee1926b2345d02f Author: Gao Xiang <gaoxiang25@huawei.com> Date: Mon Mar 11 14:08:57 2019 +0800 staging: erofs: fix race of initializing xattrs of a inode at the same time commit 62dc45979f3f8cb0ea67302a93bff686f0c46c5a upstream. In real scenario, there could be several threads accessing xattrs of the same xattr-uninitialized inode, and init_inode_xattrs() almost at the same time. That's actually an unexpected behavior, this patch closes the race. Fixes: b17500a0fdba ("staging: erofs: introduce xattr & acl support") Cc: <stable@vger.kernel.org> # 4.19+ Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 044ba07158562ecf1b2e9079fa97c9980b523eb0 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Mon Mar 11 14:08:56 2019 +0800 staging: erofs: fix memleak of inode's shared xattr array From: Sheng Yong <shengyong1@huawei.com> commit 3b1b5291f79d040d549d7c746669fc30e8045b9b upstream. If it fails to read a shared xattr page, the inode's shared xattr array is not freed. The next time the inode's xattr is accessed, the previously allocated array is leaked. Signed-off-by: Sheng Yong <shengyong1@huawei.com> Fixes: b17500a0fdba ("staging: erofs: introduce xattr & acl support") Cc: <stable@vger.kernel.org> # 4.19+ Reviewed-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 240517d98c12632095f2848bd94c30debdcaf600 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Mon Mar 11 14:08:55 2019 +0800 staging: erofs: fix fast symlink w/o xattr when fs xattr is on commit 7077fffcb0b0b65dc75e341306aeef4d0e7f2ec6 upstream. Currently, this will hit a BUG_ON for these symlinks as follows: - kernel message ------------[ cut here ]------------ kernel BUG at drivers/staging/erofs/xattr.c:59! SMP PTI CPU: 1 PID: 1170 Comm: getllxattr Not tainted 4.20.0-rc6+ #92 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014 RIP: 0010:init_inode_xattrs+0x22b/0x270 Code: 48 0f 45 ea f0 ff 4d 34 74 0d 41 83 4c 24 e0 01 31 c0 e9 00 fe ff ff 48 89 ef e8 e0 31 9e ff eb e9 89 e8 e9 ef fd ff ff 0f 0$ <0f> 0b 48 89 ef e8 fb f6 9c ff 48 8b 45 08 a8 01 75 24 f0 ff 4d 34 RSP: 0018:ffffa03ac026bdf8 EFLAGS: 00010246 ------------[ cut here ]------------ ... Call Trace: erofs_listxattr+0x30/0x2c0 ? selinux_inode_listxattr+0x5a/0x80 ? kmem_cache_alloc+0x33/0x170 ? security_inode_listxattr+0x27/0x40 listxattr+0xaf/0xc0 path_listxattr+0x5a/0xa0 do_syscall_64+0x43/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ... ---[ end trace 3c24b49408dc0c72 ]--- Fix it by checking ->xattr_isize in init_inode_xattrs(), and it also fixes improper return value -ENOTSUPP (it should be -ENODATA if xattr is enabled) for those inodes. Fixes: b17500a0fdba ("staging: erofs: introduce xattr & acl support") Cc: <stable@vger.kernel.org> # 4.19+ Reported-by: Li Guifu <bluce.liguifu@huawei.com> Tested-by: Li Guifu <bluce.liguifu@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 78544513d768a1559d7e61d5e29270844db027d2 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Mon Mar 11 14:08:54 2019 +0800 staging: erofs: add error handling for xattr submodule commit cadf1ccf1b0021d0b7a9347e102ac5258f9f98c8 upstream. This patch enhances the missing error handling code for xattr submodule, which improves the stability for the rare cases. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit f1f405af62a3f3b37bf965ddbc3ef5aa2fab2f57 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Wed Feb 27 13:33:30 2019 +0800 staging: erofs: compressed_pages should not be accessed again after freed commit af692e117cb8cd9d3d844d413095775abc1217f9 upstream. This patch resolves the following page use-after-free issue, z_erofs_vle_unzip: ... for (i = 0; i < nr_pages; ++i) { ... z_erofs_onlinepage_endio(page); (1) } for (i = 0; i < clusterpages; ++i) { page = compressed_pages[i]; if (page->mapping == mngda) (2) continue; /* recycle all individual staging pages */ (void)z_erofs_gather_if_stagingpage(page_pool, page); (3) WRITE_ONCE(compressed_pages[i], NULL); } ... After (1) is executed, page is freed and could be then reused, if compressed_pages is scanned after that, it could fall info (2) or (3) by mistake and that could finally be in a mess. This patch aims to solve the above issue only with little changes as much as possible in order to make the fix backport easier. Fixes: 3883a79abd02 ("staging: erofs: introduce VLE decompression support") Cc: <stable@vger.kernel.org> # 4.19+ Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit b3a98208a957c0e05850b82ebf7f474ab295ff00 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Wed Feb 27 13:33:31 2019 +0800 staging: erofs: fix illegal address access under memory pressure commit 1e5ceeab6929585512c63d05911d6657064abf7b upstream. Considering a read request with two decompressed file pages, If a decompression work cannot be started on the previous page due to memory pressure but in-memory LTP map lookup is done, builder->work should be still NULL. Moreover, if the current page also belongs to the same map, it won't try to start the decompression work again and then run into trouble. This patch aims to solve the above issue only with little changes as much as possible in order to make the fix backport easier. kernel message is: <4>[1051408.015930s]SLUB: Unable to allocate memory on node -1, gfp=0x2408040(GFP_NOFS|__GFP_ZERO) <4>[1051408.015930s] cache: erofs_compress, object size: 144, buffer size: 144, default order: 0, min order: 0 <4>[1051408.015930s] node 0: slabs: 98, objs: 2744, free: 0 * Cannot allocate the decompression work <3>[1051408.015960s]erofs: z_erofs_vle_normalaccess_readpages, readahead error at page 1008 of nid 5391488 * Note that the previous page was failed to read <0>[1051408.015960s]Internal error: Accessing user space memory outside uaccess.h routines: 96000005 [#1] PREEMPT SMP ... <4>[1051408.015991s]Hardware name: kirin710 (DT) ... <4>[1051408.016021s]PC is at z_erofs_vle_work_add_page+0xa0/0x17c <4>[1051408.016021s]LR is at z_erofs_do_read_page+0x12c/0xcf0 ... <4>[1051408.018096s][<ffffff80c6fb0fd4>] z_erofs_vle_work_add_page+0xa0/0x17c <4>[1051408.018096s][<ffffff80c6fb3814>] z_erofs_vle_normalaccess_readpages+0x1a0/0x37c <4>[1051408.018096s][<ffffff80c6d670b8>] read_pages+0x70/0x190 <4>[1051408.018127s][<ffffff80c6d6736c>] __do_page_cache_readahead+0x194/0x1a8 <4>[1051408.018127s][<ffffff80c6d59318>] filemap_fault+0x398/0x684 <4>[1051408.018127s][<ffffff80c6d8a9e0>] __do_fault+0x8c/0x138 <4>[1051408.018127s][<ffffff80c6d8f90c>] handle_pte_fault+0x730/0xb7c <4>[1051408.018127s][<ffffff80c6d8fe04>] __handle_mm_fault+0xac/0xf4 <4>[1051408.018157s][<ffffff80c6d8fec8>] handle_mm_fault+0x7c/0x118 <4>[1051408.018157s][<ffffff80c8c52998>] do_page_fault+0x354/0x474 <4>[1051408.018157s][<ffffff80c8c52af8>] do_translation_fault+0x40/0x48 <4>[1051408.018157s][<ffffff80c6c002f4>] do_mem_abort+0x80/0x100 <4>[1051408.018310s]---[ end trace 9f4009a3283bd78b ]--- Fixes: 3883a79abd02 ("staging: erofs: introduce VLE decompression support") Cc: <stable@vger.kernel.org> # 4.19+ Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 14b20a49fc73c4818efa3327451904ef6f9c07ab Author: Gao Xiang <gaoxiang25@huawei.com> Date: Wed Feb 27 13:33:32 2019 +0800 staging: erofs: fix mis-acted TAIL merging behavior commit a112152f6f3a2a88caa6f414d540bd49e406af60 upstream. EROFS has an optimized path called TAIL merging, which is designed to merge multiple reads and the corresponding decompressions into one if these requests read continuous pages almost at the same time. In general, it behaves as follows: ________________________________________________________________ ... | TAIL . HEAD | PAGE | PAGE | TAIL . HEAD | ... _____|_combined page A_|________|________|_combined page B_|____ 1 ] -> [ 2 ] -> [ 3 If the above three reads are requested in the order 1-2-3, it will generate a large work chain rather than 3 individual work chains to reduce scheduling overhead and boost up sequential read. However, if Read 2 is processed slightly earlier than Read 1, currently it still generates 2 individual work chains (chain 1, 2) but it does in-place decompression for combined page A, moreover, if chain 2 decompresses ahead of chain 1, it will be a race and lead to corrupted decompressed page. This patch fixes it. Fixes: 3883a79abd02 ("staging: erofs: introduce VLE decompression support") Cc: <stable@vger.kernel.org> # 4.19+ Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 1931a6c5fe28edd9c62d54dd67806c3806e9cdb7 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Tue Dec 11 15:17:50 2018 +0800 staging: erofs: unzip_vle_lz4.c,utils.c: rectify BUG_ONs commit b8e076a6ef253e763bfdb81e5c72bcc828b0fbeb upstream. remove all redundant BUG_ONs, and turn the rest useful usages to DBG_BUGONs. Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 0773d1966061cba2de6b226947470baf88feda72 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Tue Dec 11 15:17:49 2018 +0800 staging: erofs: unzip_{pagevec.h,vle.c}: rectify BUG_ONs commit 70b17991d89554cdd16f3e4fb0179bcc03c808d9 upstream. remove all redundant BUG_ONs, and turn the rest useful usages to DBG_BUGONs. Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 6a00c9d7066562e418e30b1b211c77aed5c40551 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Wed Dec 5 21:23:13 2018 +0800 staging: erofs: {dir,inode,super}.c: rectify BUG_ONs commit 8b987bca2d09649683cbe496419a011df8c08493 upstream. remove all redundant BUG_ONs, and turn the rest useful usages to DBG_BUGONs. Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit ef609890e1f8f27546f25d058bcaeb3c5a7a982f Author: Gao Xiang <gaoxiang25@huawei.com> Date: Fri Nov 23 01:16:03 2018 +0800 staging: erofs: add a full barrier in erofs_workgroup_unfreeze commit 948bbdb1818b7ad6e539dad4fbd2dd4650793ea9 upstream. Just like other generic locks, insert a full barrier in case of memory reorder. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit e88d7d9adb52d0f9ba8028c6b4a13e7e83d743a5 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Fri Nov 23 01:16:02 2018 +0800 staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}' commit 73f5c66df3e26ab750cefcb9a3e08c71c9f79cad upstream. There are two minor issues in the current freeze interface: 1) Freeze interfaces have not related with CONFIG_DEBUG_SPINLOCK, therefore fix the incorrect conditions; 2) For SMP platforms, it should also disable preemption before doing atomic_cmpxchg in case that some high priority tasks preempt between atomic_cmpxchg and disable_preempt, then spin on the locked refcount later. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 26b9413853f64a44d858c24bc2b4c834a2e6a1fc Author: Gao Xiang <gaoxiang25@huawei.com> Date: Fri Nov 23 01:16:01 2018 +0800 staging: erofs: atomic_cond_read_relaxed on ref-locked workgroup commit df134b8d17b90c1e7720e318d36416b57424ff7a upstream. It's better to use atomic_cond_read_relaxed, which is implemented in hardware instructions to monitor a variable changes currently for ARM64, instead of open-coded busy waiting. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 28e3fa73e294002f8e7c48b6e9ea92784bf9e21a Author: Gao Xiang <gaoxiang25@huawei.com> Date: Sat Nov 3 17:23:56 2018 +0800 staging: erofs: remove the redundant d_rehash() for the root dentry commit e9c892465583c8f42d61fafe30970d36580925df upstream. There is actually no need at all to d_rehash() for the root dentry as Al pointed out, fix it. Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit e3e7bbe526acfac4307a2a6d7e2aaf5222ea88de Author: Gao Xiang <gaoxiang25@huawei.com> Date: Wed Sep 19 13:49:07 2018 +0800 staging: erofs: drop multiref support temporarily commit e5e3abbadf0dbd1068f64f8abe70401c5a178180 upstream. Multiref support means that a compressed page could have more than one reference, which is designed for on-disk data deduplication. However, mkfs doesn't support this mode at this moment, and the kernel implementation is also broken. Let's drop multiref support. If it is fully implemented in the future, it can be reverted later. Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 2dd8bd1abced431fa4be477299fa9ddce4677642 Author: Chen Gong <gongchen4@huawei.com> Date: Tue Sep 18 22:27:28 2018 +0800 staging: erofs: replace BUG_ON with DBG_BUGON in data.c commit 9141b60cf6a53c99f8a9309bf8e1c6650a6785c1 upstream. This patch replace BUG_ON with DBG_BUGON in data.c, and add necessary error handler. Signed-off-by: Chen Gong <gongchen4@huawei.com> Reviewed-by: Gao Xiang <gaoxiang25@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit a14a5cf712938fadd39fb99a8f8a46d72b19cd4d Author: Gao Xiang <gaoxiang25@huawei.com> Date: Tue Sep 18 22:27:25 2018 +0800 staging: erofs: complete error handing of z_erofs_do_read_page commit 1e05ff36e6921ca61bdbf779f81a602863569ee3 upstream. This patch completes error handing code of z_erofs_do_read_page. PG_error will be set when some read error happens, therefore z_erofs_onlinepage_endio will unlock this page without setting PG_uptodate. Reviewed-by: Chao Yu <yucxhao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 381d39d1c2d471e4c318320bae60806c5d0b04bd Author: Gao Xiang <gaoxiang25@huawei.com> Date: Tue Sep 18 22:25:36 2018 +0800 staging: erofs: fix a bug when appling cache strategy commit 0734ffbf574ee813b20899caef2fe0ed502bb783 upstream. As described in Kconfig, the last compressed pack should be cached for further reading for either `EROFS_FS_ZIP_CACHE_UNIPOLAR' or `EROFS_FS_ZIP_CACHE_BIPOLAR' by design. However, there is a bug in z_erofs_do_read_page, it will switch `initial' to `false' at the very beginning before it decides to cache the last compressed pack. caching strategy should work properly after appling this patch. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 3dc0616d60bcc3888f5dcf4585bcc5e2131a64df Author: Gao Xiang <gaoxiang25@huawei.com> Date: Fri Nov 23 01:15:59 2018 +0800 staging: erofs: fix the definition of DBG_BUGON [ Upstream commit eef168789866514e5d4316f030131c9fe65b643f ] It's better not to positively BUG_ON the kernel, however developers need a way to locate issues as soon as possible. DBG_BUGON is introduced and it could only crash when EROFS_FS_DEBUG (EROFS developping feature) is on. It is helpful for developers to find and solve bugs quickly by eng builds. Previously, DBG_BUGON is defined as ((void)0) if EROFS_FS_DEBUG is off, but some unused variable warnings as follows could occur: drivers/staging/erofs/unzip_vle.c: In function `init_alway:': drivers/staging/erofs/unzip_vle.c:61:33: warning: unused variable `work' [-Wunused-variable] struct z_erofs_vle_work *const work = ^~~~ Fix it to #define DBG_BUGON(x) ((void)(x)). Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org> commit 92c97ef11b111b764dc92c5edaf9385f74c72e7d Author: Gao Xiang <gaoxiang25@huawei.com> Date: Sat Dec 8 00:19:12 2018 +0800 staging: erofs: fix use-after-free of on-stack `z_erofs_vle_unzip_io' [ Upstream commit 848bd9acdcd00c164b42b14aacec242949ecd471 ] The root cause is the race as follows: Thread #0 Thread #1 z_erofs_vle_unzip_kickoff z_erofs_submit_and_unzip struct z_erofs_vle_unzip_io io[] atomic_add_return() wait_event() [end of function] wake_up() Fix it by taking the waitqueue lock between atomic_add_return and wake_up to close such the race. kernel message: Unable to handle kernel paging request at virtual address 97f7052caa1303dc ... Workqueue: kverityd verity_work task: ffffffe32bcb8000 task.stack: ffffffe3298a0000 PC is at __wake_up_common+0x48/0xa8 LR is at __wake_up+0x3c/0x58 ... Call trace: ... [<ffffff94a08ff648>] __wake_up_common+0x48/0xa8 [<ffffff94a08ff8b8>] __wake_up+0x3c/0x58 [<ffffff94a0c11b60>] z_erofs_vle_unzip_kickoff+0x40/0x64 [<ffffff94a0c118e4>] z_erofs_vle_read_endio+0x94/0x134 [<ffffff94a0c83c9c>] bio_endio+0xe4/0xf8 [<ffffff94a1076540>] dec_pending+0x134/0x32c [<ffffff94a1076f28>] clone_endio+0x90/0xf4 [<ffffff94a0c83c9c>] bio_endio+0xe4/0xf8 [<ffffff94a1095024>] verity_work+0x210/0x368 [<ffffff94a08c4150>] process_one_work+0x188/0x4b4 [<ffffff94a08c45bc>] worker_thread+0x140/0x458 [<ffffff94a08cad48>] kthread+0xec/0x108 [<ffffff94a0883ab4>] ret_from_fork+0x10/0x1c Code: d1006273 54000260 f9400804 b9400019 (b85fc081) ---[ end trace be9dde154f677cd1 ]--- Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org> commit 323056dc5fbe4768311194a3a2adf14806f25074 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Tue Sep 18 22:25:33 2018 +0800 staging: erofs: fix a missing endian conversion [ Upstream commit 37ec35a6cc2b99eb7fd6b85b7d7b75dff46bc353 ] This patch fixes a missing endian conversion in vle_get_logical_extent_head. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 69f2b4eaba237770f5c696942595d064ae3340f8 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Thu Sep 6 17:01:47 2018 +0800 staging: erofs: rename superblock flags (MS_xyz -> SB_xyz) This patch follows commit 1751e8a6cb93 ("Rename superblock flags (MS_xyz -> SB_xyz)") and after commit ("vfs: Suppress MS_* flag defs within the kernel unless explicitly enabled"), there is no MS_RDONLY and MS_NOATIME at all. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 1621b077d53285bd5127532ce160cec69adbe660 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Tue Aug 28 11:39:48 2018 +0800 Revert "staging: erofs: disable compiling temporarile" This reverts commit 156c3df8d4db4e693c062978186f44079413d74d. Since XArray and the new mount apis aren't merged in 4.19-rc1 merge window, the BROKEN mark can be reverted directly without any problems. Fixes: 156c3df8d4db ("staging: erofs: disable compiling temporarile") Cc: Matthew Wilcox <willy@infradead.org> Cc: David Howells <dhowells@redhat.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 3bbdccddb4ee53c0b81545226439f231ae698f65 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Mon Aug 6 11:27:53 2018 +0800 staging: erofs: remove an extra semicolon in z_erofs_vle_unzip_all There is an extra semicolon in z_erofs_vle_unzip_all, remove it. Reported-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: zhong jiang <zhongjiang@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit ee25ad8cd5b803ae4cda0116304ff15383cb6881 Author: Kristaps Čivkulis <kristaps.civkulis@gmail.com> Date: Sun Aug 5 18:21:01 2018 +0300 staging: erofs: fix if assignment style issue Fix coding style issue "do not use assignment in if condition" detected by checkpatch.pl. Signed-off-by: Kristaps Čivkulis <kristaps.civkulis@gmail.com> Reviewed-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 81d71d6e9a330f4471f3edec412d6124031eac46 Author: Chao Yu <yuchao0@huawei.com> Date: Thu Aug 2 17:39:17 2018 +0800 staging: erofs: disable compiling temporarile As Stephen Rothwell reported: "After merging the staging tree, today's linux-next build (x86_64 allmodconfig) failed like this: drivers/staging/erofs/super.c: In function 'erofs_read_super': drivers/staging/erofs/super.c:343:17: error: 'MS_RDONLY' undeclared (first use in this function); did you mean 'IS_RDONLY'? sb->s_flags |= MS_RDONLY | MS_NOATIME; ^~~~~~~~~ IS_RDONLY drivers/staging/erofs/super.c:343:17: note: each undeclared identifier is reported only once for each function it appears in drivers/staging/erofs/super.c:343:29: error: 'MS_NOATIME' undeclared (first use in this function); did you mean 'S_NOATIME'? sb->s_flags |= MS_RDONLY | MS_NOATIME; ^~~~~~~~~~ S_NOATIME drivers/staging/erofs/super.c: In function 'erofs_mount': drivers/staging/erofs/super.c:501:10: warning: passing argument 5 of 'mount_bdev' makes integer from pointer without a cast [-Wint-conversion] &priv, erofs_fill_super); ^~~~~~~~~~~~~~~~ In file included from include/linux/buffer_head.h:12:0, from drivers/staging/erofs/super.c:14: include/linux/fs.h:2151:23: note: expected 'size_t {aka long unsigned int}' but argument is of type 'int (*)(struct super_block *, void *, int)' extern struct dentry *mount_bdev(struct file_system_type *fs_type, ^~~~~~~~~~ drivers/staging/erofs/super.c:500:9: error: too few arguments to function 'mount_bdev' return mount_bdev(fs_type, flags, dev_name, ^~~~~~~~~~ In file included from include/linux/buffer_head.h:12:0, from drivers/staging/erofs/super.c:14: include/linux/fs.h:2151:23: note: declared here extern struct dentry *mount_bdev(struct file_system_type *fs_type, ^~~~~~~~~~ drivers/staging/erofs/super.c: At top level: drivers/staging/erofs/super.c:518:20: error: initialization from incompatible pointer type [-Werror=incompatible-pointer-types] .mount = erofs_mount, ^~~~~~~~~~~ drivers/staging/erofs/super.c:518:20: note: (near initialization for 'erofs_fs_type.mount') drivers/staging/erofs/super.c: In function 'erofs_remount': drivers/staging/erofs/super.c:630:12: error: 'MS_RDONLY' undeclared (first use in this function); did you mean 'IS_RDONLY'? *flags |= MS_RDONLY; ^~~~~~~~~ IS_RDONLY drivers/staging/erofs/super.c: At top level: drivers/staging/erofs/super.c:640:16: error: initialization from incompatible pointer type [-Werror=incompatible-pointer-types] .remount_fs = erofs_remount, ^~~~~~~~~~~~~ Caused by various commits creating erofs in the staging tree interacting with various commits redoing the mount infrastructure in the vfs tree. I have disabed CONFIG_EROFS_FS for now:" The reason of compiling error is: Since -next collects and merges developing patches including common vfs stuff from multi-trees, but those patches didn't cover erofs, such as: ('vfs: Suppress MS_* flag defs within the kernel unless explicitly enabled") https://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git/commit/?h=for-next&id=109b45090d7d3ce2797bb1ef7f70eead5bfe0ff3 ("vfs: Require specification of size of mount data for internal mounts") https://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git/commit/?h=for-next&id=0a191e4505a4f255e6513b49426213da69bf0e80 Above vfs related patches has not been merged in staging tree, if we submit those erofs patches to staging mailing list and after including them in staging-{test,nexts} tree, it can easily cause compiling error. We worked out some patches to adjust those vfs change, but now we just submit them to -next tree temporarily to avoid compiling error. For potentail conflict in between erofs and vfs changes in incoming merge window, Stephen suggested that we can disable CONFIG_EROFS_FS temporarily to pass merge window, and after that we can do restore by reenabling CONFIG_EROFS_FS and applying those fixing patches. Also Greg confirmed this solution. So, let's disable compiling erofs for a while. Suggested-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Reviewed-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 2d4499c8b8b78dd00788a7513373d0900013e850 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Wed Aug 1 14:38:31 2018 +0800 staging: erofs: remove a redundant marco in xattr There is no need to '#if CONFIG_EROFS_FS_XATTR' in xattr.c, let's remove it. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 8eaefd9be86fa3d85305f05192568ba1507dab75 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Wed Aug 1 17:36:54 2018 +0800 staging: erofs: add the missing break in z_erofs_map_blocks_iter This patch adds a missing break after adding the default case. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit b79d82f61f25e532b98f7d3a3d49b250f1728e0d Author: Gao Xiang <gaoxiang25@huawei.com> Date: Mon Jul 30 09:51:01 2018 +0800 staging: erofs: use the wrapped PTR_ERR_OR_ZERO instead of open code Just clean up and logic doesn't change. Link: https://lists.01.org/pipermail/kbuild-all/2018-July/050766.html Fixes: d72d1ce60174 ("staging: erofs: add namei functions") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit ef38dd74d8389a7474b0f947185f347e24419686 Author: Gao Xiang <hsiangkao@aol.com> Date: Sun Jul 29 13:37:57 2018 +0800 staging: erofs: fix conditional uninitialized `pcn' in z_erofs_map_blocks_iter This patch adds error handling code for z_erofs_map_blocks_iter to fix the compiler blame. Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 294e8e93272dcbdabdd6e033e4d14bbfe6d91bb7 Author: Gao Xiang <hsiangkao@aol.com> Date: Sun Jul 29 13:34:58 2018 +0800 staging: erofs: fix compile error without built-in decompression support This patch fixes incorrect code snippets due to spilt code into small patches by mistake. Link: https://lists.01.org/pipermail/kbuild-all/2018-July/050747.html Link: https://lists.01.org/pipermail/kbuild-all/2018-July/050750.html Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit db6fedf04cecf5fa3d78a941fd068d581813dfa0 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Sat Jul 28 15:10:32 2018 +0800 staging: erofs: fix a compile warning of Z_EROFS_VLE_VMAP_ONSTACK_PAGES There is a type mismatch in the definition of Z_EROFS_VLE_VMAP_ONSTACK_PAGES, let's fix it. Link: https://lists.01.org/pipermail/kbuild-all/2018-July/050707.html Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 73c620c52e51ff2bf93cf02509cdbd9da3d50220 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Thu Jul 26 20:22:08 2018 +0800 staging: erofs: add a TODO and update MAINTAINERS for staging This patch adds a TODO to list the things to be done, and the relevant info to MAINTAINERS so we can take all the blame :) Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit fd66e0b7e7510165f9c2214a0e68d7025f8b8d83 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Thu Jul 26 20:22:07 2018 +0800 staging: erofs: introduce cached decompression This patch adds an optional choice which can be enabled by users in order to cache both incomplete ends of compressed clusters as a complement to the in-place decompression in order to boost random read, but it costs more memory than the in-place decompression only. Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit e84127077ff509f7204888244cf848bf9cddd794 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Thu Jul 26 20:22:06 2018 +0800 staging: erofs: introduce VLE decompression support This patch introduces the basic in-place VLE decompression implementation for the erofs file system. Compared with fixed-sized input compression, it implements what we call 'the variable-length extent compression' which specifies the same output size for each compression block to make the full use of IO bandwidth (which means almost all data from block device can be directly used for decomp- ression), improve the real (rather than just via data caching, which costs more memory) random read and keep the relatively lower compression ratios (it saves more storage space than fixed-sized input compression which is also configured with the same input block size), as illustrated below: |--- variable-length extent ---|------ VLE ------|--- VLE ---| /> clusterofs /> clusterofs /> clusterofs /> clusterofs ++---|-------++-----------++---------|-++-----------++-|---------++-| ...|| | || || | || || | || | ... original data ++---|-------++-----------++---------|-++-----------++-|---------++-| ++->cluster<-++->cluster<-++->cluster<-++->cluster<-++->cluster<-++ size size size size size \ / / / \ / / / \ / / / ++-----------++-----------++-----------++ ... || || || || ... compressed clusters ++-----------++-----------++-----------++ ++->cluster<-++->cluster<-++->cluster<-++ size size size The main point of 'in-place' refers to the decompression mode: Instead of allocating independent compressed pages and data structures, it reuses the allocated file cache pages at most to store its compressed data and the corresponding pagevec in a time-sharing approach by default, which will be useful for low memory scenario. In the end, unlike the other filesystems with (de)compression support using a relatively large compression block size, which reads and decompresses >= 128KB at once, and gains a more good-looking random read (In fact it collects small random reads into large sequential reads and caches all decompressed data in memory, but it is unacceptable especially for embedded devices with limited memory, and it is not the real random read), we select a universal small-sized 4KB compressed cluster, which is the smallest page size for most architectures, and all compressed clusters can be read and decompressed independently, which ensures random read number for all use cases. Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit ab43173ff3316c0120f9b2c3abc325a18773f30f Author: Gao Xiang <gaoxiang25@huawei.com> Date: Thu Jul 26 20:22:05 2018 +0800 staging: erofs: introduce workstation for decompression This patch introduces another concept used by the unzip subsystem called 'workstation'. It can be seen as a sparse array that stores pointers pointed to data structures related to the corresponding physical blocks. All lookup cases are protected by RCU read lock. Besides, reference count and spin_lock are also introduced to manage its lifetime and serialize all update operations. 'workstation' is currently implemented on the in-kernel radix tree approach for backward compatibility. With the evolution of linux kernel, it could be migrated into XArray implementation in the future. Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 84c882ba349e57fa654b0a52d6529bff5c18c0e0 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Thu Jul 26 20:22:04 2018 +0800 staging: erofs: introduce erofs shrinker This patch adds a dedicated shrinker targeting to free unneeded memory consumed by a number of erofs in-memory data structures. Like F2FS and UBIFS, it also adds: - sbi->umount_mutex to avoid races on shrinker and put_super - sbi->shrinker_run_no to not revisit recently scaned objects Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit dc98494e64df2c56c3d6658f60a86b257e9735a3 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Thu Jul 26 20:22:03 2018 +0800 staging: erofs: introduce superblock registration In order to introducing shrinker solution for erofs, let's manage all mounted erofs instances at first. Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 8ded5dd185d595bc3664cabc5de54b84021d3314 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Thu Jul 26 20:22:02 2018 +0800 staging: erofs: add a generic z_erofs VLE decompressor Currently, this patch only simply implements LZ4 decompressor due to its development priority. In the future, erofs will support more compression algorithm and format other than LZ4, thus a generic decompressor interface will be needed. Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 6406d5e0a4a3a6e88c6898268c36e421d1c5006b Author: Gao Xiang <gaoxiang25@huawei.com> Date: Thu Jul 26 20:22:01 2018 +0800 staging: erofs: introduce a customized LZ4 decompression We have to reduce the memory cost as much as possible, so we don't want to decompress more data beyond the output buffer size, however "LZ4_decompress_safe_partial" doesn't guarantee to stop at the arbitary end position, but stop just after its current LZ4 "sequence" is completed. Link: https://groups.google.com/forum/#!topic/lz4c/_3kkz5N6n00 Therefore, I hacked the LZ4 decompression logic by hand, probably NOT the fastest approach, and hope for better implementation. Signed-off-by: Miao Xie <miaoxie@huawei.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit c21aeb7e5feca41005feac999d4cf446dc65a701 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Thu Jul 26 20:22:00 2018 +0800 staging: erofs: globalize prepare_bio and __submit_bio The unzip subsystem also uses these functions, let's export them to internal.h. Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit a5908581d539ef37d1d390e3ad647216440c0ace Author: Gao Xiang <gaoxiang25@huawei.com> Date: Thu Jul 26 20:21:59 2018 +0800 staging: erofs: add erofs_allocpage This patch introduces an temporary _on-stack_ page pool to reuse the freed page directly as much as it can for better performance and release all pages at a time, it also slightly reduces the possibility of the potential memory allocation failure. Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit bbd3e12ab2521a7c982ac0707bf8da7a0d22653b Author: Gao Xiang <gaoxiang25@huawei.com> Date: Thu Jul 26 20:21:58 2018 +0800 staging: erofs: add erofs_map_blocks_iter This patch introduces an iterable L2P mapping operation 'erofs_map_blocks_iter'. Compared with 'erofs_map_blocks', it avoids a number of redundant 'release and regrab' processes if they request the same meta page. Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 70622cae335b9140e4358d7084c10cfb3da3301c Author: Gao Xiang <gaoxiang25@huawei.com> Date: Thu Jul 26 20:21:57 2018 +0800 staging: erofs: introduce pagevec for unzip subsystem For each compressed cluster, there is a straight-forward way of allocating a fixed or variable-sized (for VLE) array to record the corresponding file pages for its decompression if we decide to decompress these pages asynchronously (eg. read-ahead case), however it could take much extra on-heap memory compared with traditional uncompressed filesystems. This patch introduces a pagevec solution to reuse some allocated file page in the time-sharing approach storing parts of the array itself in order to minimize the extra memory overhead, thus only a constant and small-sized array used for booting the whole array itself up will be needed. Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit ff29dac3b4b402729a0b75b8724793158701b1f5 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Thu Jul 26 20:21:56 2018 +0800 staging: erofs: <linux/tagptr.h>: introduce tagged pointer Currently kernel has scattered tagged pointer usages hacked by hand in plain code, without a unique and portable functionset to highlight the tagged pointer itself and wrap these hacked code in order to clean up all over meaningless magic masks. Therefore, this patch introduces simple generic methods to fold tags into a pointer integer. It currently supports the last n bits of the pointer for tags, which can be selected by users. In addition, it will also be used for the upcoming EROFS filesystem, which heavily uses tagged pointer approach for high performance and reducing extra memory allocation. Link: https://en.wikipedia.org/wiki/Tagged_pointer Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 153f5ad87b67c45b453790ce206ced7c6cc62609 Author: Chao Yu <yuchao0@huawei.com> Date: Thu Jul 26 20:21:55 2018 +0800 staging: erofs: support tracepoint Add basic tracepoints for ->readpage{,s}, ->lookup, ->destroy_inode, fill_inode and map_blocks. Reviewed-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 98dd1e3a3f42df26003ae86fd1767b03bef433a6 Author: Chao Yu <yuchao0@huawei.com> Date: Thu Jul 26 20:21:54 2018 +0800 staging: erofs: introduce error injection infrastructure This patch introduces error injection infrastructure, with it, we can inject error in any kernel exported common functions which erofs used, so that it can force erofs running into error paths, it turns out that tests can cover real rare paths more easily to find bugs. Reviewed-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 220c7448cdc4c38e5177777de23793e653969904 Author: Chao Yu <yuchao0@huawei.com> Date: Thu Jul 26 20:21:53 2018 +0800 staging: erofs: support special inode This patch adds to support special inode, such as block dev, char, socket, pipe inode. Reviewed-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 7ed68385c49ac127e5baa07220d37cbf937e89d9 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Thu Jul 26 20:21:52 2018 +0800 staging: erofs: introduce xattr & acl support This implements xattr and acl functionalities. Inline and shared xattrs are introduced for flexibility. Specifically, if the same xattr occurs for many times in a large number of inodes or the value of a xattr is so large that it isn't suitable to be inlined, a shared xattr kept in the xattr meta will be used instead. Signed-off-by: Miao Xie <miaoxie@huawei.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit db9bea5cf638b0683376b4118754dad0d444dd7c Author: Gao Xiang <gaoxiang25@huawei.com> Date: Thu Jul 26 20:21:51 2018 +0800 staging: erofs: update Kconfig and Makefile This commit adds Makefile and Kconfig for erofs, and updates Makefile and Kconfig files in the fs directory. Signed-off-by: Miao Xie <miaoxie@huawei.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit afad040452afed7d552fb853c8613c6002e17ccb Author: Gao Xiang <gaoxiang25@huawei.com> Date: Thu Jul 26 20:21:50 2018 +0800 staging: erofs: add namei functions This commit adds functions that transfer names to inodes. Signed-off-by: Miao Xie <miaoxie@huawei.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 4e7097e1a4a0170e8e51866a1242cc9556dcca5d Author: Gao Xiang <gaoxiang25@huawei.com> Date: Thu Jul 26 20:21:49 2018 +0800 staging: erofs: add directory operations This adds functions for directory, mainly readdir. Signed-off-by: Miao Xie <miaoxie@huawei.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 421bfd9b50b8051aa451be073f6387bc678cccab Author: Gao Xiang <gaoxiang25@huawei.com> Date: Thu Jul 26 20:21:48 2018 +0800 staging: erofs: add inode operations This adds core functions to get, read an inode. Signed-off-by: Miao Xie <miaoxie@huawei.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 944a5ab5bd4fc480e4099c8e5e97a6dca490a239 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Thu Jul 26 20:21:47 2018 +0800 staging: erofs: add raw address_space operations This commit adds functions for meta and raw data, and also provides address_space_operations for raw data access. Signed-off-by: Miao Xie <miaoxie@huawei.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit 8305bea76c9178ce211e5759061d10effbba958e Author: Gao Xiang <gaoxiang25@huawei.com> Date: Thu Jul 26 20:21:46 2018 +0800 staging: erofs: add super block operations This commit adds erofs super block operations, including (u)mount, remount_fs, show_options, statfs, in addition to some private icache management functions. Signed-off-by: Miao Xie <miaoxie@huawei.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit ae2a66470bd70480e4953ff12bc96902e0b59617 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Thu Jul 26 20:21:45 2018 +0800 staging: erofs: add erofs in-memory stuffs - erofs_sb_info: contains erofs-specific in-memory information. - erofs_vnode: contains vfs_inode and other fs-specific information. same as super block, the only one in-memory definition exists. - erofs_map_blocks plays a role in the file L2P mapping Signed-off-by: Miao Xie <miaoxie@huawei.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> commit dd48cb6b27dc6c5977e59e05c59209ddb68c2f51 Author: Gao Xiang <gaoxiang25@huawei.com> Date: Thu Jul 26 20:21:44 2018 +0800 staging: erofs: add on-disk layout This commit adds the on-disk layout header file of erofs. Note that the on-disk layout is still WIP, and some fields are reserved for the future use by design. Any comments are welcome. Thanks-to: Li Guifu <liguifu2@huawei.com> Thanks-to: Sun Qiuyang <sunqiuyang@huawei.com> Signed-off-by: Miao Xie <miaoxie@huawei.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
3659 lines
94 KiB
C
3659 lines
94 KiB
C
/*
|
|
* linux/fs/buffer.c
|
|
*
|
|
* Copyright (C) 1991, 1992, 2002 Linus Torvalds
|
|
*/
|
|
|
|
/*
|
|
* Start bdflush() with kernel_thread not syscall - Paul Gortmaker, 12/95
|
|
*
|
|
* Removed a lot of unnecessary code and simplified things now that
|
|
* the buffer cache isn't our primary cache - Andrew Tridgell 12/96
|
|
*
|
|
* Speed up hash, lru, and free list operations. Use gfp() for allocating
|
|
* hash table, use SLAB cache for buffer heads. SMP threading. -DaveM
|
|
*
|
|
* Added 32k buffer block sizes - these are required older ARM systems. - RMK
|
|
*
|
|
* async buffer flushing, 1999 Andrea Arcangeli <andrea@suse.de>
|
|
*/
|
|
|
|
#include <linux/kernel.h>
|
|
#include <linux/sched/signal.h>
|
|
#include <linux/syscalls.h>
|
|
#include <linux/fs.h>
|
|
#include <linux/iomap.h>
|
|
#include <linux/mm.h>
|
|
#include <linux/percpu.h>
|
|
#include <linux/slab.h>
|
|
#include <linux/capability.h>
|
|
#include <linux/blkdev.h>
|
|
#include <linux/file.h>
|
|
#include <linux/quotaops.h>
|
|
#include <linux/highmem.h>
|
|
#include <linux/export.h>
|
|
#include <linux/backing-dev.h>
|
|
#include <linux/writeback.h>
|
|
#include <linux/hash.h>
|
|
#include <linux/suspend.h>
|
|
#include <linux/buffer_head.h>
|
|
#include <linux/task_io_accounting_ops.h>
|
|
#include <linux/bio.h>
|
|
#include <linux/notifier.h>
|
|
#include <linux/cpu.h>
|
|
#include <linux/bitops.h>
|
|
#include <linux/mpage.h>
|
|
#include <linux/bit_spinlock.h>
|
|
#include <linux/pagevec.h>
|
|
#include <trace/events/block.h>
|
|
#include <linux/fscrypt.h>
|
|
|
|
static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
|
|
static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
|
|
enum rw_hint hint, struct writeback_control *wbc);
|
|
|
|
#define BH_ENTRY(list) list_entry((list), struct buffer_head, b_assoc_buffers)
|
|
|
|
void init_buffer(struct buffer_head *bh, bh_end_io_t *handler, void *private)
|
|
{
|
|
bh->b_end_io = handler;
|
|
bh->b_private = private;
|
|
}
|
|
EXPORT_SYMBOL(init_buffer);
|
|
|
|
inline void touch_buffer(struct buffer_head *bh)
|
|
{
|
|
trace_block_touch_buffer(bh);
|
|
mark_page_accessed(bh->b_page);
|
|
}
|
|
EXPORT_SYMBOL(touch_buffer);
|
|
|
|
void __lock_buffer(struct buffer_head *bh)
|
|
{
|
|
wait_on_bit_lock_io(&bh->b_state, BH_Lock, TASK_UNINTERRUPTIBLE);
|
|
}
|
|
EXPORT_SYMBOL(__lock_buffer);
|
|
|
|
void unlock_buffer(struct buffer_head *bh)
|
|
{
|
|
clear_bit_unlock(BH_Lock, &bh->b_state);
|
|
smp_mb__after_atomic();
|
|
wake_up_bit(&bh->b_state, BH_Lock);
|
|
}
|
|
EXPORT_SYMBOL(unlock_buffer);
|
|
|
|
/*
|
|
* Returns if the page has dirty or writeback buffers. If all the buffers
|
|
* are unlocked and clean then the PageDirty information is stale. If
|
|
* any of the pages are locked, it is assumed they are locked for IO.
|
|
*/
|
|
void buffer_check_dirty_writeback(struct page *page,
|
|
bool *dirty, bool *writeback)
|
|
{
|
|
struct buffer_head *head, *bh;
|
|
*dirty = false;
|
|
*writeback = false;
|
|
|
|
BUG_ON(!PageLocked(page));
|
|
|
|
if (!page_has_buffers(page))
|
|
return;
|
|
|
|
if (PageWriteback(page))
|
|
*writeback = true;
|
|
|
|
head = page_buffers(page);
|
|
bh = head;
|
|
do {
|
|
if (buffer_locked(bh))
|
|
*writeback = true;
|
|
|
|
if (buffer_dirty(bh))
|
|
*dirty = true;
|
|
|
|
bh = bh->b_this_page;
|
|
} while (bh != head);
|
|
}
|
|
EXPORT_SYMBOL(buffer_check_dirty_writeback);
|
|
|
|
/*
|
|
* Block until a buffer comes unlocked. This doesn't stop it
|
|
* from becoming locked again - you have to lock it yourself
|
|
* if you want to preserve its state.
|
|
*/
|
|
void __wait_on_buffer(struct buffer_head * bh)
|
|
{
|
|
wait_on_bit_io(&bh->b_state, BH_Lock, TASK_UNINTERRUPTIBLE);
|
|
}
|
|
EXPORT_SYMBOL(__wait_on_buffer);
|
|
|
|
static void
|
|
__clear_page_buffers(struct page *page)
|
|
{
|
|
ClearPagePrivate(page);
|
|
set_page_private(page, 0);
|
|
put_page(page);
|
|
}
|
|
|
|
static void buffer_io_error(struct buffer_head *bh, char *msg)
|
|
{
|
|
if (!test_bit(BH_Quiet, &bh->b_state))
|
|
printk_ratelimited(KERN_ERR
|
|
"Buffer I/O error on dev %pg, logical block %llu%s\n",
|
|
bh->b_bdev, (unsigned long long)bh->b_blocknr, msg);
|
|
}
|
|
|
|
/*
|
|
* End-of-IO handler helper function which does not touch the bh after
|
|
* unlocking it.
|
|
* Note: unlock_buffer() sort-of does touch the bh after unlocking it, but
|
|
* a race there is benign: unlock_buffer() only use the bh's address for
|
|
* hashing after unlocking the buffer, so it doesn't actually touch the bh
|
|
* itself.
|
|
*/
|
|
static void __end_buffer_read_notouch(struct buffer_head *bh, int uptodate)
|
|
{
|
|
if (uptodate) {
|
|
set_buffer_uptodate(bh);
|
|
} else {
|
|
/* This happens, due to failed read-ahead attempts. */
|
|
clear_buffer_uptodate(bh);
|
|
}
|
|
unlock_buffer(bh);
|
|
}
|
|
|
|
/*
|
|
* Default synchronous end-of-IO handler.. Just mark it up-to-date and
|
|
* unlock the buffer. This is what ll_rw_block uses too.
|
|
*/
|
|
void end_buffer_read_sync(struct buffer_head *bh, int uptodate)
|
|
{
|
|
__end_buffer_read_notouch(bh, uptodate);
|
|
put_bh(bh);
|
|
}
|
|
EXPORT_SYMBOL(end_buffer_read_sync);
|
|
|
|
void end_buffer_write_sync(struct buffer_head *bh, int uptodate)
|
|
{
|
|
if (uptodate) {
|
|
set_buffer_uptodate(bh);
|
|
} else {
|
|
buffer_io_error(bh, ", lost sync page write");
|
|
mark_buffer_write_io_error(bh);
|
|
clear_buffer_uptodate(bh);
|
|
}
|
|
unlock_buffer(bh);
|
|
put_bh(bh);
|
|
}
|
|
EXPORT_SYMBOL(end_buffer_write_sync);
|
|
|
|
/*
|
|
* Various filesystems appear to want __find_get_block to be non-blocking.
|
|
* But it's the page lock which protects the buffers. To get around this,
|
|
* we get exclusion from try_to_free_buffers with the blockdev mapping's
|
|
* private_lock.
|
|
*
|
|
* Hack idea: for the blockdev mapping, i_bufferlist_lock contention
|
|
* may be quite high. This code could TryLock the page, and if that
|
|
* succeeds, there is no need to take private_lock. (But if
|
|
* private_lock is contended then so is mapping->tree_lock).
|
|
*/
|
|
static struct buffer_head *
|
|
__find_get_block_slow(struct block_device *bdev, sector_t block)
|
|
{
|
|
struct inode *bd_inode = bdev->bd_inode;
|
|
struct address_space *bd_mapping = bd_inode->i_mapping;
|
|
struct buffer_head *ret = NULL;
|
|
pgoff_t index;
|
|
struct buffer_head *bh;
|
|
struct buffer_head *head;
|
|
struct page *page;
|
|
int all_mapped = 1;
|
|
static DEFINE_RATELIMIT_STATE(last_warned, HZ, 1);
|
|
|
|
index = block >> (PAGE_SHIFT - bd_inode->i_blkbits);
|
|
page = find_get_page_flags(bd_mapping, index, FGP_ACCESSED);
|
|
if (!page)
|
|
goto out;
|
|
|
|
spin_lock(&bd_mapping->private_lock);
|
|
if (!page_has_buffers(page))
|
|
goto out_unlock;
|
|
head = page_buffers(page);
|
|
bh = head;
|
|
do {
|
|
if (!buffer_mapped(bh))
|
|
all_mapped = 0;
|
|
else if (bh->b_blocknr == block) {
|
|
ret = bh;
|
|
get_bh(bh);
|
|
goto out_unlock;
|
|
}
|
|
bh = bh->b_this_page;
|
|
} while (bh != head);
|
|
|
|
/* we might be here because some of the buffers on this page are
|
|
* not mapped. This is due to various races between
|
|
* file io on the block device and getblk. It gets dealt with
|
|
* elsewhere, don't buffer_error if we had some unmapped buffers
|
|
*/
|
|
ratelimit_set_flags(&last_warned, RATELIMIT_MSG_ON_RELEASE);
|
|
if (all_mapped && __ratelimit(&last_warned)) {
|
|
printk("__find_get_block_slow() failed. block=%llu, "
|
|
"b_blocknr=%llu, b_state=0x%08lx, b_size=%zu, "
|
|
"device %pg blocksize: %d\n",
|
|
(unsigned long long)block,
|
|
(unsigned long long)bh->b_blocknr,
|
|
bh->b_state, bh->b_size, bdev,
|
|
1 << bd_inode->i_blkbits);
|
|
}
|
|
out_unlock:
|
|
spin_unlock(&bd_mapping->private_lock);
|
|
put_page(page);
|
|
out:
|
|
return ret;
|
|
}
|
|
|
|
/*
|
|
* I/O completion handler for block_read_full_page() - pages
|
|
* which come unlocked at the end of I/O.
|
|
*/
|
|
static void end_buffer_async_read(struct buffer_head *bh, int uptodate)
|
|
{
|
|
unsigned long flags;
|
|
struct buffer_head *first;
|
|
struct buffer_head *tmp;
|
|
struct page *page;
|
|
int page_uptodate = 1;
|
|
|
|
BUG_ON(!buffer_async_read(bh));
|
|
|
|
page = bh->b_page;
|
|
if (uptodate) {
|
|
set_buffer_uptodate(bh);
|
|
} else {
|
|
clear_buffer_uptodate(bh);
|
|
buffer_io_error(bh, ", async page read");
|
|
SetPageError(page);
|
|
}
|
|
|
|
/*
|
|
* Be _very_ careful from here on. Bad things can happen if
|
|
* two buffer heads end IO at almost the same time and both
|
|
* decide that the page is now completely done.
|
|
*/
|
|
first = page_buffers(page);
|
|
local_irq_save(flags);
|
|
bit_spin_lock(BH_Uptodate_Lock, &first->b_state);
|
|
clear_buffer_async_read(bh);
|
|
unlock_buffer(bh);
|
|
tmp = bh;
|
|
do {
|
|
if (!buffer_uptodate(tmp))
|
|
page_uptodate = 0;
|
|
if (buffer_async_read(tmp)) {
|
|
BUG_ON(!buffer_locked(tmp));
|
|
goto still_busy;
|
|
}
|
|
tmp = tmp->b_this_page;
|
|
} while (tmp != bh);
|
|
bit_spin_unlock(BH_Uptodate_Lock, &first->b_state);
|
|
local_irq_restore(flags);
|
|
|
|
/*
|
|
* If none of the buffers had errors and they are all
|
|
* uptodate then we can set the page uptodate.
|
|
*/
|
|
if (page_uptodate && !PageError(page))
|
|
SetPageUptodate(page);
|
|
unlock_page(page);
|
|
return;
|
|
|
|
still_busy:
|
|
bit_spin_unlock(BH_Uptodate_Lock, &first->b_state);
|
|
local_irq_restore(flags);
|
|
return;
|
|
}
|
|
|
|
/*
|
|
* Completion handler for block_write_full_page() - pages which are unlocked
|
|
* during I/O, and which have PageWriteback cleared upon I/O completion.
|
|
*/
|
|
void end_buffer_async_write(struct buffer_head *bh, int uptodate)
|
|
{
|
|
unsigned long flags;
|
|
struct buffer_head *first;
|
|
struct buffer_head *tmp;
|
|
struct page *page;
|
|
|
|
BUG_ON(!buffer_async_write(bh));
|
|
|
|
page = bh->b_page;
|
|
if (uptodate) {
|
|
set_buffer_uptodate(bh);
|
|
} else {
|
|
buffer_io_error(bh, ", lost async page write");
|
|
mark_buffer_write_io_error(bh);
|
|
clear_buffer_uptodate(bh);
|
|
SetPageError(page);
|
|
}
|
|
|
|
first = page_buffers(page);
|
|
local_irq_save(flags);
|
|
bit_spin_lock(BH_Uptodate_Lock, &first->b_state);
|
|
|
|
clear_buffer_async_write(bh);
|
|
unlock_buffer(bh);
|
|
tmp = bh->b_this_page;
|
|
while (tmp != bh) {
|
|
if (buffer_async_write(tmp)) {
|
|
BUG_ON(!buffer_locked(tmp));
|
|
goto still_busy;
|
|
}
|
|
tmp = tmp->b_this_page;
|
|
}
|
|
bit_spin_unlock(BH_Uptodate_Lock, &first->b_state);
|
|
local_irq_restore(flags);
|
|
end_page_writeback(page);
|
|
return;
|
|
|
|
still_busy:
|
|
bit_spin_unlock(BH_Uptodate_Lock, &first->b_state);
|
|
local_irq_restore(flags);
|
|
return;
|
|
}
|
|
EXPORT_SYMBOL(end_buffer_async_write);
|
|
|
|
/*
|
|
* If a page's buffers are under async readin (end_buffer_async_read
|
|
* completion) then there is a possibility that another thread of
|
|
* control could lock one of the buffers after it has completed
|
|
* but while some of the other buffers have not completed. This
|
|
* locked buffer would confuse end_buffer_async_read() into not unlocking
|
|
* the page. So the absence of BH_Async_Read tells end_buffer_async_read()
|
|
* that this buffer is not under async I/O.
|
|
*
|
|
* The page comes unlocked when it has no locked buffer_async buffers
|
|
* left.
|
|
*
|
|
* PageLocked prevents anyone starting new async I/O reads any of
|
|
* the buffers.
|
|
*
|
|
* PageWriteback is used to prevent simultaneous writeout of the same
|
|
* page.
|
|
*
|
|
* PageLocked prevents anyone from starting writeback of a page which is
|
|
* under read I/O (PageWriteback is only ever set against a locked page).
|
|
*/
|
|
static void mark_buffer_async_read(struct buffer_head *bh)
|
|
{
|
|
bh->b_end_io = end_buffer_async_read;
|
|
set_buffer_async_read(bh);
|
|
}
|
|
|
|
static void mark_buffer_async_write_endio(struct buffer_head *bh,
|
|
bh_end_io_t *handler)
|
|
{
|
|
bh->b_end_io = handler;
|
|
set_buffer_async_write(bh);
|
|
}
|
|
|
|
void mark_buffer_async_write(struct buffer_head *bh)
|
|
{
|
|
mark_buffer_async_write_endio(bh, end_buffer_async_write);
|
|
}
|
|
EXPORT_SYMBOL(mark_buffer_async_write);
|
|
|
|
|
|
/*
|
|
* fs/buffer.c contains helper functions for buffer-backed address space's
|
|
* fsync functions. A common requirement for buffer-based filesystems is
|
|
* that certain data from the backing blockdev needs to be written out for
|
|
* a successful fsync(). For example, ext2 indirect blocks need to be
|
|
* written back and waited upon before fsync() returns.
|
|
*
|
|
* The functions mark_buffer_inode_dirty(), fsync_inode_buffers(),
|
|
* inode_has_buffers() and invalidate_inode_buffers() are provided for the
|
|
* management of a list of dependent buffers at ->i_mapping->private_list.
|
|
*
|
|
* Locking is a little subtle: try_to_free_buffers() will remove buffers
|
|
* from their controlling inode's queue when they are being freed. But
|
|
* try_to_free_buffers() will be operating against the *blockdev* mapping
|
|
* at the time, not against the S_ISREG file which depends on those buffers.
|
|
* So the locking for private_list is via the private_lock in the address_space
|
|
* which backs the buffers. Which is different from the address_space
|
|
* against which the buffers are listed. So for a particular address_space,
|
|
* mapping->private_lock does *not* protect mapping->private_list! In fact,
|
|
* mapping->private_list will always be protected by the backing blockdev's
|
|
* ->private_lock.
|
|
*
|
|
* Which introduces a requirement: all buffers on an address_space's
|
|
* ->private_list must be from the same address_space: the blockdev's.
|
|
*
|
|
* address_spaces which do not place buffers at ->private_list via these
|
|
* utility functions are free to use private_lock and private_list for
|
|
* whatever they want. The only requirement is that list_empty(private_list)
|
|
* be true at clear_inode() time.
|
|
*
|
|
* FIXME: clear_inode should not call invalidate_inode_buffers(). The
|
|
* filesystems should do that. invalidate_inode_buffers() should just go
|
|
* BUG_ON(!list_empty).
|
|
*
|
|
* FIXME: mark_buffer_dirty_inode() is a data-plane operation. It should
|
|
* take an address_space, not an inode. And it should be called
|
|
* mark_buffer_dirty_fsync() to clearly define why those buffers are being
|
|
* queued up.
|
|
*
|
|
* FIXME: mark_buffer_dirty_inode() doesn't need to add the buffer to the
|
|
* list if it is already on a list. Because if the buffer is on a list,
|
|
* it *must* already be on the right one. If not, the filesystem is being
|
|
* silly. This will save a ton of locking. But first we have to ensure
|
|
* that buffers are taken *off* the old inode's list when they are freed
|
|
* (presumably in truncate). That requires careful auditing of all
|
|
* filesystems (do it inside bforget()). It could also be done by bringing
|
|
* b_inode back.
|
|
*/
|
|
|
|
/*
|
|
* The buffer's backing address_space's private_lock must be held
|
|
*/
|
|
static void __remove_assoc_queue(struct buffer_head *bh)
|
|
{
|
|
list_del_init(&bh->b_assoc_buffers);
|
|
WARN_ON(!bh->b_assoc_map);
|
|
bh->b_assoc_map = NULL;
|
|
}
|
|
|
|
int inode_has_buffers(struct inode *inode)
|
|
{
|
|
return !list_empty(&inode->i_data.private_list);
|
|
}
|
|
|
|
/*
|
|
* osync is designed to support O_SYNC io. It waits synchronously for
|
|
* all already-submitted IO to complete, but does not queue any new
|
|
* writes to the disk.
|
|
*
|
|
* To do O_SYNC writes, just queue the buffer writes with ll_rw_block as
|
|
* you dirty the buffers, and then use osync_inode_buffers to wait for
|
|
* completion. Any other dirty buffers which are not yet queued for
|
|
* write will not be flushed to disk by the osync.
|
|
*/
|
|
static int osync_buffers_list(spinlock_t *lock, struct list_head *list)
|
|
{
|
|
struct buffer_head *bh;
|
|
struct list_head *p;
|
|
int err = 0;
|
|
|
|
spin_lock(lock);
|
|
repeat:
|
|
list_for_each_prev(p, list) {
|
|
bh = BH_ENTRY(p);
|
|
if (buffer_locked(bh)) {
|
|
get_bh(bh);
|
|
spin_unlock(lock);
|
|
wait_on_buffer(bh);
|
|
if (!buffer_uptodate(bh))
|
|
err = -EIO;
|
|
brelse(bh);
|
|
spin_lock(lock);
|
|
goto repeat;
|
|
}
|
|
}
|
|
spin_unlock(lock);
|
|
return err;
|
|
}
|
|
|
|
static void do_thaw_one(struct super_block *sb, void *unused)
|
|
{
|
|
while (sb->s_bdev && !thaw_bdev(sb->s_bdev, sb))
|
|
printk(KERN_WARNING "Emergency Thaw on %pg\n", sb->s_bdev);
|
|
}
|
|
|
|
static void do_thaw_all(struct work_struct *work)
|
|
{
|
|
iterate_supers(do_thaw_one, NULL);
|
|
kfree(work);
|
|
printk(KERN_WARNING "Emergency Thaw complete\n");
|
|
}
|
|
|
|
/**
|
|
* emergency_thaw_all -- forcibly thaw every frozen filesystem
|
|
*
|
|
* Used for emergency unfreeze of all filesystems via SysRq
|
|
*/
|
|
void emergency_thaw_all(void)
|
|
{
|
|
struct work_struct *work;
|
|
|
|
work = kmalloc(sizeof(*work), GFP_ATOMIC);
|
|
if (work) {
|
|
INIT_WORK(work, do_thaw_all);
|
|
schedule_work(work);
|
|
}
|
|
}
|
|
|
|
/**
|
|
* sync_mapping_buffers - write out & wait upon a mapping's "associated" buffers
|
|
* @mapping: the mapping which wants those buffers written
|
|
*
|
|
* Starts I/O against the buffers at mapping->private_list, and waits upon
|
|
* that I/O.
|
|
*
|
|
* Basically, this is a convenience function for fsync().
|
|
* @mapping is a file or directory which needs those buffers to be written for
|
|
* a successful fsync().
|
|
*/
|
|
int sync_mapping_buffers(struct address_space *mapping)
|
|
{
|
|
struct address_space *buffer_mapping = mapping->private_data;
|
|
|
|
if (buffer_mapping == NULL || list_empty(&mapping->private_list))
|
|
return 0;
|
|
|
|
return fsync_buffers_list(&buffer_mapping->private_lock,
|
|
&mapping->private_list);
|
|
}
|
|
EXPORT_SYMBOL(sync_mapping_buffers);
|
|
|
|
/*
|
|
* Called when we've recently written block `bblock', and it is known that
|
|
* `bblock' was for a buffer_boundary() buffer. This means that the block at
|
|
* `bblock + 1' is probably a dirty indirect block. Hunt it down and, if it's
|
|
* dirty, schedule it for IO. So that indirects merge nicely with their data.
|
|
*/
|
|
void write_boundary_block(struct block_device *bdev,
|
|
sector_t bblock, unsigned blocksize)
|
|
{
|
|
struct buffer_head *bh = __find_get_block(bdev, bblock + 1, blocksize);
|
|
if (bh) {
|
|
if (buffer_dirty(bh))
|
|
ll_rw_block(REQ_OP_WRITE, 0, 1, &bh);
|
|
put_bh(bh);
|
|
}
|
|
}
|
|
|
|
void mark_buffer_dirty_inode(struct buffer_head *bh, struct inode *inode)
|
|
{
|
|
struct address_space *mapping = inode->i_mapping;
|
|
struct address_space *buffer_mapping = bh->b_page->mapping;
|
|
|
|
mark_buffer_dirty(bh);
|
|
if (!mapping->private_data) {
|
|
mapping->private_data = buffer_mapping;
|
|
} else {
|
|
BUG_ON(mapping->private_data != buffer_mapping);
|
|
}
|
|
if (!bh->b_assoc_map) {
|
|
spin_lock(&buffer_mapping->private_lock);
|
|
list_move_tail(&bh->b_assoc_buffers,
|
|
&mapping->private_list);
|
|
bh->b_assoc_map = mapping;
|
|
spin_unlock(&buffer_mapping->private_lock);
|
|
}
|
|
}
|
|
EXPORT_SYMBOL(mark_buffer_dirty_inode);
|
|
|
|
/*
|
|
* Mark the page dirty, and set it dirty in the radix tree, and mark the inode
|
|
* dirty.
|
|
*
|
|
* If warn is true, then emit a warning if the page is not uptodate and has
|
|
* not been truncated.
|
|
*
|
|
* The caller must hold lock_page_memcg().
|
|
*/
|
|
void __set_page_dirty(struct page *page, struct address_space *mapping,
|
|
int warn)
|
|
{
|
|
unsigned long flags;
|
|
|
|
spin_lock_irqsave(&mapping->tree_lock, flags);
|
|
if (page->mapping) { /* Race with truncate? */
|
|
WARN_ON_ONCE(warn && !PageUptodate(page));
|
|
account_page_dirtied(page, mapping);
|
|
radix_tree_tag_set(&mapping->page_tree,
|
|
page_index(page), PAGECACHE_TAG_DIRTY);
|
|
}
|
|
spin_unlock_irqrestore(&mapping->tree_lock, flags);
|
|
}
|
|
EXPORT_SYMBOL_GPL(__set_page_dirty);
|
|
|
|
/*
|
|
* Add a page to the dirty page list.
|
|
*
|
|
* It is a sad fact of life that this function is called from several places
|
|
* deeply under spinlocking. It may not sleep.
|
|
*
|
|
* If the page has buffers, the uptodate buffers are set dirty, to preserve
|
|
* dirty-state coherency between the page and the buffers. It the page does
|
|
* not have buffers then when they are later attached they will all be set
|
|
* dirty.
|
|
*
|
|
* The buffers are dirtied before the page is dirtied. There's a small race
|
|
* window in which a writepage caller may see the page cleanness but not the
|
|
* buffer dirtiness. That's fine. If this code were to set the page dirty
|
|
* before the buffers, a concurrent writepage caller could clear the page dirty
|
|
* bit, see a bunch of clean buffers and we'd end up with dirty buffers/clean
|
|
* page on the dirty page list.
|
|
*
|
|
* We use private_lock to lock against try_to_free_buffers while using the
|
|
* page's buffer list. Also use this to protect against clean buffers being
|
|
* added to the page after it was set dirty.
|
|
*
|
|
* FIXME: may need to call ->reservepage here as well. That's rather up to the
|
|
* address_space though.
|
|
*/
|
|
int __set_page_dirty_buffers(struct page *page)
|
|
{
|
|
int newly_dirty;
|
|
struct address_space *mapping = page_mapping(page);
|
|
|
|
if (unlikely(!mapping))
|
|
return !TestSetPageDirty(page);
|
|
|
|
spin_lock(&mapping->private_lock);
|
|
if (page_has_buffers(page)) {
|
|
struct buffer_head *head = page_buffers(page);
|
|
struct buffer_head *bh = head;
|
|
|
|
do {
|
|
set_buffer_dirty(bh);
|
|
bh = bh->b_this_page;
|
|
} while (bh != head);
|
|
}
|
|
/*
|
|
* Lock out page->mem_cgroup migration to keep PageDirty
|
|
* synchronized with per-memcg dirty page counters.
|
|
*/
|
|
lock_page_memcg(page);
|
|
newly_dirty = !TestSetPageDirty(page);
|
|
spin_unlock(&mapping->private_lock);
|
|
|
|
if (newly_dirty)
|
|
__set_page_dirty(page, mapping, 1);
|
|
|
|
unlock_page_memcg(page);
|
|
|
|
if (newly_dirty)
|
|
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
|
|
|
|
return newly_dirty;
|
|
}
|
|
EXPORT_SYMBOL(__set_page_dirty_buffers);
|
|
|
|
/*
|
|
* Write out and wait upon a list of buffers.
|
|
*
|
|
* We have conflicting pressures: we want to make sure that all
|
|
* initially dirty buffers get waited on, but that any subsequently
|
|
* dirtied buffers don't. After all, we don't want fsync to last
|
|
* forever if somebody is actively writing to the file.
|
|
*
|
|
* Do this in two main stages: first we copy dirty buffers to a
|
|
* temporary inode list, queueing the writes as we go. Then we clean
|
|
* up, waiting for those writes to complete.
|
|
*
|
|
* During this second stage, any subsequent updates to the file may end
|
|
* up refiling the buffer on the original inode's dirty list again, so
|
|
* there is a chance we will end up with a buffer queued for write but
|
|
* not yet completed on that list. So, as a final cleanup we go through
|
|
* the osync code to catch these locked, dirty buffers without requeuing
|
|
* any newly dirty buffers for write.
|
|
*/
|
|
static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
|
|
{
|
|
struct buffer_head *bh;
|
|
struct list_head tmp;
|
|
struct address_space *mapping;
|
|
int err = 0, err2;
|
|
struct blk_plug plug;
|
|
|
|
INIT_LIST_HEAD(&tmp);
|
|
blk_start_plug(&plug);
|
|
|
|
spin_lock(lock);
|
|
while (!list_empty(list)) {
|
|
bh = BH_ENTRY(list->next);
|
|
mapping = bh->b_assoc_map;
|
|
__remove_assoc_queue(bh);
|
|
/* Avoid race with mark_buffer_dirty_inode() which does
|
|
* a lockless check and we rely on seeing the dirty bit */
|
|
smp_mb();
|
|
if (buffer_dirty(bh) || buffer_locked(bh)) {
|
|
list_add(&bh->b_assoc_buffers, &tmp);
|
|
bh->b_assoc_map = mapping;
|
|
if (buffer_dirty(bh)) {
|
|
get_bh(bh);
|
|
spin_unlock(lock);
|
|
/*
|
|
* Ensure any pending I/O completes so that
|
|
* write_dirty_buffer() actually writes the
|
|
* current contents - it is a noop if I/O is
|
|
* still in flight on potentially older
|
|
* contents.
|
|
*/
|
|
write_dirty_buffer(bh, REQ_SYNC);
|
|
|
|
/*
|
|
* Kick off IO for the previous mapping. Note
|
|
* that we will not run the very last mapping,
|
|
* wait_on_buffer() will do that for us
|
|
* through sync_buffer().
|
|
*/
|
|
brelse(bh);
|
|
spin_lock(lock);
|
|
}
|
|
}
|
|
}
|
|
|
|
spin_unlock(lock);
|
|
blk_finish_plug(&plug);
|
|
spin_lock(lock);
|
|
|
|
while (!list_empty(&tmp)) {
|
|
bh = BH_ENTRY(tmp.prev);
|
|
get_bh(bh);
|
|
mapping = bh->b_assoc_map;
|
|
__remove_assoc_queue(bh);
|
|
/* Avoid race with mark_buffer_dirty_inode() which does
|
|
* a lockless check and we rely on seeing the dirty bit */
|
|
smp_mb();
|
|
if (buffer_dirty(bh)) {
|
|
list_add(&bh->b_assoc_buffers,
|
|
&mapping->private_list);
|
|
bh->b_assoc_map = mapping;
|
|
}
|
|
spin_unlock(lock);
|
|
wait_on_buffer(bh);
|
|
if (!buffer_uptodate(bh))
|
|
err = -EIO;
|
|
brelse(bh);
|
|
spin_lock(lock);
|
|
}
|
|
|
|
spin_unlock(lock);
|
|
err2 = osync_buffers_list(lock, list);
|
|
if (err)
|
|
return err;
|
|
else
|
|
return err2;
|
|
}
|
|
|
|
/*
|
|
* Invalidate any and all dirty buffers on a given inode. We are
|
|
* probably unmounting the fs, but that doesn't mean we have already
|
|
* done a sync(). Just drop the buffers from the inode list.
|
|
*
|
|
* NOTE: we take the inode's blockdev's mapping's private_lock. Which
|
|
* assumes that all the buffers are against the blockdev. Not true
|
|
* for reiserfs.
|
|
*/
|
|
void invalidate_inode_buffers(struct inode *inode)
|
|
{
|
|
if (inode_has_buffers(inode)) {
|
|
struct address_space *mapping = &inode->i_data;
|
|
struct list_head *list = &mapping->private_list;
|
|
struct address_space *buffer_mapping = mapping->private_data;
|
|
|
|
spin_lock(&buffer_mapping->private_lock);
|
|
while (!list_empty(list))
|
|
__remove_assoc_queue(BH_ENTRY(list->next));
|
|
spin_unlock(&buffer_mapping->private_lock);
|
|
}
|
|
}
|
|
EXPORT_SYMBOL(invalidate_inode_buffers);
|
|
|
|
/*
|
|
* Remove any clean buffers from the inode's buffer list. This is called
|
|
* when we're trying to free the inode itself. Those buffers can pin it.
|
|
*
|
|
* Returns true if all buffers were removed.
|
|
*/
|
|
int remove_inode_buffers(struct inode *inode)
|
|
{
|
|
int ret = 1;
|
|
|
|
if (inode_has_buffers(inode)) {
|
|
struct address_space *mapping = &inode->i_data;
|
|
struct list_head *list = &mapping->private_list;
|
|
struct address_space *buffer_mapping = mapping->private_data;
|
|
|
|
spin_lock(&buffer_mapping->private_lock);
|
|
while (!list_empty(list)) {
|
|
struct buffer_head *bh = BH_ENTRY(list->next);
|
|
if (buffer_dirty(bh)) {
|
|
ret = 0;
|
|
break;
|
|
}
|
|
__remove_assoc_queue(bh);
|
|
}
|
|
spin_unlock(&buffer_mapping->private_lock);
|
|
}
|
|
return ret;
|
|
}
|
|
|
|
/*
|
|
* Create the appropriate buffers when given a page for data area and
|
|
* the size of each buffer.. Use the bh->b_this_page linked list to
|
|
* follow the buffers created. Return NULL if unable to create more
|
|
* buffers.
|
|
*
|
|
* The retry flag is used to differentiate async IO (paging, swapping)
|
|
* which may not fail from ordinary buffer allocations.
|
|
*/
|
|
struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
|
|
bool retry)
|
|
{
|
|
struct buffer_head *bh, *head;
|
|
gfp_t gfp = GFP_NOFS;
|
|
long offset;
|
|
|
|
if (retry)
|
|
gfp |= __GFP_NOFAIL;
|
|
|
|
head = NULL;
|
|
offset = PAGE_SIZE;
|
|
while ((offset -= size) >= 0) {
|
|
bh = alloc_buffer_head(gfp);
|
|
if (!bh)
|
|
goto no_grow;
|
|
|
|
bh->b_this_page = head;
|
|
bh->b_blocknr = -1;
|
|
head = bh;
|
|
|
|
bh->b_size = size;
|
|
|
|
/* Link the buffer to its page */
|
|
set_bh_page(bh, page, offset);
|
|
}
|
|
return head;
|
|
/*
|
|
* In case anything failed, we just free everything we got.
|
|
*/
|
|
no_grow:
|
|
if (head) {
|
|
do {
|
|
bh = head;
|
|
head = head->b_this_page;
|
|
free_buffer_head(bh);
|
|
} while (head);
|
|
}
|
|
|
|
return NULL;
|
|
}
|
|
EXPORT_SYMBOL_GPL(alloc_page_buffers);
|
|
|
|
static inline void
|
|
link_dev_buffers(struct page *page, struct buffer_head *head)
|
|
{
|
|
struct buffer_head *bh, *tail;
|
|
|
|
bh = head;
|
|
do {
|
|
tail = bh;
|
|
bh = bh->b_this_page;
|
|
} while (bh);
|
|
tail->b_this_page = head;
|
|
attach_page_buffers(page, head);
|
|
}
|
|
|
|
static sector_t blkdev_max_block(struct block_device *bdev, unsigned int size)
|
|
{
|
|
sector_t retval = ~((sector_t)0);
|
|
loff_t sz = i_size_read(bdev->bd_inode);
|
|
|
|
if (sz) {
|
|
unsigned int sizebits = blksize_bits(size);
|
|
retval = (sz >> sizebits);
|
|
}
|
|
return retval;
|
|
}
|
|
|
|
/*
|
|
* Initialise the state of a blockdev page's buffers.
|
|
*/
|
|
static sector_t
|
|
init_page_buffers(struct page *page, struct block_device *bdev,
|
|
sector_t block, int size)
|
|
{
|
|
struct buffer_head *head = page_buffers(page);
|
|
struct buffer_head *bh = head;
|
|
int uptodate = PageUptodate(page);
|
|
sector_t end_block = blkdev_max_block(I_BDEV(bdev->bd_inode), size);
|
|
|
|
do {
|
|
if (!buffer_mapped(bh)) {
|
|
init_buffer(bh, NULL, NULL);
|
|
bh->b_bdev = bdev;
|
|
bh->b_blocknr = block;
|
|
if (uptodate)
|
|
set_buffer_uptodate(bh);
|
|
if (block < end_block)
|
|
set_buffer_mapped(bh);
|
|
}
|
|
block++;
|
|
bh = bh->b_this_page;
|
|
} while (bh != head);
|
|
|
|
/*
|
|
* Caller needs to validate requested block against end of device.
|
|
*/
|
|
return end_block;
|
|
}
|
|
|
|
/*
|
|
* Create the page-cache page that contains the requested block.
|
|
*
|
|
* This is used purely for blockdev mappings.
|
|
*/
|
|
static int
|
|
grow_dev_page(struct block_device *bdev, sector_t block,
|
|
pgoff_t index, int size, int sizebits, gfp_t gfp)
|
|
{
|
|
struct inode *inode = bdev->bd_inode;
|
|
struct page *page;
|
|
struct buffer_head *bh;
|
|
sector_t end_block;
|
|
int ret = 0; /* Will call free_more_memory() */
|
|
gfp_t gfp_mask;
|
|
|
|
gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp;
|
|
|
|
/*
|
|
* XXX: __getblk_slow() can not really deal with failure and
|
|
* will endlessly loop on improvised global reclaim. Prefer
|
|
* looping in the allocator rather than here, at least that
|
|
* code knows what it's doing.
|
|
*/
|
|
gfp_mask |= __GFP_NOFAIL;
|
|
|
|
page = find_or_create_page(inode->i_mapping, index, gfp_mask);
|
|
|
|
BUG_ON(!PageLocked(page));
|
|
|
|
if (page_has_buffers(page)) {
|
|
bh = page_buffers(page);
|
|
if (bh->b_size == size) {
|
|
end_block = init_page_buffers(page, bdev,
|
|
(sector_t)index << sizebits,
|
|
size);
|
|
goto done;
|
|
}
|
|
if (!try_to_free_buffers(page))
|
|
goto failed;
|
|
}
|
|
|
|
/*
|
|
* Allocate some buffers for this page
|
|
*/
|
|
bh = alloc_page_buffers(page, size, true);
|
|
|
|
/*
|
|
* Link the page to the buffers and initialise them. Take the
|
|
* lock to be atomic wrt __find_get_block(), which does not
|
|
* run under the page lock.
|
|
*/
|
|
spin_lock(&inode->i_mapping->private_lock);
|
|
link_dev_buffers(page, bh);
|
|
end_block = init_page_buffers(page, bdev, (sector_t)index << sizebits,
|
|
size);
|
|
spin_unlock(&inode->i_mapping->private_lock);
|
|
done:
|
|
ret = (block < end_block) ? 1 : -ENXIO;
|
|
failed:
|
|
unlock_page(page);
|
|
put_page(page);
|
|
return ret;
|
|
}
|
|
|
|
/*
|
|
* Create buffers for the specified block device block's page. If
|
|
* that page was dirty, the buffers are set dirty also.
|
|
*/
|
|
static int
|
|
grow_buffers(struct block_device *bdev, sector_t block, int size, gfp_t gfp)
|
|
{
|
|
pgoff_t index;
|
|
int sizebits;
|
|
|
|
sizebits = -1;
|
|
do {
|
|
sizebits++;
|
|
} while ((size << sizebits) < PAGE_SIZE);
|
|
|
|
index = block >> sizebits;
|
|
|
|
/*
|
|
* Check for a block which wants to lie outside our maximum possible
|
|
* pagecache index. (this comparison is done using sector_t types).
|
|
*/
|
|
if (unlikely(index != block >> sizebits)) {
|
|
printk(KERN_ERR "%s: requested out-of-range block %llu for "
|
|
"device %pg\n",
|
|
__func__, (unsigned long long)block,
|
|
bdev);
|
|
return -EIO;
|
|
}
|
|
|
|
/* Create a page with the proper size buffers.. */
|
|
return grow_dev_page(bdev, block, index, size, sizebits, gfp);
|
|
}
|
|
|
|
static struct buffer_head *
|
|
__getblk_slow(struct block_device *bdev, sector_t block,
|
|
unsigned size, gfp_t gfp)
|
|
{
|
|
/* Size must be multiple of hard sectorsize */
|
|
if (unlikely(size & (bdev_logical_block_size(bdev)-1) ||
|
|
(size < 512 || size > PAGE_SIZE))) {
|
|
printk(KERN_ERR "getblk(): invalid block size %d requested\n",
|
|
size);
|
|
printk(KERN_ERR "logical block size: %d\n",
|
|
bdev_logical_block_size(bdev));
|
|
|
|
dump_stack();
|
|
return NULL;
|
|
}
|
|
|
|
for (;;) {
|
|
struct buffer_head *bh;
|
|
int ret;
|
|
|
|
bh = __find_get_block(bdev, block, size);
|
|
if (bh)
|
|
return bh;
|
|
|
|
ret = grow_buffers(bdev, block, size, gfp);
|
|
if (ret < 0)
|
|
return NULL;
|
|
}
|
|
}
|
|
|
|
/*
|
|
* The relationship between dirty buffers and dirty pages:
|
|
*
|
|
* Whenever a page has any dirty buffers, the page's dirty bit is set, and
|
|
* the page is tagged dirty in its radix tree.
|
|
*
|
|
* At all times, the dirtiness of the buffers represents the dirtiness of
|
|
* subsections of the page. If the page has buffers, the page dirty bit is
|
|
* merely a hint about the true dirty state.
|
|
*
|
|
* When a page is set dirty in its entirety, all its buffers are marked dirty
|
|
* (if the page has buffers).
|
|
*
|
|
* When a buffer is marked dirty, its page is dirtied, but the page's other
|
|
* buffers are not.
|
|
*
|
|
* Also. When blockdev buffers are explicitly read with bread(), they
|
|
* individually become uptodate. But their backing page remains not
|
|
* uptodate - even if all of its buffers are uptodate. A subsequent
|
|
* block_read_full_page() against that page will discover all the uptodate
|
|
* buffers, will set the page uptodate and will perform no I/O.
|
|
*/
|
|
|
|
/**
|
|
* mark_buffer_dirty - mark a buffer_head as needing writeout
|
|
* @bh: the buffer_head to mark dirty
|
|
*
|
|
* mark_buffer_dirty() will set the dirty bit against the buffer, then set its
|
|
* backing page dirty, then tag the page as dirty in its address_space's radix
|
|
* tree and then attach the address_space's inode to its superblock's dirty
|
|
* inode list.
|
|
*
|
|
* mark_buffer_dirty() is atomic. It takes bh->b_page->mapping->private_lock,
|
|
* mapping->tree_lock and mapping->host->i_lock.
|
|
*/
|
|
void mark_buffer_dirty(struct buffer_head *bh)
|
|
{
|
|
WARN_ON_ONCE(!buffer_uptodate(bh));
|
|
|
|
trace_block_dirty_buffer(bh);
|
|
|
|
/*
|
|
* Very *carefully* optimize the it-is-already-dirty case.
|
|
*
|
|
* Don't let the final "is it dirty" escape to before we
|
|
* perhaps modified the buffer.
|
|
*/
|
|
if (buffer_dirty(bh)) {
|
|
smp_mb();
|
|
if (buffer_dirty(bh))
|
|
return;
|
|
}
|
|
|
|
if (!test_set_buffer_dirty(bh)) {
|
|
struct page *page = bh->b_page;
|
|
struct address_space *mapping = NULL;
|
|
|
|
lock_page_memcg(page);
|
|
if (!TestSetPageDirty(page)) {
|
|
mapping = page_mapping(page);
|
|
if (mapping)
|
|
__set_page_dirty(page, mapping, 0);
|
|
}
|
|
unlock_page_memcg(page);
|
|
if (mapping)
|
|
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
|
|
}
|
|
}
|
|
EXPORT_SYMBOL(mark_buffer_dirty);
|
|
|
|
void mark_buffer_write_io_error(struct buffer_head *bh)
|
|
{
|
|
struct super_block *sb;
|
|
|
|
set_buffer_write_io_error(bh);
|
|
/* FIXME: do we need to set this in both places? */
|
|
if (bh->b_page && bh->b_page->mapping)
|
|
mapping_set_error(bh->b_page->mapping, -EIO);
|
|
if (bh->b_assoc_map)
|
|
mapping_set_error(bh->b_assoc_map, -EIO);
|
|
rcu_read_lock();
|
|
sb = READ_ONCE(bh->b_bdev->bd_super);
|
|
if (sb)
|
|
errseq_set(&sb->s_wb_err, -EIO);
|
|
rcu_read_unlock();
|
|
}
|
|
EXPORT_SYMBOL(mark_buffer_write_io_error);
|
|
|
|
/*
|
|
* Decrement a buffer_head's reference count. If all buffers against a page
|
|
* have zero reference count, are clean and unlocked, and if the page is clean
|
|
* and unlocked then try_to_free_buffers() may strip the buffers from the page
|
|
* in preparation for freeing it (sometimes, rarely, buffers are removed from
|
|
* a page but it ends up not being freed, and buffers may later be reattached).
|
|
*/
|
|
void __brelse(struct buffer_head * buf)
|
|
{
|
|
if (atomic_read(&buf->b_count)) {
|
|
put_bh(buf);
|
|
return;
|
|
}
|
|
WARN(1, KERN_ERR "VFS: brelse: Trying to free free buffer\n");
|
|
}
|
|
EXPORT_SYMBOL(__brelse);
|
|
|
|
/*
|
|
* bforget() is like brelse(), except it discards any
|
|
* potentially dirty data.
|
|
*/
|
|
void __bforget(struct buffer_head *bh)
|
|
{
|
|
clear_buffer_dirty(bh);
|
|
if (bh->b_assoc_map) {
|
|
struct address_space *buffer_mapping = bh->b_page->mapping;
|
|
|
|
spin_lock(&buffer_mapping->private_lock);
|
|
list_del_init(&bh->b_assoc_buffers);
|
|
bh->b_assoc_map = NULL;
|
|
spin_unlock(&buffer_mapping->private_lock);
|
|
}
|
|
__brelse(bh);
|
|
}
|
|
EXPORT_SYMBOL(__bforget);
|
|
|
|
static struct buffer_head *__bread_slow(struct buffer_head *bh)
|
|
{
|
|
lock_buffer(bh);
|
|
if (buffer_uptodate(bh)) {
|
|
unlock_buffer(bh);
|
|
return bh;
|
|
} else {
|
|
get_bh(bh);
|
|
bh->b_end_io = end_buffer_read_sync;
|
|
submit_bh(REQ_OP_READ, 0, bh);
|
|
wait_on_buffer(bh);
|
|
if (buffer_uptodate(bh))
|
|
return bh;
|
|
}
|
|
brelse(bh);
|
|
return NULL;
|
|
}
|
|
|
|
/*
|
|
* Per-cpu buffer LRU implementation. To reduce the cost of __find_get_block().
|
|
* The bhs[] array is sorted - newest buffer is at bhs[0]. Buffers have their
|
|
* refcount elevated by one when they're in an LRU. A buffer can only appear
|
|
* once in a particular CPU's LRU. A single buffer can be present in multiple
|
|
* CPU's LRUs at the same time.
|
|
*
|
|
* This is a transparent caching front-end to sb_bread(), sb_getblk() and
|
|
* sb_find_get_block().
|
|
*
|
|
* The LRUs themselves only need locking against invalidate_bh_lrus. We use
|
|
* a local interrupt disable for that.
|
|
*/
|
|
|
|
#define BH_LRU_SIZE 16
|
|
|
|
struct bh_lru {
|
|
struct buffer_head *bhs[BH_LRU_SIZE];
|
|
};
|
|
|
|
static DEFINE_PER_CPU(struct bh_lru, bh_lrus) = {{ NULL }};
|
|
|
|
#ifdef CONFIG_SMP
|
|
#define bh_lru_lock() local_irq_disable()
|
|
#define bh_lru_unlock() local_irq_enable()
|
|
#else
|
|
#define bh_lru_lock() preempt_disable()
|
|
#define bh_lru_unlock() preempt_enable()
|
|
#endif
|
|
|
|
static inline void check_irqs_on(void)
|
|
{
|
|
#ifdef irqs_disabled
|
|
BUG_ON(irqs_disabled());
|
|
#endif
|
|
}
|
|
|
|
/*
|
|
* Install a buffer_head into this cpu's LRU. If not already in the LRU, it is
|
|
* inserted at the front, and the buffer_head at the back if any is evicted.
|
|
* Or, if already in the LRU it is moved to the front.
|
|
*/
|
|
static void bh_lru_install(struct buffer_head *bh)
|
|
{
|
|
struct buffer_head *evictee = bh;
|
|
struct bh_lru *b;
|
|
int i;
|
|
|
|
check_irqs_on();
|
|
bh_lru_lock();
|
|
|
|
b = this_cpu_ptr(&bh_lrus);
|
|
for (i = 0; i < BH_LRU_SIZE; i++) {
|
|
swap(evictee, b->bhs[i]);
|
|
if (evictee == bh) {
|
|
bh_lru_unlock();
|
|
return;
|
|
}
|
|
}
|
|
|
|
get_bh(bh);
|
|
bh_lru_unlock();
|
|
brelse(evictee);
|
|
}
|
|
|
|
/*
|
|
* Look up the bh in this cpu's LRU. If it's there, move it to the head.
|
|
*/
|
|
static struct buffer_head *
|
|
lookup_bh_lru(struct block_device *bdev, sector_t block, unsigned size)
|
|
{
|
|
struct buffer_head *ret = NULL;
|
|
unsigned int i;
|
|
|
|
check_irqs_on();
|
|
bh_lru_lock();
|
|
for (i = 0; i < BH_LRU_SIZE; i++) {
|
|
struct buffer_head *bh = __this_cpu_read(bh_lrus.bhs[i]);
|
|
|
|
if (bh && bh->b_blocknr == block && bh->b_bdev == bdev &&
|
|
bh->b_size == size) {
|
|
if (i) {
|
|
while (i) {
|
|
__this_cpu_write(bh_lrus.bhs[i],
|
|
__this_cpu_read(bh_lrus.bhs[i - 1]));
|
|
i--;
|
|
}
|
|
__this_cpu_write(bh_lrus.bhs[0], bh);
|
|
}
|
|
get_bh(bh);
|
|
ret = bh;
|
|
break;
|
|
}
|
|
}
|
|
bh_lru_unlock();
|
|
return ret;
|
|
}
|
|
|
|
/*
|
|
* Perform a pagecache lookup for the matching buffer. If it's there, refresh
|
|
* it in the LRU and mark it as accessed. If it is not present then return
|
|
* NULL
|
|
*/
|
|
struct buffer_head *
|
|
__find_get_block(struct block_device *bdev, sector_t block, unsigned size)
|
|
{
|
|
struct buffer_head *bh = lookup_bh_lru(bdev, block, size);
|
|
|
|
if (bh == NULL) {
|
|
/* __find_get_block_slow will mark the page accessed */
|
|
bh = __find_get_block_slow(bdev, block);
|
|
if (bh)
|
|
bh_lru_install(bh);
|
|
} else
|
|
touch_buffer(bh);
|
|
|
|
return bh;
|
|
}
|
|
EXPORT_SYMBOL(__find_get_block);
|
|
|
|
/*
|
|
* __getblk_gfp() will locate (and, if necessary, create) the buffer_head
|
|
* which corresponds to the passed block_device, block and size. The
|
|
* returned buffer has its reference count incremented.
|
|
*
|
|
* __getblk_gfp() will lock up the machine if grow_dev_page's
|
|
* try_to_free_buffers() attempt is failing. FIXME, perhaps?
|
|
*/
|
|
struct buffer_head *
|
|
__getblk_gfp(struct block_device *bdev, sector_t block,
|
|
unsigned size, gfp_t gfp)
|
|
{
|
|
struct buffer_head *bh = __find_get_block(bdev, block, size);
|
|
|
|
might_sleep();
|
|
if (bh == NULL)
|
|
bh = __getblk_slow(bdev, block, size, gfp);
|
|
return bh;
|
|
}
|
|
EXPORT_SYMBOL(__getblk_gfp);
|
|
|
|
/*
|
|
* Do async read-ahead on a buffer..
|
|
*/
|
|
void __breadahead(struct block_device *bdev, sector_t block, unsigned size)
|
|
{
|
|
struct buffer_head *bh = __getblk(bdev, block, size);
|
|
if (likely(bh)) {
|
|
ll_rw_block(REQ_OP_READ, REQ_RAHEAD, 1, &bh);
|
|
brelse(bh);
|
|
}
|
|
}
|
|
EXPORT_SYMBOL(__breadahead);
|
|
|
|
void __breadahead_gfp(struct block_device *bdev, sector_t block, unsigned size,
|
|
gfp_t gfp)
|
|
{
|
|
struct buffer_head *bh = __getblk_gfp(bdev, block, size, gfp);
|
|
if (likely(bh)) {
|
|
ll_rw_block(REQ_OP_READ, REQ_RAHEAD, 1, &bh);
|
|
brelse(bh);
|
|
}
|
|
}
|
|
EXPORT_SYMBOL(__breadahead_gfp);
|
|
|
|
/**
|
|
* __bread_gfp() - reads a specified block and returns the bh
|
|
* @bdev: the block_device to read from
|
|
* @block: number of block
|
|
* @size: size (in bytes) to read
|
|
* @gfp: page allocation flag
|
|
*
|
|
* Reads a specified block, and returns buffer head that contains it.
|
|
* The page cache can be allocated from non-movable area
|
|
* not to prevent page migration if you set gfp to zero.
|
|
* It returns NULL if the block was unreadable.
|
|
*/
|
|
struct buffer_head *
|
|
__bread_gfp(struct block_device *bdev, sector_t block,
|
|
unsigned size, gfp_t gfp)
|
|
{
|
|
struct buffer_head *bh = __getblk_gfp(bdev, block, size, gfp);
|
|
|
|
if (likely(bh) && !buffer_uptodate(bh))
|
|
bh = __bread_slow(bh);
|
|
return bh;
|
|
}
|
|
EXPORT_SYMBOL(__bread_gfp);
|
|
|
|
/*
|
|
* invalidate_bh_lrus() is called rarely - but not only at unmount.
|
|
* This doesn't race because it runs in each cpu either in irq
|
|
* or with preempt disabled.
|
|
*/
|
|
static void invalidate_bh_lru(void *arg)
|
|
{
|
|
struct bh_lru *b = &get_cpu_var(bh_lrus);
|
|
int i;
|
|
|
|
for (i = 0; i < BH_LRU_SIZE; i++) {
|
|
brelse(b->bhs[i]);
|
|
b->bhs[i] = NULL;
|
|
}
|
|
put_cpu_var(bh_lrus);
|
|
}
|
|
|
|
static bool has_bh_in_lru(int cpu, void *dummy)
|
|
{
|
|
struct bh_lru *b = per_cpu_ptr(&bh_lrus, cpu);
|
|
int i;
|
|
|
|
for (i = 0; i < BH_LRU_SIZE; i++) {
|
|
if (b->bhs[i])
|
|
return 1;
|
|
}
|
|
|
|
return 0;
|
|
}
|
|
|
|
static void __evict_bh_lru(void *arg)
|
|
{
|
|
struct bh_lru *b = &get_cpu_var(bh_lrus);
|
|
struct buffer_head *bh = arg;
|
|
int i;
|
|
|
|
for (i = 0; i < BH_LRU_SIZE; i++) {
|
|
if (b->bhs[i] == bh) {
|
|
brelse(b->bhs[i]);
|
|
b->bhs[i] = NULL;
|
|
goto out;
|
|
}
|
|
}
|
|
out:
|
|
put_cpu_var(bh_lrus);
|
|
}
|
|
|
|
static bool bh_exists_in_lru(int cpu, void *arg)
|
|
{
|
|
struct bh_lru *b = per_cpu_ptr(&bh_lrus, cpu);
|
|
struct buffer_head *bh = arg;
|
|
int i;
|
|
|
|
for (i = 0; i < BH_LRU_SIZE; i++) {
|
|
if (b->bhs[i] == bh)
|
|
return 1;
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
void invalidate_bh_lrus(void)
|
|
{
|
|
on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1, GFP_KERNEL);
|
|
}
|
|
EXPORT_SYMBOL_GPL(invalidate_bh_lrus);
|
|
|
|
static void evict_bh_lrus(struct buffer_head *bh)
|
|
{
|
|
on_each_cpu_cond(bh_exists_in_lru, __evict_bh_lru, bh, 1, GFP_ATOMIC);
|
|
}
|
|
|
|
void set_bh_page(struct buffer_head *bh,
|
|
struct page *page, unsigned long offset)
|
|
{
|
|
bh->b_page = page;
|
|
BUG_ON(offset >= PAGE_SIZE);
|
|
if (PageHighMem(page))
|
|
/*
|
|
* This catches illegal uses and preserves the offset:
|
|
*/
|
|
bh->b_data = (char *)(0 + offset);
|
|
else
|
|
bh->b_data = page_address(page) + offset;
|
|
}
|
|
EXPORT_SYMBOL(set_bh_page);
|
|
|
|
/*
|
|
* Called when truncating a buffer on a page completely.
|
|
*/
|
|
|
|
/* Bits that are cleared during an invalidate */
|
|
#define BUFFER_FLAGS_DISCARD \
|
|
(1 << BH_Mapped | 1 << BH_New | 1 << BH_Req | \
|
|
1 << BH_Delay | 1 << BH_Unwritten)
|
|
|
|
static void discard_buffer(struct buffer_head * bh)
|
|
{
|
|
unsigned long b_state, b_state_old;
|
|
|
|
lock_buffer(bh);
|
|
clear_buffer_dirty(bh);
|
|
bh->b_bdev = NULL;
|
|
b_state = bh->b_state;
|
|
for (;;) {
|
|
b_state_old = cmpxchg(&bh->b_state, b_state,
|
|
(b_state & ~BUFFER_FLAGS_DISCARD));
|
|
if (b_state_old == b_state)
|
|
break;
|
|
b_state = b_state_old;
|
|
}
|
|
unlock_buffer(bh);
|
|
}
|
|
|
|
/**
|
|
* block_invalidatepage - invalidate part or all of a buffer-backed page
|
|
*
|
|
* @page: the page which is affected
|
|
* @offset: start of the range to invalidate
|
|
* @length: length of the range to invalidate
|
|
*
|
|
* block_invalidatepage() is called when all or part of the page has become
|
|
* invalidated by a truncate operation.
|
|
*
|
|
* block_invalidatepage() does not have to release all buffers, but it must
|
|
* ensure that no dirty buffer is left outside @offset and that no I/O
|
|
* is underway against any of the blocks which are outside the truncation
|
|
* point. Because the caller is about to free (and possibly reuse) those
|
|
* blocks on-disk.
|
|
*/
|
|
void block_invalidatepage(struct page *page, unsigned int offset,
|
|
unsigned int length)
|
|
{
|
|
struct buffer_head *head, *bh, *next;
|
|
unsigned int curr_off = 0;
|
|
unsigned int stop = length + offset;
|
|
|
|
BUG_ON(!PageLocked(page));
|
|
if (!page_has_buffers(page))
|
|
goto out;
|
|
|
|
/*
|
|
* Check for overflow
|
|
*/
|
|
BUG_ON(stop > PAGE_SIZE || stop < length);
|
|
|
|
head = page_buffers(page);
|
|
bh = head;
|
|
do {
|
|
unsigned int next_off = curr_off + bh->b_size;
|
|
next = bh->b_this_page;
|
|
|
|
/*
|
|
* Are we still fully in range ?
|
|
*/
|
|
if (next_off > stop)
|
|
goto out;
|
|
|
|
/*
|
|
* is this block fully invalidated?
|
|
*/
|
|
if (offset <= curr_off)
|
|
discard_buffer(bh);
|
|
curr_off = next_off;
|
|
bh = next;
|
|
} while (bh != head);
|
|
|
|
/*
|
|
* We release buffers only if the entire page is being invalidated.
|
|
* The get_block cached value has been unconditionally invalidated,
|
|
* so real IO is not possible anymore.
|
|
*/
|
|
if (offset == 0)
|
|
try_to_release_page(page, 0);
|
|
out:
|
|
return;
|
|
}
|
|
EXPORT_SYMBOL(block_invalidatepage);
|
|
|
|
|
|
/*
|
|
* We attach and possibly dirty the buffers atomically wrt
|
|
* __set_page_dirty_buffers() via private_lock. try_to_free_buffers
|
|
* is already excluded via the page lock.
|
|
*/
|
|
void create_empty_buffers(struct page *page,
|
|
unsigned long blocksize, unsigned long b_state)
|
|
{
|
|
struct buffer_head *bh, *head, *tail;
|
|
|
|
head = alloc_page_buffers(page, blocksize, true);
|
|
bh = head;
|
|
do {
|
|
bh->b_state |= b_state;
|
|
tail = bh;
|
|
bh = bh->b_this_page;
|
|
} while (bh);
|
|
tail->b_this_page = head;
|
|
|
|
spin_lock(&page->mapping->private_lock);
|
|
if (PageUptodate(page) || PageDirty(page)) {
|
|
bh = head;
|
|
do {
|
|
if (PageDirty(page))
|
|
set_buffer_dirty(bh);
|
|
if (PageUptodate(page))
|
|
set_buffer_uptodate(bh);
|
|
bh = bh->b_this_page;
|
|
} while (bh != head);
|
|
}
|
|
attach_page_buffers(page, head);
|
|
spin_unlock(&page->mapping->private_lock);
|
|
}
|
|
EXPORT_SYMBOL(create_empty_buffers);
|
|
|
|
/**
|
|
* clean_bdev_aliases: clean a range of buffers in block device
|
|
* @bdev: Block device to clean buffers in
|
|
* @block: Start of a range of blocks to clean
|
|
* @len: Number of blocks to clean
|
|
*
|
|
* We are taking a range of blocks for data and we don't want writeback of any
|
|
* buffer-cache aliases starting from return from this function and until the
|
|
* moment when something will explicitly mark the buffer dirty (hopefully that
|
|
* will not happen until we will free that block ;-) We don't even need to mark
|
|
* it not-uptodate - nobody can expect anything from a newly allocated buffer
|
|
* anyway. We used to use unmap_buffer() for such invalidation, but that was
|
|
* wrong. We definitely don't want to mark the alias unmapped, for example - it
|
|
* would confuse anyone who might pick it with bread() afterwards...
|
|
*
|
|
* Also.. Note that bforget() doesn't lock the buffer. So there can be
|
|
* writeout I/O going on against recently-freed buffers. We don't wait on that
|
|
* I/O in bforget() - it's more efficient to wait on the I/O only if we really
|
|
* need to. That happens here.
|
|
*/
|
|
void clean_bdev_aliases(struct block_device *bdev, sector_t block, sector_t len)
|
|
{
|
|
struct inode *bd_inode = bdev->bd_inode;
|
|
struct address_space *bd_mapping = bd_inode->i_mapping;
|
|
struct pagevec pvec;
|
|
pgoff_t index = block >> (PAGE_SHIFT - bd_inode->i_blkbits);
|
|
pgoff_t end;
|
|
int i, count;
|
|
struct buffer_head *bh;
|
|
struct buffer_head *head;
|
|
|
|
end = (block + len - 1) >> (PAGE_SHIFT - bd_inode->i_blkbits);
|
|
pagevec_init(&pvec);
|
|
while (pagevec_lookup_range(&pvec, bd_mapping, &index, end)) {
|
|
count = pagevec_count(&pvec);
|
|
for (i = 0; i < count; i++) {
|
|
struct page *page = pvec.pages[i];
|
|
|
|
if (!page_has_buffers(page))
|
|
continue;
|
|
/*
|
|
* We use page lock instead of bd_mapping->private_lock
|
|
* to pin buffers here since we can afford to sleep and
|
|
* it scales better than a global spinlock lock.
|
|
*/
|
|
lock_page(page);
|
|
/* Recheck when the page is locked which pins bhs */
|
|
if (!page_has_buffers(page))
|
|
goto unlock_page;
|
|
head = page_buffers(page);
|
|
bh = head;
|
|
do {
|
|
if (!buffer_mapped(bh) || (bh->b_blocknr < block))
|
|
goto next;
|
|
if (bh->b_blocknr >= block + len)
|
|
break;
|
|
clear_buffer_dirty(bh);
|
|
wait_on_buffer(bh);
|
|
clear_buffer_req(bh);
|
|
next:
|
|
bh = bh->b_this_page;
|
|
} while (bh != head);
|
|
unlock_page:
|
|
unlock_page(page);
|
|
}
|
|
pagevec_release(&pvec);
|
|
cond_resched();
|
|
/* End of range already reached? */
|
|
if (index > end || !index)
|
|
break;
|
|
}
|
|
}
|
|
EXPORT_SYMBOL(clean_bdev_aliases);
|
|
|
|
/*
|
|
* Size is a power-of-two in the range 512..PAGE_SIZE,
|
|
* and the case we care about most is PAGE_SIZE.
|
|
*
|
|
* So this *could* possibly be written with those
|
|
* constraints in mind (relevant mostly if some
|
|
* architecture has a slow bit-scan instruction)
|
|
*/
|
|
static inline int block_size_bits(unsigned int blocksize)
|
|
{
|
|
return ilog2(blocksize);
|
|
}
|
|
|
|
static struct buffer_head *create_page_buffers(struct page *page, struct inode *inode, unsigned int b_state)
|
|
{
|
|
BUG_ON(!PageLocked(page));
|
|
|
|
if (!page_has_buffers(page))
|
|
create_empty_buffers(page, 1 << READ_ONCE(inode->i_blkbits),
|
|
b_state);
|
|
return page_buffers(page);
|
|
}
|
|
|
|
/*
|
|
* NOTE! All mapped/uptodate combinations are valid:
|
|
*
|
|
* Mapped Uptodate Meaning
|
|
*
|
|
* No No "unknown" - must do get_block()
|
|
* No Yes "hole" - zero-filled
|
|
* Yes No "allocated" - allocated on disk, not read in
|
|
* Yes Yes "valid" - allocated and up-to-date in memory.
|
|
*
|
|
* "Dirty" is valid only with the last case (mapped+uptodate).
|
|
*/
|
|
|
|
/*
|
|
* While block_write_full_page is writing back the dirty buffers under
|
|
* the page lock, whoever dirtied the buffers may decide to clean them
|
|
* again at any time. We handle that by only looking at the buffer
|
|
* state inside lock_buffer().
|
|
*
|
|
* If block_write_full_page() is called for regular writeback
|
|
* (wbc->sync_mode == WB_SYNC_NONE) then it will redirty a page which has a
|
|
* locked buffer. This only can happen if someone has written the buffer
|
|
* directly, with submit_bh(). At the address_space level PageWriteback
|
|
* prevents this contention from occurring.
|
|
*
|
|
* If block_write_full_page() is called with wbc->sync_mode ==
|
|
* WB_SYNC_ALL, the writes are posted using REQ_SYNC; this
|
|
* causes the writes to be flagged as synchronous writes.
|
|
*/
|
|
int __block_write_full_page(struct inode *inode, struct page *page,
|
|
get_block_t *get_block, struct writeback_control *wbc,
|
|
bh_end_io_t *handler)
|
|
{
|
|
int err;
|
|
sector_t block;
|
|
sector_t last_block;
|
|
struct buffer_head *bh, *head;
|
|
unsigned int blocksize, bbits;
|
|
int nr_underway = 0;
|
|
int write_flags = wbc_to_write_flags(wbc);
|
|
|
|
head = create_page_buffers(page, inode,
|
|
(1 << BH_Dirty)|(1 << BH_Uptodate));
|
|
|
|
/*
|
|
* Be very careful. We have no exclusion from __set_page_dirty_buffers
|
|
* here, and the (potentially unmapped) buffers may become dirty at
|
|
* any time. If a buffer becomes dirty here after we've inspected it
|
|
* then we just miss that fact, and the page stays dirty.
|
|
*
|
|
* Buffers outside i_size may be dirtied by __set_page_dirty_buffers;
|
|
* handle that here by just cleaning them.
|
|
*/
|
|
|
|
bh = head;
|
|
blocksize = bh->b_size;
|
|
bbits = block_size_bits(blocksize);
|
|
|
|
block = (sector_t)page->index << (PAGE_SHIFT - bbits);
|
|
last_block = (i_size_read(inode) - 1) >> bbits;
|
|
|
|
/*
|
|
* Get all the dirty buffers mapped to disk addresses and
|
|
* handle any aliases from the underlying blockdev's mapping.
|
|
*/
|
|
do {
|
|
if (block > last_block) {
|
|
/*
|
|
* mapped buffers outside i_size will occur, because
|
|
* this page can be outside i_size when there is a
|
|
* truncate in progress.
|
|
*/
|
|
/*
|
|
* The buffer was zeroed by block_write_full_page()
|
|
*/
|
|
clear_buffer_dirty(bh);
|
|
set_buffer_uptodate(bh);
|
|
} else if ((!buffer_mapped(bh) || buffer_delay(bh)) &&
|
|
buffer_dirty(bh)) {
|
|
WARN_ON(bh->b_size != blocksize);
|
|
err = get_block(inode, block, bh, 1);
|
|
if (err)
|
|
goto recover;
|
|
clear_buffer_delay(bh);
|
|
if (buffer_new(bh)) {
|
|
/* blockdev mappings never come here */
|
|
clear_buffer_new(bh);
|
|
clean_bdev_bh_alias(bh);
|
|
}
|
|
}
|
|
bh = bh->b_this_page;
|
|
block++;
|
|
} while (bh != head);
|
|
|
|
do {
|
|
if (!buffer_mapped(bh))
|
|
continue;
|
|
/*
|
|
* If it's a fully non-blocking write attempt and we cannot
|
|
* lock the buffer then redirty the page. Note that this can
|
|
* potentially cause a busy-wait loop from writeback threads
|
|
* and kswapd activity, but those code paths have their own
|
|
* higher-level throttling.
|
|
*/
|
|
if (wbc->sync_mode != WB_SYNC_NONE) {
|
|
lock_buffer(bh);
|
|
} else if (!trylock_buffer(bh)) {
|
|
redirty_page_for_writepage(wbc, page);
|
|
continue;
|
|
}
|
|
if (test_clear_buffer_dirty(bh)) {
|
|
mark_buffer_async_write_endio(bh, handler);
|
|
} else {
|
|
unlock_buffer(bh);
|
|
}
|
|
} while ((bh = bh->b_this_page) != head);
|
|
|
|
/*
|
|
* The page and its buffers are protected by PageWriteback(), so we can
|
|
* drop the bh refcounts early.
|
|
*/
|
|
BUG_ON(PageWriteback(page));
|
|
set_page_writeback(page);
|
|
|
|
do {
|
|
struct buffer_head *next = bh->b_this_page;
|
|
if (buffer_async_write(bh)) {
|
|
submit_bh_wbc(REQ_OP_WRITE, write_flags, bh,
|
|
inode->i_write_hint, wbc);
|
|
nr_underway++;
|
|
}
|
|
bh = next;
|
|
} while (bh != head);
|
|
unlock_page(page);
|
|
|
|
err = 0;
|
|
done:
|
|
if (nr_underway == 0) {
|
|
/*
|
|
* The page was marked dirty, but the buffers were
|
|
* clean. Someone wrote them back by hand with
|
|
* ll_rw_block/submit_bh. A rare case.
|
|
*/
|
|
end_page_writeback(page);
|
|
|
|
/*
|
|
* The page and buffer_heads can be released at any time from
|
|
* here on.
|
|
*/
|
|
}
|
|
return err;
|
|
|
|
recover:
|
|
/*
|
|
* ENOSPC, or some other error. We may already have added some
|
|
* blocks to the file, so we need to write these out to avoid
|
|
* exposing stale data.
|
|
* The page is currently locked and not marked for writeback
|
|
*/
|
|
bh = head;
|
|
/* Recovery: lock and submit the mapped buffers */
|
|
do {
|
|
if (buffer_mapped(bh) && buffer_dirty(bh) &&
|
|
!buffer_delay(bh)) {
|
|
lock_buffer(bh);
|
|
mark_buffer_async_write_endio(bh, handler);
|
|
} else {
|
|
/*
|
|
* The buffer may have been set dirty during
|
|
* attachment to a dirty page.
|
|
*/
|
|
clear_buffer_dirty(bh);
|
|
}
|
|
} while ((bh = bh->b_this_page) != head);
|
|
SetPageError(page);
|
|
BUG_ON(PageWriteback(page));
|
|
mapping_set_error(page->mapping, err);
|
|
set_page_writeback(page);
|
|
do {
|
|
struct buffer_head *next = bh->b_this_page;
|
|
if (buffer_async_write(bh)) {
|
|
clear_buffer_dirty(bh);
|
|
submit_bh_wbc(REQ_OP_WRITE, write_flags, bh,
|
|
inode->i_write_hint, wbc);
|
|
nr_underway++;
|
|
}
|
|
bh = next;
|
|
} while (bh != head);
|
|
unlock_page(page);
|
|
goto done;
|
|
}
|
|
EXPORT_SYMBOL(__block_write_full_page);
|
|
|
|
/*
|
|
* If a page has any new buffers, zero them out here, and mark them uptodate
|
|
* and dirty so they'll be written out (in order to prevent uninitialised
|
|
* block data from leaking). And clear the new bit.
|
|
*/
|
|
void page_zero_new_buffers(struct page *page, unsigned from, unsigned to)
|
|
{
|
|
unsigned int block_start, block_end;
|
|
struct buffer_head *head, *bh;
|
|
|
|
BUG_ON(!PageLocked(page));
|
|
if (!page_has_buffers(page))
|
|
return;
|
|
|
|
bh = head = page_buffers(page);
|
|
block_start = 0;
|
|
do {
|
|
block_end = block_start + bh->b_size;
|
|
|
|
if (buffer_new(bh)) {
|
|
if (block_end > from && block_start < to) {
|
|
if (!PageUptodate(page)) {
|
|
unsigned start, size;
|
|
|
|
start = max(from, block_start);
|
|
size = min(to, block_end) - start;
|
|
|
|
zero_user(page, start, size);
|
|
set_buffer_uptodate(bh);
|
|
}
|
|
|
|
clear_buffer_new(bh);
|
|
mark_buffer_dirty(bh);
|
|
}
|
|
}
|
|
|
|
block_start = block_end;
|
|
bh = bh->b_this_page;
|
|
} while (bh != head);
|
|
}
|
|
EXPORT_SYMBOL(page_zero_new_buffers);
|
|
|
|
static void
|
|
iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
|
|
struct iomap *iomap)
|
|
{
|
|
loff_t offset = block << inode->i_blkbits;
|
|
|
|
bh->b_bdev = iomap->bdev;
|
|
|
|
/*
|
|
* Block points to offset in file we need to map, iomap contains
|
|
* the offset at which the map starts. If the map ends before the
|
|
* current block, then do not map the buffer and let the caller
|
|
* handle it.
|
|
*/
|
|
BUG_ON(offset >= iomap->offset + iomap->length);
|
|
|
|
switch (iomap->type) {
|
|
case IOMAP_HOLE:
|
|
/*
|
|
* If the buffer is not up to date or beyond the current EOF,
|
|
* we need to mark it as new to ensure sub-block zeroing is
|
|
* executed if necessary.
|
|
*/
|
|
if (!buffer_uptodate(bh) ||
|
|
(offset >= i_size_read(inode)))
|
|
set_buffer_new(bh);
|
|
break;
|
|
case IOMAP_DELALLOC:
|
|
if (!buffer_uptodate(bh) ||
|
|
(offset >= i_size_read(inode)))
|
|
set_buffer_new(bh);
|
|
set_buffer_uptodate(bh);
|
|
set_buffer_mapped(bh);
|
|
set_buffer_delay(bh);
|
|
break;
|
|
case IOMAP_UNWRITTEN:
|
|
/*
|
|
* For unwritten regions, we always need to ensure that
|
|
* sub-block writes cause the regions in the block we are not
|
|
* writing to are zeroed. Set the buffer as new to ensure this.
|
|
*/
|
|
set_buffer_new(bh);
|
|
set_buffer_unwritten(bh);
|
|
/* FALLTHRU */
|
|
case IOMAP_MAPPED:
|
|
if (offset >= i_size_read(inode))
|
|
set_buffer_new(bh);
|
|
bh->b_blocknr = (iomap->addr + offset - iomap->offset) >>
|
|
inode->i_blkbits;
|
|
set_buffer_mapped(bh);
|
|
break;
|
|
}
|
|
}
|
|
|
|
int __block_write_begin_int(struct page *page, loff_t pos, unsigned len,
|
|
get_block_t *get_block, struct iomap *iomap)
|
|
{
|
|
unsigned from = pos & (PAGE_SIZE - 1);
|
|
unsigned to = from + len;
|
|
struct inode *inode = page->mapping->host;
|
|
unsigned block_start, block_end;
|
|
sector_t block;
|
|
int err = 0;
|
|
unsigned blocksize, bbits;
|
|
struct buffer_head *bh, *head, *wait[2], **wait_bh=wait;
|
|
|
|
BUG_ON(!PageLocked(page));
|
|
BUG_ON(from > PAGE_SIZE);
|
|
BUG_ON(to > PAGE_SIZE);
|
|
BUG_ON(from > to);
|
|
|
|
head = create_page_buffers(page, inode, 0);
|
|
blocksize = head->b_size;
|
|
bbits = block_size_bits(blocksize);
|
|
|
|
block = (sector_t)page->index << (PAGE_SHIFT - bbits);
|
|
|
|
for(bh = head, block_start = 0; bh != head || !block_start;
|
|
block++, block_start=block_end, bh = bh->b_this_page) {
|
|
block_end = block_start + blocksize;
|
|
if (block_end <= from || block_start >= to) {
|
|
if (PageUptodate(page)) {
|
|
if (!buffer_uptodate(bh))
|
|
set_buffer_uptodate(bh);
|
|
}
|
|
continue;
|
|
}
|
|
if (buffer_new(bh))
|
|
clear_buffer_new(bh);
|
|
if (!buffer_mapped(bh)) {
|
|
WARN_ON(bh->b_size != blocksize);
|
|
if (get_block) {
|
|
err = get_block(inode, block, bh, 1);
|
|
if (err)
|
|
break;
|
|
} else {
|
|
iomap_to_bh(inode, block, bh, iomap);
|
|
}
|
|
|
|
if (buffer_new(bh)) {
|
|
clean_bdev_bh_alias(bh);
|
|
if (PageUptodate(page)) {
|
|
clear_buffer_new(bh);
|
|
set_buffer_uptodate(bh);
|
|
mark_buffer_dirty(bh);
|
|
continue;
|
|
}
|
|
if (block_end > to || block_start < from)
|
|
zero_user_segments(page,
|
|
to, block_end,
|
|
block_start, from);
|
|
continue;
|
|
}
|
|
}
|
|
if (PageUptodate(page)) {
|
|
if (!buffer_uptodate(bh))
|
|
set_buffer_uptodate(bh);
|
|
continue;
|
|
}
|
|
if (!buffer_uptodate(bh) && !buffer_delay(bh) &&
|
|
!buffer_unwritten(bh) &&
|
|
(block_start < from || block_end > to)) {
|
|
ll_rw_block(REQ_OP_READ, 0, 1, &bh);
|
|
*wait_bh++=bh;
|
|
}
|
|
}
|
|
/*
|
|
* If we issued read requests - let them complete.
|
|
*/
|
|
while(wait_bh > wait) {
|
|
wait_on_buffer(*--wait_bh);
|
|
if (!buffer_uptodate(*wait_bh))
|
|
err = -EIO;
|
|
}
|
|
if (unlikely(err))
|
|
page_zero_new_buffers(page, from, to);
|
|
return err;
|
|
}
|
|
|
|
int __block_write_begin(struct page *page, loff_t pos, unsigned len,
|
|
get_block_t *get_block)
|
|
{
|
|
return __block_write_begin_int(page, pos, len, get_block, NULL);
|
|
}
|
|
EXPORT_SYMBOL(__block_write_begin);
|
|
|
|
static int __block_commit_write(struct inode *inode, struct page *page,
|
|
unsigned from, unsigned to)
|
|
{
|
|
unsigned block_start, block_end;
|
|
int partial = 0;
|
|
unsigned blocksize;
|
|
struct buffer_head *bh, *head;
|
|
|
|
bh = head = page_buffers(page);
|
|
blocksize = bh->b_size;
|
|
|
|
block_start = 0;
|
|
do {
|
|
block_end = block_start + blocksize;
|
|
if (block_end <= from || block_start >= to) {
|
|
if (!buffer_uptodate(bh))
|
|
partial = 1;
|
|
} else {
|
|
set_buffer_uptodate(bh);
|
|
mark_buffer_dirty(bh);
|
|
}
|
|
clear_buffer_new(bh);
|
|
|
|
block_start = block_end;
|
|
bh = bh->b_this_page;
|
|
} while (bh != head);
|
|
|
|
/*
|
|
* If this is a partial write which happened to make all buffers
|
|
* uptodate then we can optimize away a bogus readpage() for
|
|
* the next read(). Here we 'discover' whether the page went
|
|
* uptodate as a result of this (potentially partial) write.
|
|
*/
|
|
if (!partial)
|
|
SetPageUptodate(page);
|
|
return 0;
|
|
}
|
|
|
|
/*
|
|
* block_write_begin takes care of the basic task of block allocation and
|
|
* bringing partial write blocks uptodate first.
|
|
*
|
|
* The filesystem needs to handle block truncation upon failure.
|
|
*/
|
|
int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
|
|
unsigned flags, struct page **pagep, get_block_t *get_block)
|
|
{
|
|
pgoff_t index = pos >> PAGE_SHIFT;
|
|
struct page *page;
|
|
int status;
|
|
|
|
page = grab_cache_page_write_begin(mapping, index, flags);
|
|
if (!page)
|
|
return -ENOMEM;
|
|
|
|
status = __block_write_begin(page, pos, len, get_block);
|
|
if (unlikely(status)) {
|
|
unlock_page(page);
|
|
put_page(page);
|
|
page = NULL;
|
|
}
|
|
|
|
*pagep = page;
|
|
return status;
|
|
}
|
|
EXPORT_SYMBOL(block_write_begin);
|
|
|
|
int block_write_end(struct file *file, struct address_space *mapping,
|
|
loff_t pos, unsigned len, unsigned copied,
|
|
struct page *page, void *fsdata)
|
|
{
|
|
struct inode *inode = mapping->host;
|
|
unsigned start;
|
|
|
|
start = pos & (PAGE_SIZE - 1);
|
|
|
|
if (unlikely(copied < len)) {
|
|
/*
|
|
* The buffers that were written will now be uptodate, so we
|
|
* don't have to worry about a readpage reading them and
|
|
* overwriting a partial write. However if we have encountered
|
|
* a short write and only partially written into a buffer, it
|
|
* will not be marked uptodate, so a readpage might come in and
|
|
* destroy our partial write.
|
|
*
|
|
* Do the simplest thing, and just treat any short write to a
|
|
* non uptodate page as a zero-length write, and force the
|
|
* caller to redo the whole thing.
|
|
*/
|
|
if (!PageUptodate(page))
|
|
copied = 0;
|
|
|
|
page_zero_new_buffers(page, start+copied, start+len);
|
|
}
|
|
flush_dcache_page(page);
|
|
|
|
/* This could be a short (even 0-length) commit */
|
|
__block_commit_write(inode, page, start, start+copied);
|
|
|
|
return copied;
|
|
}
|
|
EXPORT_SYMBOL(block_write_end);
|
|
|
|
int generic_write_end(struct file *file, struct address_space *mapping,
|
|
loff_t pos, unsigned len, unsigned copied,
|
|
struct page *page, void *fsdata)
|
|
{
|
|
struct inode *inode = mapping->host;
|
|
loff_t old_size = inode->i_size;
|
|
int i_size_changed = 0;
|
|
|
|
copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
|
|
|
|
/*
|
|
* No need to use i_size_read() here, the i_size
|
|
* cannot change under us because we hold i_mutex.
|
|
*
|
|
* But it's important to update i_size while still holding page lock:
|
|
* page writeout could otherwise come in and zero beyond i_size.
|
|
*/
|
|
if (pos+copied > inode->i_size) {
|
|
i_size_write(inode, pos+copied);
|
|
i_size_changed = 1;
|
|
}
|
|
|
|
unlock_page(page);
|
|
put_page(page);
|
|
|
|
if (old_size < pos)
|
|
pagecache_isize_extended(inode, old_size, pos);
|
|
/*
|
|
* Don't mark the inode dirty under page lock. First, it unnecessarily
|
|
* makes the holding time of page lock longer. Second, it forces lock
|
|
* ordering of page lock and transaction start for journaling
|
|
* filesystems.
|
|
*/
|
|
if (i_size_changed)
|
|
mark_inode_dirty(inode);
|
|
|
|
return copied;
|
|
}
|
|
EXPORT_SYMBOL(generic_write_end);
|
|
|
|
/*
|
|
* block_is_partially_uptodate checks whether buffers within a page are
|
|
* uptodate or not.
|
|
*
|
|
* Returns true if all buffers which correspond to a file portion
|
|
* we want to read are uptodate.
|
|
*/
|
|
int block_is_partially_uptodate(struct page *page, unsigned long from,
|
|
unsigned long count)
|
|
{
|
|
unsigned block_start, block_end, blocksize;
|
|
unsigned to;
|
|
struct buffer_head *bh, *head;
|
|
int ret = 1;
|
|
|
|
if (!page_has_buffers(page))
|
|
return 0;
|
|
|
|
head = page_buffers(page);
|
|
blocksize = head->b_size;
|
|
to = min_t(unsigned, PAGE_SIZE - from, count);
|
|
to = from + to;
|
|
if (from < blocksize && to > PAGE_SIZE - blocksize)
|
|
return 0;
|
|
|
|
bh = head;
|
|
block_start = 0;
|
|
do {
|
|
block_end = block_start + blocksize;
|
|
if (block_end > from && block_start < to) {
|
|
if (!buffer_uptodate(bh)) {
|
|
ret = 0;
|
|
break;
|
|
}
|
|
if (block_end >= to)
|
|
break;
|
|
}
|
|
block_start = block_end;
|
|
bh = bh->b_this_page;
|
|
} while (bh != head);
|
|
|
|
return ret;
|
|
}
|
|
EXPORT_SYMBOL(block_is_partially_uptodate);
|
|
|
|
/*
|
|
* Generic "read page" function for block devices that have the normal
|
|
* get_block functionality. This is most of the block device filesystems.
|
|
* Reads the page asynchronously --- the unlock_buffer() and
|
|
* set/clear_buffer_uptodate() functions propagate buffer state into the
|
|
* page struct once IO has completed.
|
|
*/
|
|
int block_read_full_page(struct page *page, get_block_t *get_block)
|
|
{
|
|
struct inode *inode = page->mapping->host;
|
|
sector_t iblock, lblock;
|
|
struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
|
|
unsigned int blocksize, bbits;
|
|
int nr, i;
|
|
int fully_mapped = 1;
|
|
|
|
head = create_page_buffers(page, inode, 0);
|
|
blocksize = head->b_size;
|
|
bbits = block_size_bits(blocksize);
|
|
|
|
iblock = (sector_t)page->index << (PAGE_SHIFT - bbits);
|
|
lblock = (i_size_read(inode)+blocksize-1) >> bbits;
|
|
bh = head;
|
|
nr = 0;
|
|
i = 0;
|
|
|
|
do {
|
|
if (buffer_uptodate(bh))
|
|
continue;
|
|
|
|
if (!buffer_mapped(bh)) {
|
|
int err = 0;
|
|
|
|
fully_mapped = 0;
|
|
if (iblock < lblock) {
|
|
WARN_ON(bh->b_size != blocksize);
|
|
err = get_block(inode, iblock, bh, 0);
|
|
if (err)
|
|
SetPageError(page);
|
|
}
|
|
if (!buffer_mapped(bh)) {
|
|
zero_user(page, i * blocksize, blocksize);
|
|
if (!err)
|
|
set_buffer_uptodate(bh);
|
|
continue;
|
|
}
|
|
/*
|
|
* get_block() might have updated the buffer
|
|
* synchronously
|
|
*/
|
|
if (buffer_uptodate(bh))
|
|
continue;
|
|
}
|
|
arr[nr++] = bh;
|
|
} while (i++, iblock++, (bh = bh->b_this_page) != head);
|
|
|
|
if (fully_mapped)
|
|
SetPageMappedToDisk(page);
|
|
|
|
if (!nr) {
|
|
/*
|
|
* All buffers are uptodate - we can set the page uptodate
|
|
* as well. But not if get_block() returned an error.
|
|
*/
|
|
if (!PageError(page))
|
|
SetPageUptodate(page);
|
|
unlock_page(page);
|
|
return 0;
|
|
}
|
|
|
|
/* Stage two: lock the buffers */
|
|
for (i = 0; i < nr; i++) {
|
|
bh = arr[i];
|
|
lock_buffer(bh);
|
|
mark_buffer_async_read(bh);
|
|
}
|
|
|
|
/*
|
|
* Stage 3: start the IO. Check for uptodateness
|
|
* inside the buffer lock in case another process reading
|
|
* the underlying blockdev brought it uptodate (the sct fix).
|
|
*/
|
|
for (i = 0; i < nr; i++) {
|
|
bh = arr[i];
|
|
if (buffer_uptodate(bh))
|
|
end_buffer_async_read(bh, 1);
|
|
else
|
|
submit_bh(REQ_OP_READ, 0, bh);
|
|
}
|
|
return 0;
|
|
}
|
|
EXPORT_SYMBOL(block_read_full_page);
|
|
|
|
/* utility function for filesystems that need to do work on expanding
|
|
* truncates. Uses filesystem pagecache writes to allow the filesystem to
|
|
* deal with the hole.
|
|
*/
|
|
int generic_cont_expand_simple(struct inode *inode, loff_t size)
|
|
{
|
|
struct address_space *mapping = inode->i_mapping;
|
|
struct page *page;
|
|
void *fsdata = NULL;
|
|
int err;
|
|
|
|
err = inode_newsize_ok(inode, size);
|
|
if (err)
|
|
goto out;
|
|
|
|
err = pagecache_write_begin(NULL, mapping, size, 0,
|
|
AOP_FLAG_CONT_EXPAND, &page, &fsdata);
|
|
if (err)
|
|
goto out;
|
|
|
|
err = pagecache_write_end(NULL, mapping, size, 0, 0, page, fsdata);
|
|
BUG_ON(err > 0);
|
|
|
|
out:
|
|
return err;
|
|
}
|
|
EXPORT_SYMBOL(generic_cont_expand_simple);
|
|
|
|
static int cont_expand_zero(struct file *file, struct address_space *mapping,
|
|
loff_t pos, loff_t *bytes)
|
|
{
|
|
struct inode *inode = mapping->host;
|
|
unsigned int blocksize = i_blocksize(inode);
|
|
struct page *page;
|
|
void *fsdata = NULL;
|
|
pgoff_t index, curidx;
|
|
loff_t curpos;
|
|
unsigned zerofrom, offset, len;
|
|
int err = 0;
|
|
|
|
index = pos >> PAGE_SHIFT;
|
|
offset = pos & ~PAGE_MASK;
|
|
|
|
while (index > (curidx = (curpos = *bytes)>>PAGE_SHIFT)) {
|
|
zerofrom = curpos & ~PAGE_MASK;
|
|
if (zerofrom & (blocksize-1)) {
|
|
*bytes |= (blocksize-1);
|
|
(*bytes)++;
|
|
}
|
|
len = PAGE_SIZE - zerofrom;
|
|
|
|
err = pagecache_write_begin(file, mapping, curpos, len, 0,
|
|
&page, &fsdata);
|
|
if (err)
|
|
goto out;
|
|
zero_user(page, zerofrom, len);
|
|
err = pagecache_write_end(file, mapping, curpos, len, len,
|
|
page, fsdata);
|
|
if (err < 0)
|
|
goto out;
|
|
BUG_ON(err != len);
|
|
err = 0;
|
|
|
|
balance_dirty_pages_ratelimited(mapping);
|
|
|
|
if (unlikely(fatal_signal_pending(current))) {
|
|
err = -EINTR;
|
|
goto out;
|
|
}
|
|
}
|
|
|
|
/* page covers the boundary, find the boundary offset */
|
|
if (index == curidx) {
|
|
zerofrom = curpos & ~PAGE_MASK;
|
|
/* if we will expand the thing last block will be filled */
|
|
if (offset <= zerofrom) {
|
|
goto out;
|
|
}
|
|
if (zerofrom & (blocksize-1)) {
|
|
*bytes |= (blocksize-1);
|
|
(*bytes)++;
|
|
}
|
|
len = offset - zerofrom;
|
|
|
|
err = pagecache_write_begin(file, mapping, curpos, len, 0,
|
|
&page, &fsdata);
|
|
if (err)
|
|
goto out;
|
|
zero_user(page, zerofrom, len);
|
|
err = pagecache_write_end(file, mapping, curpos, len, len,
|
|
page, fsdata);
|
|
if (err < 0)
|
|
goto out;
|
|
BUG_ON(err != len);
|
|
err = 0;
|
|
}
|
|
out:
|
|
return err;
|
|
}
|
|
|
|
/*
|
|
* For moronic filesystems that do not allow holes in file.
|
|
* We may have to extend the file.
|
|
*/
|
|
int cont_write_begin(struct file *file, struct address_space *mapping,
|
|
loff_t pos, unsigned len, unsigned flags,
|
|
struct page **pagep, void **fsdata,
|
|
get_block_t *get_block, loff_t *bytes)
|
|
{
|
|
struct inode *inode = mapping->host;
|
|
unsigned int blocksize = i_blocksize(inode);
|
|
unsigned int zerofrom;
|
|
int err;
|
|
|
|
err = cont_expand_zero(file, mapping, pos, bytes);
|
|
if (err)
|
|
return err;
|
|
|
|
zerofrom = *bytes & ~PAGE_MASK;
|
|
if (pos+len > *bytes && zerofrom & (blocksize-1)) {
|
|
*bytes |= (blocksize-1);
|
|
(*bytes)++;
|
|
}
|
|
|
|
return block_write_begin(mapping, pos, len, flags, pagep, get_block);
|
|
}
|
|
EXPORT_SYMBOL(cont_write_begin);
|
|
|
|
int block_commit_write(struct page *page, unsigned from, unsigned to)
|
|
{
|
|
struct inode *inode = page->mapping->host;
|
|
__block_commit_write(inode,page,from,to);
|
|
return 0;
|
|
}
|
|
EXPORT_SYMBOL(block_commit_write);
|
|
|
|
/*
|
|
* block_page_mkwrite() is not allowed to change the file size as it gets
|
|
* called from a page fault handler when a page is first dirtied. Hence we must
|
|
* be careful to check for EOF conditions here. We set the page up correctly
|
|
* for a written page which means we get ENOSPC checking when writing into
|
|
* holes and correct delalloc and unwritten extent mapping on filesystems that
|
|
* support these features.
|
|
*
|
|
* We are not allowed to take the i_mutex here so we have to play games to
|
|
* protect against truncate races as the page could now be beyond EOF. Because
|
|
* truncate writes the inode size before removing pages, once we have the
|
|
* page lock we can determine safely if the page is beyond EOF. If it is not
|
|
* beyond EOF, then the page is guaranteed safe against truncation until we
|
|
* unlock the page.
|
|
*
|
|
* Direct callers of this function should protect against filesystem freezing
|
|
* using sb_start_pagefault() - sb_end_pagefault() functions.
|
|
*/
|
|
int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
|
|
get_block_t get_block)
|
|
{
|
|
struct page *page = vmf->page;
|
|
struct inode *inode = file_inode(vma->vm_file);
|
|
unsigned long end;
|
|
loff_t size;
|
|
int ret;
|
|
|
|
lock_page(page);
|
|
size = i_size_read(inode);
|
|
if ((page->mapping != inode->i_mapping) ||
|
|
(page_offset(page) > size)) {
|
|
/* We overload EFAULT to mean page got truncated */
|
|
ret = -EFAULT;
|
|
goto out_unlock;
|
|
}
|
|
|
|
/* page is wholly or partially inside EOF */
|
|
if (((page->index + 1) << PAGE_SHIFT) > size)
|
|
end = size & ~PAGE_MASK;
|
|
else
|
|
end = PAGE_SIZE;
|
|
|
|
ret = __block_write_begin(page, 0, end, get_block);
|
|
if (!ret)
|
|
ret = block_commit_write(page, 0, end);
|
|
|
|
if (unlikely(ret < 0))
|
|
goto out_unlock;
|
|
set_page_dirty(page);
|
|
wait_for_stable_page(page);
|
|
return 0;
|
|
out_unlock:
|
|
unlock_page(page);
|
|
return ret;
|
|
}
|
|
EXPORT_SYMBOL(block_page_mkwrite);
|
|
|
|
/*
|
|
* nobh_write_begin()'s prereads are special: the buffer_heads are freed
|
|
* immediately, while under the page lock. So it needs a special end_io
|
|
* handler which does not touch the bh after unlocking it.
|
|
*/
|
|
static void end_buffer_read_nobh(struct buffer_head *bh, int uptodate)
|
|
{
|
|
__end_buffer_read_notouch(bh, uptodate);
|
|
}
|
|
|
|
/*
|
|
* Attach the singly-linked list of buffers created by nobh_write_begin, to
|
|
* the page (converting it to circular linked list and taking care of page
|
|
* dirty races).
|
|
*/
|
|
static void attach_nobh_buffers(struct page *page, struct buffer_head *head)
|
|
{
|
|
struct buffer_head *bh;
|
|
|
|
BUG_ON(!PageLocked(page));
|
|
|
|
spin_lock(&page->mapping->private_lock);
|
|
bh = head;
|
|
do {
|
|
if (PageDirty(page))
|
|
set_buffer_dirty(bh);
|
|
if (!bh->b_this_page)
|
|
bh->b_this_page = head;
|
|
bh = bh->b_this_page;
|
|
} while (bh != head);
|
|
attach_page_buffers(page, head);
|
|
spin_unlock(&page->mapping->private_lock);
|
|
}
|
|
|
|
/*
|
|
* On entry, the page is fully not uptodate.
|
|
* On exit the page is fully uptodate in the areas outside (from,to)
|
|
* The filesystem needs to handle block truncation upon failure.
|
|
*/
|
|
int nobh_write_begin(struct address_space *mapping,
|
|
loff_t pos, unsigned len, unsigned flags,
|
|
struct page **pagep, void **fsdata,
|
|
get_block_t *get_block)
|
|
{
|
|
struct inode *inode = mapping->host;
|
|
const unsigned blkbits = inode->i_blkbits;
|
|
const unsigned blocksize = 1 << blkbits;
|
|
struct buffer_head *head, *bh;
|
|
struct page *page;
|
|
pgoff_t index;
|
|
unsigned from, to;
|
|
unsigned block_in_page;
|
|
unsigned block_start, block_end;
|
|
sector_t block_in_file;
|
|
int nr_reads = 0;
|
|
int ret = 0;
|
|
int is_mapped_to_disk = 1;
|
|
|
|
index = pos >> PAGE_SHIFT;
|
|
from = pos & (PAGE_SIZE - 1);
|
|
to = from + len;
|
|
|
|
page = grab_cache_page_write_begin(mapping, index, flags);
|
|
if (!page)
|
|
return -ENOMEM;
|
|
*pagep = page;
|
|
*fsdata = NULL;
|
|
|
|
if (page_has_buffers(page)) {
|
|
ret = __block_write_begin(page, pos, len, get_block);
|
|
if (unlikely(ret))
|
|
goto out_release;
|
|
return ret;
|
|
}
|
|
|
|
if (PageMappedToDisk(page))
|
|
return 0;
|
|
|
|
/*
|
|
* Allocate buffers so that we can keep track of state, and potentially
|
|
* attach them to the page if an error occurs. In the common case of
|
|
* no error, they will just be freed again without ever being attached
|
|
* to the page (which is all OK, because we're under the page lock).
|
|
*
|
|
* Be careful: the buffer linked list is a NULL terminated one, rather
|
|
* than the circular one we're used to.
|
|
*/
|
|
head = alloc_page_buffers(page, blocksize, false);
|
|
if (!head) {
|
|
ret = -ENOMEM;
|
|
goto out_release;
|
|
}
|
|
|
|
block_in_file = (sector_t)page->index << (PAGE_SHIFT - blkbits);
|
|
|
|
/*
|
|
* We loop across all blocks in the page, whether or not they are
|
|
* part of the affected region. This is so we can discover if the
|
|
* page is fully mapped-to-disk.
|
|
*/
|
|
for (block_start = 0, block_in_page = 0, bh = head;
|
|
block_start < PAGE_SIZE;
|
|
block_in_page++, block_start += blocksize, bh = bh->b_this_page) {
|
|
int create;
|
|
|
|
block_end = block_start + blocksize;
|
|
bh->b_state = 0;
|
|
create = 1;
|
|
if (block_start >= to)
|
|
create = 0;
|
|
ret = get_block(inode, block_in_file + block_in_page,
|
|
bh, create);
|
|
if (ret)
|
|
goto failed;
|
|
if (!buffer_mapped(bh))
|
|
is_mapped_to_disk = 0;
|
|
if (buffer_new(bh))
|
|
clean_bdev_bh_alias(bh);
|
|
if (PageUptodate(page)) {
|
|
set_buffer_uptodate(bh);
|
|
continue;
|
|
}
|
|
if (buffer_new(bh) || !buffer_mapped(bh)) {
|
|
zero_user_segments(page, block_start, from,
|
|
to, block_end);
|
|
continue;
|
|
}
|
|
if (buffer_uptodate(bh))
|
|
continue; /* reiserfs does this */
|
|
if (block_start < from || block_end > to) {
|
|
lock_buffer(bh);
|
|
bh->b_end_io = end_buffer_read_nobh;
|
|
submit_bh(REQ_OP_READ, 0, bh);
|
|
nr_reads++;
|
|
}
|
|
}
|
|
|
|
if (nr_reads) {
|
|
/*
|
|
* The page is locked, so these buffers are protected from
|
|
* any VM or truncate activity. Hence we don't need to care
|
|
* for the buffer_head refcounts.
|
|
*/
|
|
for (bh = head; bh; bh = bh->b_this_page) {
|
|
wait_on_buffer(bh);
|
|
if (!buffer_uptodate(bh))
|
|
ret = -EIO;
|
|
}
|
|
if (ret)
|
|
goto failed;
|
|
}
|
|
|
|
if (is_mapped_to_disk)
|
|
SetPageMappedToDisk(page);
|
|
|
|
*fsdata = head; /* to be released by nobh_write_end */
|
|
|
|
return 0;
|
|
|
|
failed:
|
|
BUG_ON(!ret);
|
|
/*
|
|
* Error recovery is a bit difficult. We need to zero out blocks that
|
|
* were newly allocated, and dirty them to ensure they get written out.
|
|
* Buffers need to be attached to the page at this point, otherwise
|
|
* the handling of potential IO errors during writeout would be hard
|
|
* (could try doing synchronous writeout, but what if that fails too?)
|
|
*/
|
|
attach_nobh_buffers(page, head);
|
|
page_zero_new_buffers(page, from, to);
|
|
|
|
out_release:
|
|
unlock_page(page);
|
|
put_page(page);
|
|
*pagep = NULL;
|
|
|
|
return ret;
|
|
}
|
|
EXPORT_SYMBOL(nobh_write_begin);
|
|
|
|
int nobh_write_end(struct file *file, struct address_space *mapping,
|
|
loff_t pos, unsigned len, unsigned copied,
|
|
struct page *page, void *fsdata)
|
|
{
|
|
struct inode *inode = page->mapping->host;
|
|
struct buffer_head *head = fsdata;
|
|
struct buffer_head *bh;
|
|
BUG_ON(fsdata != NULL && page_has_buffers(page));
|
|
|
|
if (unlikely(copied < len) && head)
|
|
attach_nobh_buffers(page, head);
|
|
if (page_has_buffers(page))
|
|
return generic_write_end(file, mapping, pos, len,
|
|
copied, page, fsdata);
|
|
|
|
SetPageUptodate(page);
|
|
set_page_dirty(page);
|
|
if (pos+copied > inode->i_size) {
|
|
i_size_write(inode, pos+copied);
|
|
mark_inode_dirty(inode);
|
|
}
|
|
|
|
unlock_page(page);
|
|
put_page(page);
|
|
|
|
while (head) {
|
|
bh = head;
|
|
head = head->b_this_page;
|
|
free_buffer_head(bh);
|
|
}
|
|
|
|
return copied;
|
|
}
|
|
EXPORT_SYMBOL(nobh_write_end);
|
|
|
|
/*
|
|
* nobh_writepage() - based on block_full_write_page() except
|
|
* that it tries to operate without attaching bufferheads to
|
|
* the page.
|
|
*/
|
|
int nobh_writepage(struct page *page, get_block_t *get_block,
|
|
struct writeback_control *wbc)
|
|
{
|
|
struct inode * const inode = page->mapping->host;
|
|
loff_t i_size = i_size_read(inode);
|
|
const pgoff_t end_index = i_size >> PAGE_SHIFT;
|
|
unsigned offset;
|
|
int ret;
|
|
|
|
/* Is the page fully inside i_size? */
|
|
if (page->index < end_index)
|
|
goto out;
|
|
|
|
/* Is the page fully outside i_size? (truncate in progress) */
|
|
offset = i_size & (PAGE_SIZE-1);
|
|
if (page->index >= end_index+1 || !offset) {
|
|
unlock_page(page);
|
|
return 0; /* don't care */
|
|
}
|
|
|
|
/*
|
|
* The page straddles i_size. It must be zeroed out on each and every
|
|
* writepage invocation because it may be mmapped. "A file is mapped
|
|
* in multiples of the page size. For a file that is not a multiple of
|
|
* the page size, the remaining memory is zeroed when mapped, and
|
|
* writes to that region are not written out to the file."
|
|
*/
|
|
zero_user_segment(page, offset, PAGE_SIZE);
|
|
out:
|
|
ret = mpage_writepage(page, get_block, wbc);
|
|
if (ret == -EAGAIN)
|
|
ret = __block_write_full_page(inode, page, get_block, wbc,
|
|
end_buffer_async_write);
|
|
return ret;
|
|
}
|
|
EXPORT_SYMBOL(nobh_writepage);
|
|
|
|
int nobh_truncate_page(struct address_space *mapping,
|
|
loff_t from, get_block_t *get_block)
|
|
{
|
|
pgoff_t index = from >> PAGE_SHIFT;
|
|
unsigned offset = from & (PAGE_SIZE-1);
|
|
unsigned blocksize;
|
|
sector_t iblock;
|
|
unsigned length, pos;
|
|
struct inode *inode = mapping->host;
|
|
struct page *page;
|
|
struct buffer_head map_bh;
|
|
int err;
|
|
|
|
blocksize = i_blocksize(inode);
|
|
length = offset & (blocksize - 1);
|
|
|
|
/* Block boundary? Nothing to do */
|
|
if (!length)
|
|
return 0;
|
|
|
|
length = blocksize - length;
|
|
iblock = (sector_t)index << (PAGE_SHIFT - inode->i_blkbits);
|
|
|
|
page = grab_cache_page(mapping, index);
|
|
err = -ENOMEM;
|
|
if (!page)
|
|
goto out;
|
|
|
|
if (page_has_buffers(page)) {
|
|
has_buffers:
|
|
unlock_page(page);
|
|
put_page(page);
|
|
return block_truncate_page(mapping, from, get_block);
|
|
}
|
|
|
|
/* Find the buffer that contains "offset" */
|
|
pos = blocksize;
|
|
while (offset >= pos) {
|
|
iblock++;
|
|
pos += blocksize;
|
|
}
|
|
|
|
map_bh.b_size = blocksize;
|
|
map_bh.b_state = 0;
|
|
err = get_block(inode, iblock, &map_bh, 0);
|
|
if (err)
|
|
goto unlock;
|
|
/* unmapped? It's a hole - nothing to do */
|
|
if (!buffer_mapped(&map_bh))
|
|
goto unlock;
|
|
|
|
/* Ok, it's mapped. Make sure it's up-to-date */
|
|
if (!PageUptodate(page)) {
|
|
err = mapping->a_ops->readpage(NULL, page);
|
|
if (err) {
|
|
put_page(page);
|
|
goto out;
|
|
}
|
|
lock_page(page);
|
|
if (!PageUptodate(page)) {
|
|
err = -EIO;
|
|
goto unlock;
|
|
}
|
|
if (page_has_buffers(page))
|
|
goto has_buffers;
|
|
}
|
|
zero_user(page, offset, length);
|
|
set_page_dirty(page);
|
|
err = 0;
|
|
|
|
unlock:
|
|
unlock_page(page);
|
|
put_page(page);
|
|
out:
|
|
return err;
|
|
}
|
|
EXPORT_SYMBOL(nobh_truncate_page);
|
|
|
|
int block_truncate_page(struct address_space *mapping,
|
|
loff_t from, get_block_t *get_block)
|
|
{
|
|
pgoff_t index = from >> PAGE_SHIFT;
|
|
unsigned offset = from & (PAGE_SIZE-1);
|
|
unsigned blocksize;
|
|
sector_t iblock;
|
|
unsigned length, pos;
|
|
struct inode *inode = mapping->host;
|
|
struct page *page;
|
|
struct buffer_head *bh;
|
|
int err;
|
|
|
|
blocksize = i_blocksize(inode);
|
|
length = offset & (blocksize - 1);
|
|
|
|
/* Block boundary? Nothing to do */
|
|
if (!length)
|
|
return 0;
|
|
|
|
length = blocksize - length;
|
|
iblock = (sector_t)index << (PAGE_SHIFT - inode->i_blkbits);
|
|
|
|
page = grab_cache_page(mapping, index);
|
|
err = -ENOMEM;
|
|
if (!page)
|
|
goto out;
|
|
|
|
if (!page_has_buffers(page))
|
|
create_empty_buffers(page, blocksize, 0);
|
|
|
|
/* Find the buffer that contains "offset" */
|
|
bh = page_buffers(page);
|
|
pos = blocksize;
|
|
while (offset >= pos) {
|
|
bh = bh->b_this_page;
|
|
iblock++;
|
|
pos += blocksize;
|
|
}
|
|
|
|
err = 0;
|
|
if (!buffer_mapped(bh)) {
|
|
WARN_ON(bh->b_size != blocksize);
|
|
err = get_block(inode, iblock, bh, 0);
|
|
if (err)
|
|
goto unlock;
|
|
/* unmapped? It's a hole - nothing to do */
|
|
if (!buffer_mapped(bh))
|
|
goto unlock;
|
|
}
|
|
|
|
/* Ok, it's mapped. Make sure it's up-to-date */
|
|
if (PageUptodate(page))
|
|
set_buffer_uptodate(bh);
|
|
|
|
if (!buffer_uptodate(bh) && !buffer_delay(bh) && !buffer_unwritten(bh)) {
|
|
err = -EIO;
|
|
ll_rw_block(REQ_OP_READ, 0, 1, &bh);
|
|
wait_on_buffer(bh);
|
|
/* Uhhuh. Read error. Complain and punt. */
|
|
if (!buffer_uptodate(bh))
|
|
goto unlock;
|
|
}
|
|
|
|
zero_user(page, offset, length);
|
|
mark_buffer_dirty(bh);
|
|
err = 0;
|
|
|
|
unlock:
|
|
unlock_page(page);
|
|
put_page(page);
|
|
out:
|
|
return err;
|
|
}
|
|
EXPORT_SYMBOL(block_truncate_page);
|
|
|
|
/*
|
|
* The generic ->writepage function for buffer-backed address_spaces
|
|
*/
|
|
int block_write_full_page(struct page *page, get_block_t *get_block,
|
|
struct writeback_control *wbc)
|
|
{
|
|
struct inode * const inode = page->mapping->host;
|
|
loff_t i_size = i_size_read(inode);
|
|
const pgoff_t end_index = i_size >> PAGE_SHIFT;
|
|
unsigned offset;
|
|
|
|
/* Is the page fully inside i_size? */
|
|
if (page->index < end_index)
|
|
return __block_write_full_page(inode, page, get_block, wbc,
|
|
end_buffer_async_write);
|
|
|
|
/* Is the page fully outside i_size? (truncate in progress) */
|
|
offset = i_size & (PAGE_SIZE-1);
|
|
if (page->index >= end_index+1 || !offset) {
|
|
unlock_page(page);
|
|
return 0; /* don't care */
|
|
}
|
|
|
|
/*
|
|
* The page straddles i_size. It must be zeroed out on each and every
|
|
* writepage invocation because it may be mmapped. "A file is mapped
|
|
* in multiples of the page size. For a file that is not a multiple of
|
|
* the page size, the remaining memory is zeroed when mapped, and
|
|
* writes to that region are not written out to the file."
|
|
*/
|
|
zero_user_segment(page, offset, PAGE_SIZE);
|
|
return __block_write_full_page(inode, page, get_block, wbc,
|
|
end_buffer_async_write);
|
|
}
|
|
EXPORT_SYMBOL(block_write_full_page);
|
|
|
|
sector_t generic_block_bmap(struct address_space *mapping, sector_t block,
|
|
get_block_t *get_block)
|
|
{
|
|
struct inode *inode = mapping->host;
|
|
struct buffer_head tmp = {
|
|
.b_size = i_blocksize(inode),
|
|
};
|
|
|
|
get_block(inode, block, &tmp, 0);
|
|
return tmp.b_blocknr;
|
|
}
|
|
EXPORT_SYMBOL(generic_block_bmap);
|
|
|
|
static void end_bio_bh_io_sync(struct bio *bio)
|
|
{
|
|
struct buffer_head *bh = bio->bi_private;
|
|
|
|
if (unlikely(bio_flagged(bio, BIO_QUIET)))
|
|
set_bit(BH_Quiet, &bh->b_state);
|
|
|
|
bh->b_end_io(bh, !bio->bi_status);
|
|
bio_put(bio);
|
|
}
|
|
|
|
/*
|
|
* This allows us to do IO even on the odd last sectors
|
|
* of a device, even if the block size is some multiple
|
|
* of the physical sector size.
|
|
*
|
|
* We'll just truncate the bio to the size of the device,
|
|
* and clear the end of the buffer head manually.
|
|
*
|
|
* Truly out-of-range accesses will turn into actual IO
|
|
* errors, this only handles the "we need to be able to
|
|
* do IO at the final sector" case.
|
|
*/
|
|
void guard_bio_eod(int op, struct bio *bio)
|
|
{
|
|
sector_t maxsector;
|
|
struct bio_vec *bvec = &bio->bi_io_vec[bio->bi_vcnt - 1];
|
|
unsigned truncated_bytes;
|
|
struct hd_struct *part;
|
|
|
|
rcu_read_lock();
|
|
part = __disk_get_part(bio->bi_disk, bio->bi_partno);
|
|
if (part)
|
|
maxsector = part_nr_sects_read(part);
|
|
else
|
|
maxsector = get_capacity(bio->bi_disk);
|
|
rcu_read_unlock();
|
|
|
|
if (!maxsector)
|
|
return;
|
|
|
|
/*
|
|
* If the *whole* IO is past the end of the device,
|
|
* let it through, and the IO layer will turn it into
|
|
* an EIO.
|
|
*/
|
|
if (unlikely(bio->bi_iter.bi_sector >= maxsector))
|
|
return;
|
|
|
|
maxsector -= bio->bi_iter.bi_sector;
|
|
if (likely((bio->bi_iter.bi_size >> 9) <= maxsector))
|
|
return;
|
|
|
|
/* Uhhuh. We've got a bio that straddles the device size! */
|
|
truncated_bytes = bio->bi_iter.bi_size - (maxsector << 9);
|
|
|
|
/*
|
|
* The bio contains more than one segment which spans EOD, just return
|
|
* and let IO layer turn it into an EIO
|
|
*/
|
|
if (truncated_bytes > bvec->bv_len)
|
|
return;
|
|
|
|
/* Truncate the bio.. */
|
|
bio->bi_iter.bi_size -= truncated_bytes;
|
|
bvec->bv_len -= truncated_bytes;
|
|
|
|
/* ..and clear the end of the buffer for reads */
|
|
if (op == REQ_OP_READ) {
|
|
zero_user(bvec->bv_page, bvec->bv_offset + bvec->bv_len,
|
|
truncated_bytes);
|
|
}
|
|
}
|
|
|
|
static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
|
|
enum rw_hint write_hint, struct writeback_control *wbc)
|
|
{
|
|
struct bio *bio;
|
|
|
|
BUG_ON(!buffer_locked(bh));
|
|
BUG_ON(!buffer_mapped(bh));
|
|
BUG_ON(!bh->b_end_io);
|
|
BUG_ON(buffer_delay(bh));
|
|
BUG_ON(buffer_unwritten(bh));
|
|
|
|
/*
|
|
* Only clear out a write error when rewriting
|
|
*/
|
|
if (test_set_buffer_req(bh) && (op == REQ_OP_WRITE))
|
|
clear_buffer_write_io_error(bh);
|
|
|
|
/*
|
|
* from here on down, it's all bio -- do the initial mapping,
|
|
* submit_bio -> generic_make_request may further map this bio around
|
|
*/
|
|
bio = bio_alloc(GFP_NOIO, 1);
|
|
|
|
fscrypt_set_bio_crypt_ctx_bh(bio, bh, GFP_NOIO);
|
|
|
|
if (wbc) {
|
|
wbc_init_bio(wbc, bio);
|
|
wbc_account_io(wbc, bh->b_page, bh->b_size);
|
|
}
|
|
|
|
bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
|
|
bio_set_dev(bio, bh->b_bdev);
|
|
bio->bi_write_hint = write_hint;
|
|
|
|
bio_add_page(bio, bh->b_page, bh->b_size, bh_offset(bh));
|
|
BUG_ON(bio->bi_iter.bi_size != bh->b_size);
|
|
|
|
bio->bi_end_io = end_bio_bh_io_sync;
|
|
bio->bi_private = bh;
|
|
|
|
/* Take care of bh's that straddle the end of the device */
|
|
guard_bio_eod(op, bio);
|
|
|
|
if (buffer_meta(bh))
|
|
op_flags |= REQ_META;
|
|
if (buffer_prio(bh))
|
|
op_flags |= REQ_PRIO;
|
|
bio_set_op_attrs(bio, op, op_flags);
|
|
|
|
submit_bio(bio);
|
|
return 0;
|
|
}
|
|
|
|
int submit_bh(int op, int op_flags, struct buffer_head *bh)
|
|
{
|
|
return submit_bh_wbc(op, op_flags, bh, 0, NULL);
|
|
}
|
|
EXPORT_SYMBOL(submit_bh);
|
|
|
|
/**
|
|
* ll_rw_block: low-level access to block devices (DEPRECATED)
|
|
* @op: whether to %READ or %WRITE
|
|
* @op_flags: req_flag_bits
|
|
* @nr: number of &struct buffer_heads in the array
|
|
* @bhs: array of pointers to &struct buffer_head
|
|
*
|
|
* ll_rw_block() takes an array of pointers to &struct buffer_heads, and
|
|
* requests an I/O operation on them, either a %REQ_OP_READ or a %REQ_OP_WRITE.
|
|
* @op_flags contains flags modifying the detailed I/O behavior, most notably
|
|
* %REQ_RAHEAD.
|
|
*
|
|
* This function drops any buffer that it cannot get a lock on (with the
|
|
* BH_Lock state bit), any buffer that appears to be clean when doing a write
|
|
* request, and any buffer that appears to be up-to-date when doing read
|
|
* request. Further it marks as clean buffers that are processed for
|
|
* writing (the buffer cache won't assume that they are actually clean
|
|
* until the buffer gets unlocked).
|
|
*
|
|
* ll_rw_block sets b_end_io to simple completion handler that marks
|
|
* the buffer up-to-date (if appropriate), unlocks the buffer and wakes
|
|
* any waiters.
|
|
*
|
|
* All of the buffers must be for the same device, and must also be a
|
|
* multiple of the current approved size for the device.
|
|
*/
|
|
void ll_rw_block(int op, int op_flags, int nr, struct buffer_head *bhs[])
|
|
{
|
|
int i;
|
|
|
|
for (i = 0; i < nr; i++) {
|
|
struct buffer_head *bh = bhs[i];
|
|
|
|
if (!trylock_buffer(bh))
|
|
continue;
|
|
if (op == WRITE) {
|
|
if (test_clear_buffer_dirty(bh)) {
|
|
bh->b_end_io = end_buffer_write_sync;
|
|
get_bh(bh);
|
|
submit_bh(op, op_flags, bh);
|
|
continue;
|
|
}
|
|
} else {
|
|
if (!buffer_uptodate(bh)) {
|
|
bh->b_end_io = end_buffer_read_sync;
|
|
get_bh(bh);
|
|
submit_bh(op, op_flags, bh);
|
|
continue;
|
|
}
|
|
}
|
|
unlock_buffer(bh);
|
|
}
|
|
}
|
|
EXPORT_SYMBOL(ll_rw_block);
|
|
|
|
void write_dirty_buffer(struct buffer_head *bh, int op_flags)
|
|
{
|
|
lock_buffer(bh);
|
|
if (!test_clear_buffer_dirty(bh)) {
|
|
unlock_buffer(bh);
|
|
return;
|
|
}
|
|
bh->b_end_io = end_buffer_write_sync;
|
|
get_bh(bh);
|
|
submit_bh(REQ_OP_WRITE, op_flags, bh);
|
|
}
|
|
EXPORT_SYMBOL(write_dirty_buffer);
|
|
|
|
/*
|
|
* For a data-integrity writeout, we need to wait upon any in-progress I/O
|
|
* and then start new I/O and then wait upon it. The caller must have a ref on
|
|
* the buffer_head.
|
|
*/
|
|
int __sync_dirty_buffer(struct buffer_head *bh, int op_flags)
|
|
{
|
|
int ret = 0;
|
|
|
|
WARN_ON(atomic_read(&bh->b_count) < 1);
|
|
lock_buffer(bh);
|
|
if (test_clear_buffer_dirty(bh)) {
|
|
/*
|
|
* The bh should be mapped, but it might not be if the
|
|
* device was hot-removed. Not much we can do but fail the I/O.
|
|
*/
|
|
if (!buffer_mapped(bh)) {
|
|
unlock_buffer(bh);
|
|
return -EIO;
|
|
}
|
|
|
|
get_bh(bh);
|
|
bh->b_end_io = end_buffer_write_sync;
|
|
ret = submit_bh(REQ_OP_WRITE, op_flags, bh);
|
|
wait_on_buffer(bh);
|
|
if (!ret && !buffer_uptodate(bh))
|
|
ret = -EIO;
|
|
} else {
|
|
unlock_buffer(bh);
|
|
}
|
|
return ret;
|
|
}
|
|
EXPORT_SYMBOL(__sync_dirty_buffer);
|
|
|
|
int sync_dirty_buffer(struct buffer_head *bh)
|
|
{
|
|
return __sync_dirty_buffer(bh, REQ_SYNC);
|
|
}
|
|
EXPORT_SYMBOL(sync_dirty_buffer);
|
|
|
|
/*
|
|
* try_to_free_buffers() checks if all the buffers on this particular page
|
|
* are unused, and releases them if so.
|
|
*
|
|
* Exclusion against try_to_free_buffers may be obtained by either
|
|
* locking the page or by holding its mapping's private_lock.
|
|
*
|
|
* If the page is dirty but all the buffers are clean then we need to
|
|
* be sure to mark the page clean as well. This is because the page
|
|
* may be against a block device, and a later reattachment of buffers
|
|
* to a dirty page will set *all* buffers dirty. Which would corrupt
|
|
* filesystem data on the same device.
|
|
*
|
|
* The same applies to regular filesystem pages: if all the buffers are
|
|
* clean then we set the page clean and proceed. To do that, we require
|
|
* total exclusion from __set_page_dirty_buffers(). That is obtained with
|
|
* private_lock.
|
|
*
|
|
* try_to_free_buffers() is non-blocking.
|
|
*/
|
|
static inline int buffer_busy(struct buffer_head *bh)
|
|
{
|
|
return atomic_read(&bh->b_count) |
|
|
(bh->b_state & ((1 << BH_Dirty) | (1 << BH_Lock)));
|
|
}
|
|
|
|
static int
|
|
drop_buffers(struct page *page, struct buffer_head **buffers_to_free)
|
|
{
|
|
struct buffer_head *head = page_buffers(page);
|
|
struct buffer_head *bh;
|
|
|
|
bh = head;
|
|
do {
|
|
if (buffer_busy(bh)) {
|
|
/*
|
|
* Check if the busy failure was due to an
|
|
* outstanding LRU reference
|
|
*/
|
|
evict_bh_lrus(bh);
|
|
if (buffer_busy(bh))
|
|
goto failed;
|
|
}
|
|
bh = bh->b_this_page;
|
|
} while (bh != head);
|
|
|
|
do {
|
|
struct buffer_head *next = bh->b_this_page;
|
|
|
|
if (bh->b_assoc_map)
|
|
__remove_assoc_queue(bh);
|
|
bh = next;
|
|
} while (bh != head);
|
|
*buffers_to_free = head;
|
|
__clear_page_buffers(page);
|
|
return 1;
|
|
failed:
|
|
return 0;
|
|
}
|
|
|
|
int try_to_free_buffers(struct page *page)
|
|
{
|
|
struct address_space * const mapping = page->mapping;
|
|
struct buffer_head *buffers_to_free = NULL;
|
|
int ret = 0;
|
|
|
|
BUG_ON(!PageLocked(page));
|
|
if (PageWriteback(page))
|
|
return 0;
|
|
|
|
if (mapping == NULL) { /* can this still happen? */
|
|
ret = drop_buffers(page, &buffers_to_free);
|
|
goto out;
|
|
}
|
|
|
|
spin_lock(&mapping->private_lock);
|
|
ret = drop_buffers(page, &buffers_to_free);
|
|
|
|
/*
|
|
* If the filesystem writes its buffers by hand (eg ext3)
|
|
* then we can have clean buffers against a dirty page. We
|
|
* clean the page here; otherwise the VM will never notice
|
|
* that the filesystem did any IO at all.
|
|
*
|
|
* Also, during truncate, discard_buffer will have marked all
|
|
* the page's buffers clean. We discover that here and clean
|
|
* the page also.
|
|
*
|
|
* private_lock must be held over this entire operation in order
|
|
* to synchronise against __set_page_dirty_buffers and prevent the
|
|
* dirty bit from being lost.
|
|
*/
|
|
if (ret)
|
|
cancel_dirty_page(page);
|
|
spin_unlock(&mapping->private_lock);
|
|
out:
|
|
if (buffers_to_free) {
|
|
struct buffer_head *bh = buffers_to_free;
|
|
|
|
do {
|
|
struct buffer_head *next = bh->b_this_page;
|
|
free_buffer_head(bh);
|
|
bh = next;
|
|
} while (bh != buffers_to_free);
|
|
}
|
|
return ret;
|
|
}
|
|
EXPORT_SYMBOL(try_to_free_buffers);
|
|
|
|
/*
|
|
* There are no bdflush tunables left. But distributions are
|
|
* still running obsolete flush daemons, so we terminate them here.
|
|
*
|
|
* Use of bdflush() is deprecated and will be removed in a future kernel.
|
|
* The `flush-X' kernel threads fully replace bdflush daemons and this call.
|
|
*/
|
|
SYSCALL_DEFINE2(bdflush, int, func, long, data)
|
|
{
|
|
static int msg_count;
|
|
|
|
if (!capable(CAP_SYS_ADMIN))
|
|
return -EPERM;
|
|
|
|
if (msg_count < 5) {
|
|
msg_count++;
|
|
printk(KERN_INFO
|
|
"warning: process `%s' used the obsolete bdflush"
|
|
" system call\n", current->comm);
|
|
printk(KERN_INFO "Fix your initscripts?\n");
|
|
}
|
|
|
|
if (func == 1)
|
|
do_exit(0);
|
|
return 0;
|
|
}
|
|
|
|
/*
|
|
* Buffer-head allocation
|
|
*/
|
|
static struct kmem_cache *bh_cachep __read_mostly;
|
|
|
|
/*
|
|
* Once the number of bh's in the machine exceeds this level, we start
|
|
* stripping them in writeback.
|
|
*/
|
|
static unsigned long max_buffer_heads;
|
|
|
|
int buffer_heads_over_limit;
|
|
|
|
struct bh_accounting {
|
|
int nr; /* Number of live bh's */
|
|
int ratelimit; /* Limit cacheline bouncing */
|
|
};
|
|
|
|
static DEFINE_PER_CPU(struct bh_accounting, bh_accounting) = {0, 0};
|
|
|
|
static void recalc_bh_state(void)
|
|
{
|
|
int i;
|
|
int tot = 0;
|
|
|
|
if (__this_cpu_inc_return(bh_accounting.ratelimit) - 1 < 4096)
|
|
return;
|
|
__this_cpu_write(bh_accounting.ratelimit, 0);
|
|
for_each_online_cpu(i)
|
|
tot += per_cpu(bh_accounting, i).nr;
|
|
buffer_heads_over_limit = (tot > max_buffer_heads);
|
|
}
|
|
|
|
struct buffer_head *alloc_buffer_head(gfp_t gfp_flags)
|
|
{
|
|
struct buffer_head *ret = kmem_cache_zalloc(bh_cachep, gfp_flags);
|
|
if (ret) {
|
|
INIT_LIST_HEAD(&ret->b_assoc_buffers);
|
|
preempt_disable();
|
|
__this_cpu_inc(bh_accounting.nr);
|
|
recalc_bh_state();
|
|
preempt_enable();
|
|
}
|
|
return ret;
|
|
}
|
|
EXPORT_SYMBOL(alloc_buffer_head);
|
|
|
|
void free_buffer_head(struct buffer_head *bh)
|
|
{
|
|
BUG_ON(!list_empty(&bh->b_assoc_buffers));
|
|
kmem_cache_free(bh_cachep, bh);
|
|
preempt_disable();
|
|
__this_cpu_dec(bh_accounting.nr);
|
|
recalc_bh_state();
|
|
preempt_enable();
|
|
}
|
|
EXPORT_SYMBOL(free_buffer_head);
|
|
|
|
static int buffer_exit_cpu_dead(unsigned int cpu)
|
|
{
|
|
int i;
|
|
struct bh_lru *b = &per_cpu(bh_lrus, cpu);
|
|
|
|
for (i = 0; i < BH_LRU_SIZE; i++) {
|
|
brelse(b->bhs[i]);
|
|
b->bhs[i] = NULL;
|
|
}
|
|
this_cpu_add(bh_accounting.nr, per_cpu(bh_accounting, cpu).nr);
|
|
per_cpu(bh_accounting, cpu).nr = 0;
|
|
return 0;
|
|
}
|
|
|
|
/**
|
|
* bh_uptodate_or_lock - Test whether the buffer is uptodate
|
|
* @bh: struct buffer_head
|
|
*
|
|
* Return true if the buffer is up-to-date and false,
|
|
* with the buffer locked, if not.
|
|
*/
|
|
int bh_uptodate_or_lock(struct buffer_head *bh)
|
|
{
|
|
if (!buffer_uptodate(bh)) {
|
|
lock_buffer(bh);
|
|
if (!buffer_uptodate(bh))
|
|
return 0;
|
|
unlock_buffer(bh);
|
|
}
|
|
return 1;
|
|
}
|
|
EXPORT_SYMBOL(bh_uptodate_or_lock);
|
|
|
|
/**
|
|
* bh_submit_read - Submit a locked buffer for reading
|
|
* @bh: struct buffer_head
|
|
*
|
|
* Returns zero on success and -EIO on error.
|
|
*/
|
|
int bh_submit_read(struct buffer_head *bh)
|
|
{
|
|
BUG_ON(!buffer_locked(bh));
|
|
|
|
if (buffer_uptodate(bh)) {
|
|
unlock_buffer(bh);
|
|
return 0;
|
|
}
|
|
|
|
get_bh(bh);
|
|
bh->b_end_io = end_buffer_read_sync;
|
|
submit_bh(REQ_OP_READ, 0, bh);
|
|
wait_on_buffer(bh);
|
|
if (buffer_uptodate(bh))
|
|
return 0;
|
|
return -EIO;
|
|
}
|
|
EXPORT_SYMBOL(bh_submit_read);
|
|
|
|
/*
|
|
* Seek for SEEK_DATA / SEEK_HOLE within @page, starting at @lastoff.
|
|
*
|
|
* Returns the offset within the file on success, and -ENOENT otherwise.
|
|
*/
|
|
static loff_t
|
|
page_seek_hole_data(struct page *page, loff_t lastoff, int whence)
|
|
{
|
|
loff_t offset = page_offset(page);
|
|
struct buffer_head *bh, *head;
|
|
bool seek_data = whence == SEEK_DATA;
|
|
|
|
if (lastoff < offset)
|
|
lastoff = offset;
|
|
|
|
bh = head = page_buffers(page);
|
|
do {
|
|
offset += bh->b_size;
|
|
if (lastoff >= offset)
|
|
continue;
|
|
|
|
/*
|
|
* Unwritten extents that have data in the page cache covering
|
|
* them can be identified by the BH_Unwritten state flag.
|
|
* Pages with multiple buffers might have a mix of holes, data
|
|
* and unwritten extents - any buffer with valid data in it
|
|
* should have BH_Uptodate flag set on it.
|
|
*/
|
|
|
|
if ((buffer_unwritten(bh) || buffer_uptodate(bh)) == seek_data)
|
|
return lastoff;
|
|
|
|
lastoff = offset;
|
|
} while ((bh = bh->b_this_page) != head);
|
|
return -ENOENT;
|
|
}
|
|
|
|
/*
|
|
* Seek for SEEK_DATA / SEEK_HOLE in the page cache.
|
|
*
|
|
* Within unwritten extents, the page cache determines which parts are holes
|
|
* and which are data: unwritten and uptodate buffer heads count as data;
|
|
* everything else counts as a hole.
|
|
*
|
|
* Returns the resulting offset on successs, and -ENOENT otherwise.
|
|
*/
|
|
loff_t
|
|
page_cache_seek_hole_data(struct inode *inode, loff_t offset, loff_t length,
|
|
int whence)
|
|
{
|
|
pgoff_t index = offset >> PAGE_SHIFT;
|
|
pgoff_t end = DIV_ROUND_UP(offset + length, PAGE_SIZE);
|
|
loff_t lastoff = offset;
|
|
struct pagevec pvec;
|
|
|
|
if (length <= 0)
|
|
return -ENOENT;
|
|
|
|
pagevec_init(&pvec);
|
|
|
|
do {
|
|
unsigned nr_pages, i;
|
|
|
|
nr_pages = pagevec_lookup_range(&pvec, inode->i_mapping, &index,
|
|
end - 1);
|
|
if (nr_pages == 0)
|
|
break;
|
|
|
|
for (i = 0; i < nr_pages; i++) {
|
|
struct page *page = pvec.pages[i];
|
|
|
|
/*
|
|
* At this point, the page may be truncated or
|
|
* invalidated (changing page->mapping to NULL), or
|
|
* even swizzled back from swapper_space to tmpfs file
|
|
* mapping. However, page->index will not change
|
|
* because we have a reference on the page.
|
|
*
|
|
* If current page offset is beyond where we've ended,
|
|
* we've found a hole.
|
|
*/
|
|
if (whence == SEEK_HOLE &&
|
|
lastoff < page_offset(page))
|
|
goto check_range;
|
|
|
|
lock_page(page);
|
|
if (likely(page->mapping == inode->i_mapping) &&
|
|
page_has_buffers(page)) {
|
|
lastoff = page_seek_hole_data(page, lastoff, whence);
|
|
if (lastoff >= 0) {
|
|
unlock_page(page);
|
|
goto check_range;
|
|
}
|
|
}
|
|
unlock_page(page);
|
|
lastoff = page_offset(page) + PAGE_SIZE;
|
|
}
|
|
pagevec_release(&pvec);
|
|
} while (index < end);
|
|
|
|
/* When no page at lastoff and we are not done, we found a hole. */
|
|
if (whence != SEEK_HOLE)
|
|
goto not_found;
|
|
|
|
check_range:
|
|
if (lastoff < offset + length)
|
|
goto out;
|
|
not_found:
|
|
lastoff = -ENOENT;
|
|
out:
|
|
pagevec_release(&pvec);
|
|
return lastoff;
|
|
}
|
|
|
|
void __init buffer_init(void)
|
|
{
|
|
unsigned long nrpages;
|
|
int ret;
|
|
|
|
bh_cachep = kmem_cache_create("buffer_head",
|
|
sizeof(struct buffer_head), 0,
|
|
(SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
|
|
SLAB_MEM_SPREAD),
|
|
NULL);
|
|
|
|
/*
|
|
* Limit the bh occupancy to 10% of ZONE_NORMAL
|
|
*/
|
|
nrpages = (nr_free_buffer_pages() * 10) / 100;
|
|
max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head));
|
|
ret = cpuhp_setup_state_nocalls(CPUHP_FS_BUFF_DEAD, "fs/buffer:dead",
|
|
NULL, buffer_exit_cpu_dead);
|
|
WARN_ON(ret < 0);
|
|
}
|