aboutsummaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2015-05-11Merge branch 'for-4.1' of git://linux-nfs.org/~bfields/linuxLinus Torvalds5-85/+212
Pull nfsd bugfixes from Bruce Fields: "Mainly pnfs fixes (and for problems with generic callback code made more obvious by pnfs)" * 'for-4.1' of git://linux-nfs.org/~bfields/linux: nfsd: skip CB_NULL probes for 4.1 or later nfsd: fix callback restarts nfsd: split transport vs operation errors for callbacks svcrpc: fix potential GSSX_ACCEPT_SEC_CONTEXT decoding failures nfsd: fix pNFS return on close semantics nfsd: fix the check for confirmed openowner in nfs4_preprocess_stateid_op nfsd/blocklayout: pretend we can send deviceid notifications
2015-05-09Merge branch 'for-linus' of ↵Linus Torvalds1-0/+6
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user-namespace fix from Eric Biederman: "Eric Windish recently reported a really bug that allows mounting fresh copies of proc and sysfs when it really should not be allowed. The code attempted to verify that proc and sysfs were fully visible but there is a test missing to ensure that the root of the filesystem is visible. Doh! The following patch fixes that. This fixes a containment issue that the docker folks are seeing" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: mnt: Fix fs_fully_visible to verify the root directory is visible
2015-05-09mnt: Fix fs_fully_visible to verify the root directory is visibleEric W. Biederman1-0/+6
This fixes a dumb bug in fs_fully_visible that allows proc or sys to be mounted if there is a bind mount of part of /proc/ or /sys/ visible. Cc: stable@vger.kernel.org Reported-by: Eric Windisch <ewindisch@docker.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-05-08Merge branch 'for-linus' of ↵Linus Torvalds1-7/+15
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "A couple of fixes for bugs caught while digging in fs/namei.c. The first one is this cycle regression, the second is 3.11 and later" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: path_openat(): fix double fput() namei: d_is_negative() should be checked before ->d_seq validation
2015-05-09path_openat(): fix double fput()Al Viro1-1/+2
path_openat() jumps to the wrong place after do_tmpfile() - it has already done path_cleanup() (as part of path_lookupat() called by do_tmpfile()), so doing that again can lead to double fput(). Cc: stable@vger.kernel.org # v3.11+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-05-09namei: d_is_negative() should be checked before ->d_seq validationAl Viro1-6/+13
Fetching ->d_inode, verifying ->d_seq and finding d_is_negative() to be true does *not* mean that inode we'd fetched had been NULL - that holds only while ->d_seq is still unchanged. Shift d_is_negative() checks into lookup_fast() prior to ->d_seq verification. Reported-by: Steven Rostedt <rostedt@goodmis.org> Tested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-05-08Merge branch 'for-linus-4.1' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fix from Chris Mason: "When an arm user reported crashes near page_address(page) in my new code, it became clear that I can't be trusted with GFP masks. Filipe beat me to the patch, and I'll just be in the corner with my dunce cap on" * 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix wrong mapping flags for free space inode
2015-05-08Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+11
Pull block fixes from Jens Axboe: "A collection of fixes since the merge window; - fix for a double elevator module release, from Chao Yu. Ancient bug. - the splice() MORE flag fix from Christophe Leroy. - a fix for NVMe, fixing a patch that went in in the merge window. From Keith. - two fixes for blk-mq CPU hotplug handling, from Ming Lei. - bdi vs blockdev lifetime fix from Neil Brown, fixing and oops in md. - two blk-mq fixes from Shaohua, fixing a race on queue stop and a bad merge issue with FUA writes. - division-by-zero fix for writeback from Tejun. - a block bounce page accounting fix, making sure we inc/dec after bouncing so that pre/post IO pages match up. From Wang YanQing" * 'for-linus' of git://git.kernel.dk/linux-block: splice: sendfile() at once fails for big files blk-mq: don't lose requests if a stopped queue restarts blk-mq: fix FUA request hang block: destroy bdi before blockdev is unregistered. block:bounce: fix call inc_|dec_zone_page_state on different pages confuse value of NR_BOUNCE elevator: fix double release of elevator module writeback: use |1 instead of +1 to protect against div by zero blk-mq: fix CPU hotplug handling blk-mq: fix race between timeout and CPU hotplug NVMe: Fix VPD B0 max sectors translation
2015-05-07Merge tag 'for-f2fs-4.1-rc3' of ↵Linus Torvalds4-5/+12
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs fixes from Jaegeuk Kim: "Fix a performance regression and a bug" * tag 'for-f2fs-4.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: f2fs: fix wrong error hanlder in f2fs_follow_link Revert "f2fs: enhance multi-threads performance"
2015-05-06Btrfs: fix wrong mapping flags for free space inodeFilipe Manana1-1/+1
We were passing a flags value that differed from the intention in commit 2b108268006e ("Btrfs: don't use highmem for free space cache pages"). This caused problems in a ARM machine, leaving btrfs unusable there. Reported-by: Merlijn Wajer <merlijn@wizzup.org> Tested-by: Merlijn Wajer <merlijn@wizzup.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-05-06Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "EFI fixes, and FPU fix, a ticket spinlock boundary condition fix and two build fixes" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/fpu: Always restore_xinit_state() when use_eager_cpu() x86: Make cpu_tss available to external modules efi: Fix error handling in add_sysfs_runtime_map_entry() x86/spinlocks: Fix regression in spinlock contention detection x86/mm: Clean up types in xlate_dev_mem_ptr() x86/efi: Store upper bits of command line buffer address in ext_cmd_line_ptr efivarfs: Ensure VariableName is NUL-terminated
2015-05-06splice: sendfile() at once fails for big filesChristophe Leroy1-1/+11
Using sendfile with below small program to get MD5 sums of some files, it appear that big files (over 64kbytes with 4k pages system) get a wrong MD5 sum while small files get the correct sum. This program uses sendfile() to send a file to an AF_ALG socket for hashing. /* md5sum2.c */ #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <string.h> #include <fcntl.h> #include <sys/socket.h> #include <sys/stat.h> #include <sys/types.h> #include <linux/if_alg.h> int main(int argc, char **argv) { int sk = socket(AF_ALG, SOCK_SEQPACKET, 0); struct stat st; struct sockaddr_alg sa = { .salg_family = AF_ALG, .salg_type = "hash", .salg_name = "md5", }; int n; bind(sk, (struct sockaddr*)&sa, sizeof(sa)); for (n = 1; n < argc; n++) { int size; int offset = 0; char buf[4096]; int fd; int sko; int i; fd = open(argv[n], O_RDONLY); sko = accept(sk, NULL, 0); fstat(fd, &st); size = st.st_size; sendfile(sko, fd, &offset, size); size = read(sko, buf, sizeof(buf)); for (i = 0; i < size; i++) printf("%2.2x", buf[i]); printf(" %s\n", argv[n]); close(fd); close(sko); } exit(0); } Test below is done using official linux patch files. First result is with a software based md5sum. Second result is with the program above. root@vgoip:~# ls -l patch-3.6.* -rw-r--r-- 1 root root 64011 Aug 24 12:01 patch-3.6.2.gz -rw-r--r-- 1 root root 94131 Aug 24 12:01 patch-3.6.3.gz root@vgoip:~# md5sum patch-3.6.* b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz c5e8f687878457db77cb7158c38a7e43 patch-3.6.3.gz root@vgoip:~# ./md5sum2 patch-3.6.* b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz 5fd77b24e68bb24dcc72d6e57c64790e patch-3.6.3.gz After investivation, it appears that sendfile() sends the files by blocks of 64kbytes (16 times PAGE_SIZE). The problem is that at the end of each block, the SPLICE_F_MORE flag is missing, therefore the hashing operation is reset as if it was the end of the file. This patch adds SPLICE_F_MORE to the flags when more data is pending. With the patch applied, we get the correct sums: root@vgoip:~# md5sum patch-3.6.* b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz c5e8f687878457db77cb7158c38a7e43 patch-3.6.3.gz root@vgoip:~# ./md5sum2 patch-3.6.* b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz c5e8f687878457db77cb7158c38a7e43 patch-3.6.3.gz Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-06Merge tag 'efi-urgent' of ↵Ingo Molnar1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into x86/urgent Pull EFI fixes from Matt Fleming: * Avoid garbage names in efivarfs due to buggy firmware by zeroing EFI variable name. (Ross Lagerwall) * Stop erroneously dropping upper 32 bits of boot command line pointer in EFI boot stub and stash them in ext_cmd_line_ptr. (Roy Franz) * Fix double-free bug in error handling code path of EFI runtime map code. (Dan Carpenter) Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-05ocfs2: dlm: fix race between purge and get lock resourceJunxiao Bi1-0/+13
There is a race window in dlm_get_lock_resource(), which may return a lock resource which has been purged. This will cause the process to hang forever in dlmlock() as the ast msg can't be handled due to its lock resource not existing. dlm_get_lock_resource { ... spin_lock(&dlm->spinlock); tmpres = __dlm_lookup_lockres_full(dlm, lockid, namelen, hash); if (tmpres) { spin_unlock(&dlm->spinlock); >>>>>>>> race window, dlm_run_purge_list() may run and purge the lock resource spin_lock(&tmpres->spinlock); ... spin_unlock(&tmpres->spinlock); } } Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-05-05nilfs2: fix sanity check of btree level in nilfs_btree_root_broken()Ryusuke Konishi1-1/+1
The range check for b-tree level parameter in nilfs_btree_root_broken() is wrong; it accepts the case of "level == NILFS_BTREE_LEVEL_MAX" even though the level is limited to values in the range of 0 to (NILFS_BTREE_LEVEL_MAX - 1). Since the level parameter is read from storage device and used to index nilfs_btree_path array whose element count is NILFS_BTREE_LEVEL_MAX, it can cause memory overrun during btree operations if the boundary value is set to the level parameter on device. This fixes the broken sanity check and adds a comment to clarify that the upper bound NILFS_BTREE_LEVEL_MAX is exclusive. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-05-05configfs: init configfs module earlier at boot timeDaniel Baluta1-1/+1
We need this earlier in the boot process to allow various subsystems to use configfs (e.g Industrial IIO). Also, debugfs is at core_initcall level and configfs should be on the same level from infrastructure point of view. Signed-off-by: Daniel Baluta <daniel.baluta@intel.com> Suggested-by: Lars-Peter Clausen <lars@metafoo.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-05-04f2fs: fix wrong error hanlder in f2fs_follow_linkJaegeuk Kim1-5/+3
The page_follow_link_light returns NULL and its error pointer was remained in nd->path. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-05-04Revert "f2fs: enhance multi-threads performance"Jaegeuk Kim3-0/+9
This reports performance regression by Yuanhan Liu. The basic idea was to reduce one-point mutex, but it turns out this causes another contention like context swithes. https://lkml.org/lkml/2015/4/21/11 Until finishing the analysis on this issue, I'd like to revert this for a while. This reverts commit 78373b7319abdf15050af5b1632c4c8b8b398f33.
2015-05-04nfsd: skip CB_NULL probes for 4.1 or laterChristoph Hellwig1-0/+9
With sessions in v4.1 or later we don't need to manually probe the backchannel connection, so we can declare it up instantly after setting up the RPC client. Note that we really should split nfsd4_run_cb_work in the long run, this is just the least intrusive fix for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-05-04nfsd: fix callback restartsChristoph Hellwig3-33/+24
Checking the rpc_client pointer is not a reliable way to detect backchannel changes: cl_cb_client is changed only after shutting down the rpc client, so the condition cl_cb_client = tk_client will always be true. Check the RPC_TASK_KILLED flag instead, and rewrite the code to avoid the buggy cl_callbacks list and fix the lifetime rules due to double calls of the ->prepare callback operations method for this retry case. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-05-04nfsd: split transport vs operation errors for callbacksChristoph Hellwig2-36/+25
We must only increment the sequence id if the client has seen and responded to a request. If we failed to deliver it to the client we must resend with the same sequence id. So just like the client track errors at the transport level differently from those returned in the XDR. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-05-04nfsd: fix pNFS return on close semanticsSachin Bhamare3-7/+133
For the sake of forgetful clients, the server should return the layouts to the file system on 'last close' of a file (assuming that there are no delegations outstanding to that particular client) or on delegreturn (assuming that there are no opens on a file from that particular client). In theory the information is all there in current data structures, but it's not efficiently available; nfs4_file->fi_ref includes references on the file across all clients, but we need a per-(client, file) count. Walking through lots of stateid's to calculate this on each close or delegreturn would be painful. This patch introduces infrastructure to maintain per-client opens and delegation counters on a per-file basis. [hch: ported to the mainline pNFS support, merged various fixes from Jeff] Signed-off-by: Sachin Bhamare <sachin.bhamare@primarydata.com> Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-05-04nfsd: fix the check for confirmed openowner in nfs4_preprocess_stateid_opChristoph Hellwig1-10/+11
If we find a non-confirmed openowner we jump to exit the function, but do not set an error value. Fix this by factoring out a helper to do the check and properly set the error from nfsd4_validate_stateid. Cc: stable@vger.kernel.org Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-05-04nfsd/blocklayout: pretend we can send deviceid notificationsChristoph Hellwig1-0/+11
Commit df52699e4fcef ("NFSv4.1: Don't cache deviceids that have no notifications") causes the Linux NFS client to stop caching deviceid's unless a server pretends to support deviceid notifications. While this behavior is stupid and the language around this area in rfc5661 is a mess carified by an errata that I submittted, Trond insists on this behavior. Not caching deviceids degrades block layout performance massively as a GETDEVICEINFO is fairly expensive. So add this hack to make the Linux client happy again. Cc: stable@vger.kernel.org Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-05-03Merge tag 'for_linus_stable' of ↵Linus Torvalds13-229/+210
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Some miscellaneous bug fixes and some final on-disk and ABI changes for ext4 encryption which provide better security and performance" * tag 'for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix growing of tiny filesystems ext4: move check under lock scope to close a race. ext4: fix data corruption caused by unwritten and delayed extents ext4 crypto: remove duplicated encryption mode definitions ext4 crypto: do not select from EXT4_FS_ENCRYPTION ext4 crypto: add padding to filenames before encrypting ext4 crypto: simplify and speed up filename encryption
2015-05-02ext4: fix growing of tiny filesystemsJan Kara1-2/+5
The estimate of necessary transaction credits in ext4_flex_group_add() is too pessimistic. It reserves credit for sb, resize inode, and resize inode dindirect block for each group added in a flex group although they are always the same block and thus it is enough to account them only once. Also the number of modified GDT block is overestimated since we fit EXT4_DESC_PER_BLOCK(sb) descriptors in one block. Make the estimation more precise. That reduces number of requested credits enough that we can grow 20 MB filesystem (which has 1 MB journal, 79 reserved GDT blocks, and flex group size 16 by default). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2015-05-02ext4: move check under lock scope to close a race.Davide Italiano1-7/+8
fallocate() checks that the file is extent-based and returns EOPNOTSUPP in case is not. Other tasks can convert from and to indirect and extent so it's safe to check only after grabbing the inode mutex. Signed-off-by: Davide Italiano <dccitaliano@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2015-05-02ext4: fix data corruption caused by unwritten and delayed extentsLukas Czerner2-0/+10
Currently it is possible to lose whole file system block worth of data when we hit the specific interaction with unwritten and delayed extents in status extent tree. The problem is that when we insert delayed extent into extent status tree the only way to get rid of it is when we write out delayed buffer. However there is a limitation in the extent status tree implementation so that when inserting unwritten extent should there be even a single delayed block the whole unwritten extent would be marked as delayed. At this point, there is no way to get rid of the delayed extents, because there are no delayed buffers to write out. So when a we write into said unwritten extent we will convert it to written, but it still remains delayed. When we try to write into that block later ext4_da_map_blocks() will set the buffer new and delayed and map it to invalid block which causes the rest of the block to be zeroed loosing already written data. For now we can fix this by simply not allowing to set delayed status on written extent in the extent status tree. Also add WARN_ON() to make sure that we notice if this happens in the future. This problem can be easily reproduced by running the following xfs_io. xfs_io -f -c "pwrite -S 0xaa 4096 2048" \ -c "falloc 0 131072" \ -c "pwrite -S 0xbb 65536 2048" \ -c "fsync" /mnt/test/fff echo 3 > /proc/sys/vm/drop_caches xfs_io -c "pwrite -S 0xdd 67584 2048" /mnt/test/fff This can be theoretically also reproduced by at random by running fsx, but it's not very reliable, though on machines with bigger page size (like ppc) this can be seen more often (especially xfstest generic/127) Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2015-05-02ext4 crypto: remove duplicated encryption mode definitionsChanho Park1-6/+0
This patch removes duplicated encryption modes which were already in ext4.h. They were duplicated from commit 3edc18d and commit f542fb. Cc: Theodore Ts'o <tytso@mit.edu> Cc: Michael Halcrow <mhalcrow@google.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Signed-off-by: Chanho Park <chanho61.park@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-02ext4 crypto: do not select from EXT4_FS_ENCRYPTIONHerbert Xu1-2/+7
This patch adds a tristate EXT4_ENCRYPTION to do the selections for EXT4_FS_ENCRYPTION because selecting from a bool causes all the selected options to be built-in, even if EXT4 itself is a module. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-01ext4 crypto: add padding to filenames before encryptingTheodore Ts'o5-8/+31
This obscures the length of the filenames, to decrease the amount of information leakage. By default, we pad the filenames to the next 4 byte boundaries. This costs nothing, since the directory entries are aligned to 4 byte boundaries anyway. Filenames can also be padded to 8, 16, or 32 bytes, which will consume more directory space. Change-Id: Ibb7a0fb76d2c48e2061240a709358ff40b14f322 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-01ext4 crypto: simplify and speed up filename encryptionTheodore Ts'o5-204/+149
Avoid using SHA-1 when calculating the user-visible filename when the encryption key is available, and avoid decrypting lots of filenames when searching for a directory entry in a directory block. Change-Id: If4655f144784978ba0305b597bfa1c8d7bb69e63 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-01Merge branch 'for-linus-4.1' of ↵Linus Torvalds7-77/+118
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "A few more btrfs fixes. These range from corners Filipe found in the new free space cache writeback to a grab bag of fixes from the list" * 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: btrfs_release_extent_buffer_page didn't free pages of dummy extent Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode. btrfs: unlock i_mutex after attempting to delete subvolume during send btrfs: check io_ctl_prepare_pages return in __btrfs_write_out_cache btrfs: fix race on ENOMEM in alloc_extent_buffer btrfs: handle ENOMEM in btrfs_alloc_tree_block Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole Btrfs: don't check for delalloc_bytes in cache_save_setup Btrfs: fix deadlock when starting writeback of bg caches Btrfs: fix race between start dirty bg cache writeout and bg deletion
2015-04-29Btrfs: btrfs_release_extent_buffer_page didn't free pages of dummy extentForrest Liu1-25/+26
btrfs_release_extent_buffer_page() can't handle dummy extent that allocated by btrfs_clone_extent_buffer() properly. That is because reference count of pages that allocated by btrfs_clone_extent_buffer() was 2, 1 by alloc_page(), and another by attach_extent_buffer_page(). Running following command repeatly can check this memory leak problem btrfs inspect-internal inode-resolve 256 /mnt/btrfs Signed-off-by: Chien-Kuan Yeh <ckya@synology.com> Signed-off-by: Forrest Liu <forrestl@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Tested-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26Merge branch 'for-linus-4.1' of ↵Linus Torvalds1-17/+25
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Filipe hit two problems in my block group cache patches. We finalized the fixes last week and ran through more tests" * 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: prevent list corruption during free space cache processing Btrfs: fix inode cache writeout
2015-04-26Merge tag 'nfs-for-4.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds34-163/+313
Pull NFS client updates from Trond Myklebust: "Another set of mainly bugfixes and a couple of cleanups. No new functionality in this round. Highlights include: Stable patches: - Fix a regression in /proc/self/mountstats - Fix the pNFS flexfiles O_DIRECT support - Fix high load average due to callback thread sleeping Bugfixes: - Various patches to fix the pNFS layoutcommit support - Do not cache pNFS deviceids unless server notifications are enabled - Fix a SUNRPC transport reconnection regression - make debugfs file creation failure non-fatal in SUNRPC - Another fix for circular directory warnings on NFSv4 "junctioned" mountpoints - Fix locking around NFSv4.2 fallocate() support - Truncating NFSv4 file opens should also sync O_DIRECT writes - Prevent infinite loop in rpcrdma_ep_create() Features: - Various improvements to the RDMA transport code's handling of memory registration - Various code cleanups" * tag 'nfs-for-4.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (55 commits) fs/nfs: fix new compiler warning about boolean in switch nfs: Remove unneeded casts in nfs NFS: Don't attempt to decode missing directory entries Revert "nfs: replace nfs_add_stats with nfs_inc_stats when add one" NFS: Rename idmap.c to nfs4idmap.c NFS: Move nfs_idmap.h into fs/nfs/ NFS: Remove CONFIG_NFS_V4 checks from nfs_idmap.h NFS: Add a stub for GETDEVICELIST nfs: remove WARN_ON_ONCE from nfs_direct_good_bytes nfs: fix DIO good bytes calculation nfs: Fetch MOUNTED_ON_FILEID when updating an inode sunrpc: make debugfs file creation failure non-fatal nfs: fix high load average due to callback thread sleeping NFS: Reduce time spent holding the i_mutex during fallocate() NFS: Don't zap caches on fallocate() xprtrdma: Make rpcrdma_{un}map_one() into inline functions xprtrdma: Handle non-SEND completions via a callout xprtrdma: Add "open" memreg op xprtrdma: Add "destroy MRs" memreg op xprtrdma: Add "reset MRs" memreg op ...
2015-04-26Merge branch 'for-linus' of ↵Linus Torvalds285-1520/+1512
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull fourth vfs update from Al Viro: "d_inode() annotations from David Howells (sat in for-next since before the beginning of merge window) + four assorted fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: RCU pathwalk breakage when running into a symlink overmounting something fix I_DIO_WAKEUP definition direct-io: only inc/dec inode->i_dio_count for file systems fs/9p: fix readdir() VFS: assorted d_backing_inode() annotations VFS: fs/inode.c helpers: d_inode() annotations VFS: fs/cachefiles: d_backing_inode() annotations VFS: fs library helpers: d_inode() annotations VFS: assorted weird filesystems: d_inode() annotations VFS: normal filesystems (and lustre): d_inode() annotations VFS: security/: d_inode() annotations VFS: security/: d_backing_inode() annotations VFS: net/: d_inode() annotations VFS: net/unix: d_backing_inode() annotations VFS: kernel/: d_inode() annotations VFS: audit: d_backing_inode() annotations VFS: Fix up some ->d_inode accesses in the chelsio driver VFS: Cachefiles should perform fs modifications on the top layer only VFS: AF_UNIX sockets should call mknod on the top layer only
2015-04-26Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode.Yang Dongsheng2-9/+14
We need to fill inode when we found a node for it in delayed_nodes_tree. But we did not fill the ->last_trans currently, it will cause the test of xfstest/generic/311 fail. Scenario of the 311 is shown as below: Problem: (1). test_fd = open(fname, O_RDWR|O_DIRECT) (2). pwrite(test_fd, buf, 4096, 0) (3). close(test_fd) (4). drop_all_caches() <-------- "echo 3 > /proc/sys/vm/drop_caches" (5). test_fd = open(fname, O_RDWR|O_DIRECT) (6). fsync(test_fd); <-------- we did not get the correct log entry for the file Reason: When we re-open this file in (5), we would find a node in delayed_nodes_tree and fill the inode we are lookup with the information. But the ->last_trans is not filled, then the fsync() will check the ->last_trans and found it's 0 then say this inode is already in our tree which is commited, not recording the extents for it. Fix: This patch fill the ->last_trans properly and set the runtime_flags if needed in this situation. Then we can get the log entries we expected after (6) and generic/311 passed. Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaoxie@huawei.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26btrfs: unlock i_mutex after attempting to delete subvolume during sendOmar Sandoval1-1/+2
Whenever the check for a send in progress introduced in commit 521e0546c970 (btrfs: protect snapshots from deleting during send) is hit, we return without unlocking inode->i_mutex. This is easy to see with lockdep enabled: [ +0.000059] ================================================ [ +0.000028] [ BUG: lock held when returning to user space! ] [ +0.000029] 4.0.0-rc5-00096-g3c435c1 #93 Not tainted [ +0.000026] ------------------------------------------------ [ +0.000029] btrfs/211 is leaving the kernel with locks still held! [ +0.000029] 1 lock held by btrfs/211: [ +0.000023] #0: (&type->i_mutex_dir_key){+.+.+.}, at: [<ffffffff8135b8df>] btrfs_ioctl_snap_destroy+0x2df/0x7a0 Make sure we unlock it in the error path. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Cc: stable@vger.kernel.org Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26btrfs: check io_ctl_prepare_pages return in __btrfs_write_out_cacheOmar Sandoval1-4/+6
If io_ctl_prepare_pages fails, the pages in io_ctl.pages are not valid. When we try to access them later, things will blow up in various ways. Also fix the comment about the return value, which is an errno on error, not -1, and update the cases where it was not. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26btrfs: fix race on ENOMEM in alloc_extent_bufferOmar Sandoval1-1/+2
Consider the following interleaving of overlapping calls to alloc_extent_buffer: Call 1: - Successfully allocates a few pages with find_or_create_page - find_or_create_page fails, goto free_eb - Unlocks the allocated pages Call 2: - Calls find_or_create_page and gets a page in call 1's extent_buffer - Finds that the page is already associated with an extent_buffer - Grabs a reference to the half-written extent_buffer and calls mark_extent_buffer_accessed on it mark_extent_buffer_accessed will then try to call mark_page_accessed on a null page and panic. The fix is to decrement the reference count on the half-written extent_buffer before unlocking the pages so call 2 won't use it. We should also set exists = NULL in the case that we don't use exists to avoid accidentally returning a freed extent_buffer in an error case. Signed-off-by: Omar Sandoval <osandov@osandov.com> Reviewed-by: David Sterba <dsterba@suse.cz> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26btrfs: handle ENOMEM in btrfs_alloc_tree_blockOmar Sandoval1-13/+28
This is one of the first places to give out when memory is tight. Handle it properly rather than with a BUG_ON. Also fix the comment about the return value, which is an ERR_PTR, not NULL, on error. Signed-off-by: Omar Sandoval <osandov@osandov.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26Btrfs: fix find_free_dev_extent() malfunction in case device tree has holeForrest Liu1-4/+11
If device tree has hole, find_free_dev_extent() cannot find available address properly. The problem can be reproduce by following script. mntpath=/btrfs loopdev=/dev/loop0 filepath=/home/forrest/image umount $mntpath losetup -d $loopdev truncate --size 100g $filepath losetup $loopdev $filepath mkfs.btrfs -f $loopdev mount $loopdev $mntpath # make device tree with one big hole for i in `seq 1 1 100`; do fallocate -l 1g $mntpath/$i done sync for i in `seq 1 1 95`; do rm $mntpath/$i done sync # wait cleaner thread remove unused block group sleep 300 fallocate -l 1g $mntpath/aaa # failed to allocate new chunk fallocate -l 1g $mntpath/bbb Above script will make device tree with one big hole, and can only allocate just one chunk in a transaction, so failed to allocate new chunk for $mntpath/bbb item 8 key (1 DEV_EXTENT 2185232384) itemoff 15859 itemsize 48 dev extent chunk_tree 3 chunk objectid 256 chunk offset 106292051968 length 1073741824 item 9 key (1 DEV_EXTENT 104190705664) itemoff 15811 itemsize 48 dev extent chunk_tree 3 chunk objectid 256 chunk offset 103108575232 length 1073741824 Signed-off-by: Forrest Liu <forrestl@synology.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26Btrfs: don't check for delalloc_bytes in cache_save_setupChris Mason1-2/+1
Now that we're doing free space cache writeback outside the critical section in the commit, there is a bigger window for delalloc_bytes to be added after a cache has been written. find_free_extent may do this without putting the block group back into the dirty list, and also without a transaction running. Checking for delalloc_bytes in cache_save_setup means we might leave the cache marked as written without invalidating it. Consistency checks during mount will toss the cache, but it's better to get rid of the check in cache_save_setup and let it get invalidated by the checks already done during cache write out. Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26Btrfs: fix deadlock when starting writeback of bg cachesFilipe Manana1-1/+1
While starting the writes of the dirty block group caches, if we don't find a block group item in the extent tree we were leaving without releasing our path, running delayed references and then looping again to process any new dirty block groups. However this second iteration of the loop could cause a deadlock because it tries to lock some other extent tree node/leaf which another task already locked and it's blocked because it's waiting for a lock on some node/leaf that is in our path that was not released before. We could also deadlock when running the delayed references - as we could end up trying to lock the same nodes/leafs that we have in our local path (with a different lock type). Got into such case when running xfstests: [20892.242791] ------------[ cut here ]------------ [20892.243776] WARNING: CPU: 0 PID: 13299 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]() [20892.245874] BTRFS: Transaction aborted (error -2) (...) [20892.269378] Call Trace: [20892.269915] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b [20892.271097] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad [20892.272173] [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb [20892.273386] [<ffffffffa0509a6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs] [20892.274857] [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48 [20892.275851] [<ffffffffa0509a6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs] [20892.277341] [<ffffffffa0515e10>] write_one_cache_group+0x68/0xaf [btrfs] [20892.278628] [<ffffffffa052088a>] btrfs_start_dirty_block_groups+0x18d/0x29b [btrfs] [20892.280191] [<ffffffffa052f077>] btrfs_commit_transaction+0x130/0x9c9 [btrfs] (...) [20892.291316] ---[ end trace 597f77e664245373 ]--- [20892.293955] BTRFS: error (device sdg) in write_one_cache_group:3184: errno=-2 No such entry [20892.297390] BTRFS info (device sdg): forced readonly [20892.298222] ------------[ cut here ]------------ [20892.299190] WARNING: CPU: 0 PID: 13299 at fs/btrfs/ctree.c:2683 btrfs_search_slot+0x7e/0x7d2 [btrfs]() (...) [20892.326253] Call Trace: [20892.326904] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b [20892.329503] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad [20892.330815] [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb [20892.332556] [<ffffffffa0510b73>] ? btrfs_search_slot+0x7e/0x7d2 [btrfs] [20892.333955] [<ffffffff81045f62>] warn_slowpath_null+0x1a/0x1c [20892.335562] [<ffffffffa0510b73>] btrfs_search_slot+0x7e/0x7d2 [btrfs] [20892.336849] [<ffffffff8107b024>] ? arch_local_irq_save+0x9/0xc [20892.338222] [<ffffffffa051ad52>] ? cache_save_setup+0x43/0x2a5 [btrfs] [20892.339823] [<ffffffffa051ad66>] ? cache_save_setup+0x57/0x2a5 [btrfs] [20892.341275] [<ffffffff814351a4>] ? _raw_spin_unlock+0x32/0x46 [20892.342810] [<ffffffffa0515de7>] write_one_cache_group+0x3f/0xaf [btrfs] [20892.344184] [<ffffffffa052088a>] btrfs_start_dirty_block_groups+0x18d/0x29b [btrfs] [20892.347162] [<ffffffffa052f077>] btrfs_commit_transaction+0x130/0x9c9 [btrfs] (...) [20892.361015] ---[ end trace 597f77e664245374 ]--- [21120.688097] INFO: task kworker/u8:17:29854 blocked for more than 120 seconds. [21120.689881] Tainted: G W 4.0.0-rc5-btrfs-next-9+ #2 [21120.691384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. (...) [21120.703696] Call Trace: [21120.704310] [<ffffffff8143107e>] schedule+0x74/0x83 [21120.705490] [<ffffffffa055f025>] btrfs_tree_lock+0xd7/0x236 [btrfs] [21120.706757] [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31 [21120.708156] [<ffffffffa054ac1e>] lock_extent_buffer_for_io+0x3e/0x194 [btrfs] [21120.709892] [<ffffffffa054bb86>] ? btree_write_cache_pages+0x273/0x385 [btrfs] [21120.711605] [<ffffffffa054bc42>] btree_write_cache_pages+0x32f/0x385 [btrfs] [21120.723440] [<ffffffffa0527552>] btree_writepages+0x23/0x5c [btrfs] [21120.724943] [<ffffffff8110c4c8>] do_writepages+0x23/0x2c [21120.726008] [<ffffffff81176dde>] __writeback_single_inode+0x73/0x2fa [21120.727230] [<ffffffff8117714a>] ? writeback_sb_inodes+0xe5/0x38b [21120.728526] [<ffffffff811771fb>] ? writeback_sb_inodes+0x196/0x38b [21120.729701] [<ffffffff8117726a>] writeback_sb_inodes+0x205/0x38b (...) [21120.747853] INFO: task btrfs:13282 blocked for more than 120 seconds. [21120.749459] Tainted: G W 4.0.0-rc5-btrfs-next-9+ #2 [21120.751137] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. (...) [21120.768457] Call Trace: [21120.769039] [<ffffffff8143107e>] schedule+0x74/0x83 [21120.770107] [<ffffffffa052f25c>] btrfs_commit_transaction+0x315/0x9c9 [btrfs] [21120.771558] [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31 [21120.773659] [<ffffffffa056fd8c>] prepare_to_relocate+0xcb/0xd2 [btrfs] [21120.776257] [<ffffffffa05741da>] relocate_block_group+0x44/0x4a9 [btrfs] [21120.777755] [<ffffffffa05747a0>] ? btrfs_relocate_block_group+0x161/0x288 [btrfs] [21120.779459] [<ffffffffa05747a8>] btrfs_relocate_block_group+0x169/0x288 [btrfs] [21120.781153] [<ffffffffa0550403>] btrfs_relocate_chunk.isra.29+0x3e/0xa7 [btrfs] [21120.783918] [<ffffffffa05518fd>] btrfs_balance+0xaa4/0xc52 [btrfs] [21120.785436] [<ffffffff8114306e>] ? cpu_cache_get.isra.39+0xe/0x1f [21120.786434] [<ffffffffa0559252>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs] (...) [21120.889251] INFO: task fsstress:13288 blocked for more than 120 seconds. [21120.890526] Tainted: G W 4.0.0-rc5-btrfs-next-9+ #2 [21120.891773] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. (...) [21120.899960] Call Trace: [21120.900743] [<ffffffff8143107e>] schedule+0x74/0x83 [21120.903004] [<ffffffffa055f025>] btrfs_tree_lock+0xd7/0x236 [btrfs] [21120.904383] [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31 [21120.905608] [<ffffffffa051125b>] btrfs_search_slot+0x766/0x7d2 [btrfs] [21120.906812] [<ffffffff8114290e>] ? virt_to_head_page+0x9/0x2c [21120.907874] [<ffffffff81144b7f>] ? cache_alloc_debugcheck_after.isra.42+0x16c/0x1cb [21120.909551] [<ffffffffa05124e0>] btrfs_insert_empty_items+0x5d/0xa8 [btrfs] [21120.910914] [<ffffffffa0512585>] btrfs_insert_item+0x5a/0xa5 [btrfs] [21120.912181] [<ffffffffa0520271>] ? btrfs_create_pending_block_groups+0x96/0x130 [btrfs] [21120.913784] [<ffffffffa052028a>] btrfs_create_pending_block_groups+0xaf/0x130 [btrfs] [21120.915374] [<ffffffffa052ffc2>] __btrfs_end_transaction+0x84/0x366 [btrfs] [21120.916735] [<ffffffffa05302b4>] btrfs_end_transaction+0x10/0x12 [btrfs] [21120.917996] [<ffffffffa051ab26>] btrfs_check_data_free_space+0x11f/0x27c [btrfs] [21120.919478] [<ffffffffa051ba25>] btrfs_delalloc_reserve_space+0x1e/0x51 [btrfs] [21120.921226] [<ffffffffa05382f2>] btrfs_truncate_page+0x85/0x2c4 [btrfs] [21120.923121] [<ffffffffa0538572>] btrfs_cont_expand+0x41/0x3ef [btrfs] [21120.924449] [<ffffffffa0541091>] ? btrfs_file_write_iter+0x19a/0x431 [btrfs] [21120.926602] [<ffffffff8107b024>] ? arch_local_irq_save+0x9/0xc [21120.927769] [<ffffffffa0541091>] ? btrfs_file_write_iter+0x19a/0x431 [btrfs] [21120.929324] [<ffffffffa05410a0>] ? btrfs_file_write_iter+0x1a9/0x431 [btrfs] [21120.930723] [<ffffffffa05410d9>] btrfs_file_write_iter+0x1e2/0x431 [btrfs] [21120.931897] [<ffffffff81067d85>] ? get_parent_ip+0xe/0x3e [21120.934446] [<ffffffff811534c3>] new_sync_write+0x7c/0xa0 [21120.935528] [<ffffffff81153b58>] vfs_write+0xb2/0x117 (...) Fixes: 1bbc621ef284 ("Btrfs: allow block group cache writeout outside critical section in commit") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26Btrfs: fix race between start dirty bg cache writeout and bg deletionFilipe Manana1-17/+27
While running xfstests I ran into the following: [20892.242791] ------------[ cut here ]------------ [20892.243776] WARNING: CPU: 0 PID: 13299 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]() [20892.245874] BTRFS: Transaction aborted (error -2) [20892.247329] Modules linked in: btrfs dm_snapshot dm_bufio dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse$ [20892.258488] CPU: 0 PID: 13299 Comm: fsstress Tainted: G W 4.0.0-rc5-btrfs-next-9+ #2 [20892.262011] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [20892.264738] 0000000000000009 ffff880427f8bc18 ffffffff8142fa46 ffffffff8108b6a2 [20892.266244] ffff880427f8bc68 ffff880427f8bc58 ffffffff81045ea5 ffff880427f8bc48 [20892.267761] ffffffffa0509a6d 00000000fffffffe ffff8803545d6f40 ffffffffa05a15a0 [20892.269378] Call Trace: [20892.269915] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b [20892.271097] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad [20892.272173] [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb [20892.273386] [<ffffffffa0509a6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs] [20892.274857] [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48 [20892.275851] [<ffffffffa0509a6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs] [20892.277341] [<ffffffffa0515e10>] write_one_cache_group+0x68/0xaf [btrfs] [20892.278628] [<ffffffffa052088a>] btrfs_start_dirty_block_groups+0x18d/0x29b [btrfs] [20892.280191] [<ffffffffa052f077>] btrfs_commit_transaction+0x130/0x9c9 [btrfs] [20892.281781] [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf [20892.282873] [<ffffffffa054163b>] btrfs_sync_file+0x313/0x387 [btrfs] [20892.284111] [<ffffffff8117acad>] vfs_fsync_range+0x95/0xa4 [20892.285203] [<ffffffff810e603f>] ? time_hardirqs_on+0x15/0x28 [20892.286290] [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f [20892.287469] [<ffffffff8117acd8>] vfs_fsync+0x1c/0x1e [20892.288412] [<ffffffff8117ae54>] do_fsync+0x34/0x4e [20892.289348] [<ffffffff8117b07c>] SyS_fsync+0x10/0x14 [20892.290255] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17 [20892.291316] ---[ end trace 597f77e664245373 ]--- [20892.293955] BTRFS: error (device sdg) in write_one_cache_group:3184: errno=-2 No such entry [20892.297390] BTRFS info (device sdg): forced readonly This happens because in btrfs_start_dirty_block_groups() we splice the transaction's list of dirty block groups into a local list and then we keep extracting the first element of the list without holding the cache_write_mutex mutex. This means that before we acquire that mutex the first block group on the list might be removed by a conurrent task running btrfs_remove_block_group(). So make sure we extract the first element (and test the list emptyness) while holding that mutex. Fixes: 1bbc621ef284 ("Btrfs: allow block group cache writeout outside critical section in commit") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-04-24RCU pathwalk breakage when running into a symlink overmounting somethingAl Viro1-2/+4
Calling unlazy_walk() in walk_component() and do_last() when we find a symlink that needs to be followed doesn't acquire a reference to vfsmount. That's fine when the symlink is on the same vfsmount as the parent directory (which is almost always the case), but it's not always true - one _can_ manage to bind a symlink on top of something. And in such cases we end up with excessive mntput(). Cc: stable@vger.kernel.org # since 2.6.39 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-24direct-io: only inc/dec inode->i_dio_count for file systemsJens Axboe8-32/+22
do_blockdev_direct_IO() increments and decrements the inode ->i_dio_count for each IO operation. It does this to protect against truncate of a file. Block devices don't need this sort of protection. For a capable multiqueue setup, this atomic int is the only shared state between applications accessing the device for O_DIRECT, and it presents a scaling wall for that. In my testing, as much as 30% of system time is spent incrementing and decrementing this value. A mixed read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with better latencies too. Before: clat percentiles (usec): | 1.00th=[ 33], 5.00th=[ 34], 10.00th=[ 34], 20.00th=[ 34], | 30.00th=[ 34], 40.00th=[ 34], 50.00th=[ 35], 60.00th=[ 35], | 70.00th=[ 35], 80.00th=[ 35], 90.00th=[ 37], 95.00th=[ 80], | 99.00th=[ 98], 99.50th=[ 151], 99.90th=[ 155], 99.95th=[ 155], | 99.99th=[ 165] After: clat percentiles (usec): | 1.00th=[ 95], 5.00th=[ 108], 10.00th=[ 129], 20.00th=[ 149], | 30.00th=[ 155], 40.00th=[ 161], 50.00th=[ 167], 60.00th=[ 171], | 70.00th=[ 177], 80.00th=[ 185], 90.00th=[ 201], 95.00th=[ 270], | 99.00th=[ 390], 99.50th=[ 398], 99.90th=[ 418], 99.95th=[ 422], | 99.99th=[ 438] In other setups, Robert Elliott reported seeing good performance improvements: https://lkml.org/lkml/2015/4/3/557 The more applications accessing the device, the worse it gets. Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells do_blockdev_direct_IO() that it need not worry about incrementing or decrementing the inode i_dio_count for this caller. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Elliott, Robert (Server Storage) <elliott@hp.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-24fs/9p: fix readdir()Johannes Berg1-0/+2
Al Viro's IOV changes broke 9p readdir() because the new code didn't abort the read when it returned nothing. The original code checked if the combined error/length was <= 0 but in the new code that accidentally got changed to just an error check. Add back the return from the function when nothing is read. Cc: Al Viro <viro@zeniv.linux.org.uk> Fixes: e1200fe68f20 ("9p: switch p9_client_read() to passing struct iov_iter *") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-24Btrfs: prevent list corruption during free space cache processingChris Mason1-14/+18
__btrfs_write_out_cache is holding the ctl->tree_lock while it prepares a list of bitmaps to record in the free space cache. It was dropping the lock while it worked on other components, which made a window for free_bitmap() to free the bitmap struct without removing it from the list. This changes things to hold the lock the whole time, and also makes sure we hold the lock during enospc cleanup. Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-04-24Merge branch 'for-4.1' of git://linux-nfs.org/~bfields/linuxLinus Torvalds10-77/+35
Pull nfsd updates from Bruce Fields: "A quiet cycle this time; this is basically entirely bugfixes. The few that aren't cc'd to stable are cleanup or seemed unlikely to affect anyone much" * 'for-4.1' of git://linux-nfs.org/~bfields/linux: uapi: Remove kernel internal declaration nfsd: fix nsfd startup race triggering BUG_ON nfsd: eliminate NFSD_DEBUG nfsd4: fix READ permission checking nfsd4: disallow SEEK with special stateids nfsd4: disallow ALLOCATE with special stateids nfsd: add NFSEXP_PNFS to the exflags array nfsd: Remove duplicate macro define for max sec label length nfsd: allow setting acls with unenforceable DENYs nfsd: NFSD_FAULT_INJECTION depends on DEBUG_FS nfsd: remove unused status arg to nfsd4_cleanup_open_state nfsd: remove bogus setting of status in nfsd4_process_open2 NFSD: Use correct reply size calculating function NFSD: Using path_equal() for checking two paths
2015-04-24Merge branch 'for-linus-4.1' of ↵Linus Torvalds46-897/+2113
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "I've been running these through a longer set of load tests because my commits change the free space cache writeout. It fixes commit stalls on large filesystems (~20T space used and up) that we have been triggering here. We were seeing new writers blocked for 10 seconds or more during commits, which is far from good. Josef and I fixed up ENOSPC aborts when deleting huge files (3T or more), that are triggered because our metadata reservations were not properly accounting for crcs and were not replenishing during the truncate. Also in this series, a number of qgroup fixes from Fujitsu and Dave Sterba collected most of the pending cleanups from the list" * 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (93 commits) btrfs: quota: Update quota tree after qgroup relationship change. btrfs: quota: Automatically update related qgroups or mark INCONSISTENT flags when assigning/deleting a qgroup relations. btrfs: qgroup: clear STATUS_FLAG_ON in disabling quota. btrfs: Update btrfs qgroup status item when rescan is done. btrfs: qgroup: Fix dead judgement on qgroup_rescan_leaf() return value. btrfs: Don't allow subvolid >= (1 << BTRFS_QGROUP_LEVEL_SHIFT) to be created btrfs: Check qgroup level in kernel qgroup assign. btrfs: qgroup: allow to remove qgroup which has parent but no child. btrfs: qgroup: return EINVAL if level of parent is not higher than child's. btrfs: qgroup: do a reservation in a higher level. Btrfs: qgroup, Account data space in more proper timings. Btrfs: qgroup: Introduce a may_use to account space_info->bytes_may_use. Btrfs: qgroup: free reserved in exceeding quota. Btrfs: qgroup: cleanup, remove an unsued parameter in btrfs_create_qgroup(). btrfs: qgroup: fix limit args override whole limit struct btrfs: qgroup: update limit info in function btrfs_run_qgroups(). btrfs: qgroup: consolidate the parameter of fucntion update_qgroup_limit_item(). btrfs: qgroup: update qgroup in memory at the same time when we update it in btree. btrfs: qgroup: inherit limit info from srcgroup in creating snapshot. btrfs: Support busy loop of write and delete ...
2015-04-24Merge tag 'xfs-for-linus-4.1-rc1' of ↵Linus Torvalds46-1929/+1992
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs update from Dave Chinner: "This update contains: - RENAME_WHITEOUT support - conversion of per-cpu superblock accounting to use generic counters - new inode mmap lock so that we can lock page faults out of truncate, hole punch and other direct extent manipulation functions to avoid racing mmap writes from causing data corruption - rework of direct IO submission and completion to solve data corruption issue when running concurrent extending DIO writes. Also solves problem of running IO completion transactions in interrupt context during size extending AIO writes. - FALLOC_FL_INSERT_RANGE support for inserting holes into a file via direct extent manipulation to avoid needing to copy data within the file - attribute block header field overflow fix for 64k block size filesystems - Lots of changes to log messaging to be more informative and concise when errors occur. Also prevent a lot of unnecessary log spamming due to cascading failures in error conditions. - lots of cleanups and bug fixes One thing of note is the direct IO fixes that we merged last week after the window opened. Even though a little late, they fix a user reported data corruption and have been pretty well tested. I figured there was not much point waiting another 2 weeks for -rc1 to be released just so I could send them to you..." * tag 'xfs-for-linus-4.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (49 commits) xfs: using generic_file_direct_write() is unnecessary xfs: direct IO EOF zeroing needs to drain AIO xfs: DIO write completion size updates race xfs: DIO writes within EOF don't need an ioend xfs: handle DIO overwrite EOF update completion correctly xfs: DIO needs an ioend for writes xfs: move DIO mapping size calculation xfs: factor DIO write mapping from get_blocks xfs: unlock i_mutex in xfs_break_layouts xfs: kill unnecessary firstused overflow check on attr3 leaf removal xfs: use larger in-core attr firstused field and detect overflow xfs: pass attr geometry to attr leaf header conversion functions xfs: disallow ro->rw remount on norecovery mount xfs: xfs_shift_file_space can be static xfs: Add support FALLOC_FL_INSERT_RANGE for fallocate fs: Add support FALLOC_FL_INSERT_RANGE for fallocate xfs: Fix incorrect positive ENOMEM return xfs: xfs_mru_cache_insert() should use GFP_NOFS xfs: %pF is only for function pointers xfs: fix shadow warning in xfs_da3_root_split() ...
2015-04-23Btrfs: fix inode cache writeoutChris Mason1-3/+7
The code to fix stalls during free spache cache IO wasn't using the correct root when waiting on the IO for inode caches. This is only a problem when the inode cache is enabled with mount -o inode_cache This fixes the inode cache writeout to preserve any error values and makes sure not to override the root when inode cache writeout is done. Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-04-23Merge tag 'nfs-rdma-for-4.1-1' of git://git.linux-nfs.org/projects/anna/nfs-rdmaTrond Myklebust20-85/+439
NFS: NFSoRDMA Client Changes This patch series creates an operation vector for each of the different memory registration modes. This should make it easier to one day increase credit limit, rsize, and wsize. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-04-23Merge branch 'bugfixes'Trond Myklebust4-6/+3
* bugfixes: NFSv4: Return delegations synchronously in evict_inode SUNRPC: Fix a regression when reconnecting NFS: remount with security change should return EINVAL nfs: do not export discarded symbols NFSv4.1: don't export static symbol
2015-04-23fs/nfs: fix new compiler warning about boolean in switchAndre Przywara1-7/+4
The brand new GCC 5.1.0 warns by default on using a boolean in the switch condition. This results in the following warning: fs/nfs/nfs4proc.c: In function 'nfs4_proc_get_rootfh': fs/nfs/nfs4proc.c:3100:10: warning: switch condition has boolean value [-Wswitch-bool] switch (auth_probe) { ^ This code was obviously using switch to make use of the fall-through semantics (without the usual comment, though). Rewrite that code using if statements to avoid the warning and make the code a bit more readable on the way. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-23nfs: Remove unneeded casts in nfsFiro Yang1-1/+1
Don't unnecessarily cast allocation return value in fs/nfs/inode.c::nfs_alloc_inode(). Signed-off-by: Firo Yang <firogm@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-23NFS: Don't attempt to decode missing directory entriesBenjamin Coddington1-0/+4
If a READDIR reply comes back without any page data, avoid a NULL pointer dereference in xdr_copy_to_scratch(). BUG: unable to handle kernel NULL pointer dereference at 0000000000000001 IP: [<ffffffff813a378d>] memcpy+0xd/0x110 ... Call Trace: ? xdr_inline_decode+0x7a/0xb0 [sunrpc] nfs3_decode_dirent+0x73/0x320 [nfsv3] nfs_readdir_page_filler+0xd5/0x4e0 [nfs] ? nfs3_rpc_wrapper.constprop.9+0x42/0xc0 [nfsv3] nfs_readdir_xdr_to_array+0x1fa/0x330 [nfs] ? mem_cgroup_commit_charge+0xac/0x160 ? nfs_readdir_xdr_to_array+0x330/0x330 [nfs] nfs_readdir_filler+0x22/0x90 [nfs] do_read_cache_page+0x7e/0x1a0 read_cache_page+0x1c/0x20 nfs_readdir+0x18e/0x660 [nfs] ? nfs3_xdr_dec_getattr3res+0x80/0x80 [nfsv3] iterate_dir+0x97/0x130 SyS_getdents+0x94/0x120 ? fillonedir+0xd0/0xd0 system_call_fastpath+0x12/0x17 Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-23Revert "nfs: replace nfs_add_stats with nfs_inc_stats when add one"Nicolas Iooss2-2/+2
This reverts commit 5a254d08b086d80cbead2ebcee6d2a4b3a15587a. Since commit 5a254d08b086 ("nfs: replace nfs_add_stats with nfs_inc_stats when add one"), nfs_readpage and nfs_do_writepage use nfs_inc_stats to increment NFSIOS_READPAGES and NFSIOS_WRITEPAGES instead of nfs_add_stats. However nfs_inc_stats does not do the same thing as nfs_add_stats with value 1 because these functions work on distinct stats: nfs_inc_stats increments stats from "enum nfs_stat_eventcounters" (in server->io_stats->events) and nfs_add_stats those from "enum nfs_stat_bytecounters" (in server->io_stats->bytes). Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org> Fixes: 5a254d08b086 ("nfs: replace nfs_add_stats with nfs_inc_stats...") Cc: stable@vger.kernel.org # 3.19+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-23NFS: Rename idmap.c to nfs4idmap.cAnna Schumaker2-1/+1
I added the nfs4 prefix to make it obvious that this file is built into the NFS v4 module, and not the generic client. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-23NFS: Move nfs_idmap.h into fs/nfs/Anna Schumaker9-8/+76
This file is only used internally to the NFS v4 module, so it doesn't need to be in the global include path. I also renamed it from nfs_idmap.h to nfs4idmap.h to emphasize that it's an NFSv4-only include file. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-23NFS: Remove CONFIG_NFS_V4 checks from nfs_idmap.hAnna Schumaker2-2/+0
The idmapper is completely internal to the NFS v4 module, so this macro will always evaluate to true. This patch also removes unnecessary includes of this file from the generic NFS client. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-23NFS: Add a stub for GETDEVICELISTAnna Schumaker1-0/+6
d4b18c3e (pnfs: remove GETDEVICELIST implementation) removed the GETDEVICELIST operation from the NFS client, but left a "hole" in the nfs4_procedures array. This caused /proc/self/mountstats to report an operation named "51" where GETDEVICELIST used to be. This patch adds a stub to fix mountstats. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Fixes: d4b18c3e (pnfs: remove GETDEVICELIST implementation) Cc: stable@vger.kernel.org # 3.17+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-23nfs: remove WARN_ON_ONCE from nfs_direct_good_bytesPeng Tao1-2/+0
For flexfiles driver, we might choose to read from mirror index other than 0 while mirror_count is always 1 for read. Reported-by: Jean Spector <jean@primarydata.com> Cc: <stable@vger.kernel.org> # v3.19+ Cc: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-23nfs: fix DIO good bytes calculationPeng Tao1-12/+17
For direct read that has IO size larger than rsize, we'll split it into several READ requests and nfs_direct_good_bytes() would count completed bytes incorrectly by eating last zero count reply. Fix it by handling mirror and non-mirror cases differently such that we only count mirrored writes differently. This fixes 5fadeb47("nfs: count DIO good bytes correctly with mirroring"). Reported-by: Jean Spector <jean@primarydata.com> Cc: <stable@vger.kernel.org> # v3.19+ Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-23nfs: Fetch MOUNTED_ON_FILEID when updating an inodeAnna Schumaker2-2/+16
2ef47eb1 (NFS: Fix use of nfs_attr_use_mounted_on_fileid()) was a good start to fixing a circular directory structure warning for NFS v4 "junctioned" mountpoints. Unfortunately, further testing continued to generate this error. My server is configured like this: anna@nfsd ~ % df Filesystem Size Used Avail Use% Mounted on /dev/vda1 9.1G 2.0G 6.5G 24% / /dev/vdc1 1014M 33M 982M 4% /exports /dev/vdc2 1014M 33M 982M 4% /exports/vol1 /dev/vdc3 1014M 33M 982M 4% /exports/vol1/vol2 anna@nfsd ~ % cat /etc/exports /exports/ *(rw,async,no_subtree_check,no_root_squash) /exports/vol1/ *(rw,async,no_subtree_check,no_root_squash) /exports/vol1/vol2 *(rw,async,no_subtree_check,no_root_squash) I've been running chown across the entire mountpoint twice in a row to hit this problem. The first run succeeds, but the second one fails with the circular directory warning along with: anna@client ~ % dmesg [Apr 3 14:28] NFS: server 192.168.100.204 error: fileid changed fsid 0:39: expected fileid 0x100080, got 0x80 WHere 0x80 is the mountpoint's fileid and 0x100080 is the mounted-on fileid. This patch fixes the issue by requesting an updated mounted-on fileid from the server during nfs_update_inode(), and then checking that the fileid stored in the nfs_inode matches either the fileid or mounted-on fileid returned by the server. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-23nfs: fix high load average due to callback thread sleepingJeff Layton1-3/+3
Chuck pointed out a problem that crept in with commit 6ffa30d3f734 (nfs: don't call blocking operations while !TASK_RUNNING). Linux counts tasks in uninterruptible sleep against the load average, so this caused the system's load average to be pinned at at least 1 when there was a NFSv4.1+ mount active. Not a huge problem, but it's probably worth fixing before we get too many complaints about it. This patch converts the code back to use TASK_INTERRUPTIBLE sleep, simply has it flush any signals on each loop iteration. In practice no one should really be signalling this thread at all, so I think this is reasonably safe. With this change, there's also no need to game the hung task watchdog so we can also convert the schedule_timeout call back to a normal schedule. Cc: <stable@vger.kernel.org> Reported-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Tested-by: Chuck Lever <chuck.lever@oracle.com> Fixes: commit 6ffa30d3f734 (“nfs: don't call blocking . . .”) Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-23NFS: Reduce time spent holding the i_mutex during fallocate()Anna Schumaker2-7/+10
At the very least, we should not be taking the i_mutex until after checking if the server even supports ALLOCATE or DEALLOCATE, allowing v4.0 or v4.1 to exit without potentially waiting on a lock. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-23NFS: Don't zap caches on fallocate()Anna Schumaker4-10/+35
This patch adds a GETATTR to the end of ALLOCATE and DEALLOCATE operations so we can set the updated inode size and change attribute directly. DEALLOCATE will still need to release pagecache pages, so nfs42_proc_deallocate() now calls truncate_pagecache_range() before contacting the server. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-22Merge tag 'for-linus-20150422' of git://git.infradead.org/linux-mtdLinus Torvalds1-1/+0
Pull MTD updates from Brian Norris: "Common MTD: - Add Kconfig option for keeping both the 'master' and 'partition' MTDs registered as devices. This would really make a better default if we could do it over, as it allows a lot more flexibility in (1) determining the flash topology of the system from user-space and (2) adding temporary partitions at runtime (ioctl(BLKPG)). Unfortunately, this would possibly cause user-space breakage, as it will cause renumbering of the /dev/mtdX devices. We'll see if we can change this in the future, as there have already been a few people looking for this feature, and I know others have just been working around our current limitations instead of fixing them this way. - Along with the previous change, add some additional information to sysfs, so user-space can read the offset of each partition within its master device SPI NOR: - add new device tree compatible binding to represent the mostly-compatible class of SPI NOR flash which can be detected by their extended JEDEC ID bytes, cutting down the duplication of our ID tables - misc. new IDs Various other miscellaneous fixes and changes" * tag 'for-linus-20150422' of git://git.infradead.org/linux-mtd: (53 commits) mtd: spi-nor: Add support for Macronix mx25u6435f serial flash mtd: spi-nor: Add support for Winbond w25q64dw serial flash mtd: spi-nor: add support for the Winbond W25X05 flash mtd: spi-nor: support en25s64 device mtd: m25p80: bind to "nor-jedec" ID, for auto-detection Documentation: devicetree: m25p80: add "nor-jedec" binding mtd: Make MTD tests cancelable mtd: mtd_oobtest: Fix bitflip_limit usage in test case 3 mtd: docg3: remove invalid __exit annotations mtd: fsl_ifc_nand: use msecs_to_jiffies for time conversion mtd: atmel_nand: don't map the ROM table if no pmecc table offset in DT mtd: atmel_nand: add a definition for the oob reserved bytes mtd: part: Remove partition overlap checks mtd: part: Add sysfs variable for offset of partition mtd: part: Create the master device node when partitioned mtd: ts5500_flash: Fix typo in MODULE_DESCRIPTION in ts5500_flash.c mtd: denali: Disable sub-page writes in Denali NAND driver mtd: pxa3xx_nand: cleanup wait_for_completion handling mtd: nand: gpmi: Check for scan_bbt() error mtd: nand: gpmi: fixup return type of wait_for_completion_timeout ...
2015-04-22Merge branch 'for-linus' of ↵Linus Torvalds8-92/+190
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "This time around we have a collection of CephFS fixes from Zheng around MDS failure handling and snapshots, support for a new CRUSH straw2 algorithm (to sync up with userspace) and several RBD cleanups and fixes from Ilya, an error path leak fix from Taesoo, and then an assorted collection of cleanups from others" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (28 commits) rbd: rbd_wq comment is obsolete libceph: announce support for straw2 buckets crush: straw2 bucket type with an efficient 64-bit crush_ln() crush: ensuring at most num-rep osds are selected crush: drop unnecessary include from mapper.c ceph: fix uninline data function ceph: rename snapshot support ceph: fix null pointer dereference in send_mds_reconnect() ceph: hold on to exclusive caps on complete directories libceph: simplify our debugfs attr macro ceph: show non-default options only libceph: expose client options through debugfs libceph, ceph: split ceph_show_options() rbd: mark block queue as non-rotational libceph: don't overwrite specific con error msgs ceph: cleanup unsafe requests when reconnecting is denied ceph: don't zero i_wrbuffer_ref when reconnecting is denied ceph: don't mark dirty caps when there is no auth cap ceph: keep i_snap_realm while there are writers libceph: osdmap.h: Add missing format newlines ...
2015-04-22ceph: fix uninline data functionYan, Zheng1-13/+21
For CEPH_OSD_CMPXATTR_MODE_U64, OSD expects the u64 to be encoded as string in object's xattr. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-04-22ceph: rename snapshot supportYan, Zheng2-4/+10
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-04-22ceph: fix null pointer dereference in send_mds_reconnect()Yan, Zheng1-1/+2
sb->s_root can be null when umounting Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-04-21nfsd: fix nsfd startup race triggering BUG_ONGiuseppe Cantavenera1-8/+8
nfsd triggered a BUG_ON in net_generic(...) when rpc_pipefs_event(...) in fs/nfsd/nfs4recover.c was called before assigning ntfsd_net_id. The following was observed on a MIPS 32-core processor: kernel: Call Trace: kernel: [<ffffffffc00bc5e4>] rpc_pipefs_event+0x7c/0x158 [nfsd] kernel: [<ffffffff8017a2a0>] notifier_call_chain+0x70/0xb8 kernel: [<ffffffff8017a4e4>] __blocking_notifier_call_chain+0x4c/0x70 kernel: [<ffffffff8053aff8>] rpc_fill_super+0xf8/0x1a0 kernel: [<ffffffff8022204c>] mount_ns+0xb4/0xf0 kernel: [<ffffffff80222b48>] mount_fs+0x50/0x1f8 kernel: [<ffffffff8023dc00>] vfs_kern_mount+0x58/0xf0 kernel: [<ffffffff802404ac>] do_mount+0x27c/0xa28 kernel: [<ffffffff80240cf0>] SyS_mount+0x98/0xe8 kernel: [<ffffffff80135d24>] handle_sys64+0x44/0x68 kernel: kernel: Code: 0040f809 00000000 2e020001 <00020336> 3c12c00d 3c02801a de100000 6442eb98 0040f809 kernel: ---[ end trace 7471374335809536 ]--- Fixed this behaviour by calling register_pernet_subsys(&nfsd_net_ops) before registering rpc_pipefs_event(...) with the notifier chain. Signed-off-by: Giuseppe Cantavenera <giuseppe.cantavenera.ext@nokia.com> Signed-off-by: Lorenzo Restelli <lorenzo.restelli.ext@nokia.com> Reviewed-by: Kinlong Mee <kinglongmee@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-04-21nfsd: eliminate NFSD_DEBUGMark Salter3-3/+3
Commit f895b252d4edf ("sunrpc: eliminate RPC_DEBUG") introduced use of IS_ENABLED() in a uapi header which leads to a build failure for userspace apps trying to use <linux/nfsd/debug.h>: linux/nfsd/debug.h:18:15: error: missing binary operator before token "(" #if IS_ENABLED(CONFIG_SUNRPC_DEBUG) ^ Since this was only used to define NFSD_DEBUG if CONFIG_SUNRPC_DEBUG is enabled, replace instances of NFSD_DEBUG with CONFIG_SUNRPC_DEBUG. Cc: stable@vger.kernel.org Fixes: f895b252d4edf "sunrpc: eliminate RPC_DEBUG" Signed-off-by: Mark Salter <msalter@redhat.com> Reviewed-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-04-21nfsd4: fix READ permission checkingJ. Bruce Fields1-4/+8
In the case we already have a struct file (derived from a stateid), we still need to do permission-checking; otherwise an unauthorized user could gain access to a file by sniffing or guessing somebody else's stateid. Cc: stable@vger.kernel.org Fixes: dc97618ddda9 "nfsd4: separate splice and readv cases" Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-04-21nfsd4: disallow SEEK with special stateidsJ. Bruce Fields1-0/+2
If the client uses a special stateid then we'll pass a NULL file to vfs_llseek. Fixes: 24bab491220f " NFSD: Implement SEEK" Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: stable@vger.kernel.org Reported-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-04-21nfsd4: disallow ALLOCATE with special stateidsJ. Bruce Fields1-0/+2
vfs_fallocate will hit a NULL dereference if the client tries an ALLOCATE or DEALLOCATE with a special stateid. Fix that. (We also depend on the open to have broken any conflicting leases or delegations for us.) (If it turns out we need to allow special stateid's then we could do a temporary open here in the special-stateid case, as we do for read and write. For now I'm assuming it's not necessary.) Fixes: 95d871f03cae "nfsd: Add ALLOCATE support" Cc: stable@vger.kernel.org Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-04-21Revert "ocfs2: incorrect check for debugfs returns"Linus Torvalds3-37/+16
This reverts commit e2ac55b6a8e337fac7cc59c6f452caac92ab5ee6. Huang Ying reports that this causes a hang at boot with debugfs disabled. It is true that the debugfs error checks are kind of confusing, and this code certainly merits more cleanup and thinking about it, but there's something wrong with the trivial "check not just for NULL, but for error pointers too" patch. Yes, with debugfs disabled, we will end up setting the o2hb_debug_dir pointer variable to an error pointer (-ENODEV), and then continue as if everything was fine. But since debugfs is disabled, all the _users_ of that pointer end up being compiled away, so even though the pointer can not be dereferenced, that's still fine. So it's confusing and somewhat questionable, but the "more correct" error checks end up causing more trouble than they fix. Reported-by: Huang Ying <ying.huang@intel.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Chengyu Song <csong84@gatech.edu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-20ceph: hold on to exclusive caps on complete directoriesYan, Zheng1-0/+12
If a directory is complete, we want to keep the exclusive cap. So that MDS does not end up revoking the shared cap on every create/unlink operation. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-04-20ceph: show non-default options onlyIlya Dryomov1-5/+1
Don't pollute /proc/mounts with default options (presently these are dcache, nofsc and acl). Leave the acl/noacl however - it's a bit of a special case due to CONFIG_CEPH_FS_POSIX_ACL. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-04-20libceph, ceph: split ceph_show_options()Ilya Dryomov1-25/+15
Split ceph_show_options() into two pieces and move the piece responsible for printing client (libceph) options into net/ceph. This way people adding a libceph option wouldn't have to remember to update code in fs/ceph. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-04-20ceph: cleanup unsafe requests when reconnecting is deniedYan, Zheng1-0/+28
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-04-20ceph: don't zero i_wrbuffer_ref when reconnecting is deniedYan, Zheng1-7/+0
remove_session_caps_cb() does not truncate dirty data in page cache, but zeros i_wrbuffer_ref/i_wrbuffer_ref_head. This will result negtive i_wrbuffer_ref/i_wrbuffer_ref_head Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-04-20ceph: don't mark dirty caps when there is no auth capYan, Zheng2-2/+8
No i_auth_cap means reconnecting to MDS was denied. So don't add new dirty caps. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-04-20ceph: keep i_snap_realm while there are writersYan, Zheng1-9/+22
when reconnecting to MDS is denied, we remove session caps forcibly. But it's possible there are ongoing write, the write code needs to reference i_snap_realm. So if there are ongoing write, we keep i_snap_realm. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-04-20ceph: kstrdup() memory handlingSanidhya Kashyap3-13/+44
Currently, there is no check for the kstrdup() for r_path2, r_path1 and snapdir_name as various locations as there is a possibility of failure during memory pressure. Therefore, returning ENOMEM where the checks have been missed. Signed-off-by: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-04-20ceph: properly release page upon errorTaesoo Kim1-0/+4
When ceph_update_writeable_page fails (including -EAGAIN), it unlocks (w/ unlock_page) the page but does not 'release' (w/ page_cache_release) properly. Upon error, properly set *pagep to NULL, indicating an error. Signed-off-by: Taesoo Kim <tsgatesv@gmail.com> Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-04-20ceph: match wait_for_completion_timeout return typeNicholas Mc Guire1-4/+5
return type of wait_for_completion_timeout is unsigned long not int. An appropriately named unsigned long is added and the assignment fixed up. Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org> Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-04-20ceph: use msecs_to_jiffies for time conversionNicholas Mc Guire1-1/+1
This is only an API consolidation and should make things more readable it replaces var * HZ / 1000 by msecs_to_jiffies(var). Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org> Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-04-20ceph: remove redundant declarationFabian Frederick1-1/+0
ceph_aops was already defined extern in addr.c section Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-04-20ceph: fix dcache/nocache mount optionYan, Zheng2-1/+4
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-04-20ceph: drop cap releases in requests composed before cap reconnectYan, Zheng1-6/+13
These cap releases are stale because MDS will re-establish client caps according to the cap reconnect messages. Note: MDS can detect stale cap messages, so these stale cap releases are harmless even we don't drop them. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-04-19Merge tag 'ext4_for_linus' of ↵Linus Torvalds29-246/+3344
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "A few bug fixes and add support for file-system level encryption in ext4" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (31 commits) ext4 crypto: enable encryption feature flag ext4 crypto: add symlink encryption ext4 crypto: enable filename encryption ext4 crypto: filename encryption modifications ext4 crypto: partial update to namei.c for fname crypto ext4 crypto: insert encrypted filenames into a leaf directory block ext4 crypto: teach ext4_htree_store_dirent() to store decrypted filenames ext4 crypto: filename encryption facilities ext4 crypto: implement the ext4 decryption read path ext4 crypto: implement the ext4 encryption write path ext4 crypto: inherit encryption policies on inode and directory create ext4 crypto: enforce context consistency ext4 crypto: add encryption key management facilities ext4 crypto: add ext4 encryption facilities ext4 crypto: add encryption policy and password salt support ext4 crypto: add encryption xattr support ext4 crypto: export ext4_empty_dir() ext4 crypto: add ext4 encryption Kconfig ext4 crypto: reserve codepoints used by the ext4 encryption feature ext4 crypto: add ext4_mpage_readpages() ...
2015-04-19fs: take i_mutex during prepare_binprm for set[ug]id executablesJann Horn1-28/+48
This prevents a race between chown() and execve(), where chowning a setuid-user binary to root would momentarily make the binary setuid root. This patch was mostly written by Linus Torvalds. Signed-off-by: Jann Horn <jann@thejh.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-18Merge tag 'for-linus-4.1-merge-window' of ↵Linus Torvalds3-7/+6
git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs Pull 9pfs updates from Eric Van Hensbergen: "Some accumulated cleanup patches for kerneldoc and unused variables as well as some lock bug fixes and adding privateport option for RDMA" * tag 'for-linus-4.1-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs: net/9p: add a privport option for RDMA transport. fs/9p: Initialize status in v9fs_file_do_lock. net/9p: Initialize opts->privport as it should be. net/9p: use memcpy() instead of snprintf() in p9_mount_tag_show() 9p: use unsigned integers for nwqid/count 9p: do not crash on unknown lock status code 9p: fix error handling in v9fs_file_do_lock 9p: remove unused variable in p9_fd_create() 9p: kerneldoc warning fixes
2015-04-18Merge branch 'for-linus' of ↵Linus Torvalds4-57/+156
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull usernamespace mount fixes from Eric Biederman: "Way back in October Andrey Vagin reported that umount(MNT_DETACH) could be used to defeat MNT_LOCKED. As I worked to fix this I discovered that combined with mount propagation and an appropriate selection of shared subtrees a reference to a directory on an unmounted filesystem is not necessary. That MNT_DETACH is allowed in user namespace in a form that can break MNT_LOCKED comes from my early misunderstanding what MNT_DETACH does. To avoid breaking existing userspace the conflict between MNT_DETACH and MNT_LOCKED is fixed by leaving mounts that are locked to their parents in the mount hash table until the last reference goes away. While investigating this issue I also found an issue with __detach_mounts. The code was unnecessarily and incorrectly triggering mount propagation. Resulting in too many mounts going away when a directory is deleted, and too many cpu cycles are burned while doing that. Looking some more I realized that __detach_mounts by only keeping mounts connected that were MNT_LOCKED it had the potential to still leak information so I tweaked the code to keep everything locked together that possibly could be. This code was almost ready last cycle but Al invented fs_pin which slightly simplifies this code but required rewrites and retesting, and I have not been in top form for a while so it took me a while to get all of that done. Similiarly this pull request is late because I have been feeling absolutely miserable all week. The issue of being able to escape a bind mount has not yet been addressed, as the fixes are not yet mature" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: mnt: Update detach_mounts to leave mounts connected mnt: Fix the error check in __detach_mounts mnt: Honor MNT_LOCKED when detaching mounts fs_pin: Allow for the possibility that m_list or s_list go unused. mnt: Factor umount_mnt from umount_tree mnt: Factor out unhash_mnt from detach_mnt and umount_tree mnt: Fail collect_mounts when applied to unmounted mounts mnt: Don't propagate unmounts to locked mounts mnt: On an unmount propagate clearing of MNT_LOCKED mnt: Delay removal from the mount hash. mnt: Add MNT_UMOUNT flag mnt: In umount_tree reuse mnt_list instead of mnt_hash mnt: Don't propagate umounts in __detach_mounts mnt: Improve the umount_tree flags mnt: Use hlist_move_list in namespace_unlock
2015-04-18Merge tag 'for-f2fs-4.1' of ↵Linus Torvalds20-261/+1230
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "New features: - in-memory extent_cache - fs_shutdown to test power-off-recovery - use inline_data to store symlink path - show f2fs as a non-misc filesystem Major fixes: - avoid CPU stalls on sync_dirty_dir_inodes - fix some power-off-recovery procedure - fix handling of broken symlink correctly - fix missing dot and dotdot made by sudden power cuts - handle wrong data index during roll-forward recovery - preallocate data blocks for direct_io ... and a bunch of minor bug fixes and cleanups" * tag 'for-f2fs-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (71 commits) f2fs: pass checkpoint reason on roll-forward recovery f2fs: avoid abnormal behavior on broken symlink f2fs: flush symlink path to avoid broken symlink after POR f2fs: change 0 to false for bool type f2fs: do not recover wrong data index f2fs: do not increase link count during recovery f2fs: assign parent's i_mode for empty dir f2fs: add F2FS_INLINE_DOTS to recover missing dot dentries f2fs: fix mismatching lock and unlock pages for roll-forward recovery f2fs: fix sparse warnings f2fs: limit b_size of mapped bh in f2fs_map_bh f2fs: persist system.advise into on-disk inode f2fs: avoid NULL pointer dereference in f2fs_xattr_advise_get f2fs: preallocate fallocated blocks for direct IO f2fs: enable inline data by default f2fs: preserve extent info for extent cache f2fs: initialize extent tree with on-disk extent info of inode f2fs: introduce __{find,grab}_extent_tree f2fs: split set_data_blkaddr from f2fs_update_extent_cache f2fs: enable fast symlink by utilizing inline data ...
2015-04-17efivarfs: Ensure VariableName is NUL-terminatedRoss Lagerwall1-1/+1
Some buggy firmware implementations update VariableNameSize on success such that it does not include the final NUL character which results in garbage in the efivarfs name entries. Use kzalloc on the efivar_entry (as is done in efivars.c) to ensure that the name is always NUL-terminated. The buggy firmware is: BIOS Information Vendor: Intel Corp. Version: S1200RP.86B.02.02.0005.102320140911 Release Date: 10/23/2014 BIOS Revision: 4.6 System Information Manufacturer: Intel Corporation Product Name: S1200RP_SE Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Acked-by: Matthew Garrett <mjg59@coreos.com> Cc: Jeremy Kerr <jk@ozlabs.org> Cc: <stable@vger.kernel.org> Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2015-04-17Merge branch 'akpm' (patches from Andrew)Linus Torvalds51-362/+561
Merge third patchbomb from Andrew Morton: - various misc things - a couple of lib/ optimisations - provide DIV_ROUND_CLOSEST_ULL() - checkpatch updates - rtc tree - befs, nilfs2, hfs, hfsplus, fatfs, adfs, affs, bfs - ptrace fixes - fork() fixes - seccomp cleanups - more mmap_sem hold time reductions from Davidlohr * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (138 commits) proc: show locks in /proc/pid/fdinfo/X docs: add missing and new /proc/PID/status file entries, fix typos drivers/rtc/rtc-at91rm9200.c: make IO endian agnostic Documentation/spi/spidev_test.c: fix warning drivers/rtc/rtc-s5m.c: allow usage on device type different than main MFD type .gitignore: ignore *.tar MAINTAINERS: add Mediatek SoC mailing list tomoyo: reduce mmap_sem hold for mm->exe_file powerpc/oprofile: reduce mmap_sem hold for exe_file oprofile: reduce mmap_sem hold for mm->exe_file mips: ip32: add platform data hooks to use DS1685 driver lib/Kconfig: fix up HAVE_ARCH_BITREVERSE help text x86: switch to using asm-generic for seccomp.h sparc: switch to using asm-generic for seccomp.h powerpc: switch to using asm-generic for seccomp.h parisc: switch to using asm-generic for seccomp.h mips: switch to using asm-generic for seccomp.h microblaze: use asm-generic for seccomp.h arm: use asm-generic for seccomp.h seccomp: allow COMPAT sigreturn overrides ...
2015-04-17proc: show locks in /proc/pid/fdinfo/XAndrey Vagin2-10/+55
Let's show locks which are associated with a file descriptor in its fdinfo file. Currently we don't have a reliable way to determine who holds a lock. We can find some information in /proc/locks, but PID which is reported there can be wrong. For example, a process takes a lock, then forks a child and dies. In this case /proc/locks contains the parent pid, which can be reused by another process. $ cat /proc/locks ... 6: FLOCK ADVISORY WRITE 324 00:13:13431 0 EOF ... $ ps -C rpcbind PID TTY TIME CMD 332 ? 00:00:00 rpcbind $ cat /proc/332/fdinfo/4 pos: 0 flags: 0100000 mnt_id: 22 lock: 1: FLOCK ADVISORY WRITE 324 00:13:13431 0 EOF $ ls -l /proc/332/fd/4 lr-x------ 1 root root 64 Mar 5 14:43 /proc/332/fd/4 -> /run/rpcbind.lock $ ls -l /proc/324/fd/ total 0 lrwx------ 1 root root 64 Feb 27 14:50 0 -> /dev/pts/0 lrwx------ 1 root root 64 Feb 27 14:50 1 -> /dev/pts/0 lrwx------ 1 root root 64 Feb 27 14:49 2 -> /dev/pts/0 You can see that the process with the 324 pid doesn't hold the lock. This information is required for proper dumping and restoring file locks. Signed-off-by: Andrey Vagin <avagin@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Acked-by: Jeff Layton <jlayton@poochiereds.net> Acked-by: "J. Bruce Fields" <bfields@fieldses.org> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17bfs: correct return valuesSanidhya Kashyap1-2/+2
In case of failed memory allocation, the return should be ENOMEM instead of ENOSPC. Return -EIO when sb_bread() fails. Signed-off-by: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17affs: kstrdup() memory handlingSanidhya Kashyap1-1/+5
There is a possibility of kstrdup() failure upon memory pressure. Therefore, returning ENOMEM even for new_opts. [akpm@linux-foundation.org: cleanup] Signed-off-by: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: Taesoo kim <taesoo@gatech.edu> Cc: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17fs/affs: use affs_test_opt()Fabian Frederick5-21/+22
Replace mount option test by affs_test_opt(). Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17fs/affs/super.c: use affs_set_opt()Fabian Frederick1-15/+16
Replace direct mount option assignation by affs_set_opt() macro. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17fs/affs/affs.h: add mount option manipulation macrosFabian Frederick1-0/+4
Add clear/set/test affs mount option macros. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17fs/affs: use AFFS_MOUNT prefix for mount optionsFabian Frederick6-48/+54
Currently, affs still uses direct access on mount_options. This patch prepares to use affs_clear/set/test_opt() like other filesystems. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17adfs: return correct return valuesSanidhya Kashyap2-5/+16
Fix the wrong values returned by various functions such as EIO and ENOMEM. Signed-off-by: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: Fabian Frederick <fabf@skynet.be> Cc: Joe Perches <joe@perches.com> Cc: Taesoo kim <taesoo@gatech.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17fs/exec.c:de_thread: move notify_count write under lockKirill Tkhai1-1/+5
We set sig->notify_count = -1 between RELEASE and ACQUIRE operations: spin_unlock_irq(lock); ... if (!thread_group_leader(tsk)) { ... for (;;) { sig->notify_count = -1; write_lock_irq(&tasklist_lock); There are no restriction on it so other processors may see this STORE mixed with other STOREs in both areas limited by the spinlocks. Probably, it may be reordered with the above sig->group_exit_task = tsk; sig->notify_count = zap_other_threads(tsk); in some way. Set it under tasklist_lock locked to be sure nothing will be reordered. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17prctl: avoid using mmap_sem for exe_file serializationDavidlohr Bueso1-0/+6
Oleg cleverly suggested using xchg() to set the new mm->exe_file instead of calling set_mm_exe_file() which requires some form of serialization -- mmap_sem in this case. For archs that do not have atomic rmw instructions we still fallback to a spinlock alternative, so this should always be safe. As such, we only need the mmap_sem for looking up the backing vm_file, which can be done sharing the lock. Naturally, this means we need to manually deal with both the new and old file reference counting, and we need not worry about the MMF_EXE_FILE_CHANGED bits, which can probably be deleted in the future anyway. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Suggested-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17mm: rcu-protected get_mm_exe_file()Konstantin Khlebnikov1-2/+1
This patch removes mm->mmap_sem from mm->exe_file read side. Also it kills dup_mm_exe_file() and moves exe_file duplication into dup_mmap() where both mmap_sems are locked. [akpm@linux-foundation.org: fix comment typo] Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17fs/fat: comment fix, fat_bits can be also 32Alexander Kuleshov1-1/+1
Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17fs/fat: remove unnecessary includesAlexander Kuleshov9-33/+1
'fat.h' includes <linux/buffer_head.h> which includes <linux/fs.h> which includes all the header files required for all *.c files fat filesystem. [akpm@linux-foundation.org: fs/fat/iode.c needs seq_file.h] [sfr@canb.auug.org.au: put one actually necessary include file back] Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17fs/fat: remove unnecessary defintionAlexander Kuleshov1-2/+1
'*sb' never used, so let's remote it and pass inode->i_sb directly to the MSDOS_SB. Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17hfsplus: don't store special "osx" xattr prefix on-diskThomas Hebb1-5/+14
On Mac OS X, HFS+ extended attributes are not namespaced. Since we want to be compatible with OS X filesystems and yet still support the Linux namespacing system, the hfsplus driver implements a special "osx" namespace that is reported for any attribute that is not namespaced on-disk. However, the current code for getting and setting these unprefixed attributes is broken. hfsplus_osx_setattr() and hfsplus_osx_getattr() are passed names that have already had their "osx." prefixes stripped by the generic functions. The functions first, quite correctly, check those names to make sure that they aren't prefixed with a known namespace, which would allow namespace access restrictions to be bypassed. However, the functions then prepend "osx." to the name they're given before passing it on to hfsplus_getattr() and hfsplus_setattr(). Not only does this cause the "osx." prefix to be stored on-disk, defeating its purpose, it also breaks the check for the special "com.apple.FinderInfo" attribute, which is reported for all files, and as a consequence makes some userspace applications (e.g. GNU patch) fail even when extended attributes are not otherwise in use. There are five commits which have touched this particular code: 127e5f5ae51e ("hfsplus: rework functionality of getting, setting and deleting of extended attributes") b168fff72109 ("hfsplus: use xattr handlers for removexattr") bf29e886b242 ("hfsplus: correct usage of HFSPLUS_ATTR_MAX_STRLEN for non-English attributes") fcacbd95e121 ("fs/hfsplus: move xattr_name allocation in hfsplus_getxattr()") ec1bbd346f18 ("fs/hfsplus: move xattr_name allocation in hfsplus_setxattr()") The first commit creates the functions to begin with. The namespace is prepended by the original code, which I believe was correct at the time, since hfsplus_?etattr() stripped the prefix if found. The second commit removes this behavior from hfsplus_?etattr() and appears to have been intended to also remove the prefixing from hfsplus_osx_?etattr(). However, what it actually does is remove a necessary strncpy() call completely, breaking the osx namespace entirely. The third commit re-adds the strncpy() call as it was originally, but doesn't mention it in its commit message. The final two commits refactor the code and don't affect its functionality. This commit does what b168fff attempted to do (prevent the prefix from being added), but does it properly, instead of passing in an empty buffer (which is what b168fff actually did). Fixes: b168fff72109 ("hfsplus: use xattr handlers for removexattr") Signed-off-by: Thomas Hebb <tommyhebb@gmail.com> Cc: Hin-Tak Leung <htl10@users.sourceforge.net> Cc: Sergei Antonov <saproj@gmail.com> Cc: Anton Altaparmakov <anton@tuxera.com> Cc: Fabian Frederick <fabf@skynet.be> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Viacheslav Dubeyko <slava@dubeyko.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17hfsplus: fix expand when not enough available spaceSergei Antonov1-0/+6
Fix a bug which is reproduced as follows. Create a file: echo abc > test_file Try to expand the file beyond available space: truncate --size=<size exceeding available space> test_file Since HFS+ does not support file size > allocated size, truncate should fail. However, it ends successfully. The driver returns success despite having been unable to allocate the requested space for the file. Also filesystem check finds an error: Checking catalog file. Incorrect size for file test_file (It should be 469094400 instead of 1000000000) Add a piece of code analogous to code in the fat driver. Now a proper error is returned and filesystem remains consistent. Signed-off-by: Sergei Antonov <saproj@gmail.com> Cc: Vyacheslav Dubeyko <slava@dubeyko.com> Cc: Hin-Tak Leung <htl10@users.sourceforge.net> Reviewed-by: Anton Altaparmakov <anton@tuxera.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Sougata Santra <sougata@tuxera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17hfsplus: incorrect return valueChengyu Song1-2/+2
In case of memory allocation error, the return should be -ENOMEM, instead of -ENOSPC. Signed-off-by: Chengyu Song <csong84@gatech.edu> Reviewed-by: Sergei Antonov <saproj@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17fs/hfsplus: replace if/BUG by BUG_ONFabian Frederick1-3/+1
Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17fs/hfsplus: use bool instead of int for is_known_namespace() return valueFabian Frederick1-1/+1
is_known_namespace() only returns true/false. Also remove inline and let compiler decide what to do with static functions. Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17fs/hfsplus: atomically set inode->i_flagsFabian Frederick1-7/+5
According to commit 5f16f3225b06 ("ext4: atomically set inode->i_flags in ext4_set_inode_flags()"). Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17fs/hfsplus: move xattr_name allocation in hfsplus_setxattr()Fabian Frederick5-65/+35
security/trusted/user/osx setxattr did the same xattr_name initialization. Move that operation in hfsplus_setxattr(). Tested with security/trusted/user getfattr/setfattr Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17fs/hfsplus: move xattr_name allocation in hfsplus_getxattr()Fabian Frederick5-67/+38
security/trusted/user/osx getxattr did the same xattr_name initialization. Move that operation in hfsplus_getxattr(). Tested with security/trusted/user getfattr/setfattr Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17hfsplus: add missing curly braces in hfsplus_delete_cat()Dan Carpenter1-1/+2
This doesn't change how the code works, but clearly the curly braces were intended. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Vyacheslav Dubeyko <slava@dubeyko.com> Cc: Sougata Santra <sougata@tuxera.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17hfs: incorrect return valuesChengyu Song1-2/+2
In case of memory allocation error, the return should be -ENOMEM, instead of -ENOSPC. Signed-off-by: Chengyu Song <csong84@gatech.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17nilfs2: use inode_set_flags() in nilfs_set_inode_flags()Ryusuke Konishi1-7/+8
Use inode_set_flags() to atomically set i_flags instead of clearing out the S_IMMUTABLE, S_APPEND, etc. flags and then setting them from the FS_IMMUTABLE_FL, FS_APPEND_FL flags to avoid a race where an immutable file has the immutable flag cleared for a brief window of time. This is a similar fix to commit 5f16f3225b06 ("ext4: atomically set inode->i_flags in ext4_set_inode_flags()"). Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17nilfs2: put out gfp mask manipulation from nilfs_set_inode_flags()Ryusuke Konishi1-2/+2
nilfs_set_inode_flags() function adjusts gfp-mask of inode->i_mapping as well as i_flags, however, this coupling of operations is not appropriate. For instance, nilfs_ioctl_setflags(), one of three callers of nilfs_set_inode_flags(), doesn't need to reinitialize the gfp-mask at all. In addition, nilfs_new_inode(), another caller of nilfs_set_inode_flags(), doesn't either because it has already initialized the gfp-mask. Only __nilfs_read_inode(), the remaining caller, needs it. So, this moves the gfp mask manipulation to __nilfs_read_inode() from nilfs_set_inode_flags(). Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17nilfs2: fix gcc warning at nilfs_checkpoint_is_mounted()Ryusuke Konishi1-1/+1
Fix the following build warning: fs/nilfs2/super.c: In function 'nilfs_checkpoint_is_mounted': fs/nilfs2/super.c:1023:10: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] if (cno < 0 || cno > nilfs->ns_cno) ^ This warning indicates that the comparision "cno < 0" is useless because variable "cno" has an unsigned integer type "__u64". Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Reported-by: David Binderman <dcb314@hotmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17nilfs2: improve execution time of NILFS_IOCTL_GET_CPINFO ioctlRyusuke Konishi1-6/+52
The older a filesystem gets, the slower lscp command becomes. This is because nilfs_cpfile_do_get_cpinfo() function meets more hole blocks as the start offset of valid checkpoint numbers gets bigger. This reduces the overhead by skipping hole blocks efficiently with nilfs_mdt_find_block() helper. A measurement result of this patch is as follows: Before: $ time lscp CNO DATE TIME MODE FLG BLKCNT ICNT 5769303 2015-02-22 19:31:33 cp - 108 1 5769304 2015-02-22 19:38:54 cp - 108 1 real 0m0.182s user 0m0.003s sys 0m0.180s After: $ time lscp CNO DATE TIME MODE FLG BLKCNT ICNT 5769303 2015-02-22 19:31:33 cp - 108 1 5769304 2015-02-22 19:38:54 cp - 108 1 real 0m0.003s user 0m0.001s sys 0m0.002s Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17nilfs2: add helper to find existent block on metadata fileRyusuke Konishi2-0/+57
Add a new metadata file function, nilfs_mdt_find_block(), which finds an existent block on a metadata file in a given range of blocks. This function skips continuous hole blocks efficiently by using nilfs_bmap_seek_key(). Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17nilfs2: add bmap function to seek a valid keyRyusuke Konishi4-1/+115
Add a new bmap function, nilfs_bmap_seek_key(), which seeks a valid entry and returns its key starting from a given key. This function can be used to skip hole blocks efficiently. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17nilfs2: unify type of key arguments in bmap interfaceRyusuke Konishi4-20/+16
The type of key arguments in block mapping interface varies depending on function. For instance, nilfs_bmap_lookup_at_level() takes "__u64" for its key argument whereas nilfs_bmap_lookup() takes "unsigned long". This fits them to "__u64" to eliminate the variation. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17nilfs2: use bgl_lock_ptr()Ryusuke Konishi1-2/+5
Simplify nilfs_mdt_bgl_lock() by utilizing bgl_lock_ptr() helper in <linux/blockgroup_lock.h>. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17nilfs2: use set_mask_bits() for operations on buffer state bitmapRyusuke Konishi2-20/+18
nilfs_forget_buffer(), nilfs_clear_dirty_page(), and nilfs_segctor_complete_write() are using a bunch of atomic bit operations against buffer state bitmap. This reduces the number of them by utilizing set_mask_bits() macro. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17nilfs2: do not use async write flag for segment summary buffersRyusuke Konishi1-3/+0
The async write flag is introduced to nilfs2 in the commit 7f42ec394156 ("nilfs2: fix issue with race condition of competition between segments for dirty blocks"), but the flag only makes sense for data buffers and btree node buffers. It is not needed for segment summary buffers. This gets rid of the latter uses as part of refactoring of atomic bit operations on buffer state bitmap. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: Vyacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17befs: replace typedef befs_inode_info by structureFabian Frederick2-7/+7
See Documentation/CodingStyle Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17befs: replace typedef befs_sb_info by structureFabian Frederick5-13/+11
See Documenation/CodingStyle Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17befs: replace typedef befs_mount_options by structureFabian Frederick2-5/+5
See Documentation/CodingStyle Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17fs/binfmt_misc.c: simplify entry_status()Rasmus Villemoes1-21/+9
sprintf() reliably returns the number of characters printed, so we don't need to ask strlen() where we are. Also replace calling sprintf("%02x") in a loop with the much simpler bin2hex(). [akpm@linux-foundation.org: it's odd to include kernel.h after everything else] Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-16Merge branch 'for-linus' of ↵Linus Torvalds45-562/+439
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull third hunk of vfs changes from Al Viro: "This contains the ->direct_IO() changes from Omar + saner generic_write_checks() + dealing with fcntl()/{read,write}() races (mirroring O_APPEND/O_DIRECT into iocb->ki_flags and instead of repeatedly looking at ->f_flags, which can be changed by fcntl(2), check ->ki_flags - which cannot) + infrastructure bits for dhowells' d_inode annotations + Christophs switch of /dev/loop to vfs_iter_write()" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (30 commits) block: loop: switch to VFS ITER_BVEC configfs: Fix inconsistent use of file_inode() vs file->f_path.dentry->d_inode VFS: Make pathwalk use d_is_reg() rather than S_ISREG() VFS: Fix up debugfs to use d_is_dir() in place of S_ISDIR() VFS: Combine inode checks with d_is_negative() and d_is_positive() in pathwalk NFS: Don't use d_inode as a variable name VFS: Impose ordering on accesses of d_inode and d_flags VFS: Add owner-filesystem positive/negative dentry checks nfs: generic_write_checks() shouldn't be done on swapout... ocfs2: use __generic_file_write_iter() mirror O_APPEND and O_DIRECT into iocb->ki_flags switch generic_write_checks() to iocb and iter ocfs2: move generic_write_checks() before the alignment checks ocfs2_file_write_iter: stop messing with ppos udf_file_write_iter: reorder and simplify fuse: ->direct_IO() doesn't need generic_write_checks() ext4_file_write_iter: move generic_write_checks() up xfs_file_aio_write_checks: switch to iocb/iov_iter generic_write_checks(): drop isblk argument blkdev_write_iter: expand generic_file_checks() call in there ...
2015-04-16Merge branch 'for_linus' of ↵Linus Torvalds25-308/+461
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull quota and udf updates from Jan Kara: "The pull contains quota changes which complete unification of XFS and VFS quota interfaces (so tools can use either interface to manipulate any filesystem). There's also a patch to support project quotas in VFS quota subsystem from Li Xi. Finally there's a bunch of UDF fixes and cleanups and tiny cleanup in reiserfs & ext3" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (21 commits) udf: Update ctime and mtime when directory is modified udf: return correct errno for udf_update_inode() ext3: Remove useless condition in if statement. vfs: Add general support to enforce project quota limits reiserfs: fix __RASSERT format string udf: use int for allocated blocks instead of sector_t udf: remove redundant buffer_head.h includes udf: remove else after return in __load_block_bitmap() udf: remove unused variable in udf_table_free_blocks() quota: Fix maximum quota limit settings quota: reorder flags in quota state quota: paranoia: check quota tree root quota: optimize i_dquot access quota: Hook up Q_XSETQLIM for id 0 to ->set_info xfs: Add support for Q_SETINFO quota: Make ->set_info use structure with neccesary info to VFS and XFS quota: Remove ->get_xstate and ->get_xstatev callbacks gfs2: Convert to using ->get_state callback xfs: Convert to using ->get_state callback quota: Wire up Q_GETXSTATE and Q_GETXSTATV calls to work with ->get_state ...
2015-04-16Merge branch 'for-4.1/core' of git://git.kernel.dk/linux-blockLinus Torvalds1-15/+30
Pull block layer core bits from Jens Axboe: "This is the core pull request for 4.1. Not a lot of stuff in here for this round, mostly little fixes or optimizations. This pull request contains: - An optimization that speeds up queue runs on blk-mq, especially for the case where there's a large difference between nr_cpu_ids and the actual mapped software queues on a hardware queue. From Chong Yuan. - Honor node local allocations for requests on legacy devices. From David Rientjes. - Cleanup of blk_mq_rq_to_pdu() from me. - exit_aio() fixup from me, greatly speeding up exiting multiple IO contexts off exit_group(). For my particular test case, fio exit took ~6 seconds. A typical case of both exposing RCU grace periods to user space, and serializing exit of them. - Make blk_mq_queue_enter() honor the gfp mask passed in, so we only wait if __GFP_WAIT is set. From Keith Busch. - blk-mq exports and two added helpers from Mike Snitzer, which will be used by the dm-mq code. - Cleanups of blk-mq queue init from Wei Fang and Xiaoguang Wang" * 'for-4.1/core' of git://git.kernel.dk/linux-block: blk-mq: reduce unnecessary software queue looping aio: fix serial draining in exit_aio() blk-mq: cleanup blk_mq_rq_to_pdu() blk-mq: put blk_queue_rq_timeout together in blk_mq_init_queue() block: remove redundant check about 'set->nr_hw_queues' in blk_mq_alloc_tag_set() block: allocate request memory local to request queue blk-mq: don't wait in blk_mq_queue_enter() if __GFP_WAIT isn't set blk-mq: export blk_mq_run_hw_queues blk-mq: add blk_mq_init_allocated_queue and export blk_mq_register_disk
2015-04-16Merge tag 'powerpc-4.1-1' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux Pull powerpc updates from Michael Ellerman: - Numerous minor fixes, cleanups etc. - More EEH work from Gavin to remove its dependency on device_nodes. - Memory hotplug implemented entirely in the kernel from Nathan Fontenot. - Removal of redundant CONFIG_PPC_OF by Kevin Hao. - Rewrite of VPHN parsing logic & tests from Greg Kurz. - A fix from Nish Aravamudan to reduce memory usage by clamping nodes_possible_map. - Support for pstore on powernv from Hari Bathini. - Removal of old powerpc specific byte swap routines by David Gibson. - Fix from Vasant Hegde to prevent the flash driver telling you it was flashing your firmware when it wasn't. - Patch from Ben Herrenschmidt to add an OPAL heartbeat driver. - Fix for an oops causing get/put_cpu_var() imbalance in perf by Jan Stancek. - Some fixes for migration from Tyrel Datwyler. - A new syscall to switch the cpu endian by Michael Ellerman. - Large series from Wei Yang to implement SRIOV, reviewed and acked by Bjorn. - A fix for the OPAL sensor driver from Cédric Le Goater. - Fixes to get STRICT_MM_TYPECHECKS building again by Michael Ellerman. - Large series from Daniel Axtens to make our PCI hooks per PHB rather than per machine. - Small patch from Sam Bobroff to explicitly abort non-suspended transactions on syscalls, plus a test to exercise it. - Numerous reworks and fixes for the 24x7 PMU from Sukadev Bhattiprolu. - Small patch to enable the hard lockup detector from Anton Blanchard. - Fix from Dave Olson for missing L2 cache information on some CPUs. - Some fixes from Michael Ellerman to get Cell machines booting again. - Freescale updates from Scott: Highlights include BMan device tree nodes, an MSI erratum workaround, a couple minor performance improvements, config updates, and misc fixes/cleanup. * tag 'powerpc-4.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (196 commits) powerpc/powermac: Fix build error seen with powermac smp builds powerpc/pseries: Fix compile of memory hotplug without CONFIG_MEMORY_HOTREMOVE powerpc: Remove PPC32 code from pseries specific find_and_init_phbs() powerpc/cell: Fix iommu breakage caused by controller_ops change powerpc/eeh: Fix crash in eeh_add_device_early() on Cell powerpc/perf: Cap 64bit userspace backtraces to PERF_MAX_STACK_DEPTH powerpc/perf/hv-24x7: Fail 24x7 initcall if create_events_from_catalog() fails powerpc/pseries: Correct memory hotplug locking powerpc: Fix missing L2 cache size in /sys/devices/system/cpu powerpc: Add ppc64 hard lockup detector support oprofile: Disable oprofile NMI timer on ppc64 powerpc/perf/hv-24x7: Add missing put_cpu_var() powerpc/perf/hv-24x7: Break up single_24x7_request powerpc/perf/hv-24x7: Define update_event_count() powerpc/perf/hv-24x7: Whitespace cleanup powerpc/perf/hv-24x7: Define add_event_to_24x7_request() powerpc/perf/hv-24x7: Rename hv_24x7_event_update powerpc/perf/hv-24x7: Move debug prints to separate function powerpc/perf/hv-24x7: Drop event_24x7_request() powerpc/perf/hv-24x7: Use pr_devel() to log message ... Conflicts: tools/testing/selftests/powerpc/Makefile tools/testing/selftests/powerpc/tm/Makefile
2015-04-16f2fs: pass checkpoint reason on roll-forward recoveryJaegeuk Kim3-2/+7
This patch adds CP_RECOVERY to remain recovery information for checkpoint. And, it makes sure writing checkpoint in this case. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-04-16f2fs: avoid abnormal behavior on broken symlinkJaegeuk Kim1-1/+19
When f2fs_symlink was triggered and checkpoint was done before syncing its link path, f2fs can get broken symlink like "xxx -> \0\0\0". This incurs abnormal path_walk by VFS. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-04-16f2fs: flush symlink path to avoid broken symlink after PORJaegeuk Kim1-0/+11
This patch tries to avoid broken symlink case after POR in best effort. This results in performance regression. But, if f2fs has inline_data and the target path is under 3KB-sized long, the page would be stored in its inode_block, so that there would be no performance regression. Note that, if user wants to keep this file atomically, it needs to trigger dir->fsync. And, there is still a hole to produce broken symlink. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-04-16Merge branch 'xfs-dio-extend-fix' into for-nextxfs-for-linus-4.1-rc1Dave Chinner3-82/+239
Conflicts: fs/xfs/xfs_file.c
2015-04-16xfs: using generic_file_direct_write() is unnecessaryDave Chinner1-3/+20
generic_file_direct_write() does all sorts of things to make DIO work "sorta ok" with mixed buffered IO workloads. We already do most of this work in xfs_file_aio_dio_write() because of the locking requirements, so there's only a couple of things it does for us. The first thing is that it does a page cache invalidation after the ->direct_IO callout. This can easily be added to the XFS code. The second thing it does is that if data was written, it updates the iov_iter structure to reflect the data written, and then does EOF size updates if necessary. For XFS, these EOF size updates are now not necessary, as we do them safely and race-free in IO completion context. That leaves just the iov_iter update, and that's also moved to the XFS code. Therefore we don't need to call generic_file_direct_write() and in doing so remove redundant buffered writeback and page cache invalidation calls from the DIO submission path. We also remove a racy EOF size update, and make the DIO submission code in XFS much easier to follow. Wins all round, really. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-16xfs: direct IO EOF zeroing needs to drain AIODave Chinner1-0/+10
When we are doing AIO DIO writes, the IOLOCK only provides an IO submission barrier. When we need to do EOF zeroing, we need to ensure that no other IO is in progress and all pending in-core EOF updates have been completed. This requires us to wait for all outstanding AIO DIO writes to the inode to complete and, if necessary, run their EOF updates. Once all the EOF updates are complete, we can then restart xfs_file_aio_write_checks() while holding the IOLOCK_EXCL, knowing that EOF is up to date and we have exclusive IO access to the file so we can run EOF block zeroing if we need to without interference. This gives EOF zeroing the same exclusivity against other IO as we provide truncate operations. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-16xfs: DIO write completion size updates raceDave Chinner2-1/+19
xfs_end_io_direct_write() can race with other IO completions when updating the in-core inode size. The IO completion processing is not serialised for direct IO - they are done either under the IOLOCK_SHARED for non-AIO DIO, and without any IOLOCK held at all during AIO DIO completion. Hence the non-atomic test-and-set update of the in-core inode size is racy and can result in the in-core inode size going backwards if the race if hit just right. If the inode size goes backwards, this can trigger the EOF zeroing code to run incorrectly on the next IO, which then will zero data that has successfully been written to disk by a previous DIO. To fix this bug, we need to serialise the test/set updates of the in-core inode size. This first patch introduces locking around the relevant updates and checks in the DIO path. Because we now have an ioend in xfs_end_io_direct_write(), we know exactly then we are doing an IO that requires an in-core EOF update, and we know that they are not running in interrupt context. As such, we do not need to use irqsave() spinlock variants to protect against interrupts while the lock is held. Hence we can use an existing spinlock in the inode to do this serialisation and so not need to grow the struct xfs_inode just to work around this problem. This patch does not address the test/set EOF update in generic_file_write_direct() for various reasons - that will be done as a followup with separate explanation. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-16xfs: DIO writes within EOF don't need an ioendDave Chinner2-30/+40
DIO writes that lie entirely within EOF have nothing to do in IO completion. In this case, we don't need no steekin' ioend, and so we can avoid allocating an ioend until we have a mapping that spans EOF. This means that IO completion has two contexts - deferred completion to the dio workqueue that uses an ioend, and interrupt completion that does nothing because there is nothing that can be done in this context. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-16xfs: handle DIO overwrite EOF update completion correctlyDave Chinner2-31/+31
Currently a DIO overwrite that extends the EOF (e.g sub-block IO or write into allocated blocks beyond EOF) requires a transaction for the EOF update. Thi is done in IO completion context, but we aren't explicitly handling this situation properly and so it can run in interrupt context. Ensure that we defer IO that spans EOF correctly to the DIO completion workqueue, and now that we have an ioend in IO completion we can use the common ioend completion path to do all the work. Note: we do not preallocate the append transaction as we can have multiple mapping and allocation calls per direct IO. hence preallocating can still leave us with nested transactions by attempting to map and allocate more blocks after we've preallocated an append transaction. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-16xfs: DIO needs an ioend for writesDave Chinner2-10/+85
Currently we can only tell DIO completion that an IO requires unwritten extent completion. This is done by a hacky non-null private pointer passed to Io completion, but the private pointer does not actually contain any information that is used. We also need to pass to IO completion the fact that the IO may be beyond EOF and so a size update transaction needs to be done. This is currently determined by checks in the io completion, but we need to determine if this is necessary at block mapping time as we need to defer the size update transactions to a completion workqueue, just like unwritten extent conversion. To do this, first we need to allocate and pass an ioend to to IO completion. Add this for unwritten extent conversion; we'll do the EOF updates in the next commit. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-16xfs: move DIO mapping size calculationDave Chinner1-33/+46
The mapping size calculation is done last in __xfs_get_blocks(), but we are going to need the actual mapping size we will use to map the direct IO correctly in xfs_map_direct(). Factor out the calculation for code clarity, and move the call to be the first operation in mapping the extent to the returned buffer. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-16xfs: factor DIO write mapping from get_blocksDave Chinner1-13/+27
Clarify and separate the buffer mapping logic so that the direct IO mapping is not tangled up in propagating the extent status to teh mapping buffer. This makes it easier to extend the direct IO mapping to use an ioend in future. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-16ext4 crypto: enable encryption feature flagTheodore Ts'o6-24/+79
Also add the test dummy encryption mode flag so we can more easily test the encryption patches using xfstests. Signed-off-by: Michael Halcrow <mhalcrow@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-16ext4 crypto: add symlink encryptionTheodore Ts'o5-23/+184
Signed-off-by: Uday Savagaonkar <savagaon@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-15Merge branch 'akpm' (patches from Andrew)Linus Torvalds17-138/+183
Merge second patchbomb from Andrew Morton: - the rest of MM - various misc bits - add ability to run /sbin/reboot at reboot time - printk/vsprintf changes - fiddle with seq_printf() return value * akpm: (114 commits) parisc: remove use of seq_printf return value lru_cache: remove use of seq_printf return value tracing: remove use of seq_printf return value cgroup: remove use of seq_printf return value proc: remove use of seq_printf return value s390: remove use of seq_printf return value cris fasttimer: remove use of seq_printf return value cris: remove use of seq_printf return value openrisc: remove use of seq_printf return value ARM: plat-pxa: remove use of seq_printf return value nios2: cpuinfo: remove use of seq_printf return value microblaze: mb: remove use of seq_printf return value ipc: remove use of seq_printf return value rtc: remove use of seq_printf return value power: wakeup: remove use of seq_printf return value x86: mtrr: if: remove use of seq_printf return value linux/bitmap.h: improve BITMAP_{LAST,FIRST}_WORD_MASK MAINTAINERS: CREDITS: remove Stefano Brivio from B43 .mailmap: add Ricardo Ribalda CREDITS: add Ricardo Ribalda Delgado ...
2015-04-15proc: remove use of seq_printf return valueJoe Perches2-36/+50
The seq_printf return value, because it's frequently misused, will eventually be converted to void. See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to seq_has_overflowed() and make public") Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15lib/string_helpers.c: change semantics of string_escape_memRasmus Villemoes1-2/+2
The current semantics of string_escape_mem are inadequate for one of its current users, vsnprintf(). If that is to honour its contract, it must know how much space would be needed for the entire escaped buffer, and string_escape_mem provides no way of obtaining that (short of allocating a large enough buffer (~4 times input string) to let it play with, and that's definitely a big no-no inside vsnprintf). So change the semantics for string_escape_mem to be more snprintf-like: Return the size of the output that would be generated if the destination buffer was big enough, but of course still only write to the part of dst it is allowed to, and (contrary to snprintf) don't do '\0'-termination. It is then up to the caller to detect whether output was truncated and to append a '\0' if desired. Also, we must output partial escape sequences, otherwise a call such as snprintf(buf, 3, "%1pE", "\123") would cause printf to write a \0 to buf[2] but leaving buf[0] and buf[1] with whatever they previously contained. This also fixes a bug in the escaped_string() helper function, which used to unconditionally pass a length of "end-buf" to string_escape_mem(); since the latter doesn't check osz for being insanely large, it would happily write to dst. For example, kasprintf(GFP_KERNEL, "something and then %pE", ...); is an easy way to trigger an oops. In test-string_helpers.c, the -ENOMEM test is replaced with testing for getting the expected return value even if the buffer is too small. We also ensure that nothing is written (by relying on a NULL pointer deref) if the output size is 0 by passing NULL - this has to work for kasprintf("%pE") to work. In net/sunrpc/cache.c, I think qword_add still has the same semantics. Someone should definitely double-check this. In fs/proc/array.c, I made the minimum possible change, but longer-term it should stop poking around in seq_file internals. [andriy.shevchenko@linux.intel.com: simplify qword_add] [andriy.shevchenko@linux.intel.com: add missed curly braces] Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15kernel: conditionally support non-root users, groups and capabilitiesIulia Manda2-1/+2
There are a lot of embedded systems that run most or all of their functionality in init, running as root:root. For these systems, supporting multiple users is not necessary. This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for non-root users, non-root groups, and capabilities optional. It is enabled under CONFIG_EXPERT menu. When this symbol is not defined, UID and GID are zero in any possible case and processes always have all capabilities. The following syscalls are compiled out: setuid, setregid, setgid, setreuid, setresuid, getresuid, setresgid, getresgid, setgroups, getgroups, setfsuid, setfsgid, capget, capset. Also, groups.c is compiled out completely. In kernel/capability.c, capable function was moved in order to avoid adding two ifdef blocks. This change saves about 25 KB on a defconfig build. The most minimal kernels have total text sizes in the high hundreds of kB rather than low MB. (The 25k goes down a bit with allnoconfig, but not that much. The kernel was booted in Qemu. All the common functionalities work. Adding users/groups is not possible, failing with -ENOSYS. Bloat-o-meter output: add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650) [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Iulia Manda <iulia.manda21@gmail.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15/proc/PID/status: show all sets of pid according to nsChen Hanxiao1-0/+18
If some issues occurred inside a container guest, host user could not know which process is in trouble just by guest pid: the users of container guest only knew the pid inside containers. This will bring obstacle for trouble shooting. This patch adds four fields: NStgid, NSpid, NSpgid and NSsid: a) In init_pid_ns, nothing changed; b) In one pidns, will tell the pid inside containers: NStgid: 21776 5 1 NSpid: 21776 5 1 NSpgid: 21776 5 1 NSsid: 21729 1 0 ** Process id is 21776 in level 0, 5 in level 1, 1 in level 2. c) If pidns is nested, it depends on which pidns are you in. NStgid: 5 1 NSpid: 5 1 NSpgid: 5 1 NSsid: 1 0 ** Views from level 1 [akpm@linux-foundation.org: add CONFIG_PID_NS ifdef] Signed-off-by: Chen Hanxiao <chenhanxiao@cn.fujitsu.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Tested-by: Serge Hallyn <serge.hallyn@canonical.com> Tested-by: Nathan Scott <nathans@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15dax: unify ext2/4_{dax,}_file_operationsBoaz Harrosh9-64/+9
The original dax patchset split the ext2/4_file_operations because of the two NULL splice_read/splice_write in the dax case. In the vfs if splice_read/splice_write are NULL we then call default_splice_read/write. What we do here is make generic_file_splice_read aware of IS_DAX() so the original ext2/4_file_operations can be used as is. For write it appears that iter_file_splice_write is just fine. It uses the regular f_op->write(file,..) or new_sync_write(file, ...). Signed-off-by: Boaz Harrosh <boaz@plexistor.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Dave Chinner <dchinner@redhat.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15dax: use pfn_mkwrite to update c/mtime + freeze protectionBoaz Harrosh3-0/+19
From: Yigal Korman <yigal@plexistor.com> [v1] Without this patch, c/mtime is not updated correctly when mmap'ed page is first read from and then written to. A new xfstest is submitted for testing this (generic/080) [v2] Jan Kara has pointed out that if we add the sb_start/end_pagefault pair in the new pfn_mkwrite we are then fixing another bug where: A user could start writing to the page while filesystem is frozen. Signed-off-by: Yigal Korman <yigal@plexistor.com> Signed-off-by: Boaz Harrosh <boaz@plexistor.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15fs, jfs: remove slab object constructorDavid Rientjes2-20/+12
Mempools based on slab caches with object constructors are risky because element allocation can happen either from the slab cache itself, meaning the constructor is properly called before returning, or from the mempool reserve pool, meaning the constructor is not called before returning, depending on the allocation context. For this reason, we should disallow creating mempools based on slab caches that have object constructors. Callers of mempool_alloc() will be responsible for properly initializing the returned element. Then, it doesn't matter if the element came from the slab cache or the mempool reserved pool. The only occurrence of a mempool being based on a slab cache with an object constructor in the tree is in fs/jfs/jfs_metapage.c. Remove it and properly initialize the element in alloc_metapage(). At the same time, META_free is never used, so remove it as well. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Sebastian Ott <sebott@linux.vnet.ibm.com> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15hugetlbfs: accept subpool min_size mount option and setup accordinglyMike Kravetz1-19/+71
Make 'min_size=<value>' be an option when mounting a hugetlbfs. This option takes the same value as the 'size' option. min_size can be specified without specifying size. If both are specified, min_size must be less that or equal to size else the mount will fail. If min_size is specified, then at mount time an attempt is made to reserve min_size pages. If the reservation fails, the mount fails. At umount time, the reserved pages are released. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15f2fs: change 0 to false for bool typeTaehee Yoo1-1/+1
in the f2fs_fill_super function, variable "retry" is bool type i think that it should be set as false. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-04-15Merge tag 'locks-v4.1-1' of git://git.samba.org/jlayton/linuxLinus Torvalds2-37/+37
Pull file locking related changes from Jeff Layton: "This set is mostly minor cleanups to the overhaul that went in last cycle. The other noticeable items are the changes to the lm_get_owner and lm_put_owner prototypes, and the fact that we no longer need to use the i_lock to protect the i_flctx pointer" * tag 'locks-v4.1-1' of git://git.samba.org/jlayton/linux: locks: use cmpxchg to assign i_flctx pointer locks: get rid of WE_CAN_BREAK_LSLK_NOW dead code locks: change lm_get_owner and lm_put_owner prototypes locks: don't allocate a lock context for an F_UNLCK request locks: Add lockdep assertion for blocked_lock_lock locks: remove extraneous IS_POSIX and IS_FLOCK tests locks: Remove unnecessary IS_POSIX test
2015-04-15Merge tag 'for-linus-4.1' of ↵Linus Torvalds3-70/+77
git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml Pull UML updates from Richard Weinberger: - hostfs saw a face lifting - old/broken stuff was removed (SMP, HIGHMEM, SKAS3/4) - random cleanups and bug fixes * tag 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml: (26 commits) um: Print minimum physical memory requirement um: Move uml_postsetup in the init_thread stack um: add a kmsg_dumper x86, UML: fix integer overflow in ELF_ET_DYN_BASE um: hostfs: Reduce number of syscalls in readdir um: Remove broken highmem support um: Remove broken SMP support um: Remove SKAS3/4 support um: Remove ppc cruft um: Remove ia64 cruft um: Remove dead code from stacktrace hostfs: No need to box and later unbox the file mode hostfs: Use page_offset() hostfs: Set page flags in hostfs_readpage() correctly hostfs: Remove superfluous initializations in hostfs_open() hostfs: hostfs_open: Reset open flags upon each retry hostfs: Remove superfluous test in hostfs_open() hostfs: Report append flag in ->show_options() hostfs: Use __getname() in follow_link hostfs: Remove open coded strcpy() ...
2015-04-15Merge tag 'upstream-4.1-rc1' of git://git.infradead.org/linux-ubifsLinus Torvalds25-429/+436
Pull UBI/UBIFS updates from Richard Weinberger: "This pull request includes the following UBI/UBIFS changes: - powercut emulation for UBI - a huge update to UBI Fastmap - cleanups and bugfixes all over UBI and UBIFS" * tag 'upstream-4.1-rc1' of git://git.infradead.org/linux-ubifs: (50 commits) UBI: power cut emulation for testing UBIFS: fix output format of INUM_WATERMARK UBI: Fastmap: Fall back to scanning mode after ECC error UBI: Fastmap: Remove is_fm_block() UBI: Fastmap: Add blank line after declarations UBI: Fastmap: Remove else after return. UBI: Fastmap: Introduce may_reserve_for_fm() UBI: Fastmap: Introduce ubi_fastmap_init() UBI: Fastmap: Wire up WL accessor functions UBI: Add accessor functions for WL data structures UBI: Move fastmap specific functions out of wl.c UBI: Fastmap: Add new module parameter fm_debug UBI: Fastmap: Make self_check_eba() depend on fastmap self checking UBI: Fastmap: Add self check to detect absent PEBs UBI: Fix stale pointers in ubi->lookuptbl UBI: Fastmap: Enhance fastmap checking UBI: Add initial support for fastmap self checks UBI: Fastmap: Rework fastmap error paths UBI: Fastmap: Prepare for variable sized fastmaps UBI: Fastmap: Locking updates ...
2015-04-15Merge branch 'for-linus-2' of ↵Linus Torvalds60-820/+300
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull second vfs update from Al Viro: "Now that net-next went in... Here's the next big chunk - killing ->aio_read() and ->aio_write(). There'll be one more pile today (direct_IO changes and generic_write_checks() cleanups/fixes), but I'd prefer to keep that one separate" * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits) ->aio_read and ->aio_write removed pcm: another weird API abuse infinibad: weird APIs switched to ->write_iter() kill do_sync_read/do_sync_write fuse: use iov_iter_get_pages() for non-splice path fuse: switch to ->read_iter/->write_iter switch drivers/char/mem.c to ->read_iter/->write_iter make new_sync_{read,write}() static coredump: accept any write method switch /dev/loop to vfs_iter_write() serial2002: switch to __vfs_read/__vfs_write ashmem: use __vfs_read() export __vfs_read() autofs: switch to __vfs_write() new helper: __vfs_write() switch hugetlbfs to ->read_iter() coda: switch to ->read_iter/->write_iter ncpfs: switch to ->read_iter/->write_iter net/9p: remove (now-)unused helpers p9_client_attach(): set fid->uid correctly ...
2015-04-15VFS: assorted d_backing_inode() annotationsDavid Howells3-7/+7
Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15VFS: fs/inode.c helpers: d_inode() annotationsDavid Howells1-3/+3
these should be used on objects already in top layer Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15VFS: fs/cachefiles: d_backing_inode() annotationsDavid Howells6-62/+62
Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15VFS: fs library helpers: d_inode() annotationsDavid Howells2-18/+18
library helpers called by filesystem drivers on their own inodes Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15VFS: assorted weird filesystems: d_inode() annotationsDavid Howells3-11/+11
Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15VFS: normal filesystems (and lustre): d_inode() annotationsDavid Howells266-1351/+1350
that's the bulk of filesystem drivers dealing with inodes of their own Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15VFS: Cachefiles should perform fs modifications on the top layer onlyDavid Howells2-28/+28
Cachefiles should perform fs modifications (eg. vfs_unlink()) on the top layer only and should not attempt to alter the lower layer. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15configfs: Fix inconsistent use of file_inode() vs file->f_path.dentry->d_inodeDavid Howells1-1/+1
Fix inconsistent use of file_inode() vs file->f_path.dentry->d_inode. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15VFS: Make pathwalk use d_is_reg() rather than S_ISREG()David Howells1-1/+1
Make pathwalk use d_is_reg() rather than S_ISREG() to determine whether to honour O_TRUNC. Since this occurs after complete_walk(), the dentry type field cannot change and the inode pointer cannot change as we hold a ref on the dentry, so this should be safe. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15VFS: Fix up debugfs to use d_is_dir() in place of S_ISDIR()David Howells1-1/+1
Fix up debugfs to use d_is_dir(dentry) in place of S_ISDIR(dentry->d_inode->i_mode). Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15VFS: Combine inode checks with d_is_negative() and d_is_positive() in pathwalkDavid Howells1-3/+3
Where we have: if (!dentry->d_inode || d_is_negative(dentry)) { type constructions in pathwalk we should be able to eliminate the check of d_inode and rely solely on the result of d_is_negative() or d_is_positive(). What we do have to take care to do is to read d_inode after calling a d_is_xxx() typecheck function to get the barriering right. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15NFS: Don't use d_inode as a variable nameDavid Howells1-4/+4
Don't use d_inode as a variable name as it now masks a function name. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15VFS: Impose ordering on accesses of d_inode and d_flagsDavid Howells1-8/+39
Impose ordering on accesses of d_inode and d_flags to avoid the need to do this: if (!dentry->d_inode || d_is_negative(dentry)) { when this: if (d_is_negative(dentry)) { should suffice. This check is especially problematic if a dentry can have its type field set to something other than DENTRY_MISS_TYPE when d_inode is NULL (as in unionmount). What we really need to do is stick a write barrier between setting d_inode and setting d_flags and a read barrier between reading d_flags and reading d_inode. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15nfs: generic_write_checks() shouldn't be done on swapout...Al Viro2-13/+10
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15aio: fix serial draining in exit_aio()Jens Axboe1-15/+30
exit_aio() currently serializes killing io contexts. Each context killing ends up having to do percpu_ref_kill(), which in turns has to wait for an RCU grace period. This can take a long time, depending on the number of contexts. And there's no point in doing them serially, when we could be waiting for all of them in one fell swoop. This patches makes my fio thread offload test case exit 0.2s instead of almost 6s. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds3-5/+18
Pull networking updates from David Miller: 1) Add BQL support to via-rhine, from Tino Reichardt. 2) Integrate SWITCHDEV layer support into the DSA layer, so DSA drivers can support hw switch offloading. From Floria Fainelli. 3) Allow 'ip address' commands to initiate multicast group join/leave, from Madhu Challa. 4) Many ipv4 FIB lookup optimizations from Alexander Duyck. 5) Support EBPF in cls_bpf classifier and act_bpf action, from Daniel Borkmann. 6) Remove the ugly compat support in ARP for ugly layers like ax25, rose, etc. And use this to clean up the neigh layer, then use it to implement MPLS support. All from Eric Biederman. 7) Support L3 forwarding offloading in switches, from Scott Feldman. 8) Collapse the LOCAL and MAIN ipv4 FIB tables when possible, to speed up route lookups even further. From Alexander Duyck. 9) Many improvements and bug fixes to the rhashtable implementation, from Herbert Xu and Thomas Graf. In particular, in the case where an rhashtable user bulk adds a large number of items into an empty table, we expand the table much more sanely. 10) Don't make the tcp_metrics hash table per-namespace, from Eric Biederman. 11) Extend EBPF to access SKB fields, from Alexei Starovoitov. 12) Split out new connection request sockets so that they can be established in the main hash table. Much less false sharing since hash lookups go direct to the request sockets instead of having to go first to the listener then to the request socks hashed underneath. From Eric Dumazet. 13) Add async I/O support for crytpo AF_ALG sockets, from Tadeusz Struk. 14) Support stable privacy address generation for RFC7217 in IPV6. From Hannes Frederic Sowa. 15) Hash network namespace into IP frag IDs, also from Hannes Frederic Sowa. 16) Convert PTP get/set methods to use 64-bit time, from Richard Cochran. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1816 commits) fm10k: Bump driver version to 0.15.2 fm10k: corrected VF multicast update fm10k: mbx_update_max_size does not drop all oversized messages fm10k: reset head instead of calling update_max_size fm10k: renamed mbx_tx_dropped to mbx_tx_oversized fm10k: update xcast mode before synchronizing multicast addresses fm10k: start service timer on probe fm10k: fix function header comment fm10k: comment next_vf_mbx flow fm10k: don't handle mailbox events in iov_event path and always process mailbox fm10k: use separate workqueue for fm10k driver fm10k: Set PF queues to unlimited bandwidth during virtualization fm10k: expose tx_timeout_count as an ethtool stat fm10k: only increment tx_timeout_count in Tx hang path fm10k: remove extraneous "Reset interface" message fm10k: separate PF only stats so that VF does not display them fm10k: use hw->mac.max_queues for stats fm10k: only show actual queues, not the maximum in hardware fm10k: allow creation of VLAN on default vid fm10k: fix unused warnings ...
2015-04-14Merge branch 'akpm' (patches from Andrew)Linus Torvalds24-134/+279
Merge first patchbomb from Andrew Morton: - arch/sh updates - ocfs2 updates - kernel/watchdog feature - about half of mm/ * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (122 commits) Documentation: update arch list in the 'memtest' entry Kconfig: memtest: update number of test patterns up to 17 arm: add support for memtest arm64: add support for memtest memtest: use phys_addr_t for physical addresses mm: move memtest under mm mm, hugetlb: abort __get_user_pages if current has been oom killed mm, mempool: do not allow atomic resizing memcg: print cgroup information when system panics due to panic_on_oom mm: numa: remove migrate_ratelimited mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZE mm: split ET_DYN ASLR from mmap ASLR s390: redefine randomize_et_dyn for ELF_ET_DYN_BASE mm: expose arch_mmap_rnd when available s390: standardize mmap_rnd() usage powerpc: standardize mmap_rnd() usage mips: extract logic for mmap_rnd() arm64: standardize mmap_rnd() usage x86: standardize mmap_rnd() usage arm: factor out mmap ASLR into mmap_rnd ...
2015-04-14mm, mempool: do not allow atomic resizingDavid Rientjes1-4/+2
Allocating a large number of elements in atomic context could quickly deplete memory reserves, so just disallow atomic resizing entirely. Nothing currently uses mempool_resize() with anything other than GFP_KERNEL, so convert existing callers to drop the gfp_mask. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Steffen Maier <maier@linux.vnet.ibm.com> [zfcp] Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Steve French <sfrench@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZEKees Cook1-3/+1
The arch_randomize_brk() function is used on several architectures, even those that don't support ET_DYN ASLR. To avoid bulky extern/#define tricks, consolidate the support under CONFIG_ARCH_HAS_ELF_RANDOMIZE for the architectures that support it, while still handling CONFIG_COMPAT_BRK. Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Hector Marco-Gisbert <hecmargi@upv.es> Cc: Russell King <linux@arm.linux.org.uk> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: "David A. Long" <dave.long@linaro.org> Cc: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Arun Chandran <achandran@mvista.com> Cc: Yann Droneaud <ydroneaud@opteya.com> Cc: Min-Hua Chen <orca.chen@gmail.com> Cc: Paul Burton <paul.burton@imgtec.com> Cc: Alex Smith <alex@alex-smith.me.uk> Cc: Markos Chandras <markos.chandras@imgtec.com> Cc: Vineeth Vijayan <vvijayan@mvista.com> Cc: Jeff Bailey <jeffbailey@google.com> Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Behan Webster <behanw@converseincode.com> Cc: Ismael Ripoll <iripoll@upv.es> Cc: Jan-Simon Mller <dl9pf@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: split ET_DYN ASLR from mmap ASLRKees Cook2-17/+4
This fixes the "offset2lib" weakness in ASLR for arm, arm64, mips, powerpc, and x86. The problem is that if there is a leak of ASLR from the executable (ET_DYN), it means a leak of shared library offset as well (mmap), and vice versa. Further details and a PoC of this attack is available here: http://cybersecurity.upv.es/attacks/offset2lib/offset2lib.html With this patch, a PIE linked executable (ET_DYN) has its own ASLR region: $ ./show_mmaps_pie 54859ccd6000-54859ccd7000 r-xp ... /tmp/show_mmaps_pie 54859ced6000-54859ced7000 r--p ... /tmp/show_mmaps_pie 54859ced7000-54859ced8000 rw-p ... /tmp/show_mmaps_pie 7f75be764000-7f75be91f000 r-xp ... /lib/x86_64-linux-gnu/libc.so.6 7f75be91f000-7f75beb1f000 ---p ... /lib/x86_64-linux-gnu/libc.so.6 7f75beb1f000-7f75beb23000 r--p ... /lib/x86_64-linux-gnu/libc.so.6 7f75beb23000-7f75beb25000 rw-p ... /lib/x86_64-linux-gnu/libc.so.6 7f75beb25000-7f75beb2a000 rw-p ... 7f75beb2a000-7f75beb4d000 r-xp ... /lib64/ld-linux-x86-64.so.2 7f75bed45000-7f75bed46000 rw-p ... 7f75bed46000-7f75bed47000 r-xp ... 7f75bed47000-7f75bed4c000 rw-p ... 7f75bed4c000-7f75bed4d000 r--p ... /lib64/ld-linux-x86-64.so.2 7f75bed4d000-7f75bed4e000 rw-p ... /lib64/ld-linux-x86-64.so.2 7f75bed4e000-7f75bed4f000 rw-p ... 7fffb3741000-7fffb3762000 rw-p ... [stack] 7fffb377b000-7fffb377d000 r--p ... [vvar] 7fffb377d000-7fffb377f000 r-xp ... [vdso] The change is to add a call the newly created arch_mmap_rnd() into the ELF loader for handling ET_DYN ASLR in a separate region from mmap ASLR, as was already done on s390. Removes CONFIG_BINFMT_ELF_RANDOMIZE_PIE, which is no longer needed. Signed-off-by: Kees Cook <keescook@chromium.org> Reported-by: Hector Marco-Gisbert <hecmargi@upv.es> Cc: Russell King <linux@arm.linux.org.uk> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: "David A. Long" <dave.long@linaro.org> Cc: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Arun Chandran <achandran@mvista.com> Cc: Yann Droneaud <ydroneaud@opteya.com> Cc: Min-Hua Chen <orca.chen@gmail.com> Cc: Paul Burton <paul.burton@imgtec.com> Cc: Alex Smith <alex@alex-smith.me.uk> Cc: Markos Chandras <markos.chandras@imgtec.com> Cc: Vineeth Vijayan <vvijayan@mvista.com> Cc: Jeff Bailey <jeffbailey@google.com> Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Behan Webster <behanw@converseincode.com> Cc: Ismael Ripoll <iripoll@upv.es> Cc: Jan-Simon Mller <dl9pf@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14fs/binfmt_elf.c: fix bug in loading of PIE binariesMichael Davidson1-1/+8
With CONFIG_ARCH_BINFMT_ELF_RANDOMIZE_PIE enabled, and a normal top-down address allocation strategy, load_elf_binary() will attempt to map a PIE binary into an address range immediately below mm->mmap_base. Unfortunately, load_elf_ binary() does not take account of the need to allocate sufficient space for the entire binary which means that, while the first PT_LOAD segment is mapped below mm->mmap_base, the subsequent PT_LOAD segment(s) end up being mapped above mm->mmap_base into the are that is supposed to be the "gap" between the stack and the binary. Since the size of the "gap" on x86_64 is only guaranteed to be 128MB this means that binaries with large data segments > 128MB can end up mapping part of their data segment over their stack resulting in corruption of the stack (and the data segment once the binary starts to run). Any PIE binary with a data segment > 128MB is vulnerable to this although address randomization means that the actual gap between the stack and the end of the binary is normally greater than 128MB. The larger the data segment of the binary the higher the probability of failure. Fix this by calculating the total size of the binary in the same way as load_elf_interp(). Signed-off-by: Michael Davidson <md@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Kees Cook <keescook@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14cleancache: remove limit on the number of cleancache enabled filesystemsVladimir Davydov1-1/+1
The limit equals 32 and is imposed by the number of entries in the fs_poolid_map and shared_fs_poolid_map. Nowadays it is insufficient, because with containers on board a Linux host can have hundreds of active fs mounts. These maps were introduced by commit 49a9ab815acb8 ("mm: cleancache: lazy initialization to allow tmem backends to build/run as modules") in order to allow compiling cleancache drivers as modules. Real pool ids are stored in these maps while super_block->cleancache_poolid points to an entry in the map, so that on cleancache registration we can walk over all (if there are <= 32 of them, of course) cleancache-enabled super blocks and assign real pool ids. Actually, there is absolutely no need in these maps, because we can iterate over all super blocks immediately using iterate_supers. This is not racy, because cleancache_init_ops is called from mount_fs with super_block->s_umount held for writing, while iterate_supers takes this semaphore for reading, so if we call iterate_supers after setting cleancache_ops, all super blocks that had been created before cleancache_register_ops was called will be assigned pool ids by the action function of iterate_supers while all newer super blocks will receive it in cleancache_init_fs. This patch therefore removes the maps and hence the artificial limit on the number of cleancache enabled filesystems. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Stefan Hengelein <ilendir@googlemail.com> Cc: Florian Schmaus <fschmaus@gmail.com> Cc: Andor Daam <andor.daam@googlemail.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Bob Liu <lliubbo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14cleancache: zap uuid arg of cleancache_init_shared_fsVladimir Davydov1-1/+1
Use super_block->s_uuid instead. Every shared filesystem using cleancache must now initialize super_block->s_uuid before calling cleancache_init_shared_fs. The only one on the tree, ocfs2, already meets this requirement. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Stefan Hengelein <ilendir@googlemail.com> Cc: Florian Schmaus <fschmaus@gmail.com> Cc: Andor Daam <andor.daam@googlemail.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Bob Liu <lliubbo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14ocfs2: copy fs uuid to superblockVladimir Davydov1-0/+2
Currently, maximal number of cleancache enabled filesystems equals 32, which is insufficient nowadays, because a Linux host can have hundreds of containers on board, each of which might want its own filesystem. This patch set targets at removing this limitation - see patch 4 for more details. Patches 1-3 prepare the code for this change. This patch (of 4): This will allow us to remove the uuid argument from cleancache_init_shared_fs. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Stefan Hengelein <ilendir@googlemail.com> Cc: Florian Schmaus <fschmaus@gmail.com> Cc: Andor Daam <andor.daam@googlemail.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Bob Liu <lliubbo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14page_writeback: clean up mess around cancel_dirty_page()Konstantin Khlebnikov3-8/+3
This patch replaces cancel_dirty_page() with a helper function account_page_cleaned() which only updates counters. It's called from truncate_complete_page() and from try_to_free_buffers() (hack for ext3). Page is locked in both cases, page-lock protects against concurrent dirtiers: see commit 2d6d7f982846 ("mm: protect set_page_dirty() from ongoing truncation"). Delete_from_page_cache() shouldn't be called for dirty pages, they must be handled by caller (either written or truncated). This patch treats final dirty accounting fixup at the end of __delete_from_page_cache() as a debug check and adds WARN_ON_ONCE() around it. If something removes dirty pages without proper handling that might be a bug and unwritten data might be lost. Hugetlbfs has no dirty pages accounting, ClearPageDirty() is enough here. cancel_dirty_page() in nfs_wb_page_cancel() is redundant. This is helper for nfs_invalidate_page() and it's called only in case complete invalidation. The mess was started in v2.6.20 after commits 46d2277c796f ("Clean up and make try_to_free_buffers() not race with dirty pages") and 3e67c0987d75 ("truncate: clear page dirtiness before running try_to_free_buffers()") first was reverted right in v2.6.20 in commit ecdfc9787fe5 ("Resurrect 'try_to_free_buffers()' VM hackery"), second in v2.6.25 commit a2b345642f53 ("Fix dirty page accounting leak with ext3 data=journal"). Custom fixes were introduced between these points. NFS in v2.6.23, commit 1b3b4a1a2deb ("NFS: Fix a write request leak in nfs_invalidate_page()"). Kludge in __delete_from_page_cache() in v2.6.24, commit 3a6927906f1b ("Do dirty page accounting when removing a page from the page cache"). Since v2.6.25 all of them are redundant. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Tejun Heo <tj@kernel.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14ocfs2: make mlog_errno return the errnoAndrew Morton1-2/+3
ocfs2 does mlog_errno(v); return v; in many places. Change mlog_errno() so we can do return mlog_errno(v); For some weird reason this patch reduces the size of ocfs2 by 6k: akpm3:/usr/src/25> size fs/ocfs2/ocfs2.ko text data bss dec hex filename 1146613 82767 832192 2061572 1f7504 fs/ocfs2/ocfs2.ko-before 1140857 82767 832192 2055816 1f5e88 fs/ocfs2/ocfs2.ko-after [dan.carpenter@oracle.com: double evaluation concerns in mlog_errno()] Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: alex chen <alex.chen@huawei.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14ocfs2: check if the ocfs2 lock resource has been initialized before calling ↵alex chen1-0/+5
ocfs2_dlm_lock If ocfs2 lockres has not been initialized before calling ocfs2_dlm_lock, the lock won't be dropped and then will lead umount hung. The case is described below: ocfs2_mknod ocfs2_mknod_locked __ocfs2_mknod_locked ocfs2_journal_access_di Failed because of -ENOMEM or other reasons, the inode lockres has not been initialized yet. iput(inode) ocfs2_evict_inode ocfs2_delete_inode ocfs2_inode_lock ocfs2_inode_lock_full_nested __ocfs2_cluster_lock Succeeds and allocates a new dlm lockres. ocfs2_clear_inode ocfs2_open_unlock ocfs2_drop_inode_locks ocfs2_drop_lock Since lockres has not been initialized, the lock can't be dropped and the lockres can't be migrated, thus umount will hang forever. Signed-off-by: Alex Chen <alex.chen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: joyce.xue <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14ocfs2: logging: remove static buffer, use vsprintf extension %pVJoe Perches1-15/+18
Use the vsprintf %pV extension to avoid using a static buffer and remove the now unnecessary buffer. Signed-off-by: Joe Perches <joe@perches.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>