aboutsummaryrefslogtreecommitdiffstats
path: root/mm/vmalloc.c
AgeCommit message (Collapse)AuthorFilesLines
2015-04-15mm/vmalloc: get rid of dirty bitmap inside vmap_block structureRoman Pen1-18/+17
In original implementation of vm_map_ram made by Nick Piggin there were two bitmaps: alloc_map and dirty_map. None of them were used as supposed to be: finding a suitable free hole for next allocation in block. vm_map_ram allocates space sequentially in block and on free call marks pages as dirty, so freed space can't be reused anymore. Actually it would be very interesting to know the real meaning of those bitmaps, maybe implementation was incomplete, etc. But long time ago Zhang Yanfei removed alloc_map by these two commits: mm/vmalloc.c: remove dead code in vb_alloc 3fcd76e8028e0be37b02a2002b4f56755daeda06 mm/vmalloc.c: remove alloc_map from vmap_block b8e748b6c32999f221ea4786557b8e7e6c4e4e7a In this patch I replaced dirty_map with two range variables: dirty min and max. These variables store minimum and maximum position of dirty space in a block, since we need only to know the dirty range, not exact position of dirty pages. Why it was made? Several reasons: at first glance it seems that vm_map_ram allocator concerns about fragmentation thus it uses bitmaps for finding free hole, but it is not true. To avoid complexity seems it is better to use something simple, like min or max range values. Secondly, code also becomes simpler, without iteration over bitmap, just comparing values in min and max macros. Thirdly, bitmap occupies up to 1024 bits (4MB is a max size of a block). Here I replaced the whole bitmap with two longs. Finally vm_unmap_aliases should be slightly faster and the whole vmap_block structure occupies less memory. Signed-off-by: Roman Pen <r.peniaev@gmail.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Eric Dumazet <edumazet@google.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: WANG Chao <chaowang@redhat.com> Cc: Fabian Frederick <fabf@skynet.be> Cc: Christoph Lameter <cl@linux.com> Cc: Gioh Kim <gioh.kim@lge.com> Cc: Rob Jones <rob.jones@codethink.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15mm/vmalloc: occupy newly allocated vmap block just after allocationRoman Pen1-21/+37
Previous implementation allocates new vmap block and repeats search of a free block from the very beginning, iterating over the CPU free list. Why it can be better?? 1. Allocation can happen on one CPU, but search can be done on another CPU. In worst case we preallocate amount of vmap blocks which is equal to CPU number on the system. 2. In previous patch I added newly allocated block to the tail of free list to avoid soon exhaustion of virtual space and give a chance to occupy blocks which were allocated long time ago. Thus to find newly allocated block all the search sequence should be repeated, seems it is not efficient. In this patch newly allocated block is occupied right away, address of virtual space is returned to the caller, so there is no any need to repeat the search sequence, allocation job is done. Signed-off-by: Roman Pen <r.peniaev@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Eric Dumazet <edumazet@google.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: WANG Chao <chaowang@redhat.com> Cc: Fabian Frederick <fabf@skynet.be> Cc: Christoph Lameter <cl@linux.com> Cc: Gioh Kim <gioh.kim@lge.com> Cc: Rob Jones <rob.jones@codethink.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15mm/vmalloc: fix possible exhaustion of vmalloc space caused by vm_map_ram ↵Roman Pen1-1/+1
allocator Recently I came across high fragmentation of vm_map_ram allocator: vmap_block has free space, but still new blocks continue to appear. Further investigation showed that certain mapping/unmapping sequences can exhaust vmalloc space. On small 32bit systems that's not a big problem, cause purging will be called soon on a first allocation failure (alloc_vmap_area), but on 64bit machines, e.g. x86_64 has 45 bits of vmalloc space, that can be a disaster. 1) I came up with a simple allocation sequence, which exhausts virtual space very quickly: while (iters) { /* Map/unmap big chunk */ vaddr = vm_map_ram(pages, 16, -1, PAGE_KERNEL); vm_unmap_ram(vaddr, 16); /* Map/unmap small chunks. * * -1 for hole, which should be left at the end of each block * to keep it partially used, with some free space available */ for (i = 0; i < (VMAP_BBMAP_BITS - 16) / 8 - 1; i++) { vaddr = vm_map_ram(pages, 8, -1, PAGE_KERNEL); vm_unmap_ram(vaddr, 8); } } The idea behind is simple: 1. We have to map a big chunk, e.g. 16 pages. 2. Then we have to occupy the remaining space with smaller chunks, i.e. 8 pages. At the end small hole should remain to keep block in free list, but do not let big chunk to occupy remaining space. 3. Goto 1 - allocation request of 16 pages can't be completed (only 8 slots are left free in the block in the #2 step), new block will be allocated, all further requests will lay into newly allocated block. To have some measurement numbers for all further tests I setup ftrace and enabled 4 basic calls in a function profile: echo vm_map_ram > /sys/kernel/debug/tracing/set_ftrace_filter; echo alloc_vmap_area >> /sys/kernel/debug/tracing/set_ftrace_filter; echo vm_unmap_ram >> /sys/kernel/debug/tracing/set_ftrace_filter; echo free_vmap_block >> /sys/kernel/debug/tracing/set_ftrace_filter; So for this scenario I got these results: BEFORE (all new blocks are put to the head of a free list) # cat /sys/kernel/debug/tracing/trace_stat/function0 Function Hit Time Avg s^2 -------- --- ---- --- --- vm_map_ram 126000 30683.30 us 0.243 us 30819.36 us vm_unmap_ram 126000 22003.24 us 0.174 us 340.886 us alloc_vmap_area 1000 4132.065 us 4.132 us 0.903 us AFTER (all new blocks are put to the tail of a free list) # cat /sys/kernel/debug/tracing/trace_stat/function0 Function Hit Time Avg s^2 -------- --- ---- --- --- vm_map_ram 126000 28713.13 us 0.227 us 24944.70 us vm_unmap_ram 126000 20403.96 us 0.161 us 1429.872 us alloc_vmap_area 993 3916.795 us 3.944 us 29.370 us free_vmap_block 992 654.157 us 0.659 us 1.273 us SUMMARY: The most interesting numbers in those tables are numbers of block allocations and deallocations: alloc_vmap_area and free_vmap_block calls, which show that before the change blocks were not freed, and virtual space and physical memory (vmap_block structure allocations, etc) were consumed. Average time which were spent in vm_map_ram/vm_unmap_ram became slightly better. That can be explained with a reasonable amount of blocks in a free list, which we need to iterate to find a suitable free block. 2) Another scenario is a random allocation: while (iters) { /* Randomly take number from a range [1..32/64] */ nr = rand(1, VMAP_MAX_ALLOC); vaddr = vm_map_ram(pages, nr, -1, PAGE_KERNEL); vm_unmap_ram(vaddr, nr); } I chose mersenne twister PRNG to generate persistent random state to guarantee that both runs have the same random sequence. For each vm_map_ram call random number from [1..32/64] was taken to represent amount of pages which I do map. I did 10'000 vm_map_ram calls and got these two tables: BEFORE (all new blocks are put to the head of a free list) # cat /sys/kernel/debug/tracing/trace_stat/function0 Function Hit Time Avg s^2 -------- --- ---- --- --- vm_map_ram 10000 10170.01 us 1.017 us 993.609 us vm_unmap_ram 10000 5321.823 us 0.532 us 59.789 us alloc_vmap_area 420 2150.239 us 5.119 us 3.307 us free_vmap_block 37 159.587 us 4.313 us 134.344 us AFTER (all new blocks are put to the tail of a free list) # cat /sys/kernel/debug/tracing/trace_stat/function0 Function Hit Time Avg s^2 -------- --- ---- --- --- vm_map_ram 10000 7745.637 us 0.774 us 395.229 us vm_unmap_ram 10000 5460.573 us 0.546 us 67.187 us alloc_vmap_area 414 2201.650 us 5.317 us 5.591 us free_vmap_block 412 574.421 us 1.394 us 15.138 us SUMMARY: 'BEFORE' table shows, that 420 blocks were allocated and only 37 were freed. Remained 383 blocks are still in a free list, consuming virtual space and physical memory. 'AFTER' table shows, that 414 blocks were allocated and 412 were really freed. 2 blocks remained in a free list. So fragmentation was dramatically reduced. Why? Because when we put newly allocated block to the head, all further requests will occupy new block, regardless remained space in other blocks. In this scenario all requests come randomly. Eventually remained free space will be less than requested size, free list will be iterated and it is possible that nothing will be found there - finally new block will be created. So exhaustion in random scenario happens for the maximum possible allocation size: 32 pages for 32-bit system and 64 pages for 64-bit system. Also average cost of vm_map_ram was reduced from 1.017 us to 0.774 us. Again this can be explained by iteration through smaller list of free blocks. 3) Next simple scenario is a sequential allocation, when the allocation order is increased for each block. This scenario forces allocator to reach maximum amount of partially free blocks in a free list: while (iters) { /* Populate free list with blocks with remaining space */ for (order = 0; order <= ilog2(VMAP_MAX_ALLOC); order++) { nr = VMAP_BBMAP_BITS / (1 << order); /* Leave a hole */ nr -= 1; for (i = 0; i < nr; i++) { vaddr = vm_map_ram(pages, (1 << order), -1, PAGE_KERNEL); vm_unmap_ram(vaddr, (1 << order)); } /* Completely occupy blocks from a free list */ for (order = 0; order <= ilog2(VMAP_MAX_ALLOC); order++) { vaddr = vm_map_ram(pages, (1 << order), -1, PAGE_KERNEL); vm_unmap_ram(vaddr, (1 << order)); } } Results which I got: BEFORE (all new blocks are put to the head of a free list) # cat /sys/kernel/debug/tracing/trace_stat/function0 Function Hit Time Avg s^2 -------- --- ---- --- --- vm_map_ram 2032000 399545.2 us 0.196 us 467123.7 us vm_unmap_ram 2032000 363225.7 us 0.178 us 111405.9 us alloc_vmap_area 7001 30627.76 us 4.374 us 495.755 us free_vmap_block 6993 7011.685 us 1.002 us 159.090 us AFTER (all new blocks are put to the tail of a free list) # cat /sys/kernel/debug/tracing/trace_stat/function0 Function Hit Time Avg s^2 -------- --- ---- --- --- vm_map_ram 2032000 394259.7 us 0.194 us 589395.9 us vm_unmap_ram 2032000 292500.7 us 0.143 us 94181.08 us alloc_vmap_area 7000 31103.11 us 4.443 us 703.225 us free_vmap_block 7000 6750.844 us 0.964 us 119.112 us SUMMARY: No surprises here, almost all numbers are the same. Fixing this fragmentation problem I also did some improvements in a allocation logic of a new vmap block: occupy block immediately and get rid of extra search in a free list. Also I replaced dirty bitmap with min/max dirty range values to make the logic simpler and slightly faster, since two longs comparison costs less, than loop thru bitmap. This patchset raises several questions: Q: Think the problem you comments is already known so that I wrote comments about it as "it could consume lots of address space through fragmentation". Could you tell me about your situation and reason why it should be avoided? Gioh Kim A: Indeed, there was a commit 364376383 which adds explicit comment about fragmentation. But fragmentation which is described in this comment caused by mixing of long-lived and short-lived objects, when a whole block is pinned in memory because some page slots are still in use. But here I am talking about blocks which are free, nobody uses them, and allocator keeps them alive forever, continuously allocating new blocks. Q: I think that if you put newly allocated block to the tail of a free list, below example would results in enormous performance degradation. new block: 1MB (256 pages) while (iters--) { vm_map_ram(3 or something else not dividable for 256) * 85 vm_unmap_ram(3) * 85 } On every iteration, it needs newly allocated block and it is put to the tail of a free list so finding it consumes large amount of time. Joonsoo Kim A: Second patch in current patchset gets rid of extra search in a free list, so new block will be immediately occupied.. Also, the scenario above is impossible, cause vm_map_ram allocates virtual range in orders, i.e. 2^n. I.e. passing 3 to vm_map_ram you will allocate 4 slots in a block and 256 slots (capacity of a block) of course dividable on 4, so block will be completely occupied. But there is a worst case which we can achieve: each free block has a hole equal to order size. The maximum size of allocation is 64 pages for 64-bit system (if you try to map more, original alloc_vmap_area will be called). So the maximum order is 6. That means that worst case, before allocator makes a decision to allocate a new block, is to iterate 7 blocks: HEAD 1st block - has 1 page slot free (order 0) 2nd block - has 2 page slots free (order 1) 3rd block - has 4 page slots free (order 2) 4th block - has 8 page slots free (order 3) 5th block - has 16 page slots free (order 4) 6th block - has 32 page slots free (order 5) 7th block - has 64 page slots free (order 6) TAIL So the worst scenario on 64-bit system is that each CPU queue can have 7 blocks in a free list. This can happen only and only if you allocate blocks increasing the order. (as I did in the function written in the comment of the first patch) This is weird and rare case, but still it is possible. Afterwards you will get 7 blocks in a list. All further requests should be placed in a newly allocated block or some free slots should be found in a free list. Seems it does not look dramatically awful. This patch (of 3): If suitable block can't be found, new block is allocated and put into a head of a free list, so on next iteration this new block will be found first. That's bad, because old blocks in a free list will not get a chance to be fully used, thus fragmentation will grow. Let's consider this simple example: #1 We have one block in a free list which is partially used, and where only one page is free: HEAD |xxxxxxxxx-| TAIL ^ free space for 1 page, order 0 #2 New allocation request of order 1 (2 pages) comes, new block is allocated since we do not have free space to complete this request. New block is put into a head of a free list: HEAD |----------|xxxxxxxxx-| TAIL #3 Two pages were occupied in a new found block: HEAD |xx--------|xxxxxxxxx-| TAIL ^ two pages mapped here #4 New allocation request of order 0 (1 page) comes. Block, which was created on #2 step, is located at the beginning of a free list, so it will be found first: HEAD |xxX-------|xxxxxxxxx-| TAIL ^ ^ page mapped here, but better to use this hole It is obvious, that it is better to complete request of #4 step using the old block, where free space is left, because in other case fragmentation will be highly increased. But fragmentation is not only the case. The worst thing is that I can easily create scenario, when the whole vmalloc space is exhausted by blocks, which are not used, but already dirty and have several free pages. Let's consider this function which execution should be pinned to one CPU: static void exhaust_virtual_space(struct page *pages[16], int iters) { /* Firstly we have to map a big chunk, e.g. 16 pages. * Then we have to occupy the remaining space with smaller * chunks, i.e. 8 pages. At the end small hole should remain. * So at the end of our allocation sequence block looks like * this: * XX big chunk * |XXxxxxxxx-| x small chunk * - hole, which is enough for a small chunk, * but is not enough for a big chunk */ while (iters--) { int i; void *vaddr; /* Map/unmap big chunk */ vaddr = vm_map_ram(pages, 16, -1, PAGE_KERNEL); vm_unmap_ram(vaddr, 16); /* Map/unmap small chunks. * * -1 for hole, which should be left at the end of each block * to keep it partially used, with some free space available */ for (i = 0; i < (VMAP_BBMAP_BITS - 16) / 8 - 1; i++) { vaddr = vm_map_ram(pages, 8, -1, PAGE_KERNEL); vm_unmap_ram(vaddr, 8); } } } On every iteration new block (1MB of vm area in my case) will be allocated and then will be occupied, without attempt to resolve small allocation request using previously allocated blocks in a free list. In case of random allocation (size should be randomly taken from the range [1..64] in 64-bit case or [1..32] in 32-bit case) situation is the same: new blocks continue to appear if maximum possible allocation size (32 or 64) passed to the allocator, because all remaining blocks in a free list do not have enough free space to complete this allocation request. In summary if new blocks are put into the head of a free list eventually virtual space will be exhausted. In current patch I simply put newly allocated block to the tail of a free list, thus reduce fragmentation, giving a chance to resolve allocation request using older blocks with possible holes left. Signed-off-by: Roman Pen <r.peniaev@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: WANG Chao <chaowang@redhat.com> Cc: Fabian Frederick <fabf@skynet.be> Cc: Christoph Lameter <cl@linux.com> Cc: Gioh Kim <gioh.kim@lge.com> Cc: Rob Jones <rob.jones@codethink.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: change vunmap to tear down huge KVA mappingsToshi Kani1-0/+4
Change vunmap_pmd_range() and vunmap_pud_range() to tear down huge KVA mappings when they are set. pud_clear_huge() and pmd_clear_huge() return zero when no-operation is performed, i.e. huge page mapping was not used. These changes are only enabled when CONFIG_HAVE_ARCH_HUGE_VMAP is defined on the architecture. [akpm@linux-foundation.org: use consistent code layout] Signed-off-by: Toshi Kani <toshi.kani@hp.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Robert Elliott <Elliott@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: change __get_vm_area_node() to use fls_long()Toshi Kani1-1/+3
ioremap() and its related interfaces are used to create I/O mappings to memory-mapped I/O devices. The mapping sizes of the traditional I/O devices are relatively small. Non-volatile memory (NVM), however, has many GB and is going to have TB soon. It is not very efficient to create large I/O mappings with 4KB. This patchset extends the ioremap() interfaces to transparently create I/O mappings with huge pages whenever possible. ioremap() continues to use 4KB mappings when a huge page does not fit into a requested range. There is no change necessary to the drivers using ioremap(). A requested physical address must be aligned by a huge page size (1GB or 2MB on x86) for using huge page mapping, though. The kernel huge I/O mapping will improve performance of NVM and other devices with large memory, and reduce the time to create their mappings as well. On x86, MTRRs can override PAT memory types with a 4KB granularity. When using a huge page, MTRRs can override the memory type of the huge page, which may lead a performance penalty. The processor can also behave in an undefined manner if a huge page is mapped to a memory range that MTRRs have mapped with multiple different memory types. Therefore, the mapping code falls back to use a smaller page size toward 4KB when a mapping range is covered by non-WB type of MTRRs. The WB type of MTRRs has no affect on the PAT memory types. The patchset introduces HAVE_ARCH_HUGE_VMAP, which indicates that the arch supports huge KVA mappings for ioremap(). User may specify a new kernel option "nohugeiomap" to disable the huge I/O mapping capability of ioremap() when necessary. Patch 1-4 change common files to support huge I/O mappings. There is no change in the functinalities unless HAVE_ARCH_HUGE_VMAP is defined on the architecture of the system. Patch 5-6 implement the HAVE_ARCH_HUGE_VMAP funcs on x86, and set HAVE_ARCH_HUGE_VMAP on x86. This patch (of 6): __get_vm_area_node() takes unsigned long size, which is a 64-bit value on a 64-bit kernel. However, fls(size) simply ignores the upper 32-bit. Change to use fls_long() to handle the size properly. Signed-off-by: Toshi Kani <toshi.kani@hp.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Robert Elliott <Elliott@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-12kasan, module, vmalloc: rework shadow allocation for modulesAndrey Ryabinin1-0/+1
Current approach in handling shadow memory for modules is broken. Shadow memory could be freed only after memory shadow corresponds it is no longer used. vfree() called from interrupt context could use memory its freeing to store 'struct llist_node' in it: void vfree(const void *addr) { ... if (unlikely(in_interrupt())) { struct vfree_deferred *p = this_cpu_ptr(&vfree_deferred); if (llist_add((struct llist_node *)addr, &p->list)) schedule_work(&p->wq); Later this list node used in free_work() which actually frees memory. Currently module_memfree() called in interrupt context will free shadow before freeing module's memory which could provoke kernel crash. So shadow memory should be freed after module's memory. However, such deallocation order could race with kasan_module_alloc() in module_alloc(). Free shadow right before releasing vm area. At this point vfree()'d memory is not used anymore and yet not available for other allocations. New VM_KASAN flag used to indicate that vm area has dynamically allocated shadow memory so kasan frees shadow only if it was previously allocated. Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13mm: vmalloc: pass additional vm_flags to __vmalloc_node_range()Andrey Ryabinin1-4/+6
For instrumenting global variables KASan will shadow memory backing memory for modules. So on module loading we will need to allocate memory for shadow and map it at address in shadow that corresponds to the address allocated in module_alloc(). __vmalloc_node_range() could be used for this purpose, except it puts a guard hole after allocated area. Guard hole in shadow memory should be a problem because at some future point we might need to have a shadow memory at address occupied by guard hole. So we could fail to allocate shadow for module_alloc(). Now we have VM_NO_GUARD flag disabling guard page, so we need to pass into __vmalloc_node_range(). Add new parameter 'vm_flags' to __vmalloc_node_range() function. Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrey Konovalov <adech.fo@gmail.com> Cc: Yuri Gribov <tetra2005@gmail.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13mm: vmalloc: add flag preventing guard hole allocationAndrey Ryabinin1-4/+2
For instrumenting global variables KASan will shadow memory backing memory for modules. So on module loading we will need to allocate memory for shadow and map it at address in shadow that corresponds to the address allocated in module_alloc(). __vmalloc_node_range() could be used for this purpose, except it puts a guard hole after allocated area. Guard hole in shadow memory should be a problem because at some future point we might need to have a shadow memory at address occupied by guard hole. So we could fail to allocate shadow for module_alloc(). Add a new vm_struct flag 'VM_NO_GUARD' indicating that vm area doesn't have a guard hole. Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrey Konovalov <adech.fo@gmail.com> Cc: Yuri Gribov <tetra2005@gmail.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13mm/vmalloc.c: fix memory ordering bugDmitry Vyukov1-2/+2
Read memory barriers must follow the read operations. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Dumazet <edumazet@google.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10mm/vmalloc.c: replace printk with pr_warnPintu Kumar1-2/+1
This patch replaces printk(KERN_WARNING..) with pr_warn. Thus it also reduces one line extra because of formatting. Signed-off-by: Pintu Kumar <pintu.k@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm/vmalloc.c: use seq_open_private() instead of seq_open()Rob Jones1-15/+5
Using seq_open_private() removes boilerplate code from vmalloc_open(). The resultant code is shorter and easier to follow. However, please note that seq_open_private() call kzalloc() rather than kmalloc() which may affect timing due to the memory initialisation overhead. Signed-off-by: Rob Jones <rob.jones@codethink.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm/vmalloc.c: clean up map_vm_area third argumentWANG Chao1-9/+5
Currently map_vm_area() takes (struct page *** pages) as third argument, and after mapping, it moves (*pages) to point to (*pages + nr_mappped_pages). It looks like this kind of increment is useless to its caller these days. The callers don't care about the increments and actually they're trying to avoid this by passing another copy to map_vm_area(). The caller can always guarantee all the pages can be mapped into vm_area as specified in first argument and the caller only cares about whether map_vm_area() fails or not. This patch cleans up the pointer movement in map_vm_area() and updates its callers accordingly. Signed-off-by: WANG Chao <chaowang@redhat.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm, vmalloc: constify allocation maskDavid Rientjes1-4/+4
tmp_mask in the __vmalloc_area_node() iteration never changes so it can be moved into function scope and marked with const. This causes the movl and orl to only be done once per call rather than area->nr_pages times. nested_gfp can also be marked const. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06mm/vmalloc.c: add a schedule point to vmalloc()Eric Dumazet1-0/+2
It is not uncommon on busy servers to get stuck hundred of ms in vmalloc() calls (like file descriptor expansions). Add a cond_resched() to __vmalloc_area_node() to be gentle to other tasks. [akpm@linux-foundation.org: only do it for __GFP_WAIT, per David] Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Hugh Dickins <hughd@google.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06vmalloc: use rcu list iterator to reduce vmap_area_lock contentionJoonsoo Kim1-3/+3
Richard Yao reported a month ago that his system have a trouble with vmap_area_lock contention during performance analysis by /proc/meminfo. Andrew asked why his analysis checks /proc/meminfo stressfully, but he didn't answer it. https://lkml.org/lkml/2014/4/10/416 Although I'm not sure that this is right usage or not, there is a solution reducing vmap_area_lock contention with no side-effect. That is just to use rcu list iterator in get_vmalloc_info(). rcu can be used in this function because all RCU protocol is already respected by writers, since Nick Piggin commit db64fe02258f1 ("mm: rewrite vmap layer") back in linux-2.6.28 Specifically : insertions use list_add_rcu(), deletions use list_del_rcu() and kfree_rcu(). Note the rb tree is not used from rcu reader (it would not be safe), only the vmap_area_list has full RCU protection. Note that __purge_vmap_area_lazy() already uses this rcu protection. rcu_read_lock(); list_for_each_entry_rcu(va, &vmap_area_list, list) { if (va->flags & VM_LAZY_FREE) { if (va->va_start < *start) *start = va->va_start; if (va->va_end > *end) *end = va->va_end; nr += (va->va_end - va->va_start) >> PAGE_SHIFT; list_add_tail(&va->purge_list, &valist); va->flags |= VM_LAZY_FREEING; va->flags &= ~VM_LAZY_FREE; } } rcu_read_unlock(); Peter: : While rcu list traversal over the vmap_area_list is safe, this may : arrive at different results than the spinlocked version. The rcu list : traversal version will not be a 'snapshot' of a single, valid instant : of the entire vmap_area_list, but rather a potential amalgam of : different list states. Joonsoo: : Yes, you are right, but I don't think that we should be strict here. : Meminfo is already not a 'snapshot' at specific time. While we try to get : certain stats, the other stats can change. And, although we may arrive at : different results than the spinlocked version, the difference would not be : large and would not make serious side-effect. [edumazet@google.com: add more commit description] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Richard Yao <ryao@gentoo.org> Acked-by: Eric Dumazet <edumazet@google.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Zhang Yanfei <zhangyanfei.yes@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm/vmalloc.c: export unmap_kernel_range()Minchan Kim1-0/+1
zsmalloc needs exported unmap_kernel_range for building as a module. See https://lkml.org/lkml/2013/1/18/487 I didn't send a patch to make unmap_kernel_range exportable at that time because zram was staging stuff and I thought VM function exporting for staging stuff makes no sense. Now zsmalloc was promoted. If we can't build zsmalloc as module, it means we can't build zram as module, either. Additionally, buddy map_vm_area is already exported so let's export unmap_kernel_range to help his buddy. Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm/vmalloc.c: replace seq_printf by seq_putsFabian Frederick1-5/+5
Replace seq_printf where possible Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm: replace __get_cpu_var uses with this_cpu_ptrChristoph Lameter1-1/+1
Replace places where __get_cpu_var() is used for an address calculation with this_cpu_ptr(). Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07mm/vmalloc.c: enhance vm_map_ram() commentGioh Kim1-0/+6
vm_map_ram() has a fragmentation problem when it cannot purge a chunk(ie, 4M address space) if there is a pinning object in that addresss space. So it could consume all VMALLOC address space easily. We can fix the fragmentation problem by using vmap instead of vm_map_ram() but vmap() is known to be slow compared to vm_map_ram(). Minchan said vm_map_ram is 5 times faster than vmap in his tests. So I thought we should fix fragment problem of vm_map_ram because our proprietary GPU driver has used it heavily. On second thought, it's not an easy because we should reuse freed space for solving the problem and it could make more IPI and bitmap operation for searching hole. It could mitigate API's goal which is very fast mapping. And even fragmentation problem wouldn't show in 64 bit machine. Another option is that the user should separate long-life and short-life object and use vmap for long-life but vm_map_ram for short-life. If we inform the user about the characteristic of vm_map_ram the user can choose one according to the page lifetime. Let's add some notice messages to user. [akpm@linux-foundation.org: tweak comment text] Signed-off-by: Gioh Kim <gioh.kim@lge.com> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07mm: use macros from compiler.h instead of __attribute__((...))Gideon Israel Dsouza1-1/+3
To increase compiler portability there is <linux/compiler.h> which provides convenience macros for various gcc constructs. Eg: __weak for __attribute__((weak)). I've replaced all instances of gcc attributes with the right macro in the memory management (/mm) subsystem. [akpm@linux-foundation.org: while-we're-there consistency tweaks] Signed-off-by: Gideon Israel Dsouza <gidisrael@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-27Revert "mm/vmalloc: interchage the implementation of vmalloc_to_{pfn,page}"malc1-10/+10
Revert commit ece86e222db4, which was intended as a small performance improvement. Despite the claim that the patch doesn't introduce any functional changes in fact it does. The "no page" path behaves different now. Originally, vmalloc_to_page might return NULL under some conditions, with new implementation it returns pfn_to_page(0) which is not the same as NULL. Simple test shows the difference. test.c #include <linux/kernel.h> #include <linux/module.h> #include <linux/vmalloc.h> #include <linux/mm.h> int __init myi(void) { struct page *p; void *v; v = vmalloc(PAGE_SIZE); /* trigger the "no page" path in vmalloc_to_page*/ vfree(v); p = vmalloc_to_page(v); pr_err("expected val = NULL, returned val = %p", p); return -EBUSY; } void __exit mye(void) { } module_init(myi) module_exit(mye) Before interchange: expected val = NULL, returned val = (null) After interchange: expected val = NULL, returned val = c7ebe000 Signed-off-by: Vladimir Murzin <murzin.v@gmail.com> Cc: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21mm/vmalloc: interchage the implementation of vmalloc_to_{pfn,page}Jianyu Zhan1-10/+10
Currently we are implementing vmalloc_to_pfn() as a wrapper around vmalloc_to_page(), which is implemented as follow: 1. walks the page talbes to generates the corresponding pfn, 2. then converts the pfn to struct page, 3. returns it. And vmalloc_to_pfn() re-wraps vmalloc_to_page() to get the pfn. This seems too circuitous, so this patch reverses the way: implement vmalloc_to_page() as a wrapper around vmalloc_to_pfn(). This makes vmalloc_to_pfn() and vmalloc_to_page() slightly more efficient. No functional change. Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Cc: Vladimir Murzin <murzin.v@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13mm: kmemleak: avoid false negatives on vmalloc'ed objectsCatalin Marinas1-4/+10
Commit 248ac0e1943a ("mm/vmalloc: remove guard page from between vmap blocks") had the side effect of making vmap_area.va_end member point to the next vmap_area.va_start. This was creating an artificial reference to vmalloc'ed objects and kmemleak was rarely reporting vmalloc() leaks. This patch marks the vmap_area containing pointers explicitly and reduces the min ref_count to 2 as vm_struct still contains a reference to the vmalloc'ed object. The kmemleak add_scan_area() function has been improved to allow a SIZE_MAX argument covering the rest of the object (for simpler calling sites). Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13revert mm/vmalloc.c: emit the failure message before returnWanpeng Li1-1/+1
Don't warn twice in __vmalloc_area_node and __vmalloc_node_range if __vmalloc_area_node allocation failure. This patch reverts commit 46c001a2753f ("mm/vmalloc.c: emit the failure message before return"). Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13mm/vmalloc: revert "mm/vmalloc.c: check VM_UNINITIALIZED flag in s_show ↵Wanpeng Li1-5/+5
instead of show_numa_info" The VM_UNINITIALIZED/VM_UNLIST flag introduced by f5252e009d5b ("mm: avoid null pointer access in vm_struct via /proc/vmallocinfo") is used to avoid accessing the pages field with unallocated page when show_numa_info() is called. This patch moves the check just before show_numa_info in order that some messages still can be dumped via /proc/vmallocinfo. This patch reverts commit d157a55815ff ("mm/vmalloc.c: check VM_UNINITIALIZED flag in s_show instead of show_numa_info"); Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13mm/vmalloc: fix show vmap_area information race with vmap_area tear downWanpeng Li1-8/+5
There is a race window between vmap_area tear down and show vmap_area information. A B remove_vm_area spin_lock(&vmap_area_lock); va->vm = NULL; va->flags &= ~VM_VM_AREA; spin_unlock(&vmap_area_lock); spin_lock(&vmap_area_lock); if (va->flags & (VM_LAZY_FREE | VM_LAZY_FREEZING)) return 0; if (!(va->flags & VM_VM_AREA)) { seq_printf(m, "0x%pK-0x%pK %7ld vm_map_ram\n", (void *)va->va_start, (void *)va->va_end, va->va_end - va->va_start); return 0; } free_unmap_vmap_area(va); flush_cache_vunmap free_unmap_vmap_area_noflush unmap_vmap_area free_vmap_area_noflush va->flags |= VM_LAZY_FREE The assumption !VM_VM_AREA represents vm_map_ram allocation is introduced by d4033afdf828 ("mm, vmalloc: iterate vmap_area_list, instead of vmlist, in vmallocinfo()"). However, !VM_VM_AREA also represents vmap_area is being tear down in race window mentioned above. This patch fix it by don't dump any information for !VM_VM_AREA case and also remove (VM_LAZY_FREE | VM_LAZY_FREEING) check since they are not possible for !VM_VM_AREA case. Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13mm/vmalloc: don't set area->caller twiceWanpeng Li1-4/+3
The caller address has already been set in set_vmalloc_vm(), there's no need to set it again in __vmalloc_area_node. Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13mm/vmalloc: use NUMA_NO_NODEJianguo Wu1-1/+1
Use more appropriate "if (node == NUMA_NO_NODE)" instead of "if (node < 0)" Signed-off-by: Jianguo Wu <wujianguo@huawei.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm/vmalloc: use wrapper function get_vm_area_size to caculate size of vm areaWanpeng Li1-6/+6
Use wrapper function get_vm_area_size to calculate size of vm area. Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm, vmalloc: use well-defined find_last_bit() funcJoonsoo Kim1-9/+6
Our intention in here is to find last_bit within the region to flush. There is well-defined function, find_last_bit() for this purpose and its performance may be slightly better than current implementation. So change it. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm, vmalloc: remove useless variable in vmap_blockJoonsoo Kim1-2/+0
vbq in vmap_block isn't used. So remove it. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09mm/vmalloc.c: fix an overflow bug in alloc_vmap_area()Zhang Yanfei1-3/+3
When searching a vmap area in the vmalloc space, we use (addr + size - 1) to check if the value is less than addr, which is an overflow. But we assign (addr + size) to vmap_area->va_end. So if we come across the below case: (addr + size - 1) : not overflow (addr + size) : overflow we will assign an overflow value (e.g 0) to vmap_area->va_end, And this will trigger BUG in __insert_vmap_area, causing system panic. So using (addr + size) to check the overflow should be the correct behaviour, not (addr + size - 1). Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Reported-by: Ghennadi Procopciuc <unix140@gmail.com> Tested-by: Daniel Baluta <dbaluta@ixiacom.com> Cc: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09vfree: don't schedule free_work() if llist_add() returns falseOleg Nesterov1-3/+2
vfree() only needs schedule_work(&p->wq) if p->list was empty, otherwise vfree_deferred->wq is already pending or it is running and didn't do llist_del_all() yet. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09mm/vmalloc.c: check VM_UNINITIALIZED flag in s_show instead of show_numa_infoZhang Yanfei1-5/+5
We should check the VM_UNITIALIZED flag in s_show(). If this flag is set, that said, the vm_struct is not fully initialized. So it is unnecessary to try to show the information contained in vm_struct. We checked this flag in show_numa_info(), but I think it's better to check it earlier. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09mm/vmalloc.c: rename VM_UNLIST to VM_UNINITIALIZEDZhang Yanfei1-9/+9
VM_UNLIST was used to indicate that the vm_struct is not listed in vmlist. But after commit 4341fa454796 ("mm, vmalloc: remove list management of vmlist after initializing vmalloc"), the meaning of this flag changed. It now means the vm_struct is not fully initialized. So renaming it to VM_UNINITIALIZED seems more reasonable. Also change clear_vm_unlist to clear_vm_uninitialized_flag. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09mm/vmalloc.c: emit the failure message before returnZhang Yanfei1-1/+1
Use goto to jump to the fail label to give a failure message before returning NULL. This makes the failure handling in this function consistent. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09mm/vmalloc.c: remove alloc_map from vmap_blockZhang Yanfei1-3/+0
As we have removed the dead code in the vb_alloc, it seems there is no place to use the alloc_map. So there is no reason to maintain the alloc_map in vmap_block. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09mm/vmalloc.c: remove unused purge_fragmented_blocks_thiscpuZhang Yanfei1-5/+0
This function is nowhere used now, so remove it. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09mm/vmalloc.c: remove dead code in vb_allocZhang Yanfei1-15/+1
Space in a vmap block that was once allocated is considered dirty and not made available for allocation again before the whole block is recycled. The result is that free space within a vmap block is always contiguous. So if a vmap block has enough free space for allocation, the allocation is impossible to fail. Thus, the fragmented block purging was never invoked from vb_alloc(). So remove this dead code. [ Same patches also sent by: Chanho Min <chanho.min@lge.com> Johannes Weiner <hannes@cmpxchg.org> but git doesn't do "multiple authors" ] Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09mm/vmalloc.c: unbreak __vunmap()Dan Carpenter1-1/+1
There is an extra semi-colon so the function always returns. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03mm, vmalloc: use clamp() to simplify codeZhang Yanfei1-10/+2
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03mm, vmalloc: remove insert_vmalloc_vm()Zhang Yanfei1-7/+0
Now this function is nowhere used, we can remove it directly. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03mm, vmalloc: call setup_vmalloc_vm() instead of insert_vmalloc_vm()Zhang Yanfei1-2/+2
Here we pass flags with only VM_ALLOC bit set, it is unnecessary to call clear_vm_unlist to clear VM_UNLIST bit. So use setup_vmalloc_vm instead of insert_vmalloc_vm. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03mm, vmalloc: only call setup_vmalloc_vm() only in __get_vm_area_node()Zhang Yanfei1-10/+1
Now for insert_vmalloc_vm, it only calls the two functions: - setup_vmalloc_vm: fill vm_struct and vmap_area instances - clear_vm_unlist: clear VM_UNLIST bit in vm_struct->flags So in __get_vm_area_node(), if VM_UNLIST bit unset in flags, that is the else branch here, we don't need to clear VM_UNLIST bit for vm->flags since this bit is obviously not set. That is to say, we could only call setup_vmalloc_vm instead of insert_vmalloc_vm here. And then we could even remove the if test here. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03vmalloc: introduce remap_vmalloc_range_partialHATAYAMA Daisuke1-22/+45
We want to allocate ELF note segment buffer on the 2nd kernel in vmalloc space and remap it to user-space in order to reduce the risk that memory allocation fails on system with huge number of CPUs and so with huge ELF note segment that exceeds 11-order block size. Although there's already remap_vmalloc_range for the purpose of remapping vmalloc memory to user-space, we need to specify user-space range via vma. Mmap on /proc/vmcore needs to remap range across multiple objects, so the interface that requires vma to cover full range is problematic. This patch introduces remap_vmalloc_range_partial that receives user-space range as a pair of base address and size and can be used for mmap on /proc/vmcore case. remap_vmalloc_range is rewritten using remap_vmalloc_range_partial. [akpm@linux-foundation.org: use PAGE_ALIGNED()] Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Cc: Lisa Mitchell <lisa.mitchell@hp.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03vmalloc: make find_vm_area check in rangeHATAYAMA Daisuke1-1/+1
Currently, __find_vmap_area searches for the kernel VM area starting at a given address. This patch changes this behavior so that it searches for the kernel VM area to which the address belongs. This change is needed by remap_vmalloc_range_partial to be introduced in later patch that receives any position of kernel VM area as target address. This patch changes the condition (addr > va->va_start) to the equivalent (addr >= va->va_end) by taking advantage of the fact that each kernel VM area is non-overlapping. Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Cc: Lisa Mitchell <lisa.mitchell@hp.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-07mm/vmalloc.c: add vfree commentAndrew Morton1-0/+2
Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-01Merge branch 'for-linus' of ↵Linus Torvalds1-5/+40
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS updates from Al Viro, Misc cleanups all over the place, mainly wrt /proc interfaces (switch create_proc_entry to proc_create(), get rid of the deprecated create_proc_read_entry() in favor of using proc_create_data() and seq_file etc). 7kloc removed. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits) don't bother with deferred freeing of fdtables proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h proc: Make the PROC_I() and PDE() macros internal to procfs proc: Supply a function to remove a proc entry by PDE take cgroup_open() and cpuset_open() to fs/proc/base.c ppc: Clean up scanlog ppc: Clean up rtas_flash driver somewhat hostap: proc: Use remove_proc_subtree() drm: proc: Use remove_proc_subtree() drm: proc: Use minor->index to label things, not PDE->name drm: Constify drm_proc_list[] zoran: Don't print proc_dir_entry data in debug reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show() proc: Supply an accessor for getting the data from a PDE's parent airo: Use remove_proc_subtree() rtl8192u: Don't need to save device proc dir PDE rtl8187se: Use a dir under /proc/net/r8180/ proc: Add proc_mkdir_data() proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h} proc: Move PDE_NET() to fs/proc/proc_net.c ...
2013-04-29kexec, vmalloc: export additional vmalloc layer informationAtsushi Kumagai1-11/+0
Now, vmap_area_list is exported as VMCOREINFO for makedumpfile to get the start address of vmalloc region (vmalloc_start). The address which contains vmalloc_start value is represented as below: vmap_area_list.next - OFFSET(vmap_area.list) + OFFSET(vmap_area.va_start) However, both OFFSET(vmap_area.va_start) and OFFSET(vmap_area.list) aren't exported as VMCOREINFO. So this patch exports them externally with small cleanup. [akpm@linux-foundation.org: vmalloc.h should include list.h for list_head] Signed-off-by: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Dave Anderson <anderson@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Ingo Molnar <mingo@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29mm, vmalloc: remove list management of vmlist after initializing vmallocJoonsoo Kim1-40/+12
Now, there is no need to maintain vmlist after initializing vmalloc. So remove related code and data structure. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Dave Anderson <anderson@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Ingo Molnar <mingo@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29mm, vmalloc: export vmap_area_list, instead of vmlistJoonsoo Kim1-5/+6
Although our intention is to unexport internal structure entirely, but there is one exception for kexec. kexec dumps address of vmlist and makedumpfile uses this information. We are about to remove vmlist, then another way to retrieve information of vmalloc layer is needed for makedumpfile. For this purpose, we export vmap_area_list, instead of vmlist. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Dave Anderson <anderson@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29mm, vmalloc: iterate vmap_area_list, instead of vmlist, in vmallocinfo()Joonsoo Kim1-13/+42
This patch is a preparatory step for removing vmlist entirely. For above purpose, we change iterating a vmap_list codes to iterating a vmap_area_list. It is somewhat trivial change, but just one thing should be noticed. Using vmap_area_list in vmallocinfo() introduce ordering problem in SMP system. In s_show(), we retrieve some values from vm_struct. vm_struct's values is not fully setup when va->vm is assigned. Full setup is notified by removing VM_UNLIST flag without holding a lock. When we see that VM_UNLIST is removed, it is not ensured that vm_struct has proper values in view of other CPUs. So we need smp_[rw]mb for ensuring that proper values is assigned when we see that VM_UNLIST is removed. Therefore, this patch not only change a iteration list, but also add a appropriate smp_[rw]mb to right places. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Dave Anderson <anderson@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Ingo Molnar <mingo@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29mm, vmalloc: iterate vmap_area_list in get_vmalloc_info()Joonsoo Kim1-26/+30
This patch is a preparatory step for removing vmlist entirely. For above purpose, we change iterating a vmap_list codes to iterating a vmap_area_list. It is somewhat trivial change, but just one thing should be noticed. vmlist is lack of information about some areas in vmalloc address space. For example, vm_map_ram() allocate area in vmalloc address space, but it doesn't make a link with vmlist. To provide full information about vmalloc address space is better idea, so we don't use va->vm and use vmap_area directly. This makes get_vmalloc_info() more precise. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Dave Anderson <anderson@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Ingo Molnar <mingo@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29mm, vmalloc: iterate vmap_area_list, instead of vmlist in vread/vwrite()Joonsoo Kim1-16/+32
Now, when we hold a vmap_area_lock, va->vm can't be discarded. So we can safely access to va->vm when iterating a vmap_area_list with holding a vmap_area_lock. With this property, change iterating vmlist codes in vread/vwrite() to iterating vmap_area_list. There is a little difference relate to lock, because vmlist_lock is mutex, but, vmap_area_lock is spin_lock. It may introduce a spinning overhead during vread/vwrite() is executing. But, these are debug-oriented functions, so this overhead is not real problem for common case. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Dave Anderson <anderson@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Ingo Molnar <mingo@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29mm, vmalloc: protect va->vm by vmap_area_lockJoonsoo Kim1-0/+7
Inserting and removing an entry to vmlist is linear time complexity, so it is inefficient. Following patches will try to remove vmlist entirely. This patch is preparing step for it. For removing vmlist, iterating vmlist codes should be changed to iterating a vmap_area_list. Before implementing that, we should make sure that when we iterate a vmap_area_list, accessing to va->vm doesn't cause a race condition. This patch ensure that when iterating a vmap_area_list, there is no race condition for accessing to vm_struct. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Dave Anderson <anderson@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Ingo Molnar <mingo@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29mm, vmalloc: move get_vmalloc_info() to vmalloc.cJoonsoo Kim1-0/+44
Now get_vmalloc_info() is in fs/proc/mmu.c. There is no reason that this code must be here and it's implementation needs vmlist_lock and it iterate a vmlist which may be internal data structure for vmalloc. It is preferable that vmlist_lock and vmlist is only used in vmalloc.c for maintainability. So move the code to vmalloc.c Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Dave Anderson <anderson@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Ingo Molnar <mingo@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-10make vfree() safe to call from interrupt contextsAl Viro1-5/+40
A bunch of RCU callbacks want to be able to do vfree() and end up with rather kludgy schemes. Just let vfree() do the right thing - put the victim on llist and schedule actual __vunmap() via schedule_work(), so that it runs from non-interrupt context. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-23mm: use NUMA_NO_NODEDavid Rientjes1-15/+18
Make a sweep through mm/ and convert code that uses -1 directly to using the more appropriate NUMA_NO_NODE. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-11mm: use IS_ENABLED(CONFIG_NUMA) instead of NUMA_BUILDKirill A. Shutemov1-2/+2
We don't need custom NUMA_BUILD anymore, since we have handy IS_ENABLED(). Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: use %pK for /proc/vmallocinfoKees Cook1-1/+1
In the paranoid case of sysctl kernel.kptr_restrict=2, mask the kernel virtual addresses in /proc/vmallocinfo too. Signed-off-by: Kees Cook <keescook@chromium.org> Reported-by: Brad Spengler <spender@grsecurity.net> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: kill vma flag VM_RESERVED and mm->reserved_vm counterKonstantin Khlebnikov1-2/+1
A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA, currently it lost original meaning but still has some effects: | effect | alternative flags -+------------------------+--------------------------------------------- 1| account as reserved_vm | VM_IO 2| skip in core dump | VM_IO, VM_DONTDUMP 3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP 4| do not mlock | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP This patch removes reserved_vm counter from mm_struct. Seems like nobody cares about it, it does not exported into userspace directly, it only reduces total_vm showed in proc. Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP. remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP. remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP. [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup] Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm: make vb_alloc() more foolproofJan Kara1-0/+8
If someone calls vb_alloc() (or vm_map_ram() for that matter) to allocate 0 bytes (0 pages), get_order() returns BITS_PER_LONG - PAGE_CACHE_SHIFT and interesting stuff happens. So make debugging such problems easier and warn about 0-size allocation. [akpm@linux-foundation.org: use WARN_ON-return-value feature] Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31vmalloc: walk vmap_areas by sorted list instead of rb_next()Hong zhi guo1-4/+4
There's a walk by repeating rb_next to find a suitable hole. Could be simply replaced by walk on the sorted vmap_area_list. More simpler and efficient. Mutation of the list and tree only happens in pair within __insert_vmap_area and __free_vmap_area, under protection of vmap_area_lock. The patch code is also under vmap_area_lock, so the list walk is safe, and consistent with the tree walk. Tested on SMP by repeating batch of vmalloc anf vfree for random sizes and rounds for hours. Signed-off-by: Hong Zhiguo <honkiko@gmail.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30Merge branch 'for-linus-for-3.6-rc1' of ↵Linus Torvalds1-10/+18
git://git.linaro.org/people/mszyprowski/linux-dma-mapping Pull DMA-mapping updates from Marek Szyprowski: "Those patches are continuation of my earlier work. They contains extensions to DMA-mapping framework to remove limitation of the current ARM implementation (like limited total size of DMA coherent/write combine buffers), improve performance of buffer sharing between devices (attributes to skip cpu cache operations or creation of additional kernel mapping for some specific use cases) as well as some unification of the common code for dma_mmap_attrs() and dma_mmap_coherent() functions. All extensions have been implemented and tested for ARM architecture." * 'for-linus-for-3.6-rc1' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping: ARM: dma-mapping: add support for DMA_ATTR_SKIP_CPU_SYNC attribute common: DMA-mapping: add DMA_ATTR_SKIP_CPU_SYNC attribute ARM: dma-mapping: add support for dma_get_sgtable() common: dma-mapping: introduce dma_get_sgtable() function ARM: dma-mapping: add support for DMA_ATTR_NO_KERNEL_MAPPING attribute common: DMA-mapping: add DMA_ATTR_NO_KERNEL_MAPPING attribute common: dma-mapping: add support for generic dma_mmap_* calls ARM: dma-mapping: fix error path for memory allocation failure ARM: dma-mapping: add more sanity checks in arm_dma_mmap() ARM: dma-mapping: remove custom consistent dma region mm: vmalloc: use const void * for caller argument scatterlist: add sg_alloc_table_from_pages function
2012-07-30ARM: dma-mapping: remove custom consistent dma regionMarek Szyprowski1-1/+9
This patch changes dma-mapping subsystem to use generic vmalloc areas for all consistent dma allocations. This increases the total size limit of the consistent allocations and removes platform hacks and a lot of duplicated code. Atomic allocations are served from special pool preallocated on boot, because vmalloc areas cannot be reliably created in atomic context. Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Reviewed-by: Kyungmin Park <kyungmin.park@samsung.com> Reviewed-by: Minchan Kim <minchan@kernel.org>
2012-07-30mm: vmalloc: use const void * for caller argumentMarek Szyprowski1-9/+9
'const void *' is a safer type for caller function type. This patch updates all references to caller function type. Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Reviewed-by: Kyungmin Park <kyungmin.park@samsung.com> Reviewed-by: Minchan Kim <minchan@kernel.org>
2012-07-24vmalloc: remove KM_USER0 from commentsCong Wang1-6/+2
Signed-off-by: Cong Wang <amwang@redhat.com>
2012-05-29mm: fix faulty initialization in vmalloc_init()KyongHo1-1/+2
The transfer of ->flags causes some of the static mapping virtual addresses to be prematurely freed (before the mapping is removed) because VM_LAZY_FREE gets "set" if tmp->flags has VM_IOREMAP set. This might cause subsequent vmalloc/ioremap calls to fail because it might allocate one of the freed virtual address ranges that aren't unmapped. va->flags has different types of flags from tmp->flags. If a region with VM_IOREMAP set is registered with vm_area_add_early(), it will be removed by __purge_vmap_area_lazy(). Fix vmalloc_init() to correctly initialize vmap_area for the given vm_struct. Also initialise va->vm. If it is not set, find_vm_area() for the early vm regions will always fail. Signed-off-by: KyongHo Cho <pullip.cho@samsung.com> Cc: "Olav Haugan" <ohaugan@codeaurora.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: use kcalloc() instead of kzalloc() to allocate arrayThomas Meyer1-2/+2
The advantage of kcalloc is, that will prevent integer overflows which could result from the multiplication of number of elements and size and it is also a bit nicer to read. The semantic patch that makes this change is available in https://lkml.org/lkml/2011/11/25/107 Signed-off-by: Thomas Meyer <thomas@m3y3r.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-20mm: remove the second argument of k[un]map_atomic()Cong Wang1-4/+4
Signed-off-by: Cong Wang <amwang@redhat.com>
2012-01-12mm/vmalloc.c: eliminate extra loop in pcpu_get_vm_areas error pathKautuk Consul1-5/+4
If either of the vas or vms arrays are not properly kzalloced, then the code jumps to the err_free label. The err_free label runs a loop to check and free each of the array members of the vas and vms arrays which is not required for this situation as none of the array members have been allocated till this point. Eliminate the extra loop we have to go through by introducing a new label err_free2 and then jumping to it. [akpm@linux-foundation.org: remove now-unneeded tests] Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-10mm/vmalloc.c: change void* into explict vm_struct*Minchan Kim1-4/+4
vmap_area->private is void* but we don't use the field for various purpose but use only for vm_struct. So change it to a vm_struct* with naming to improve for readability and type checking. Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-05Merge branch 'devel-stable' into for-linusRussell King1-2/+27
Conflicts: arch/arm/kernel/setup.c arch/arm/mach-shmobile/board-kota2.c
2011-12-20mm/vmalloc.c: remove static declaration of va from __get_vm_area_nodeKautuk Consul1-1/+1
Static storage is not required for the struct vmap_area in __get_vm_area_node. Removing "static" to store this variable on the stack instead. Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-12-09mm: vmalloc: check for page allocation failure before vmlist insertionMel Gorman1-0/+2
Commit f5252e00 ("mm: avoid null pointer access in vm_struct via /proc/vmallocinfo") adds newly allocated vm_structs to the vmlist after it is fully initialised. Unfortunately, it did not check that __vmalloc_area_node() successfully populated the area. In the event of allocation failure, the vmalloc area is freed but the pointer to freed memory is inserted into the vmlist leading to a a crash later in get_vmalloc_info(). This patch adds a check for ____vmalloc_area_node() failure within __vmalloc_node_range. It does not use "goto fail" as in the previous error path as a warning was already displayed by __vmalloc_area_node() before it called vfree in its failure path. Credit goes to Luciano Chavez for doing all the real work of identifying exactly where the problem was. Signed-off-by: Mel Gorman <mgorman@suse.de> Reported-by: Luciano Chavez <lnx1138@linux.vnet.ibm.com> Tested-by: Luciano Chavez <lnx1138@linux.vnet.ibm.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [3.1.x+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-12-05Merge branch 'vmalloc' of git://git.linaro.org/people/nico/linux into ↵Russell King1-2/+27
devel-stable
2011-11-18mm: add vm_area_add_early()Nicolas Pitre1-2/+27
The existing vm_area_register_early() allows for early vmalloc space allocation. However upcoming cleanups in the ARM architecture require that some fixed locations in the vmalloc area be reserved also very early. The name "vm_area_register_early" would have been a good name for the reservation part without the allocation. Since it is already in use with different semantics, let's create vm_area_add_early() instead. Both vm_area_register_early() and vm_area_add_early() can be used together meaning that the former is now implemented using the later where it is ensured that no conflicting areas are added, but no attempt is made to make the allocation scheme in vm_area_register_early() more sophisticated. After all, you must know what you're doing when using those functions. Signed-off-by: Nicolas Pitre <nicolas.pitre@linaro.org> Acked-by: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org
2011-11-16xen: map foreign pages for shared rings by updating the PTEs directlyDavid Vrabel1-14/+13
When mapping a foreign page with xenbus_map_ring_valloc() with the GNTTABOP_map_grant_ref hypercall, set the GNTMAP_contains_pte flag and pass a pointer to the PTE (in init_mm). After the page is mapped, the usual fault mechanism can be used to update additional MMs. This allows the vmalloc_sync_all() to be removed from alloc_vm_area(). Signed-off-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> [v1: Squashed fix by Michal for no-mmu case] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Michal Simek <monstr@monstr.eu>
2011-10-31mm/vmalloc.c: report more vmalloc failuresJoe Perches1-3/+8
Some vmalloc failure paths do not report OOM conditions. Add warn_alloc_failed, which also does a dump_stack, to those failure paths. This allows more site specific vmalloc failure logging message printks to be removed. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31mm: neaten warn_alloc_failedJoe Perches1-2/+2
Add __attribute__((format (printf...) to the function to validate format and arguments. Use vsprintf extension %pV to avoid any possible message interleaving. Coalesce format string. Convert printks/pr_warning to pr_warn. [akpm@linux-foundation.org: use the __printf() macro] Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31mm: avoid null pointer access in vm_struct via /proc/vmallocinfoMitsuo Hayasaka1-17/+48
The /proc/vmallocinfo shows information about vmalloc allocations in vmlist that is a linklist of vm_struct. It, however, may access pages field of vm_struct where a page was not allocated. This results in a null pointer access and leads to a kernel panic. Why this happens: In __vmalloc_node_range() called from vmalloc(), newly allocated vm_struct is added to vmlist at __get_vm_area_node() and then, some fields of vm_struct such as nr_pages and pages are set at __vmalloc_area_node(). In other words, it is added to vmlist before it is fully initialized. At the same time, when the /proc/vmallocinfo is read, it accesses the pages field of vm_struct according to the nr_pages field at show_numa_info(). Thus, a null pointer access happens. The patch adds the newly allocated vm_struct to the vmlist *after* it is fully initialized. So, it can avoid accessing the pages field with unallocated page when show_numa_info() is called. Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Cc: Namhyung Kim <namhyung@gmail.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-14mm: sync vmalloc address space page tables in alloc_vm_area()David Vrabel1-0/+8
Xen backend drivers (e.g., blkback and netback) would sometimes fail to map grant pages into the vmalloc address space allocated with alloc_vm_area(). The GNTTABOP_map_grant_ref would fail because Xen could not find the page (in the L2 table) containing the PTEs it needed to update. (XEN) mm.c:3846:d0 Could not find L1 PTE for address fbb42000 netback and blkback were making the hypercall from a kernel thread where task->active_mm != &init_mm and alloc_vm_area() was only updating the page tables for init_mm. The usual method of deferring the update to the page tables of other processes (i.e., after taking a fault) doesn't work as a fault cannot occur during the hypercall. This would work on some systems depending on what else was using vmalloc. Fix this by reverting ef691947d8a3 ("vmalloc: remove vmalloc_sync_all() from alloc_vm_area()") and add a comment to explain why it's needed. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Ian Campbell <Ian.Campbell@citrix.com> Cc: Keir Fraser <keir.xen@gmail.com> Cc: <stable@kernel.org> [3.0.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-14mm: fix wrong vmap address calculations with odd NR_CPUS valuesClemens Ladisch1-3/+4
Commit db64fe02258f ("mm: rewrite vmap layer") introduced code that does address calculations under the assumption that VMAP_BLOCK_SIZE is a power of two. However, this might not be true if CONFIG_NR_CPUS is not set to a power of two. Wrong vmap_block index/offset values could lead to memory corruption. However, this has never been observed in practice (or never been diagnosed correctly); what caught this was the BUG_ON in vb_alloc() that checks for inconsistent vmap_block indices. To fix this, ensure that VMAP_BLOCK_SIZE always is a power of two. BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=31572 Reported-by: Pavel Kysilka <goldenfish@linuxsoft.cz> Reported-by: Matias A. Fonzo <selk@dragora.org> Signed-off-by: Clemens Ladisch <clemens@ladisch.de> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de> Cc: Nick Piggin <npiggin@suse.de> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Krzysztof Helt <krzysztof.h1@poczta.fm> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: 2.6.28+ <stable@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-26atomic: use <linux/atomic.h>Arun Sharma1-1/+1
This allows us to move duplicated code in <asm/atomic.h> (atomic_inc_not_zero() for now) to <linux/atomic.h> Signed-off-by: Arun Sharma <asharma@fb.com> Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-20vmalloc,rcu: Convert call_rcu(rcu_free_vb) to kfree_rcu()Lai Jiangshan1-8/+1
The rcu callback rcu_free_vb() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(rcu_free_vb). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Namhyung Kim <namhyung@gmail.com> Cc: David Rientjes <rientjes@google.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-07-20vmalloc,rcu: Convert call_rcu(rcu_free_va) to kfree_rcu()Lai Jiangshan1-8/+1
The rcu callback rcu_free_va() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(rcu_free_va). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Namhyung Kim <namhyung@gmail.com> Cc: David Rientjes <rientjes@google.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-26Merge branch 'upstream/tidy-xen-mmu-2.6.39' of ↵Linus Torvalds1-4/+0
git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen * 'upstream/tidy-xen-mmu-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen: xen: fix compile without CONFIG_XEN_DEBUG_FS Use arbitrary_virt_to_machine() to deal with ioremapped pud updates. Use arbitrary_virt_to_machine() to deal with ioremapped pmd updates. xen/mmu: remove all ad-hoc stats stuff xen: use normal virt_to_machine for ptes xen: make a pile of mmu pvop functions static vmalloc: remove vmalloc_sync_all() from alloc_vm_area() xen: condense everything onto xen_set_pte xen: use mmu_update for xen_set_pte_at() xen: drop all the special iomap pte paths.
2011-05-25mm: print vmalloc() state after allocation failuresDave Hansen1-2/+7
I was tracking down a page allocation failure that ended up in vmalloc(). Since vmalloc() uses 0-order pages, if somebody asks for an insane amount of memory, we'll still get a warning with "order:0" in it. That's not very useful. During recovery, vmalloc() also nicely frees all of the memory that it got up to the point of the failure. That is wonderful, but it also quickly hides any issues. We have a much different sitation if vmalloc() repeatedly fails 10GB in to: vmalloc(100 * 1<<30); versus repeatedly failing 4096 bytes in to a: vmalloc(8192); This patch will print out messages that look like this: [ 68.123503] vmalloc: allocation failure, allocated 6680576 of 13426688 bytes [ 68.124218] bash: page allocation failure: order:0, mode:0xd2 [ 68.124811] Pid: 3770, comm: bash Not tainted 2.6.39-rc3-00082-g85f2e68-dirty #333 [ 68.125579] Call Trace: [ 68.125853] [<ffffffff810f6da6>] warn_alloc_failed+0x146/0x170 [ 68.126464] [<ffffffff8107e05c>] ? printk+0x6c/0x70 [ 68.126791] [<ffffffff8112b5d4>] ? alloc_pages_current+0x94/0xe0 [ 68.127661] [<ffffffff8111ed37>] __vmalloc_node_range+0x237/0x290 ... The 'order' variable is added for clarity when calling warn_alloc_failed() to avoid having an unexplained '0' as an argument. The 'tmp_mask' is because adding an open-coded '| __GFP_NOWARN' would take us over 80 columns for the alloc_pages_node() call. If we are going to add a line, it might as well be one that makes the sucker easier to read. As a side issue, I also noticed that ctl_ioctl() does vmalloc() based solely on an unverified value passed in from userspace. Granted, it's under CAP_SYS_ADMIN, but it still frightens me a bit. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm/vmalloc: remove guard page from between vmap blocksJohannes Weiner1-3/+3
The vmap allocator is used to, among other things, allocate per-cpu vmap blocks, where each vmap block is naturally aligned to its own size. Obviously, leaving a guard page after each vmap area forbids packing vmap blocks efficiently and can make the kernel run out of possible vmap blocks long before overall vmap space is exhausted. The new interface to map a user-supplied page array into linear vmalloc space (vm_map_ram) insists on allocating from a vmap block (instead of falling back to a custom area) when the area size is below a certain threshold. With heavy users of this interface (e.g. XFS) and limited vmalloc space on 32-bit, vmap block exhaustion is a real problem. Remove the guard page from the core vmap allocator. vmalloc and the old vmap interface enforce a guard page on their own at a higher level. Note that without this patch, we had accidental guard pages after those vm_map_ram areas that happened to be at the end of a vmap block, but not between every area. This patch removes this accidental guard page only. If we want guard pages after every vm_map_ram area, this should be done separately. And just like with vmalloc and the old interface on a different level, not in the core allocator. Mel pointed out: "If necessary, the guard page could be reintroduced as a debugging-only option (CONFIG_DEBUG_PAGEALLOC?). Otherwise it seems reasonable." Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Dave Chinner <david@fromorbit.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-20vmalloc: remove vmalloc_sync_all() from alloc_vm_area()Jeremy Fitzhardinge1-4/+0
There's no need for it: it will get faulted into the current pagetable as needed. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
2011-03-22vmalloc: remove confusing comment on vwrite()Namhyung Kim1-2/+0
KM_USER1 is never used for vwrite() path so the caller doesn't need to guarantee it is not used. Only the caller should guarantee is KM_USER0 and it is commented already. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22mm: vmap area cacheNick Piggin1-52/+104
Provide a free area cache for the vmalloc virtual address allocator, based on the algorithm used by the user virtual memory allocator. This reduces the number of rbtree operations and linear traversals over the vmap extents in order to find a free area, by starting off at the last point that a free area was found. The free area cache is reset if areas are freed behind it, or if we are searching for a smaller area or alignment than last time. So allocation patterns are not changed (verified by corner-case and random test cases in userspace testing). This solves a regression caused by lazy vunmap TLB purging introduced in db64fe02 (mm: rewrite vmap layer). That patch will leave extents in the vmap allocator after they are vunmapped, and until a significant number accumulate that can be flushed in a single batch. So in a workload that vmalloc/vfree frequently, a chain of extents will build up from VMALLOC_START address, which have to be iterated over each time (giving an O(n) type of behaviour). After this patch, the search will start from where it left off, giving closer to an amortized O(1). This is verified to solve regressions reported Steven in GFS2, and Avi in KVM. Hugh's update: : I tried out the recent mmotm, and on one machine was fortunate to hit : the BUG_ON(first->va_start < addr) which seems to have been stalling : your vmap area cache patch ever since May. : I can get you addresses etc, I did dump a few out; but once I stared : at them, it was easier just to look at the code: and I cannot see how : you would be so sure that first->va_start < addr, once you've done : that addr = ALIGN(max(...), align) above, if align is over 0x1000 : (align was 0x8000 or 0x4000 in the cases I hit: ioremaps like Steve). : I originally got around it by just changing the : if (first->va_start < addr) { : to : while (first->va_start < addr) { : without thinking about it any further; but that seemed unsatisfactory, : why would we want to loop here when we've got another very similar : loop just below it? : I am never going to admit how long I've spent trying to grasp your : "while (n)" rbtree loop just above this, the one with the peculiar : if (!first && tmp->va_start < addr + size) : in. That's unfamiliar to me, I'm guessing it's designed to save a : subsequent rb_next() in a few circumstances (at risk of then setting : a wrong cached_hole_size?); but they did appear few to me, and I didn't : feel I could sign off something with that in when I don't grasp it, : and it seems responsible for extra code and mistaken BUG_ON below it. : I've reverted to the familiar rbtree loop that find_vma() does (but : with va_end >= addr as you had, to respect the additional guard page): : and then (given that cached_hole_size starts out 0) I don't see the : need for any complications below it. If you do want to keep that loop : as you had it, please add a comment to explain what it's trying to do, : and where addr is relative to first when you emerge from it. : Aren't your tests "size <= cached_hole_size" and : "addr + size > first->va_start" forgetting the guard page we want : before the next area? I've changed those. : I have not changed your many "addr + size - 1 < addr" overflow tests, : but have since come to wonder, shouldn't they be "addr + size < addr" : tests - won't the vend checks go wrong if addr + size is 0? : I have added a few comments - Wolfgang Wander's 2.6.13 description of : 1363c3cd8603a913a27e2995dccbd70d5312d8e6 Avoiding mmap fragmentation : helped me a lot, perhaps a pointer to that would be good too. And I found : it easier to understand when I renamed cached_start slightly and moved the : overflow label down. : This patch would go after your mm-vmap-area-cache.patch in mmotm. : Trivially, nobody is going to get that BUG_ON with this patch, and it : appears to work fine on my machines; but I have not given it anything like : the testing you did on your original, and may have broken all the : performance you were aiming for. Please take a look and test it out : integrate with yours if you're satisfied - thanks. [akpm@linux-foundation.org: add locking comment] Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reported-and-tested-by: Steven Whitehouse <swhiteho@redhat.com> Reported-and-tested-by: Avi Kivity <avi@redhat.com> Tested-by: "Barry J. Marson" <bmarson@redhat.com> Cc: Prarit Bhargava <prarit@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13Merge branch 'release' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6 * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (59 commits) ACPI / PM: Fix build problems for !CONFIG_ACPI related to NVS rework ACPI: fix resource check message ACPI / Battery: Update information on info notification and resume ACPI: Drop device flag wake_capable ACPI: Always check if _PRW is present before trying to evaluate it ACPI / PM: Check status of power resources under mutexes ACPI / PM: Rename acpi_power_off_device() ACPI / PM: Drop acpi_power_nocheck ACPI / PM: Drop acpi_bus_get_power() Platform / x86: Make fujitsu_laptop use acpi_bus_update_power() ACPI / Fan: Rework the handling of power resources ACPI / PM: Register power resource devices as soon as they are needed ACPI / PM: Register acpi_power_driver early ACPI / PM: Add function for updating device power state consistently ACPI / PM: Add function for device power state initialization ACPI / PM: Introduce __acpi_bus_get_power() ACPI / PM: Introduce function for refcounting device power resources ACPI / PM: Add functions for manipulating lists of power resources ACPI / PM: Prevent acpi_power_get_inferred_state() from making changes ACPICA: Update version to 20101209 ...
2011-01-13vmalloc: remove redundant unlikely()Tobias Klauser1-1/+1
IS_ERR() already implies unlikely(), so it can be omitted here. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13mm: unify module_alloc code for vmallocDavid Rientjes1-21/+29
Four architectures (arm, mips, sparc, x86) use __vmalloc_area() for module_init(). Much of the code is duplicated and can be generalized in a globally accessible function, __vmalloc_node_range(). __vmalloc_node() now calls into __vmalloc_node_range() with a range of [VMALLOC_START, VMALLOC_END) for functionally equivalent behavior. Each architecture may then use __vmalloc_node_range() directly to remove the duplication of code. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13mm: remove gfp mask from pcpu_get_vm_areasDavid Rientjes1-12/+9
pcpu_get_vm_areas() only uses GFP_KERNEL allocations, so remove the gfp_t formal and use the mask internally. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13mm: remove unused get_vm_area_nodeDavid Rientjes1-7/+0
get_vm_area_node() is unused in the kernel and can thus be removed. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13mm: convert sprintf_symbol to %pSJoe Perches1-7/+2
Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Jiri Kosina <trivial@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-12ACPI, APEI, Generic Hardware Error Source POLL/IRQ/NMI notification type supportHuang Ying1-0/+1
Generic Hardware Error Source provides a way to report platform hardware errors (such as that from chipset). It works in so called "Firmware First" mode, that is, hardware errors are reported to firmware firstly, then reported to Linux by firmware. This way, some non-standard hardware error registers or non-standard hardware link can be checked by firmware to produce more valuable hardware error information for Linux. This patch adds POLL/IRQ/NMI notification types support. Because the memory area used to transfer hardware error information from BIOS to Linux can be determined only in NMI, IRQ or timer handler, but general ioremap can not be used in atomic context, so a special version of atomic ioremap is implemented for that. Known issue: - Error information can not be printed for recoverable errors notified via NMI, because printk is not NMI-safe. Will fix this via delay printing to IRQ context via irq_work or make printk NMI-safe. v2: - adjust printk format per comments. Signed-off-by: Huang Ying <ying.huang@intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2010-12-02vmalloc: eagerly clear ptes on vunmapJeremy Fitzhardinge1-11/+17
On stock 2.6.37-rc4, running: # mount lilith:/export /mnt/lilith # find /mnt/lilith/ -type f -print0 | xargs -0 file crashes the machine fairly quickly under Xen. Often it results in oops messages, but the couple of times I tried just now, it just hung quietly and made Xen print some rude messages: (XEN) mm.c:2389:d80 Bad type (saw 7400000000000001 != exp 3000000000000000) for mfn 1d7058 (pfn 18fa7) (XEN) mm.c:964:d80 Attempt to create linear p.t. with write perms (XEN) mm.c:2389:d80 Bad type (saw 7400000000000010 != exp 1000000000000000) for mfn 1d2e04 (pfn 1d1fb) (XEN) mm.c:2965:d80 Error while pinning mfn 1d2e04 Which means the domain tried to map a pagetable page RW, which would allow it to map arbitrary memory, so Xen stopped it. This is because vm_unmap_ram() left some pages mapped in the vmalloc area after NFS had finished with them, and those pages got recycled as pagetable pages while still having these RW aliases. Removing those mappings immediately removes the Xen-visible aliases, and so it has no problem with those pages being reused as pagetable pages. Deferring the TLB flush doesn't upset Xen because it can flush the TLB itself as needed to maintain its invariants. When unmapping a region in the vmalloc space, clear the ptes immediately. There's no point in deferring this because there's no amortization benefit. The TLBs are left dirty, and they are flushed lazily to amortize the cost of the IPIs. This specific motivation for this patch is an oops-causing regression since 2.6.36 when using NFS under Xen, triggered by the NFS client's use of vm_map_ram() introduced in 56e4ebf877b60 ("NFS: readdir with vmapped pages") . XFS also uses vm_map_ram() and could cause similar problems. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Bryan Schumaker <bjschuma@netapp.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Alex Elder <aelder@sgi.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26mm: add vzalloc() and vzalloc_node() helpersDave Young1-2/+44
Add vzalloc() and vzalloc_node() to encapsulate the vmalloc-then-memset-zero operation. Use __GFP_ZERO to zero fill the allocated memory. Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Acked-by: Greg Ungerer <gerg@snapgear.com> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26vmalloc: annotate lock context change on s_start/stop()Namhyung Kim1-0/+2
s_start() and s_stop() grab/release vmlist_lock but were missing proper annotations. Add them. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26vmalloc: rename temporary variable in __insert_vmap_area()Namhyung Kim1-4/+4
Rename redundant 'tmp' to fix following sparse warnings: mm/vmalloc.c:296:34: warning: symbol 'tmp' shadows an earlier one mm/vmalloc.c:293:24: originally declared here Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-22Merge branch 'for-linus' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: update comments to reflect that percpu allocations are always zero-filled percpu: Optimize __get_cpu_var() x86, percpu: Optimize this_cpu_ptr percpu: clear memory allocated with the km allocator percpu: fix build breakage on s390 and cleanup build configuration tests percpu: use percpu allocator on UP too percpu: reduce PCPU_MIN_UNIT_SIZE to 32k vmalloc: pcpu_get/free_vm_areas() aren't needed on UP Fixed up trivial conflicts in include/linux/percpu.h
2010-09-17mm, x86: Saving vmcore with non-lazy freeing of vmasCliff Wickman1-0/+9
During the reading of /proc/vmcore the kernel is doing ioremap()/iounmap() repeatedly. And the buildup of un-flushed vm_area_struct's is causing a great deal of overhead. (rb_next() is chewing up most of that time). This solution is to provide function set_iounmap_nonlazy(). It causes a subsequent call to iounmap() to immediately purge the vma area (with try_purge_vmap_area_lazy()). With this patch we have seen the time for writing a 250MB compressed dump drop from 71 seconds to 44 seconds. Signed-off-by: Cliff Wickman <cpw@sgi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: kexec@lists.infradead.org Cc: <stable@kernel.org> LKML-Reference: <E1OwHZ4-0005WK-Tw@eag09.americas.sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-09-08vmalloc: pcpu_get/free_vm_areas() aren't needed on UPTejun Heo1-0/+2
These functions are used only by percpu memory allocator on SMP. Don't build them on UP. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Nick Piggin <npiggin@kernel.dk> Reviewed-by: Chrsitoph Lameter <cl@linux.com>
2010-08-12Merge branch 'stable/xen-swiotlb-0.8.6' of ↵Linus Torvalds1-0/+4
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen * 'stable/xen-swiotlb-0.8.6' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: x86: Detect whether we should use Xen SWIOTLB. pci-swiotlb-xen: Add glue code to setup dma_ops utilizing xen_swiotlb_* functions. swiotlb-xen: SWIOTLB library for Xen PV guest with PCI passthrough. xen/mmu: inhibit vmap aliases rather than trying to clear them out vmap: add flag to allow lazy unmap to be disabled at runtime xen: Add xen_create_contiguous_region xen: Rename the balloon lock xen: Allow unprivileged Xen domains to create iomap pages xen: use _PAGE_IOMAP in ioremap to do machine mappings Fix up trivial conflicts (adding both xen swiotlb and xen pci platform driver setup close to each other) in drivers/xen/{Kconfig,Makefile} and include/xen/xen-ops.h
2010-08-09mm/vmalloc.c: check kmalloc() return valueKulikov Vasiliy1-1/+4
kmalloc() may fail, if so return -ENOMEM. Signed-off-by: Kulikov Vasiliy <segooon@gmail.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09mm: use ERR_CASTJulia Lawall1-1/+1
Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more clear what is the purpose of the operation, which otherwise looks like a no-op. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ type T; T x; identifier f; @@ T f (...) { <+... - ERR_PTR(PTR_ERR(x)) + x ...+> } @@ expression x; @@ - ERR_PTR(PTR_ERR(x)) + ERR_CAST(x) // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-07-27vmap: add flag to allow lazy unmap to be disabled at runtimeJeremy Fitzhardinge1-0/+4
Add a flag to force lazy_max_pages() to zero to prevent any outstanding mapped pages. We'll need this for Xen. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Nick Piggin <npiggin@suse.de>
2010-07-09x86, ioremap: Fix incorrect physical address handling in PAE modeKenji Kaneshige1-1/+1
Current x86 ioremap() doesn't handle physical address higher than 32-bit properly in X86_32 PAE mode. When physical address higher than 32-bit is passed to ioremap(), higher 32-bits in physical address is cleared wrongly. Due to this bug, ioremap() can map wrong address to linear address space. In my case, 64-bit MMIO region was assigned to a PCI device (ioat device) on my system. Because of the ioremap()'s bug, wrong physical address (instead of MMIO region) was mapped to linear address space. Because of this, loading ioatdma driver caused unexpected behavior (kernel panic, kernel hangup, ...). Signed-off-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com> LKML-Reference: <4C1AE680.7090408@jp.fujitsu.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2010-02-02mm: purge fragmented percpu vmap blocksNick Piggin1-11/+81
Improve handling of fragmented per-CPU vmaps. We previously don't free up per-CPU maps until all its addresses have been used and freed. So fragmented blocks could fill up vmalloc space even if they actually had no active vmap regions within them. Add some logic to allow all CPUs to have these blocks purged in the case of failure to allocate a new vm area, and also put some logic to trim such blocks of a current CPU if we hit them in the allocation path (so as to avoid a large build up of them). Christoph reported some vmap allocation failures when using the per CPU vmap APIs in XFS, which cannot be reproduced after this patch and the previous bug fix. Cc: linux-mm@kvack.org Cc: stable@kernel.org Tested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Nick Piggin <npiggin@suse.de> -- Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-02mm: percpu-vmap fix RCU list walkingNick Piggin1-14/+6
RCU list walking of the per-cpu vmap cache was broken. It did not use RCU primitives, and also the union of free_list and rcu_head is obviously wrong (because free_list is indeed the list we are RCU walking). While we are there, remove a couple of unused fields from an earlier iteration. These APIs aren't actually used anywhere, because of problems with the XFS conversion. Christoph has now verified that the problems are solved with these patches. Also it is an exported interface, so I think it will be good to be merged now (and Christoph wants to get the XFS changes into their local tree). Cc: stable@kernel.org Cc: linux-mm@kvack.org Tested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Nick Piggin <npiggin@suse.de> -- Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-21vmalloc: remove BUG_ON due to racy counting of VM_LAZY_FREEYongseok Koh1-3/+1
In free_unmap_area_noflush(), va->flags is marked as VM_LAZY_FREE first, and then vmap_lazy_nr is increased atomically. But, in __purge_vmap_area_lazy(), while traversing of vmap_are_list, nr is counted by checking VM_LAZY_FREE is set to va->flags. After counting the variable nr, kernel reads vmap_lazy_nr atomically and checks a BUG_ON condition whether nr is greater than vmap_lazy_nr to prevent vmap_lazy_nr from being negative. The problem is that, if interrupted right after marking VM_LAZY_FREE, increment of vmap_lazy_nr can be delayed. Consequently, BUG_ON condition can be met because nr is counted more than vmap_lazy_nr. It is highly probable when vmalloc/vfree are called frequently. This scenario have been verified by adding delay between marking VM_LAZY_FREE and increasing vmap_lazy_nr in free_unmap_area_noflush(). Even the vmap_lazy_nr is for checking high watermark, it never be the strict watermark. Although the BUG_ON condition is to prevent vmap_lazy_nr from being negative, vmap_lazy_nr is signed variable. So, it could go down to negative value temporarily. Consequently, removing the BUG_ON condition is proper. A possible BUG_ON message is like the below. kernel BUG at mm/vmalloc.c:517! invalid opcode: 0000 [#1] SMP EIP: 0060:[<c04824a4>] EFLAGS: 00010297 CPU: 3 EIP is at __purge_vmap_area_lazy+0x144/0x150 EAX: ee8a8818 EBX: c08e77d4 ECX: e7c7ae40 EDX: c08e77ec ESI: 000081fe EDI: e7c7ae60 EBP: e7c7ae64 ESP: e7c7ae3c DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Call Trace: [<c0482ad9>] free_unmap_vmap_area_noflush+0x69/0x70 [<c0482b02>] remove_vm_area+0x22/0x70 [<c0482c15>] __vunmap+0x45/0xe0 [<c04831ec>] vmalloc+0x2c/0x30 Code: 8d 59 e0 eb 04 66 90 89 cb 89 d0 e8 87 fe ff ff 8b 43 20 89 da 8d 48 e0 8d 43 20 3b 04 24 75 e7 fe 05 a8 a5 a3 c0 e9 78 ff ff ff <0f> 0b eb fe 90 8d b4 26 00 00 00 00 56 89 c6 b8 ac a5 a3 c0 31 EIP: [<c04824a4>] __purge_vmap_area_lazy+0x144/0x150 SS:ESP 0068:e7c7ae3c [ See also http://marc.info/?l=linux-kernel&m=126335856228090&w=2 ] Signed-off-by: Yongseok Koh <yongseok.koh@samsung.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15vmalloc(): adjust gfp mask passed on nested vmalloc() invocationJan Beulich1-4/+3
- avoid wasting more precious resources (DMA or DMA32 pools), when being called through vmalloc_32{,_user}() - explicitly allow using high memory here even if the outer allocation request doesn't allow it Signed-off-by: Jan Beulich <jbeulich@novell.com> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-14Merge branch 'for-linus' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (34 commits) m68k: rename global variable vmalloc_end to m68k_vmalloc_end percpu: add missing per_cpu_ptr_to_phys() definition for UP percpu: Fix kdump failure if booted with percpu_alloc=page percpu: make misc percpu symbols unique percpu: make percpu symbols in ia64 unique percpu: make percpu symbols in powerpc unique percpu: make percpu symbols in x86 unique percpu: make percpu symbols in xen unique percpu: make percpu symbols in cpufreq unique percpu: make percpu symbols in oprofile unique percpu: make percpu symbols in tracer unique percpu: make percpu symbols under kernel/ and mm/ unique percpu: remove some sparse warnings percpu: make alloc_percpu() handle array types vmalloc: fix use of non-existent percpu variable in put_cpu_var() this_cpu: Use this_cpu_xx in trace_functions_graph.c this_cpu: Use this_cpu_xx for ftrace this_cpu: Use this_cpu_xx in nmi handling this_cpu: Use this_cpu operations in RCU this_cpu: Use this_cpu ops for VM statistics ... Fix up trivial (famous last words) global per-cpu naming conflicts in arch/x86/kvm/svm.c mm/slab.c
2009-10-29vmalloc: fix use of non-existent percpu variable in put_cpu_var()Tejun Heo1-2/+2
vmalloc used non-existent percpu variable vmap_cpu_blocks instead of the intended vmap_block_queue. This went unnoticed because put_cpu_var() didn't evaluate the parameter. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Nick Piggin <npiggin@suse.de>
2009-10-11headers: remove sched.h from interrupt.hAlexey Dobriyan1-0/+1
After m68k's task_thread_info() doesn't refer to current, it's possible to remove sched.h from interrupt.h and not break m68k! Many thanks to Heiko Carstens for allowing this. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-10-08Merge branch 'sparc-perf-events-fixes-for-linus' of ↵Linus Torvalds1-22/+26
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sparc-perf-events-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: mm, perf_event: Make vmalloc_user() align base kernel virtual address to SHMLBA perf_event: Provide vmalloc() based mmap() backing
2009-10-08mm, perf_event: Make vmalloc_user() align base kernel virtual address to SHMLBADavid Miller1-22/+26
When a vmalloc'd area is mmap'd into userspace, some kind of co-ordination is necessary for this to work on platforms with cpu D-caches which can have aliases. Otherwise kernel side writes won't be seen properly in userspace and vice versa. If the kernel side mapping and the user side one have the same alignment, modulo SHMLBA, this can work as long as VM_SHARED is shared of VMA and for all current users this is true. VM_SHARED will force SHMLBA alignment of the user side mmap on platforms with D-cache aliasing matters. The bulk of this patch is just making it so that a specific alignment can be passed down into __get_vm_area_node(). All existing callers pass in '1' which preserves existing behavior. vmalloc_user() gives SHMLBA for the alignment. As a side effect this should get the video media drivers and other vmalloc_user() users into more working shape on such systems. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> LKML-Reference: <200909211922.n8LJMYjw029425@imap1.linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-10-08mm: includecheck fix: vmalloc.cJaswinder Singh Rajput1-1/+0
fix the following 'make includecheck' warning: mm/vmalloc.c: linux/highmem.h is included more than once. Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23kcore: register module area in generic wayKAMEZAWA Hiroyuki1-1/+1
Some archs define MODULED_VADDR/MODULES_END which is not in VMALLOC area. This is handled only in x86-64. This patch make it more generic. And we can use vread/vwrite to access the area. Fix it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22mm: replace various uses of num_physpages by totalram_pagesJan Beulich1-2/+2
Sizing of memory allocations shouldn't depend on the number of physical pages found in a system, as that generally includes (perhaps a huge amount of) non-RAM pages. The amount of what actually is usable as storage should instead be used as a basis here. Some of the calculations (i.e. those not intending to use high memory) should likely even use (totalram_pages - totalhigh_pages). Signed-off-by: Jan Beulich <jbeulich@novell.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Dave Airlie <airlied@linux.ie> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: "David S. Miller" <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22kcore: fix vread/vwrite to be aware of holesKAMEZAWA Hiroyuki1-23/+176
vread/vwrite access vmalloc area without checking there is a page or not. In most case, this works well. In old ages, the caller of get_vm_ara() is only IOREMAP and there is no memory hole within vm_struct's [addr...addr + size - PAGE_SIZE] ( -PAGE_SIZE is for a guard page.) After per-cpu-alloc patch, it uses get_vm_area() for reserve continuous virtual address but remap _later_. There tend to be a hole in valid vmalloc area in vm_struct lists. Then, skip the hole (not mapped page) is necessary. This patch updates vread/vwrite() for avoiding memory hole. Routines which access vmalloc area without knowing for which addr is used are - /proc/kcore - /dev/kmem kcore checks IOREMAP, /dev/kmem doesn't. After this patch, IOREMAP is checked and /dev/kmem will avoid to read/write it. Fixes to /proc/kcore will be in the next patch in series. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: WANG Cong <xiyou.wangcong@gmail.com> Cc: Mike Smith <scgtrp@gmail.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22vmalloc: unmap vmalloc area after hiding itKAMEZAWA Hiroyuki1-5/+9
vmap area should be purged after vm_struct is removed from the list because vread/vwrite etc...believes the range is valid while it's on vm_struct list. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com> Cc: Mike Smith <scgtrp@gmail.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22vmalloc.c: fix double error checkingFigo.zhang1-3/+1
There is no need for double error checking. Signed-off-by: Figo.zhang <figo1802@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-08-14vmalloc: implement pcpu_get_vm_areas()Tejun Heo1-0/+293
To directly use spread NUMA memories for percpu units, percpu allocator will be updated to allow sparsely mapping units in a chunk. As the distances between units can be very large, this makes allocating single vmap area for each chunk undesirable. This patch implements pcpu_get_vm_areas() and pcpu_free_vm_areas() which allocates and frees sparse congruent vmap areas. pcpu_get_vm_areas() take @offsets and @sizes array which define distances and sizes of vmap areas. It scans down from the top of vmalloc area looking for the top-most address which can accomodate all the areas. The top-down scan is to avoid interacting with regular vmallocs which can push up these congruent areas up little by little ending up wasting address space and page table. To speed up top-down scan, the highest possible address hint is maintained. Although the scan is linear from the hint, given the usual large holes between memory addresses between NUMA nodes, the scanning is highly likely to finish after finding the first hole for the last unit which is scanned first. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Nick Piggin <npiggin@suse.de>
2009-08-14vmalloc: separate out insert_vmalloc_vm()Tejun Heo1-21/+24
Separate out insert_vmalloc_vm() from __get_vm_area_node(). insert_vmalloc_vm() initializes vm_struct from vmap_area and inserts it into vmlist. insert_vmalloc_vm() only initializes fields which can be determined from @vm, @flags and @caller The rest should be initialized by the caller. For __get_vm_area_node(), all other fields just need to be cleared and this is done by using kzalloc instead of kmalloc. This will be used to implement pcpu_get_vm_areas(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Nick Piggin <npiggin@suse.de>
2009-06-11Merge branch 'for-linus' of git://linux-arm.org/linux-2.6Linus Torvalds1-3/+27
* 'for-linus' of git://linux-arm.org/linux-2.6: kmemleak: Add the corresponding MAINTAINERS entry kmemleak: Simple testing module for kmemleak kmemleak: Enable the building of the memory leak detector kmemleak: Remove some of the kmemleak false positives kmemleak: Add modules support kmemleak: Add kmemleak_alloc callback from alloc_large_system_hash kmemleak: Add the vmalloc memory allocation/freeing hooks kmemleak: Add the slub memory allocation/freeing hooks kmemleak: Add the slob memory allocation/freeing hooks kmemleak: Add the slab memory allocation/freeing hooks kmemleak: Add documentation on the memory leak detector kmemleak: Add the base support Manual conflict resolution (with the slab/earlyboot changes) in: drivers/char/vt.c init/main.c mm/slab.c
2009-06-11vmalloc: use kzalloc() instead of alloc_bootmem()Pekka Enberg1-2/+1
We can call vmalloc_init() after kmem_cache_init() and use kzalloc() instead of the bootmem allocator when initializing vmalloc data structures. Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Nick Piggin <npiggin@suse.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-06-11kmemleak: Add the vmalloc memory allocation/freeing hooksCatalin Marinas1-3/+27
This patch adds the callbacks to kmemleak_(alloc|free) functions from vmalloc/vfree. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2009-05-06alloc_vmap_area: fix memory leakRalph Wuerthner1-0/+1
If alloc_vmap_area() fails the allocated struct vmap_area has to be freed. Signed-off-by: Ralph Wuerthner <ralphw@linux.vnet.ibm.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01vmap: remove needless lock and list in vmapMinChan Kim1-16/+3
vmap's dirty_list is unused. It's for optimizing flushing. but Nick didn't write the code yet. so, we don't need it until time as it is needed. This patch removes vmap_block's dirty_list and codes related to it. Signed-off-by: MinChan Kim <minchan.kim@gmail.com> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-04Merge branch 'x86/core' into core/percpuIngo Molnar1-1/+12
2009-03-01Merge branch 'x86/urgent' into x86/patIngo Molnar1-1/+9
2009-02-27mm: fix lazy vmap purging (use-after-free error)Vegard Nossum1-1/+2
I just got this new warning from kmemcheck: WARNING: kmemcheck: Caught 32-bit read from freed memory (c7806a60) a06a80c7ecde70c1a04080c700000000a06709c1000000000000000000000000 f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f ^ Pid: 0, comm: swapper Not tainted (2.6.29-rc4 #230) EIP: 0060:[<c1096df7>] EFLAGS: 00000286 CPU: 0 EIP is at __purge_vmap_area_lazy+0x117/0x140 EAX: 00070f43 EBX: c7806a40 ECX: c1677080 EDX: 00027b66 ESI: 00002001 EDI: c170df0c EBP: c170df00 ESP: c178830c DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 CR0: 80050033 CR2: c7806b14 CR3: 01775000 CR4: 00000690 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: 00004000 DR7: 00000000 [<c1096f3e>] free_unmap_vmap_area_noflush+0x6e/0x70 [<c1096f6a>] remove_vm_area+0x2a/0x70 [<c1097025>] __vunmap+0x45/0xe0 [<c10970de>] vunmap+0x1e/0x30 [<c1008ba5>] text_poke+0x95/0x150 [<c1008ca9>] alternatives_smp_unlock+0x49/0x60 [<c171ef47>] alternative_instructions+0x11b/0x124 [<c171f991>] check_bugs+0xbd/0xdc [<c17148c5>] start_kernel+0x2ed/0x360 [<c171409e>] __init_begin+0x9e/0xa9 [<ffffffff>] 0xffffffff It happened here: $ addr2line -e vmlinux -i c1096df7 mm/vmalloc.c:540 Code: list_for_each_entry(va, &valist, purge_list) __free_vmap_area(va); It's this instruction: mov 0x20(%ebx),%edx Which corresponds to a dereference of va->purge_list.next: (gdb) p ((struct vmap_area *) 0)->purge_list.next Cannot access memory at address 0x20 It seems that we should use "safe" list traversal here, as the element is freed inside the loop. Please verify that this is the right fix. Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: <stable@kernel.org> [2.6.28.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-27mm: vmap fix overflowNick Piggin1-0/+7
The new vmap allocator can wrap the address and get confused in the case of large allocations or VMALLOC_END near the end of address space. Problem reported by Christoph Hellwig on a 32-bit XFS workload. Signed-off-by: Nick Piggin <npiggin@suse.de> Reported-by: Christoph Hellwig <hch@lst.de> Cc: <stable@kernel.org> [2.6.28.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-26Merge branches 'x86/apic', 'x86/defconfig', 'x86/memtest', 'x86/mm' and ↵Ingo Molnar1-0/+3
'linus' into x86/core
2009-02-25x86: make vmap yell louder when it is used under irqs_disabled()Peter Zijlstra1-0/+3
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-24Merge branch 'tj-percpu' of ↵Ingo Molnar1-3/+91
git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into core/percpu Conflicts: arch/x86/include/asm/pgtable.h
2009-02-24vmalloc: add @align to vm_area_register_early()Tejun Heo1-4/+7
Impact: allow larger alignment for early vmalloc area allocation Some early vmalloc users might want larger alignment, for example, for custom large page mapping. Add @align to vm_area_register_early(). While at it, drop docbook comment on non-existent @size. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
2009-02-20vmalloc: call flush_cache_vunmap() from unmap_kernel_range()Tejun Heo1-0/+2
Impact: proper vcache flush on unmap_kernel_range() flush_cache_vunmap() should be called before pages are unmapped. Add a call to it in unmap_kernel_range(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: David S. Miller <davem@davemloft.net> Cc: <stable@kernel.org> [2.6.28.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-20vmalloc: add un/map_kernel_range_noflush()Tejun Heo1-3/+64
Impact: two more public map/unmap functions Implement map_kernel_range_noflush() and unmap_kernel_range_noflush(). These functions respectively map and unmap address range in kernel VM area but doesn't do any vcache or tlb flushing. These will be used by new percpu allocator. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au>
2009-02-20vmalloc: implement vm_area_register_early()Tejun Heo1-0/+24
Impact: allow multiple early vm areas There are places where kernel VM area needs to be allocated before vmalloc is initialized. This is done by allocating static vm_struct, initializing several fields and linking it to vmlist and later vmalloc initialization picking up these from vmlist. This is currently done manually and if there's more than one such areas, there's no defined way to arbitrate who gets which address. This patch implements vm_area_register_early(), which takes vm_area struct with flags and size initialized, assigns address to it and puts it on the vmlist. This way, multiple early vm areas can determine which addresses they should use. The only current user - alpha mm init - is converted to use it. Signed-off-by: Tejun Heo <tj@kernel.org>
2009-02-20vmalloc: call flush_cache_vunmap() from unmap_kernel_range()Tejun Heo1-0/+2
Impact: proper vcache flush on unmap_kernel_range() flush_cache_vunmap() should be called before pages are unmapped. Add a call to it in unmap_kernel_range(). Signed-off-by: Tejun Heo <tj@kernel.org>
2009-02-18vmalloc: add __get_vm_area_caller()Benjamin Herrenschmidt1-0/+8
We have get_vm_area_caller() and __get_vm_area() but not __get_vm_area_caller() On powerpc, I use __get_vm_area() to separate the ranges of addresses given to vmalloc vs. ioremap (various good reasons for that) so in order to be able to implement the new caller tracking in /proc/vmallocinfo, I need a "_caller" variant of it. (akpm: needed for ongoing powerpc development, so merge it early) [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-15revert "mm: vmalloc use mutex for purge"Andrew Morton1-5/+4
Revert commit e97a630eb0f5b8b380fd67504de6cedebb489003 ("mm: vmalloc use mutex for purge") Bryan Donlan reports: : After testing 2.6.29-rc1 on xen-x86 with a btrfs root filesystem, I : got the OOPS quoted below and a hard freeze shortly after boot. : Boot messages and config are attached. : : ------------[ cut here ]------------ : Kernel BUG at c05ef80d [verbose debug info unavailable] : invalid opcode: 0000 [#1] SMP : last sysfs file: /sys/block/xvdc/size : Modules linked in: : : Pid: 0, comm: swapper Not tainted (2.6.29-rc1 #6) : EIP: 0061:[<c05ef80d>] EFLAGS: 00010087 CPU: 2 : EIP is at schedule+0x7cd/0x950 : EAX: d5aeca80 EBX: 00000002 ECX: 00000000 EDX: d4cb9a40 : ESI: c12f5600 EDI: d4cb9a40 EBP: d6033fa4 ESP: d6033ef4 : DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0069 : Process swapper (pid: 0, ti=d6032000 task=d6020b70 task.ti=d6032000) : Stack: : 000d85bc 00000000 000186a0 00000000 0dd11410 c0105417 c12efe00 0dc367c3 : 00000011 c0105d46 d5a5d310 deadbeef d4cb9a40 c07cc600 c05f1340 c12e0060 : deadbeef d6020b70 d6020d08 00000002 c014377d 00000000 c12f5600 00002c22 : Call Trace: : [<c0105417>] xen_force_evtchn_callback+0x17/0x30 : [<c0105d46>] check_events+0x8/0x12 : [<c05f1340>] _spin_unlock_irqrestore+0x20/0x40 : [<c014377d>] hrtimer_start_range_ns+0x12d/0x2e0 : [<c014c4f6>] tick_nohz_restart_sched_tick+0x146/0x160 : [<c0107485>] cpu_idle+0xa5/0xc0 and bisected it to this commit. Let's remove it now while we have a think about the problem. Reported-by: Bryan Donlan <bdonlan@gmail.com> Tested-by: Christophe Saout <christophe@saout.de> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-15alpha: fix vmalloc breakageIvan Kokshaysky1-0/+11
On alpha, we have to map some stuff in the VMALLOC space very early in the boot process (to make SRM console callbacks work and so on, see arch/alpha/mm/init.c). For old VM allocator, we just manually placed a vm_struct onto the global vmlist and this worked for ages. Unfortunately, the new allocator isn't aware of this, so it constantly tries to allocate the VM space which is already in use, making vmalloc on alpha defunct. This patch forces KVA to import vmlist entries on init. [akpm@linux-foundation.org: remove unneeded check (per Johannes)] Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Nick Piggin <npiggin@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: vmalloc make lazy unmapping configurableNick Piggin1-0/+24
Lazy unmapping in the vmalloc code has now opened the possibility for use after free bugs to go undetected. We can catch those by forcing an unmap and flush (which is going to be slow, but that's what happens). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: vmalloc use mutex for purgeNick Piggin1-4/+5
The vmalloc purge lock can be a mutex so we can sleep while a purge is going on (purge involves a global kernel TLB invalidate, so it can take quite a while). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: vmalloc improve vmallocinfoGlauber Costa1-4/+8
If we do that, output of files like /proc/vmallocinfo will show things like "vmalloc_32", "vmalloc_user", or whomever the caller was as the caller. This info is not as useful as the real caller of the allocation. So, proposal is to call __vmalloc_node node directly, with matching parameters to save the caller information Signed-off-by: Glauber Costa <glommer@redhat.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: vmalloc tweak failure printkGlauber Costa1-2/+3
If we can't service a vmalloc allocation, show size of the allocation that actually failed. Useful for debugging. Signed-off-by: Glauber Costa <glommer@redhat.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-04vmalloc.c: fix flushing in vmap_page_range()Adam Lackorzynski1-2/+3
The flush_cache_vmap in vmap_page_range() is called with the end of the range twice. The following patch fixes this for me. Signed-off-by: Adam Lackorzynski <adam@os.inf.tu-dresden.de> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-10KSYM_SYMBOL_LEN fixesHugh Dickins1-1/+1
Miles Lane tailing /sys files hit a BUG which Pekka Enberg has tracked to my 966c8c12dc9e77f931e2281ba25d2f0244b06949 sprint_symbol(): use less stack exposing a bug in slub's list_locations() - kallsyms_lookup() writes a 0 to namebuf[KSYM_NAME_LEN-1], but that was beyond the end of page provided. The 100 slop which list_locations() allows at end of page looks roughly enough for all the other stuff it might print after the symbol before it checks again: break out KSYM_SYMBOL_LEN earlier than before. Latencytop and ftrace and are using KSYM_NAME_LEN buffers where they need KSYM_SYMBOL_LEN buffers, and vmallocinfo a 2*KSYM_NAME_LEN buffer where it wants a KSYM_SYMBOL_LEN buffer: fix those before anyone copies them. [akpm@linux-foundation.org: ftrace.h needs module.h] Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc Miles Lane <miles.lane@gmail.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Steven Rostedt <srostedt@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-01mm: vmalloc fix lazy unmapping cache aliasingNick Piggin1-4/+16
Jim Radford has reported that the vmap subsystem rewrite was sometimes causing his VIVT ARM system to behave strangely (seemed like going into infinite loops trying to fault in pages to userspace). We determined that the problem was most likely due to a cache aliasing issue. flush_cache_vunmap was only being called at the moment the page tables were to be taken down, however with lazy unmapping, this can happen after the page has subsequently been freed and allocated for something else. The dangling alias may still have dirty data attached to it. The fix for this problem is to do the cache flushing when the caller has called vunmap -- it would be a bug for them to write anything else to the mapping at that point. That appeared to solve Jim's problems. Reported-by: Jim Radford <radford@blackbean.org> Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-19mm: vmalloc search restart fixGlauber Costa1-2/+2
Current vmalloc restart search for a free area in case we can't find one. The reason is there are areas which are lazily freed, and could be possibly freed now. However, current implementation start searching the tree from the last failing address, which is pretty much by definition at the end of address space. So, we fail. The proposal of this patch is to restart the search from the beginning of the requested vstart address. This fixes the regression in running KVM virtual machines for me, described in http://lkml.org/lkml/2008/10/28/349, caused by commit db64fe02258f1507e13fe5212a989922323685ce. Signed-off-by: Glauber Costa <glommer@redhat.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-19mm: vmalloc failure flush fixNick Piggin1-2/+13
An initial vmalloc failure should start off a synchronous flush of lazy areas, in case someone is in progress flushing them already, which could cause us to return an allocation failure even if there is plenty of KVA free. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-19mm: vmalloc allocator off by oneNick Piggin1-1/+1
Fix off by one bug in the KVA allocator that can leave gaps in the address space. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-07vmap: cope with vm_unmap_aliases before vmalloc_init()Jeremy Fitzhardinge1-0/+7
Xen can end up calling vm_unmap_aliases() before vmalloc_init() has been called. In this case its safe to make it a simple no-op. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: Linux Memory Management List <linux-mm@kvack.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-06[ARM] fix naming of MODULE_START / MODULE_ENDRussell King1-1/+1
As of 73bdf0a60e607f4b8ecc5aec597105976565a84f, the kernel needs to know where modules are located in the virtual address space. On ARM, we located this region between MODULE_START and MODULE_END. Unfortunately, everyone else calls it MODULES_VADDR and MODULES_END. Update ARM to use the same naming, so is_vmalloc_or_module_addr() can work properly. Also update the comment on mm/vmalloc.c to reflect that ARM also places modules in a separate region from the vmalloc space. Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2008-10-30mm: fix kernel-doc function notationRandy Dunlap1-1/+2
Delete excess kernel-doc notation in mm/ subdirectory. Actually this is a kernel-doc notation fix. Warning(/var/linsrc/linux-2.6.27-git10//mm/vmalloc.c:902): Excess function parameter or struct member 'returns' description in 'vm_map_ram' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-23proc: move /proc/vmallocinfo to mm/vmalloc.cAlexey Dobriyan1-1/+32
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux-foundation.org>
2008-10-20mm: remove duplicated #include'sHuang Weiyi1-1/+0
Removed duplicated #include <linux/vmalloc.h> in mm/vmalloc.c and "internal.h" in mm/memory.c. Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-20Merge branch 'x86-fixes-for-linus' of ↵Linus Torvalds1-2/+16
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86 ACPI: fix breakage of resume on 64-bit UP systems with SMP kernel Introduce is_vmalloc_or_module_addr() and use with DEBUG_VIRTUAL
2008-10-20mm: rewrite vmap layerNick Piggin1-133/+842
Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and provide a fast, scalable percpu frontend for small vmaps (requires a slightly different API, though). The biggest problem with vmap is actually vunmap. Presently this requires a global kernel TLB flush, which on most architectures is a broadcast IPI to all CPUs to flush the cache. This is all done under a global lock. As the number of CPUs increases, so will the number of vunmaps a scaled workload will want to perform, and so will the cost of a global TLB flush. This gives terrible quadratic scalability characteristics. Another problem is that the entire vmap subsystem works under a single lock. It is a rwlock, but it is actually taken for write in all the fast paths, and the read locking would likely never be run concurrently anyway, so it's just pointless. This is a rewrite of vmap subsystem to solve those problems. The existing vmalloc API is implemented on top of the rewritten subsystem. The TLB flushing problem is solved by using lazy TLB unmapping. vmap addresses do not have to be flushed immediately when they are vunmapped, because the kernel will not reuse them again (would be a use-after-free) until they are reallocated. So the addresses aren't allocated again until a subsequent TLB flush. A single TLB flush then can flush multiple vunmaps from each CPU. XEN and PAT and such do not like deferred TLB flushing because they can't always handle multiple aliasing virtual addresses to a physical address. They now call vm_unmap_aliases() in order to flush any deferred mappings. That call is very expensive (well, actually not a lot more expensive than a single vunmap under the old scheme), however it should be OK if not called too often. The virtual memory extent information is stored in an rbtree rather than a linked list to improve the algorithmic scalability. There is a per-CPU allocator for small vmaps, which amortizes or avoids global locking. To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces must be used in place of vmap and vunmap. Vmalloc does not use these interfaces at the moment, so it will not be quite so scalable (although it will use lazy TLB flushing). As a quick test of performance, I ran a test that loops in the kernel, linearly mapping then touching then unmapping 4 pages. Different numbers of tests were run in parallel on an 4 core, 2 socket opteron. Results are in nanoseconds per map+touch+unmap. threads vanilla vmap rewrite 1 14700 2900 2 33600 3000 4 49500 2800 8 70631 2900 So with a 8 cores, the rewritten version is already 25x faster. In a slightly more realistic test (although with an older and less scalable version of the patch), I ripped the not-very-good vunmap batching code out of XFS, and implemented the large buffer mapping with vm_map_ram and vm_unmap_ram... along with a couple of other tricks, I was able to speed up a large directory workload by 20x on a 64 CPU system. I believe vmap/vunmap is actually sped up a lot more than 20x on such a system, but I'm running into other locks now. vmap is pretty well blown off the profiles. Before: 1352059 total 0.1401 798784 _write_lock 8320.6667 <- vmlist_lock 529313 default_idle 1181.5022 15242 smp_call_function 15.8771 <- vmap tlb flushing 2472 __get_vm_area_node 1.9312 <- vmap 1762 remove_vm_area 4.5885 <- vunmap 316 map_vm_area 0.2297 <- vmap 312 kfree 0.1950 300 _spin_lock 3.1250 252 sn_send_IPI_phys 0.4375 <- tlb flushing 238 vmap 0.8264 <- vmap 216 find_lock_page 0.5192 196 find_next_bit 0.3603 136 sn2_send_IPI 0.2024 130 pio_phys_write_mmr 2.0312 118 unmap_kernel_range 0.1229 After: 78406 total 0.0081 40053 default_idle 89.4040 33576 ia64_spinlock_contention 349.7500 1650 _spin_lock 17.1875 319 __reg_op 0.5538 281 _atomic_dec_and_lock 1.0977 153 mutex_unlock 1.5938 123 iget_locked 0.1671 117 xfs_dir_lookup 0.1662 117 dput 0.1406 114 xfs_iget_core 0.0268 92 xfs_da_hashname 0.1917 75 d_alloc 0.0670 68 vmap_page_range 0.0462 <- vmap 58 kmem_cache_alloc 0.0604 57 memset 0.0540 52 rb_next 0.1625 50 __copy_user 0.0208 49 bitmap_find_free_region 0.2188 <- vmap 46 ia64_sn_udelay 0.1106 45 find_inode_fast 0.1406 42 memcmp 0.2188 42 finish_task_switch 0.1094 42 __d_lookup 0.0410 40 radix_tree_lookup_slot 0.1250 37 _spin_unlock_irqrestore 0.3854 36 xfs_bmapi 0.0050 36 kmem_cache_free 0.0256 35 xfs_vn_getattr 0.0322 34 radix_tree_lookup 0.1062 33 __link_path_walk 0.0035 31 xfs_da_do_buf 0.0091 30 _xfs_buf_find 0.0204 28 find_get_page 0.0875 27 xfs_iread 0.0241 27 __strncpy_from_user 0.2812 26 _xfs_buf_initialize 0.0406 24 _xfs_buf_lookup_pages 0.0179 24 vunmap_page_range 0.0250 <- vunmap 23 find_lock_page 0.0799 22 vm_map_ram 0.0087 <- vmap 20 kfree 0.0125 19 put_page 0.0330 18 __kmalloc 0.0176 17 xfs_da_node_lookup_int 0.0086 17 _read_lock 0.0885 17 page_waitqueue 0.0664 vmap has gone from being the top 5 on the profiles and flushing the crap out of all TLBs, to using less than 1% of kernel time. [akpm@linux-foundation.org: cleanups, section fix] [akpm@linux-foundation.org: fix build on alpha] Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Krzysztof Helt <krzysztof.h1@poczta.fm> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16Introduce is_vmalloc_or_module_addr() and use with DEBUG_VIRTUALLinus Torvalds1-2/+16
Impact: crash on module insertion with CONFIG_DEBUG_VIRTUAL We would incorrectly BUG due to: VIRTUAL_BUG_ON(!is_vmalloc_addr(vmalloc_addr) && !is_module_address(addr)); ... because, at least on x86-64, is_module_address() doesn't do what it should. This patch introduces is_vmalloc_or_module_addr(), which is what we really want anyway, and uses it instead. Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2008-10-12Merge branches 'x86/xen', 'x86/build', 'x86/microcode', 'x86/mm-debug-v2', ↵Ingo Molnar1-0/+7
'x86/memory-corruption-check', 'x86/early-printk', 'x86/xsave', 'x86/ptrace-v2', 'x86/quirks', 'x86/setup', 'x86/spinlocks' and 'x86/signal' into x86/core-v2
2008-07-26Use WARN() in mm/vmalloc.cArjan van de Ven1-4/+2
Use WARN() instead of a printk+WARN_ON() pair; this way the message becomes part of the warning section for better reporting/collection. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24vmallocinfo: add NUMA informationEric Dumazet1-0/+20
Christoph recently added /proc/vmallocinfo file to get information about vmalloc allocations. This patch adds NUMA specific information, giving number of pages allocated on each memory node. This should help to check that vmalloc() is able to respect NUMA policies. Example of output on a four nodes machine (one cpu per node) 1) network hash tables are evenly spreaded on four nodes (OK) (Same point for inodes and dentries hash tables) 2) iptables tables (x_tables) are correctly allocated on each cpu node (OK). 3) sys_swapon() allocates its memory from one node only. 4) each loaded module is using memory on one node. Sysadmins could tune their setup to change points 3) and 4) if necessary. grep "pages=" /proc/vmallocinfo 0xffffc20000000000-0xffffc20000201000 2101248 alloc_large_system_hash+0x204/0x2c0 pages=512 vmalloc N0=128 N1=128 N2=128 N3=128 0xffffc20000201000-0xffffc20000302000 1052672 alloc_large_system_hash+0x204/0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64 0xffffc2000031a000-0xffffc2000031d000 12288 alloc_large_system_hash+0x204/0x2c0 pages=2 vmalloc N1=1 N2=1 0xffffc2000031f000-0xffffc2000032b000 49152 cramfs_uncompress_init+0x2e/0x80 pages=11 vmalloc N0=3 N1=3 N2=2 N3=3 0xffffc2000033e000-0xffffc20000341000 12288 sys_swapon+0x640/0xac0 pages=2 vmalloc N0=2 0xffffc20000341000-0xffffc20000344000 12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N0=2 0xffffc20000344000-0xffffc20000347000 12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N1=2 0xffffc20000347000-0xffffc2000034a000 12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N2=2 0xffffc2000034a000-0xffffc2000034d000 12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N3=2 0xffffc20004381000-0xffffc20004402000 528384 alloc_large_system_hash+0x204/0x2c0 pages=128 vmalloc N0=32 N1=32 N2=32 N3=32 0xffffc20004402000-0xffffc20004803000 4198400 alloc_large_system_hash+0x204/0x2c0 pages=1024 vmalloc vpages N0=256 N1=256 N2=256 N3=256 0xffffc20004803000-0xffffc20004904000 1052672 alloc_large_system_hash+0x204/0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64 0xffffc20004904000-0xffffc20004bec000 3047424 sys_swapon+0x640/0xac0 pages=743 vmalloc vpages N0=743 0xffffffffa0000000-0xffffffffa000f000 61440 sys_init_module+0xc27/0x1d00 pages=14 vmalloc N1=14 0xffffffffa000f000-0xffffffffa0014000 20480 sys_init_module+0xc27/0x1d00 pages=4 vmalloc N0=4 0xffffffffa0014000-0xffffffffa0017000 12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N0=2 0xffffffffa0017000-0xffffffffa0022000 45056 sys_init_module+0xc27/0x1d00 pages=10 vmalloc N1=10 0xffffffffa0022000-0xffffffffa0028000 24576 sys_init_module+0xc27/0x1d00 pages=5 vmalloc N3=5 0xffffffffa0028000-0xffffffffa0050000 163840 sys_init_module+0xc27/0x1d00 pages=39 vmalloc N1=39 0xffffffffa0050000-0xffffffffa0052000 8192 sys_init_module+0xc27/0x1d00 pages=1 vmalloc N1=1 0xffffffffa0052000-0xffffffffa0056000 16384 sys_init_module+0xc27/0x1d00 pages=3 vmalloc N1=3 0xffffffffa0056000-0xffffffffa0081000 176128 sys_init_module+0xc27/0x1d00 pages=42 vmalloc N3=42 0xffffffffa0081000-0xffffffffa00ae000 184320 sys_init_module+0xc27/0x1d00 pages=44 vmalloc N3=44 0xffffffffa00ae000-0xffffffffa00b1000 12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N3=2 0xffffffffa00b1000-0xffffffffa00b9000 32768 sys_init_module+0xc27/0x1d00 pages=7 vmalloc N0=7 0xffffffffa00b9000-0xffffffffa00c4000 45056 sys_init_module+0xc27/0x1d00 pages=10 vmalloc N3=10 0xffffffffa00c6000-0xffffffffa00e0000 106496 sys_init_module+0xc27/0x1d00 pages=25 vmalloc N2=25 0xffffffffa00e0000-0xffffffffa00f1000 69632 sys_init_module+0xc27/0x1d00 pages=16 vmalloc N2=16 0xffffffffa00f1000-0xffffffffa00f4000 12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N3=2 0xffffffffa00f4000-0xffffffffa00f7000 12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N3=2 [akpm@linux-foundation.org: fix comment] Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-19x86, MM: virtual address debug, cleanupsIngo Molnar1-2/+4
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-19MM: virtual address debugJiri Slaby1-0/+5
Add some (configurable) expensive sanity checking to catch wrong address translations on x86. - create linux/mmdebug.h file to be able include this file in asm headers to not get unsolvable loops in header files - __phys_addr on x86_32 became a function in ioremap.c since PAGE_OFFSET, is_vmalloc_addr and VMALLOC_* non-constasts are undefined if declared in page_32.h - add __phys_addr_const for initializing doublefault_tss.__cr3 Tested on 386, 386pae, x86_64 and x86_64 numa=fake=2. Contains Andi's enable numa virtual address debug patch. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-05-01docbook: fix vmalloc missing parameter notationRandy Dunlap1-0/+1
Fix vmalloc kernel-doc warning: Warning(linux-2.6.25-git14//mm/vmalloc.c:555): No description found for parameter 'caller' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30infrastructure to debug (dynamic) objectsThomas Gleixner1-0/+2
We can see an ever repeating problem pattern with objects of any kind in the kernel: 1) freeing of active objects 2) reinitialization of active objects Both problems can be hard to debug because the crash happens at a point where we have no chance to decode the root cause anymore. One problem spot are kernel timers, where the detection of the problem often happens in interrupt context and usually causes the machine to panic. While working on a timer related bug report I had to hack specialized code into the timer subsystem to get a reasonable hint for the root cause. This debug hack was fine for temporary use, but far from a mergeable solution due to the intrusiveness into the timer code. The code further lacked the ability to detect and report the root cause instantly and keep the system operational. Keeping the system operational is important to get hold of the debug information without special debugging aids like serial consoles and special knowledge of the bug reporter. The problems described above are not restricted to timers, but timers tend to expose it usually in a full system crash. Other objects are less explosive, but the symptoms caused by such mistakes can be even harder to debug. Instead of creating specialized debugging code for the timer subsystem a generic infrastructure is created which allows developers to verify their code and provides an easy to enable debug facility for users in case of trouble. The debugobjects core code keeps track of operations on static and dynamic objects by inserting them into a hashed list and sanity checking them on object operations and provides additional checks whenever kernel memory is freed. The tracked object operations are: - initializing an object - adding an object to a subsystem list - deleting an object from a subsystem list Each operation is sanity checked before the operation is executed and the subsystem specific code can provide a fixup function which allows to prevent the damage of the operation. When the sanity check triggers a warning message and a stack trace is printed. The list of operations can be extended if the need arises. For now it's limited to the requirements of the first user (timers). The core code enqueues the objects into hash buckets. The hash index is generated from the address of the object to simplify the lookup for the check on kfree/vfree. Each bucket has it's own spinlock to avoid contention on a global lock. The debug code can be compiled in without being active. The runtime overhead is minimal and could be optimized by asm alternatives. A kernel command line option enables the debugging code. Thanks to Ingo Molnar for review, suggestions and cleanup patches. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Greg KH <greg@kroah.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28vmallocinfo: add caller informationChristoph Lameter1-18/+47
Add caller information so that /proc/vmallocinfo shows where the allocation request for a slice of vmalloc memory originated. Results in output like this: 0xffffc20000000000-0xffffc20000801000 8392704 alloc_large_system_hash+0x127/0x246 pages=2048 vmalloc vpages 0xffffc20000801000-0xffffc20000806000 20480 alloc_large_system_hash+0x127/0x246 pages=4 vmalloc 0xffffc20000806000-0xffffc20000c07000 4198400 alloc_large_system_hash+0x127/0x246 pages=1024 vmalloc vpages 0xffffc20000c07000-0xffffc20000c0a000 12288 alloc_large_system_hash+0x127/0x246 pages=2 vmalloc 0xffffc20000c0a000-0xffffc20000c0c000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap 0xffffc20000c0c000-0xffffc20000c0f000 12288 acpi_os_map_memory+0x13/0x1c phys=cff64000 ioremap 0xffffc20000c10000-0xffffc20000c15000 20480 acpi_os_map_memory+0x13/0x1c phys=cff65000 ioremap 0xffffc20000c16000-0xffffc20000c18000 8192 acpi_os_map_memory+0x13/0x1c phys=cff69000 ioremap 0xffffc20000c18000-0xffffc20000c1a000 8192 acpi_os_map_memory+0x13/0x1c phys=fed1f000 ioremap 0xffffc20000c1a000-0xffffc20000c1c000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap 0xffffc20000c1c000-0xffffc20000c1e000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap 0xffffc20000c1e000-0xffffc20000c20000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap 0xffffc20000c20000-0xffffc20000c22000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap 0xffffc20000c22000-0xffffc20000c24000 8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap 0xffffc20000c24000-0xffffc20000c26000 8192 acpi_os_map_memory+0x13/0x1c phys=e0081000 ioremap 0xffffc20000c26000-0xffffc20000c28000 8192 acpi_os_map_memory+0x13/0x1c phys=e0080000 ioremap 0xffffc20000c28000-0xffffc20000c2d000 20480 alloc_large_system_hash+0x127/0x246 pages=4 vmalloc 0xffffc20000c2d000-0xffffc20000c31000 16384 tcp_init+0xd5/0x31c pages=3 vmalloc 0xffffc20000c31000-0xffffc20000c34000 12288 alloc_large_system_hash+0x127/0x246 pages=2 vmalloc 0xffffc20000c34000-0xffffc20000c36000 8192 init_vdso_vars+0xde/0x1f1 0xffffc20000c36000-0xffffc20000c38000 8192 pci_iomap+0x8a/0xb4 phys=d8e00000 ioremap 0xffffc20000c38000-0xffffc20000c3a000 8192 usb_hcd_pci_probe+0x139/0x295 [usbcore] phys=d8e00000 ioremap 0xffffc20000c3a000-0xffffc20000c3e000 16384 sys_swapon+0x509/0xa15 pages=3 vmalloc 0xffffc20000c40000-0xffffc20000c61000 135168 e1000_probe+0x1c4/0xa32 phys=d8a20000 ioremap 0xffffc20000c61000-0xffffc20000c6a000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap 0xffffc20000c6a000-0xffffc20000c73000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap 0xffffc20000c73000-0xffffc20000c7c000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap 0xffffc20000c7c000-0xffffc20000c7f000 12288 e1000e_setup_tx_resources+0x29/0xbe pages=2 vmalloc 0xffffc20000c80000-0xffffc20001481000 8392704 pci_mmcfg_arch_init+0x90/0x118 phys=e0000000 ioremap 0xffffc20001481000-0xffffc20001682000 2101248 alloc_large_system_hash+0x127/0x246 pages=512 vmalloc 0xffffc20001682000-0xffffc20001e83000 8392704 alloc_large_system_hash+0x127/0x246 pages=2048 vmalloc vpages 0xffffc20001e83000-0xffffc20002204000 3674112 alloc_large_system_hash+0x127/0x246 pages=896 vmalloc vpages 0xffffc20002204000-0xffffc2000220d000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap 0xffffc2000220d000-0xffffc20002216000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap 0xffffc20002216000-0xffffc2000221f000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap 0xffffc2000221f000-0xffffc20002228000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap 0xffffc20002228000-0xffffc20002231000 36864 _xfs_buf_map_pages+0x8e/0xc0 vmap 0xffffc20002231000-0xffffc20002234000 12288 e1000e_setup_rx_resources+0x35/0x122 pages=2 vmalloc 0xffffc20002240000-0xffffc20002261000 135168 e1000_probe+0x1c4/0xa32 phys=d8a60000 ioremap 0xffffc20002261000-0xffffc2000270c000 4894720 sys_swapon+0x509/0xa15 pages=1194 vmalloc vpages 0xffffffffa0000000-0xffffffffa0022000 139264 module_alloc+0x4f/0x55 pages=33 vmalloc 0xffffffffa0022000-0xffffffffa0029000 28672 module_alloc+0x4f/0x55 pages=6 vmalloc 0xffffffffa002b000-0xffffffffa0034000 36864 module_alloc+0x4f/0x55 pages=8 vmalloc 0xffffffffa0034000-0xffffffffa003d000 36864 module_alloc+0x4f/0x55 pages=8 vmalloc 0xffffffffa003d000-0xffffffffa0049000 49152 module_alloc+0x4f/0x55 pages=11 vmalloc 0xffffffffa0049000-0xffffffffa0050000 28672 module_alloc+0x4f/0x55 pages=6 vmalloc [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Christoph Lameter <clameter@sgi.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28vmalloc: show vmalloced areas via /proc/vmallocinfoChristoph Lameter1-1/+75
Implement a new proc file that allows the display of the currently allocated vmalloc memory. It allows to see the users of vmalloc. That is important if vmalloc space is scarce (i386 for example). And it's going to be important for the compound page fallback to vmalloc. Many of the current users can be switched to use compound pages with fallback. This means that the number of users of vmalloc is reduced and page tables no longer necessary to access the memory. /proc/vmallocinfo allows to review how that reduction occurs. If memory becomes fragmented and larger order allocations are no longer possible then /proc/vmallocinfo allows to see which compound page allocations fell back to virtual compound pages. That is important for new users of virtual compound pages. Such as order 1 stack allocation etc that may fallback to virtual compound pages in the future. /proc/vmallocinfo permissions are made readable-only-by-root to avoid possible information leakage. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: CONFIG_MMU=n build fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-19mm: fix various kernel-doc commentsRandy Dunlap1-2/+4
Fix various kernel-doc notation in mm/: filemap.c: add function short description; convert 2 to kernel-doc fremap.c: change parameter 'prot' to @prot pagewalk.c: change "-" in function parameters to ":" slab.c: fix short description of kmem_ptr_validate() swap.c: fix description & parameters of put_pages_list() swap_state.c: fix function parameters vmalloc.c: change "@returns" to "Returns:" since that is not a parameter Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08CONFIG_HIGHPTE vs. sub-page page tables.Martin Schwidefsky1-1/+1
Background: I've implemented 1K/2K page tables for s390. These sub-page page tables are required to properly support the s390 virtualization instruction with KVM. The SIE instruction requires that the page tables have 256 page table entries (pte) followed by 256 page status table entries (pgste). The pgstes are only required if the process is using the SIE instruction. The pgstes are updated by the hardware and by the hypervisor for a number of reasons, one of them is dirty and reference bit tracking. To avoid wasting memory the standard pte table allocation should return 1K/2K (31/64 bit) and 2K/4K if the process is using SIE. Problem: Page size on s390 is 4K, page table size is 1K or 2K. That means the s390 version for pte_alloc_one cannot return a pointer to a struct page. Trouble is that with the CONFIG_HIGHPTE feature on x86 pte_alloc_one cannot return a pointer to a pte either, since that would require more than 32 bit for the return value of pte_alloc_one (and the pte * would not be accessible since its not kmapped). Solution: The only solution I found to this dilemma is a new typedef: a pgtable_t. For s390 pgtable_t will be a (pte *) - to be introduced with a later patch. For everybody else it will be a (struct page *). The additional problem with the initialization of the ptl lock and the NR_PAGETABLE accounting is solved with a constructor pgtable_page_ctor and a destructor pgtable_page_dtor. The page table allocation and free functions need to call these two whenever a page table page is allocated or freed. pmd_populate will get a pgtable_t instead of a struct page pointer. To get the pgtable_t back from a pmd entry that has been installed with pmd_populate a new function pmd_pgtable is added. It replaces the pmd_page call in free_pte_range and apply_to_pte_range. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05mm: don't allow ioremapping of ranges larger than vmalloc spaceRobert Bragg1-0/+4
When running with a 16M IOREMAP_MAX_ORDER (on armv7) we found that the vmlist search routine in __get_vm_area_node can mistakenly allow a driver to ioremap a range larger than vmalloc space. If at the time of the ioremap all existing vmlist areas sit below the determined alignment then the search routine continues past all entries and exits the for loop - straight into the found: label - without ever testing for integer wrapping or that the requested size fits. We were seeing a driver successfully ioremap 128M of flash even though there was only 120M of vmalloc space. From that point the system was left with the remainder of the first 16M of space to vmalloc/ioremap within. Signed-off-by: Robert Bragg <robert@sixbynine.org> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05make __vmalloc_area_node() staticAdrian Bunk1-2/+2
__vmalloc_area_node() can become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05vmalloc: clean up page array indexingChristoph Lameter1-5/+11
The page array is repeatedly indexed both in vunmap and vmalloc_area_node(). Add a temporary variable to make it easier to read (and easier to patch later). Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05vmalloc: add const to void* parametersChristoph Lameter1-8/+8
Make vmalloc functions work the same way as kfree() and friends that take a const void * argument. [akpm@linux-foundation.org: fix consts, coding-style] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05Move vmalloc_to_page() to mm/vmalloc.Christoph Lameter1-0/+38
We already have page table manipulation for vmalloc in vmalloc.c. Move the vmalloc_to_page() function there as well. Move the definitions for vmalloc related functions in mm.h to a newly created section. A better place would be vmalloc.h but mm.h is basic and may depend on these functions. An alternative would be to include vmalloc.h in mm.h (like done for vmstat.h). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-20spelling fixes: mm/Simon Arlott1-3/+3
Spelling fixes in mm/. Signed-off-by: Simon Arlott <simon@fire.lp0.eu> Signed-off-by: Adrian Bunk <bunk@kernel.org>
2007-10-16Categorize GFP flagsChristoph Lameter1-2/+3
The function of GFP_LEVEL_MASK seems to be unclear. In order to clear up the mystery we get rid of it and replace GFP_LEVEL_MASK with 3 sets of GFP flags: GFP_RECLAIM_MASK Flags used to control page allocator reclaim behavior. GFP_CONSTRAINT_MASK Flags used to limit where allocations can occur. GFP_SLAB_BUG_MASK Flags that the slab allocator BUG()s on. These replace the uses of GFP_LEVEL mask in the slab allocators and in vmalloc.c. The use of the flags not included in these sets may occur as a result of a slab allocation standing in for a page allocation when constructing scatter gather lists. Extraneous flags are cleared and not passed through to the page allocator. __GFP_MOVABLE/RECLAIMABLE, __GFP_COLD and __GFP_COMP will now be ignored if passed to a slab allocator. Change the allocation of allocator meta data in SLAB and vmalloc to not pass through flags listed in GFP_CONSTRAINT_MASK. SLAB already removes the __GFP_THISNODE flag for such allocations. Generalize that to also cover vmalloc. The use of GFP_CONSTRAINT_MASK also includes __GFP_HARDWALL. The impact of allocator metadata placement on access latency to the cachelines of the object itself is minimal since metadata is only referenced on alloc and free. The attempt is still made to place the meta data optimally but we consistently allow fallback both in SLAB and vmalloc (SLUB does not need to allocate metadata like that). Allocator metadata may serve multiple in kernel users and thus should not be subject to the limitations arising from a single allocation context. [akpm@linux-foundation.org: fix fallback_alloc()] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19lguest: export symbols for lguest as a moduleRusty Russell1-0/+2
lguest does some fairly lowlevel things to support a host, which normal modules don't need: math_state_restore: When the guest triggers a Device Not Available fault, we need to be able to restore the FPU __put_task_struct: We need to hold a reference to another task for inter-guest I/O, and put_task_struct() is an inline function which calls __put_task_struct. access_process_vm: We need to access another task for inter-guest I/O. map_vm_area & __get_vm_area: We need to map the switcher shim (ie. monitor) at 0xFFC01000. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19vmalloc_32 should use GFP_KERNELBenjamin Herrenschmidt1-2/+2
I've noticed lots of failures of vmalloc_32 on machines where it shouldn't have failed unless it was doing an atomic operation. Looking closely, I noticed that: #if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32) #define GFP_VMALLOC32 GFP_DMA32 #elif defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA) #define GFP_VMALLOC32 GFP_DMA #else #define GFP_VMALLOC32 GFP_KERNEL #endif Which seems to be incorrect, it should always -or- in the DMA flags on top of GFP_KERNEL, thus this patch. This fixes frequent errors launchin X with the nouveau DRM for example. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Andi Kleen <ak@suse.de> Cc: Dave Airlie <airlied@linux.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-18Allocate and free vmalloc areasJeremy Fitzhardinge1-0/+53
Allocate/release a chunk of vmalloc address space: alloc_vm_area reserves a chunk of address space, and makes sure all the pagetables are constructed for that address range - but no pages. free_vm_area releases the address space range. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: Ian Pratt <ian.pratt@xensource.com> Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Cc: "Jan Beulich" <JBeulich@novell.com> Cc: "Andi Kleen" <ak@muc.de>
2007-07-17Slab allocators: Replace explicit zeroing with __GFP_ZEROChristoph Lameter1-3/+3
kmalloc_node() and kmem_cache_alloc_node() were not available in a zeroing variant in the past. But with __GFP_ZERO it is possible now to do zeroing while allocating. Use __GFP_ZERO to remove the explicit clearing of memory via memset whereever we can. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-14[POWERPC] unmap_vm_area becomes unmap_kernel_range for the publicBenjamin Herrenschmidt1-4/+9
This makes unmap_vm_area static and a wrapper around a new exported unmap_kernel_range that takes an explicit range instead of a vm_area struct. This makes it more versatile for code that wants to play with kernel page tables outside of the standard vmalloc area. (One example is some rework of the PowerPC PCI IO space mapping code that depends on that patch and removes some code duplication and horrible abuse of forged struct vm_struct). Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-05-17Make __vunmap staticBenjamin Herrenschmidt1-1/+1
__vunmap doesn't seem to be used outside of mm/vmalloc.c, and has no prototype in any header so let's make it static Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08move die notifier handling to common codeChristoph Hellwig1-0/+7
This patch moves the die notifier handling to common code. Previous various architectures had exactly the same code for it. Note that the new code is compiled unconditionally, this should be understood as an appel to the other architecture maintainer to implement support for it aswell (aka sprinkling a notify_die or two in the proper place) arm had a notifiy_die that did something totally different, I renamed it to arm_notify_die as part of the patch and made it static to the file it's declared and used at. avr32 used to pass slightly less information through this interface and I brought it into line with the other architectures. [akpm@linux-foundation.org: build fix] [akpm@linux-foundation.org: fix vmalloc_sync_all bustage] [bryan.wu@analog.com: fix vmalloc_sync_all in nommu] Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: <linux-arch@vger.kernel.org> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Bryan Wu <bryan.wu@analog.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-02[PATCH] x86-64: Fix vmalloc_32 to really allocate <4GB on 64bit platformsAndi Kleen1-3/+11
Ugly ifdef, but should handle all 64bit platforms that have suitable zones. On some like Altix it's probably impossible without IOMMU use to get memory <4GB this way, but they have to live with that. Signed-off-by: Andi Kleen <ak@suse.de>
2007-02-11[PATCH] Numerous fixes to kernel-doc info in source files.Robert P. J. Day1-1/+1
A variety of (mostly) innocuous fixes to the embedded kernel-doc content in source files, including: * make multi-line initial descriptions single line * denote some function names, constants and structs as such * change erroneous opening '/*' to '/**' in a few places * reword some text for clarity Signed-off-by: Robert P. J. Day <rpjday@mindspring.com> Cc: "Randy.Dunlap" <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2006-11-16[PATCH] Fix strange size check in __get_vm_area_node()OGAWA Hirofumi1-3/+2
Recently, __get_vm_area_node() was changed like following if (unlikely(!area)) return NULL; - if (unlikely(!size)) { - kfree (area); + if (unlikely(!size)) return NULL; - } It is leaking `area', also original code seems strange already. Probably, we wanted to do this patch. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-13[PATCH] vmalloc: optimization, cleanup, bugfixesEric Dumazet1-13/+13
- reorder 'struct vm_struct' to speedup lookups on CPUS with small cache lines. The fields 'next,addr,size' should be now in the same cache line, to speedup lookups. - One minor cleanup in __get_vm_area_node() - Bugfixes in vmalloc_user() and vmalloc_32_user() NULL returns from __vmalloc() and __find_vm_area() were not tested. [akpm@osdl.org: remove redundant BUG_ONs] Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-29[PATCH] Fix GFP_HIGHMEM slab panicGiridhar Pemmasani1-1/+1
As reported by Martin J. Bligh <mbligh@google.com>, we let through some non-slab bits to slab allocation through __get_vm_area_node when doing a vmalloc. I haven't been able to reproduce this, although I understand why it happens: vmalloc allocates memory with GFP_KERNEL | __GFP_HIGHMEM and commit 52fd24ca1db3a741f144bbc229beefe044202cac resulted in the same flags are passed down to cache_alloc_refill, causing the BUG. The following patch fixes it. Note that when calling kmalloc_node, I am masking off __GFP_HIGHMEM with GFP_LEVEL_MASK, whereas __vmalloc_area_node does the same with ~(__GFP_HIGHMEM | __GFP_ZERO). IMHO, using GFP_LEVEL_MASK is preferable, but either should fix this problem. Signed-off-by: Giridhar Pemmasani (pgiri@yahoo.com) Cc: Martin J. Bligh <mbligh@google.com> Cc: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28[PATCH] __vmalloc with GFP_ATOMIC causes 'sleeping from invalid context'Giridhar Pemmasani1-7/+11
If __vmalloc is called to allocate memory with GFP_ATOMIC in atomic context, the chain of calls results in __get_vm_area_node allocating memory for vm_struct with GFP_KERNEL, causing the 'sleeping from invalid context' warning. This patch fixes it by passing the gfp flags along so __get_vm_area_node allocates memory for vm_struct with the same flags. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-17[PATCH] vmalloc(): don't pass __GFP_ZERO to slabAndrew Morton1-2/+5
A recent change to the vmalloc() code accidentally resulted in us passing __GFP_ZERO into the slab allocator. But we only wanted __GFP_ZERO for the actual pages whcih are being vmalloc()ed, and passing __GFP_ZERO into slab is not a rational thing to ask for. Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03Spelling fix: "control" instead of "cotrol"Michael Opdenacker1-3/+3
This patch against fixes a spelling mistake ("control" instead of "cotrol"). Signed-off-by: Michael Opdenacker <michael@free-electrons.com> Acked-by: Alan Cox <alan@redhat.com> Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-09-27[PATCH] Mark __remove_vm_area() staticRolf Eike Beer1-1/+1
The function is exported but not used from anywhere else. It's also marked as "not for driver use" so noone out there should really care. Signed-off-by: Rolf Eike Beer <eike-kernel@sf-tec.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>