In original code, we kmalloc a lot of memory when we open cgroup.tasks.
_This memory is allocated by kmalloc_, So this memory is not statistical
for this process or this user. This is a critical bug.
An unprivileged user(attacker) may use all his available processes
to enlarge cgroup.tasks file and use all his available open-files
to open cgroup.tasks many many times. Thus he will _steal_ a lot of
memory.
If the size of memory that he steal is large than system's memory,
the system will oops.
This script will show how many memory be used when open a file 1000 times
--------------------
#!/bin/sh
file=$1
use_mem=$(free -k | awk '{if(NR==2) print $3}')
i=24
while ((i<1024))
do
eval "exec $i<$file"
((i++))
done
use_mem_afer_open=$(free -k | awk '{if(NR==2) print $3}')
echo Use about $((use_mem_afer_open - use_mem))K physical memories \
for open $file 1000 times
-----------------------
$ ./memory.sh /proc/cpuinfo
Use about 156K physical memories for open /proc/cpuinfo 1000 times
$ ./memory.sh /proc/cpuinfo
Use about 0K physical memories for open /proc/cpuinfo 1000 times
$ ./memory.sh /proc/cpuinfo
Use about -36K physical memories for open /proc/cpuinfo 1000 times
$ ./memory.sh /proc/cpuinfo
Use about 0K physical memories for open /proc/cpuinfo 1000 times
$ ./memory.sh /proc/cpuinfo
Use about 56K physical memories for open /proc/cpuinfo 1000 times
----------------------
$ wc -l /dev/cpuset/tasks
1163
$ ./memory.sh /dev/cpuset/tasks
Use about 7896K physical memories for open /dev/cpuset/tasks 1000 times
$ ./memory.sh /dev/cpuset/tasks
Use about 8008K physical memories for open /dev/cpuset/tasks 1000 times
$ ./memory.sh /dev/cpuset/tasks
Use about 8104K physical memories for open /dev/cpuset/tasks 1000 times
$ ./memory.sh /dev/cpuset/tasks
Use about 7756K physical memories for open /dev/cpuset/tasks 1000 times
$ ./memory.sh /dev/cpuset/tasks
Use about 8104K physical memories for open /dev/cpuset/tasks 1000 times
-----------------------
This patch will fix ...It's true that this does represent a way for an untrusted user to consume a big chunk of memory on a machine. It's been around since cpusets was introduced. I'm not sure that it's more serious than say, a fork bomb or allocating lots of memory via lots of sockets with huge network buffers. I don't see how this can cause an oops - slab.c and slub.c both appear to just return NULL in that case, from a quick inspection. But improving the current code has been on my todo list for a while. My thought on the best way to fix it is to use a proper seq_file - that would remove the issues that you have currently with the "unconsumed" portion of the output. Using a seq_file iterator with your single pid array per cgroup might work quite nicely - but isn't there still a problem that with a really large cgroup we can overflow the kmalloc() limit for a single pid array? My original thoughts for solving this involved using a prio_heap to ensure that every task in the cgroup was seen once: 1) similar to cgroup_scan_tasks, fill a prio_heap with the first N (by task start time) tasks in the cgroup 2) sort these tasks by pid 3) report these by seq_file 4) once all the pids in the heap have been reported, update the minimum task start time to be greater than the last task start time in the reported group, and goto 1 This way we never have to allocate more than a single prio_heap per open file; the downside is that we have more overhead when reporting on really large cgroups since we have to make multiple scans of the cgroup's task list. Paul --
IMO This is very different. The number of creating processes is statistical,
The number of opening sockets is statistical, The kernel(OOM-killer ...)
or system manager(use ulimit or other tools) can control these behaviors.
But kmalloc a big chunk of memory is not statistical.
I failed to try my best to rescue my system in this condition in my testes.
(But this is true: my skill is so few that I can not rescue any system)
oops - I means that the system is very hard to be rescued. The kernel
do not know which process should be killed for no statistical result.
system manager may know which process should be killed(use lsof or other tools)
but he has no memory to do any operation.
oops - My system was not responded after some printk's messages for other
But for my poor knowledge we always have to handle "unconsumed portion"
for nonseekable file in all different way. In my binary search based
solving, I have to handle unconsumed part for a pid and bring in
It's complicated than necessary and change too much code IMO.
This solving is nonseekable also(for multi-scanning).
And the result of tasks file is not sorted(for multi-scanning).
Steps in my solving.
1) deal with the unconsumed part of the just read pid
2) binary search
3) deal with the unconsumed part of file(unconsumed pids).
step 2 ensure that every task in the cgroup was seen once and
ensure that we do not need scan again, the result of tasks file
is sorted.
Thank you for reviewing and comments.
Lai
--
I meant that the seq_file code could handle the "unconsumed" portion for you, rather than you having to manage it explicitly. Paul --
What about the problem that maintaining a single pid array can still fail for a really large cgroup? I guess we could just say "don't create such large cgroups" but someone's bound to want to do that. Perhaps use an array of pages rather than a single large kmalloc? Paul --
Actually, I had a plan to write such a patch: [RFC PATCH] cgroup,cpuset: use alternative malloc instead of kmalloc The main idea is: when allocate size >= PAGE_SIZE, vmalloc will be used instead. This will reduce the stress when continuous pages are few. Alternative malloc is used for cgroup_tasks_open() and update_tasks_nodemask(). And vmalloc can malloc larger memory than kmalloc, is vmalloc() enough? If not, I think using an array of pages is the best choice. [There are several subsystem who use alternative malloc. kernel/relay.c for example. relay.c is also using an array of pages for relay buffer. ] Lai --
vmalloc would be simpler, certainly, but it has a higher overhead. And since we're dealing with arrays of integers here, it's not too hard to manage multiple arrays. Oh, except for sorting them, which would be more of a pain. So yes, maybe vmalloc() would be a better choice at first. Paul --
Yep, these are hard. Which method is your favorite?
My original purpose was to fix a bug as I described.
This bug and the problem that offering big enough array for a huge
cgroup are orthogonal!
Could you consider/test that is it a bug as I described(and is it as
critical as I described, maybe I was too nervous)?
And this is also a problem: opening a cgroup.tasks twice or will waste
a lot of _physical_ memory.
Thanks! Lai
--
Hi Lai,
Sorry for the delay, I've been away on vacation.
You're right. So solving them separately seems fine.
It's definitely a bug that can usefully be fixed, but I think it's not quite as critical as you suggest.
I think the simplest approach is to use your basic idea of a shared pid array, but using a standard seq_file rather than hand-coded logic for handling partial reads. Something like the patch below.
Other possible extensions might include:
- delay generating the pid array until the first read
- allocate an array of single pages rather than a single kmalloc() region
Paul
Convert cgroup tasks file to use a seq_file with shared pid array
Rather than pre-generating the entire text for the "tasks" file each
time the file is opened, we instead just generate/update the array of
process ids and use a seq_file to report these to userspace. All open
file handles on the same "tasks" file can share a pid array, which may
be updated any time that no thread is actively reading the array. By
sharing the array, the potential for userspace to DoS the system by
opening many handles on the same "tasks" file is removed.
[ Based on a patch by Lai Jiangshan, extended to use seq_file ]
Signed-off-by: Paul Menage <menage@google.com>
---
include/linux/cgroup.h | 10 ++
kernel/cgroup.c | 226 ++++++++++++++++++++++++++++++-------------------
2 files changed, 151 insertions(+), 85 deletions(-)
Index: pids-2.6.27-rc1-mm1/include/linux/cgroup.h
===================================================================
--- pids-2.6.27-rc1-mm1.orig/include/linux/cgroup.h
+++ pids-2.6.27-rc1-mm1/include/linux/cgroup.h
@@ -15,6 +15,7 @@
#include <linux/rcupdate.h>
#include <linux/cgroupstats.h>
#include <linux/prio_heap.h>
+#include <linux/rwsem.h>
#ifdef CONFIG_CGROUPS
@@ -137,6 +138,15 @@ struct cgroup {
* release_list_lock
*/
struct list_head release_list;
+
+ /* pids_mutex protects the fields below */
+ struct rw_semaphore ...It's great, Thanks! Lai Jiangshan. Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com> --
Given the "if" test directly above, those two are equivalent. Paul --
On Thu, 04 Sep 2008 22:34:47 -0700 kmalloc becomes more unreliable above 32 kbytes and 100% unreliable above MAX_ORDER. --
Agreed, but that's something to be fixed in a different patch - the existing cgroups code (and cpusets originally) has had this kind of kmalloc call in it. I think it should be reasonably straightforward to replace it with an array of page allocations. Paul --
