Thursday, October 26, 2006

understand blackfin MM - 1026

Linux Virtual Memory

* bootmem
* credits
* home
* kernalloc
* kmap
* mmap
* numa
* pagecache
* swap
* vm_paulwilson
* vminit
* vmnotes
* vmoutline
* vmpolicy
* zonealloc


Swapping and the Page


Pretty much every page
of user process RAM is
kept in either the page
cache or the swap cache
(the swap cache is just
the part of the page
cache associated with
swap devices). Most user
pages are added to the
cache when they are
initially mapped into
process VM, and remain
there until they are
reclaimed for use by
either another process
or by the kernel itself.
The purpose of the cache
is simply to keep as
much useful data in
memory as possible, so
that page faults may be
serviced quickly. Pages
in different parts of
their life cycle must be
managed in different
ways by the VM system,
and likewise pages that
are mapped into process
VM in different ways.

The cache is a layer
between the kernel
memory management code
and the disk I/O code.
When the kernel swaps
pages out of a task,
they do not get written
immediately to disk, but
rather are added to the
cache. The kernel then
writes the cache pages
out to disk as necessary
in order to create free

The kernel maintains a
number of page lists
which collectively
comprise the page cache.
The active_list, the
inactive_dirty_list, and
the inactive_clean_list
are used to maintain a
sorting of user pages
(the page replacement
policy actually
implemented is something
like "not recently
used", since Linux is
fairly unconcerned about
keeping a strict LRU
ordering of pages).
Furthermore, each page
of an executable image
or mmap()ed file is
associated with a
per-inode cache,
allowing the disk file
to be used as backing
storage for the page.
Finally, anonymous pages
(those without a disk
file to serve as backing
storage - pages of
malloc()'d memory, for
example) are assigned an
entry in the system
swapfile, and those
pages are maintained in
the swap cache.

Note that anonymous
pages don't get added to
the swap cache - and
don't have swap space
reserved - until the
first time they are
evicted from a process's
memory map, whereas
pages mapped from files
begin life in the page
cache. Thus, the
character of the swap
cache is different than
that of the page cache,
and it makes sense to
make the distinction.
However, the cache code
is mostly generic, and I
won't be too concerned
about the differences
between mapped pages and
swap pages here.

The general
characteristics of pages
on the LRU page lists
are as follows:

* active_list:
pages on the
active list have
page->age > 0,
may be clean or
dirty, and may
be (but are not
mapped by

* inactive_dirty_list: pages on this list have page->age == 0, may be clean or dirty, and are not mapped by any process PTE.

* inactive_clean_list: each zone has its own inactive_clean_list, which contains clean pages with age == 0, not mapped by any process PTE.

During page-fault
handling, the kernel
looks for the faulting
page in the page cache.
If it's found, it can be
moved to the active_list
(if it's not already
there) and used
immediately to service
the fault.

Life Cycle of a User
I'll present the common
case of a page (call it
P) that's part of an
mmap()ed data file.
(Executable text pages
have a similar life
cycle, except they never
get dirtied and thus are
never written out to

1. The page is read
from disk into
memory and added
to the page
cache. This can
happen in a
number of
different ways:
* Process
page P;
it is
read in
by the
page-fault handler for the process's VM area corresponding to the mapped file and added to the page cache, and to the process page tables. The page starts its life on the inode cache for the file's inode, and on the active_list of the LRU, where it remains while it is actively being used.


* Page P
is read
during a
readahead operation, and added to the page cache. In this case, the reason the page is read in is simply that it is part of the cluster of blocks on disk that happens to be easy to read; we don't necessarily know the page will be needed, but it's cheap to read a bunch of pages that are sequential on the disk - and the cost of throwing away those pages if it turns out we don't need them is trivial, since they can be immediately reclaimed if they're never referenced. Such pages start life in the swap cache, and on the active_list. (Actually this will never be the case for an mmapped page - such pages are never written to swap.)


* Page P
is read
during a
readahead operation, in which case a sequence of adjacent pages following the faulting page in an mmapped file is read. Such pages start their life in the page cache associated with the mmapped file, and on the active list.

2. P is written by
the process, and
thus dirtied. At
this point P is
still on the

3. P is not used
for a while.
invocations of
the kernel swap
daemon kswapd()
will gradually
reduce the
page->age count.
kswapd() wakes
up more
frequently as
memory pressure
increases. P's
age will
gradually decay
to 0 if it is
not referenced,
due to periodic
invocations of
refill_inactive() by kswapd.

4. If memory is
swap_out() will
eventually be
called by
kswapd() to try
to evict pages
from Process A's
virtual address
space. Since
page P hasn't
been referenced
and has age 0,
the PTE will be
dropped, and the
only remaining
reference to P
is the one
resulting from
its presence in
the page cache
(assuming, of
course, that no
other process
has mapped the
file in the
swap_out() does
not actually
swap the page
out; rather, it
simply removes
the process's
reference to the
page, and
depends upon the
page cache and
swap machinery
to ensure the
page gets
written to disk
if necessary.
(If a PTE has
been referenced
when swap_out()
examines it, the
mapped page is
aged up - made
younger - rather
than being

5. Time passes... a
little or a lot,
depending on
memory demand.

6. refill_inactive_scan() comes along, trying to find pages that can be moved to the inactive_dirty list. Since P is not mapped by any process and has age 0, it is moved from the active_list to the inactive_dirty list.

7. Process A
attempts to
access P, but
it's not present
in the process
VM since the PTE
has been cleared
by swap_out().
The fault
handler calls
__find_page_nolock() to try to locate P in the page cache, and lo and behold, it's there, so the PTE can be immediately restored, and P is moved to the active_list, where it remains as long as it is actively used by the process.

8. More time
clears Process
A's PTE for page
refill_inactive_scan() deactivates P, moving it to the inactive_dirty list.

9. More time
passes... memory
gets low.

10. page_launder()
is invoked to
clean some dirty
pages. It finds
P on the
inactive_dirty_list, notices that it's actually dirty, and attempts to write it out to the disk. When the page has been written, it can then be moved to the inactive_clean_list. The following sequence of events occurs when page_launder() actually decides to write out a page:
* Lock the

* We
determine the page needs to be written, so we call the writepage method of the page's mapping. That call invokes some filesystem-specific code to perform an asynchronous write to disk with the page locked. At that point, page_launder() is finished with the page: it remains on the inactive_dirty_list, and will be unlocked once the async write completes.

* Next
page_launder() is called it will find the page clean and move it to the inactive_clean_list, assuming no process has found it in the pagecache and started using it in the meantime.

11. page_launder()
runs again,
finds the page
unused and
clean, and moves
it to the
inactive_clean_list of the page's zone.

12. An attempt is
made by someone
to allocate a
single free page
from P's zone.
Since the
request is for a
single page, it
can be satisfied
by reclaiming an
page; P is
chosen for
removes P from
the page cache
ensuring that no
other process
will be able to
gain a reference
to it during
page fault
handling), and
it is given to
the caller as a
free page.


kreclaimd comes
along trying to
create free
memory. It
reclaims P and
then frees it.

Note that this is only
one possible sequence of
events: a page can live
in the page cache for a
long time, aging, being
deactivated, being
recovered by processes
during page fault
handling and thereby
reactivated, aging,
being deactivated, being
laundered, being
recovered and

Pages can be recovered
from the inactive_clean
and active lists as well
as from the
inactive_dirty list.
Read-only pages,
naturally, are never
dirty, so page_launder()
can move them from the
inactive_dirty_list to
the inactive_clean_list
"for free," so to speak.

Pages on the
inactive_clean list are
periodically examined by
the kreclaimd kernel
thread and freed. The
purpose of this is to
try to produce larger
contiguous free memory
blocks, which are needed
in some situations.

Finally, note that P is
in essence a logical
page, though of course
it is instantiated by
some particular physical

Usage Notes
When a page is in the
page cache, the cache
holds a reference to the
page. That is, any code
that adds a page to the
page cache must
increment page->count,
and any code that
removes a page from the
page cache must
decrement page->count.
Failure to honor these
rules will cause Bad
Thingstm to happen,
since the page
reclamation code expects
cached pages to have a
reference count of
exactly 1 (or 2 if the
page is also in the
buffer cache) in order
to be reclaimed.

The existing public
interface to the page
etc) already handles
page reference counting
properly. You shouldn't
attempt to add pages to
the cache in any other

Page Cache Data
The basic data
structures involved in
the page cache are

* The LRU list of
page structs:
* The
active_list, containing active pages (mapped by process page tables).

* The
inactive_dirty_list, containing pages that are not mapped by any process page tables, but which may be dirty.

* The
inactive_clean_list, containing freeable pages that are clean and not mapped by any process page tables. There is one inactive_clean_list per zone, so that allocations from each zone can be fulfilled by reclaiming appropriate inactive_clean pages.

* The struct
which represents
the virtual
memory image of
a file. Each
mmap()ed file,
executables and
files which are
mmap()ed by a
program, is
represented by
an address_space
struct which
holds on to the
inode for the
file, a list of
all the VM
mappings of the
file, and a
collection of
pointers to
functions needed
to handle
various VM tasks
on the file,
such as reading
in and writing
out pages. No
matter how many
processes have
mapped a
particular file,
there is only
struct for the
file. Swap files
are also
represented by
structs, which
allows the page
cache code to be
generic for
pages and
mmapped file

* The ubiquitous
page struct -
these are the
items that make
up the LRU lists
and inode

* The page hash
queues - lists
of pages with
the same hash
values. The hash
queues are used
when looking for
mapped pages in
the page cache
based on the
file and the
offset of the
page within the
Page Cache Code
add_to_page_cache( page
struct * page, struct
address_space * mapping,
unsigned long offset )
is the public interface
by which pages are added
to the page cache. It
adds the given page to
the inode and hash
queues for the specified
address_space, and to
the LRU lists.

The actual cache-add
operation is performed
by the helper function
in mm/filemap.c on line

* Line 506 if the
page is locked,
it's a bug. (The
page may, for
example, be in
the process of
being written
out to disk.)

* Lines 509 and
510: clear the
error, dirty,
uptodate, and
(architecture-specific?) bits, and set the locked bit in the page frame.

* Line 511 take a
reference to the
page. This is
the cache's
reference, and
is critical for
the proper
operation of
other page-cache
code (notably

* The remainder of
the code in the
function adds
the page to the
mapping, to the
page hash queue,
and to the LRU

This is the same as
except that it first
checks to be sure the
page being added isn't
already in the cache.

page struct *
__find_page_nolock(struct address_space *mapping, unsigned long offset, page struct *page) is the basic cache-search function. It looks at the pages in the hash queue given by the page argument, examining only those that are associated with the given mapping (address_space), and returns the one with the matching offset, if such exists. The code is "interesting", but pretty straightforward.

As an example of how
__find_page_nolock() is
used, consider the
kernel's activity when
handling a page fault on
an executable text page:

* The fault
handler uses the
fault address to
find the
corresponding to
the fault
address. It
calls the
*nopage member
of the
vm_ops, which in
the case of a
text area will

* filemap_nopage()
tracks down the
via the
vm_area_struct->file member. It computes the page's offset into the file using the vm_area_struct's base address and the fault address. It computes the hash value of the faulting page, and calls __find_get_page() to search the page cache, giving it the address_space mapping, offset, and hash queue.

* __find_get_page() calls __find_page_nolock() to search the hash queue for the page at the given offset in the proper mapping. If found, it takes an additional reference to the page on behalf of the faulting process, and returns the page. If the page isn't found, filemap_nopage() reads the page into the page cache and starts over from the beginning.

file * file, unsigned
long offset) checks to
see if a page
corresponding to the
given file and offset is
already in the page
cache. If it's not, it
reads in the page and
adds it to the cache.

Well, actually the other
way around:

* Line 557 is the
page in the
cache? If so, do

* Line 562
allocate a page
(page_cache_alloc() == get_free_page()).

* Line 566 add the
page to the page
associating it
with the
specified file
and offset.

* Line 567 do
whatever is
necessary to
read the page.
Note that adding
the page to the
page cache has
locked it;
anyone who wants
to use the page
must wait for it
to be unlocked
The page will be
unlocked when
its read
completes. [How
does this

void wait_on_page(page
struct *page) simply
checks to see if the
given page is locked,
and if it is, waits for
it to be unlocked. The
actual waiting is done
in ___wait_on_page() in

* Line 609 declare
a waitqueue
entry and add it
to the page's
wait queue.
We're declaring
the waitqueue
entry on the
stack; that's
OK, since this
instance of
___wait_on_page() won't return until the page is unlocked.

* The loop on line
612 is executed
while the page
is locked.
* On line
613 we
try to
any I/O
on the
page to

* We then
set the
state to
uninterruptible sleep, check to see if the page has been unlocked in the meantime, and if not, run the disk I/O task queue and schedule another process. [sync_page() seems to normally do run_task_queue(&tq_disk), via block_sync_page(). So we're forcing I/O twice?]

* We'll
wake up
schedule() as soon as the page gets unlocked and the wait queue is awoken.

When we need to write a
page from the page cache
into backing storage, in
page_launder() for
example, we invoke the
writepage member of the
page mapping's
structure. In most cases
this will ultimately
__block_write_full_page(struct inode *inode, page struct *page, get_block_t *get_block), defined in fs/buffer.c on line 1492 __b_w_f_p() writes out the given page via the buffer cache. Essentially, all we do here is create buffers for the page if it hasn't been done yet, map them to disk blocks, and submit them to the disk I/O subsystem to be physically written out asynchronously. We set the I/O completion handler to end_buffer_io_async(), which is a special handler that knows how to deal with page-cache read and write completions.

kswapd() is the kernel
swap thread. It simply
loops, deactivating and
laundering pages as
necessary. If memory is
tight, it is more
aggressive, but it will
always trickle a few
inactive pages per
minute back into the
free pool, if there are
any inactive pages to be

When there's nothing to
do, kswapd() sleeps. It
can be woken explicitly
by tasks that need
memory using
wakeup_kswapd(); a task
that wakes kswapd() can
either let it carry on
asynchronously, or wait
for it to free some
pages and go to sleep

try_to_free_pages() does
almost the same thing
that kswapd() does, but
it does it synchronously
without invoking kswapd.
Some tasks might
deadlock with kswapd()
if, for example, they
are holding kernel locks
that kswapd() needs in
order to operate; such
processes call
instead of
wakeup_kswapd(). [An
example would be good

Most of kswapd's work is
done by

do_try_to_free_pages(unsigned int gfp_mask, int user) takes a number of actions in order to try to create free (or freeable) pages.

* Line 921 move
pages to the
list, and if
launder some
pages by
starting writes
to the disk.

* Line 931 attempt
to move some
active pages to

* Line 935 free
unused slab

int gfp_mask, int user)
is invoked to try to
deactivate active pages.
It tries to create
enough freeable pages to
erase any existing free
memory shortage.

* The loop
beginning on
line 849 loops
over priority
until it reaches
0. Each time
through the loop
we decrement the
priority; lower
indicates higher
memory pressure,
and will cause
us to take more
action to try to
free pages.

* On line 857 we
attempt to
refill the
list by
deactivating as
many unused
active pages as
possible, using
refill_inactive_scan(). If this works well, it's all we need, and the loop will exit.

* On line 868 we
shrink the inode
cache and the
dentry cache, if

* Then, on line
874 we attempt
to forcibly
remove process
page mappings.
The swap_out()
function is
mis-named: it
doesn't actually
do any swapping
out. Rather, it
tries to evict
pages from
process PTE
mappings. If it
can eliminate
all process
mappings of the
page, the page
becomes eligible
for deactivation
and laundering.

* If we've
eliminated the
memory shortage,
exit the loop.
If we haven't
been able to
free any pages,
do the loop
again, more
Note that the
loop is certain
to exit
eventually by
eliminating the
memory shortage,
or failing to
make progress
even at priority
0. (Unless
you're one of
those lucky
souls with an
infinite number
of freeable
pages on your

looks at each page on
the active list, aging
the pages up if they've
been referenced since
last time and down if
they haven't (this is
counterintuitive - older
pages have lower age

If a page's age is 0,
and the only remaining
references to the page
are those of the buffer
cache and/or page cache,
then the page is
deactivated (moved to

There is some misleading
commentary in this
function. It doesn't
actually use
age_page_down() to age
down the page and
deactivate it; rather,
it ages the page down,
and then additionally
checks the reference
count to be sure it is
really unreferenced by
process PTEs before
actually deactivating
the page.

int swap_out(unsigned
int priority, int
gfp_mask) scans process
page tables and attempts
to unmap pages from
process VM. It computes
the per-task swap_cnt
value - basically a
"kick me" rating, higher
values make a process
more likely to be
victimized - and uses it
to decide which process
to try to "swap out"
first. Larger processes
which have not been
swapped out in a while
have the best chance of
being chosen.

The actual unmapping is
done by swap_out_mm(),
which tries to unmap
pages from a process's
memory map;
swap_out_vma(), which
tries to unmap pages
from a particular VM
area within a process by
walking the part of the
page directory
corresponding to the VM
area looking for page
middle directory entries
to unmap; swap_out_pgd()
which walks page middle
directory entries
looking for page tables
to unmap; swap_out_pmd()
which walks page tables
looking for page table
entries to unmap; and
try_to_swap_out(), where
all the interesting
stuff happens.

mm_struct * mm, struct
vm_area_struct* vma,
unsigned long address,
pte_t * page_table, int
gfp_mask) attempts to
unmap the page indicated
by the given pte_t from
the given
vm_area_struct. It
returns 0 if the
higher-level swap-out
code should continue to
scan the current
process, 1 if the
current scan should be
aborted and another swap
victim chosen. It's a
bit complicated:

* Line 46 if the
given page isn't
present, return
0 to continue
the scan.

* Do some paranoia
checks to be
sure the page is
in a sane

* Line 52 if the
process's memory
map has been
victimized by
the swap code,
return 1 to
force a new
process to be
reduce the
process's "good
victim" rating.

* Line 57 check if
the page is

* Line 59 age the
page up (make it
younger -
someone didn't
have enough
caffeine when
they wrote the
code :-) if it's
been referenced

* Line 65 if the
page is not
active, age the
page down, but
do not move it
to the
inactive_dirty_list because (by virtue of us looking at the page in try_to_swap_out()) it's still mapped and thus not freeable. That's the meaning of the mysterious comment on line 64

* Line 72 if the
page is young,
don't unmap it.
Note that under
high memory
load, both
swap_out() and
refill_inactive_scan() will be called more frequently (often by processes attempting to allocate free pages), thus pages will age more quickly, and we have a reasonable chance to find young pages in swap_out().

* Line 75 lock the
page frame so
happens to it
while we're
molesting it. If
we can't lock
it, continue to
scan the VMA.

* Line 83 clear
the PTE, flush
the TLB (so the
CPU doesn't
continue to
believe the page
is mapped).

* Line 94 if the
page is already
in the swap
cache (that is,
already in the
page cache and
associated with
swap space
rather than with
a regular file
mapping), we
just replace the
pte with the
swapfile entry
that will allow
us to swap the
page in again
when required.
We also mark the
page dirty, if
necessary, to
force it to be
written out next
looks at it.

* Line 102 -
drop_pte: we
jump to this
label from a
number of places
in order to
finish the
Unlock the page,
attempt to
deactivate it
(this will only
succeed if no
other process is
using it), and
drop the
Return 0 to
continue the

* Line 124 We get
here if the page
wasn't already
in the swap
cache (it's
either an
anonymous page
that's never
been swapped out
before, or else
it's an mmapped
page with
backing storage
on a regular
file and a valid
If the page is
clean we just
goto drop_pte,
since the page
can be recovered
by paging it in
from its backing
store, if any.
Note that if
it's a clean
anonymous page,
and it's not in
the swap cache,
then no process
has ever written
data to the page
- so it can be
replaced, if
necessary, by
allocating a
free page.

* Line 133 the
page is
dirty. If the
page has a
mapping, we're
cool, because
will write it
out if
necessary, so we
can just goto
drop_pte and be

* Line 144 The
page had no
mapping, so we
need to allocate
a swap page for
it and add it to
the swap cache.
(The swap cache
is just that
part of the page
cache that's
associated with
mapping rather
than with a
regular file
mapping.) If the
attempt to
allocate a swap
page fails, we
must restore the
PTE by jumping

* Line 149 swap
space allocated
successfully. We
add the page to
the page cache,
set it dirty so
knows to write
it out, and goto
set_swap_pte to
replace the PTE
with the swap
deactivate the
page, and
continue the
Simple, eh?

int page_launder(int
gfp_mask, int sync) is
responsible for moving
inactive_dirty pages to
the inactive_clean list,
writing them out to
backing disk if
necessary. It's pretty
straightforward. It
makes at most two loops
over the
inactive_dirty_list. On
the first loop it simply
moves as many clean
pages as it can to the
inactive_clean_list. On
the second loop it
starts asynchronous
writes for any dirty
pages it finds, after
locking the pages; but
it leaves those pages on
the inactive_dirty list.
Eventually the writes
will complete and the
pages will be unlocked,
at which point future
invocations of
page_launder() can move
them to the

page_launder() may also
do synchronous writes if
it's invoked by a user
process that's trying to
free up enough memory to
complete an allocation.
In this case we start
the async writes as
above (mm/vmscan.c line
561), but then wait for
them to complete, by
with wait > 1 (vmscan.c,
line 595 - line 603).
Starting the page write
causes the page to be
added to the buffer
attempts to remove the
page from the buffer
cache by writing all its
buffers to disk, if

page_struct *
reclaim_page(zone_t *
zone) attempts to
reclaim an
inactive_clean page from
the given zone. It
examines each page in
the zone's
moving referenced pages
to the active list, and
dirty pages to the
inactive_dirty list.
(Both of those tests are
basically paranoid and
should never happen,
since code that remaps
an inactive page in the
cache also moves it to
the active list.) When
it finds a genuine clean
page, which should
happen immediately, it
removes the page from
the page cache and
returns it as a free

Note the code from line
431 to line 439 every
inactive_clean page is
either in the swap cache
(if it was an anonymous
page) or else it's in
the page cache
associated with an
mmapped file
(page->mapping && !
Exactly one of those
conditions will be true.

Note also line 455 this
is just another paranoia
check. Pages on the
inactive_clean list are
no longer in the buffer
cache (since they've
been written out and
their buffers freed in
page_launder), so every
reclaimed page will have
a reference count of
exactly 1 (the page
cache's reference), in
the absence of broken
kernel modules or

deactivate_page() is
called within
try_to_swap_out() to try
to deactivate a page. It
for the interesting
bits. It checks that the
page is unused by anyone
except the caller (which
presumably is about to
release a reference) and
the page and buffer
caches, then makes the
page old by setting its
age to 0 and moves it to
the inactive_dirty_list,
but only if it was on
the active list. We
never want to deactivate
a page that's not in the
page cache, since we
have no idea what else
such pages may be being
used for, and the page
cache makes some pretty
strong assumptions about
what it can do with the
pages under its control.



Linux MM



Questions and comments
to Joe Knapka

The links in this page were produced by lxrreplace.tcl, which is available for free.


Last changed:
01-25-06 10:11:22

This page was rendered by LittleSite.
All content Copyright (c) 2005 by J.Knapka.
Questions and comments to JK

No comments:

Blog Archive