Monday, October 30, 2006

vi and csope

The Vim/Cscope tutorial

Cscope is a very handy tool, but it's even better when you don't ever have to leave the comfort of your favorite editor (i.e. Vim) to use it. Fortunately, Cscope support has been built into Vim.

This tutorial introduces you both to Vim's built-in Cscope support, and to a set of maps that make searching more convenient.

It is assumed you know the basics of using a vi-style editor, but you don't need any particular knowledge about Vim (where Vim-specific features--like multiple windows--are used, a working knowledge of the features is briefly introduced). You also don't need to know anything about Cscope: the basics are introduced as we go along.

In a nutshell, Vim's Cscope support is very similar to Vim's ctags features, in case you've used those. But since Cscope has more search types than ctags, there are a few differences.

This is a hands-on tutorial, so open up a shell, and follow these steps:

1. Get and install Cscope if you don't have it already on your machine. Ideally, you will also have Vim 6.x, but you can get most of the functionality with later versions of Vim 5 (vertical splits don't work, but horizontal splits will work if you modify the maps as described in the file's comments).

Note: If your version of Vim wasn't compiled with '--enable-cscope', you will need to reconfigure and recompile Vim with that flag. Most Vim binaries that ship with Linux distributions have the Cscope plugin enabled.

2. Download the cscope_maps.vim file, and arrange for it to be read by Vim at startup time. If you are using Vim 6.x, stick the file in your $HOME/.vim/plugin directory (or in any other 'plugin' subdirectory in your 'runtimepath'). If you are using Vim 5.x, you can either cut and paste the entire contents of the cscope_maps file into your $HOME/.vimrc file, or stick a "source cscope_maps.vim" line into your .vimrc file.

3. Go into a directory with some C code in it, and enter 'cscope -R' (the '-R' makes Cscope parse all subdirectories, not just the current directory). Since we aren't passing the '-b' flag (which tells Cscope to just build the database, then exit), you will also find yourself inside Cscope's curses-based GUI. Try a couple of searches (hint: you use the arrow keys to move around between search types, and 'tab' to switch between the search types and your search results). Hit the number at the far left of a search result, and Cscope will open Vim right to that location. (unless you've set your EDITOR environment variable to something besides Vim). Exit Vim, and you'll be right back in the Cscope GUI where you left off. Nifty.

Alas, the Cscope interface has one big problem: you need to exit Vim each time you want to do a new search. That's where the Vim plugin comes in. Hit CTRL-D to exit Cscope.

4. Start up Vim. If you want, you can start it with a C symbol (ex: 'vim -t main'), and you should hop right to the definition of that symbol in your code.

5. Put the cursor over a C symbol that is used in several places in your program. Type "CTRL-\ s" (Control-backslash, then just 's') in quick succession, and you should see a menu at the bottom of your Vim window showing you all the uses of the symbol in the program. Select one of them and hit enter, and you'll jump to that use. As with ctags, you can hit "CTRL-t" to jump back to your original location before the search (and you can nest searches and CTRL-t will unwind them one at a time).

Mnemonic: the '\' key is right next to the ']' key, which is used for ctags searches.

6. Try the same search, but this time via "CTRL-spacebar s". This time, your Vim window will split in two horizontally , and the Cscope search result will be put in the new window. [if you've never used multiple Vim windows before: move between windows via 'CTRL-W w' (or CTRL-W arrow key, or CTRL-W h/j/k/l for left/up/down/right), close a window via 'CTRL-W c' (or good old ':q'), make the current window the only one via 'CTRL-W o', split a window into two via 'CTRL-W s' (or 'CTRL-W v' for a vertical split), open a file in a new window via ':spl[it] filename']

Mnemonic: there's now a big, spacebar-like bar across the middle of your screen separating your Vim windows.

7. Now try the same search via "CTRL-spacebar CTRL-spacebar s" (just hold down the CTRL key and tap the spacebar twice). If you have trouble hitting the keys fast enough for this to work, go into the cscope_maps.vim script and change Vim's timeout settings as described in the comments [actually, I generally recommend that you turn off Vim's timeouts]. This time your Vim window will be split vertically (note: this doesn't work with Vim 5.x, as vertical splits are new with Vim 6.0).

8. Up to now we've only been using the keystroke maps from 'cscope_maps.vim', which all do a search for the term that happens to be under your cursor in Vim. To do Cscope searches the old-fashioned way (using Vim's built-in Cscope support), enter ":cscope find symbol foo" (or, more tersely, ":cs f s foo"). To do the horizontal split version, use ":scscope" (or just ":scs") instead (Vim 6.x only). While it's easier to use the maps if the word you want to search for is under your cursor, the command line interface lets you go to any symbol you type in, so you'll definitely want to use it at times.

9. So far we've only been doing one kind of search: 's', for 'find all uses of symbol X'. Try doing one of Cscope's other searches by using a different letter: 'g' finds the global definition(s) of a symbol, 'c' finds all calls to a function, 'f' opens the filename under the cursor (note: since Cscope by default parses all C header files it finds in /usr/include, you can open up most standard include files with this). Those are the ones I use most frequently, but there are others (look in the cscope_maps.vim file for all of them, and/or read the Cscope man page).

10. Although Cscope was originally intended only for use with C code, it's actually a very flexible tool that works well with languages like C++ and Java. You can think of it as a generic 'grep' database, with the ability to recognize certain additional constructs like function calls and variable definitions. By default Cscope only parses C, lex, and yacc files (.c, .h, .l, .y) in the current directory (and subdirectories, if you pass the -R flag), and there's currently no way to change that list of file extensions (yes, we ought to change that). So instead you have to make a list of the files that you want to parse, and call it 'cscope.files' (you can call it anything you want if you invoke 'cscope -i foofile'). An easy (and very flexible) way to do this is via the trusty Unix 'find' command:

find . -name '*.java' > cscope.files

Now run 'cscope -b' to rebuild the database (the -b just builds the database without launching the Cscope GUI), and you'll be able to browse all the symbols in your Java files. Apparently there are folks out there using Cscope to browse and edit large volumes of documentation files, which shows how flexible Cscope's parser is.

For larger projects, you may additionally need to use the -q flag, and/or use a more sophisticated 'find' command. See our tutorial on using Cscope with large projects for more info.

11. Try setting the $CSCOPE_DB environment variable to point to a Cscope database you create, so you won't always need to launch Vim in the same directory as the database. This is particularly useful for projects where code is split into multiple subdirectories. Note: for this to work, you should build the database with absolute pathnames: cd to /, and do

find /my/project/dir -name '*.c' -o -name '*.h' > /foo/cscope.files

Then run Cscope in the same directory as the cscope.files file (or use 'cscope -i /foo/cscope.files'), then set and export the $CSCOPE_DB variable, pointing it to the cscope.out file that results):

cd /foo
cscope -b
CSCOPE_DB=/foo/cscope.out; export CSCOPE_DB

(The last command above is for Bourne/Korn/Bash shells: I've forgotten how to export variables in csh-based shells, since I avoid them like the plague).

You should now be able to run 'vim -t foo' in any directory on your machine and have Vim jump right to the definition of 'foo'. I tend to write little shell scripts (that just define and export CSCOPE_DB) for all my different projects, which lets me switch between them with a simple 'source projectA' command.
BUG: in versions of Cscope prior to 15.4, there is a silly bug that may cause Vim to freeze when you do this unless you call your database something other than the default 'cscope.out': use '-f foo' in your Cscope invocation to name your database 'foo.out' instead, and you'll be OK.

12. That's it! Use ":help cscope" (in Vim) and/or "man cscope" (from your shell) if you've got questions, and to learn the fine points.

Using Cscope on large projects (example: the Linux kernel)
Cscope can be a particularly useful tool if you need to wade into a large code base. You can save yourself a lot of time by being able to do fast, targeted searches rather than randomly grepping through the source files by hand (especially since grep starts to take a while with a truly large code base).

In this tutorial you'll learn how to set up Cscope with a large project. We'll use as our example the Linux kernel source code, but the basic steps are the same for any other large project, including C++ or Java projects.

1. Get the source. First get the source code. You can download the Linux kernel source from For the rest of this tutorial, I'll assume you've downloaded Linux 2.4.18 and installed it into /home/jru/linux-2.4.18.

Note: Make sure you've got enough disk space: the kernel tarball alone is 30 MB, it expands into 150 MB of source code, and the Cscope database we'll generate will gobble up another 20-100+ MB (depending on how much of the kernel code you decide to include in the database). You can put the Cscope database on a different disk partition than the source code if you need to.

2. Figure out where you want to put your Cscope database files. I'll assume you'll use /home/jru/cscope as the directory to store your database and associated files.

3. Generate cscope.files with a list of files to be scanned. For some projects, you may want to include every C source file in the project's directories in your Cscope database. In that case you can skip this step, and just use 'cscope -R' in the project's top-level directory to build your Cscope database. But if there's some code that you wish to exclude, and/or your project contains C++ or Java source code (by default Cscope only parses files with the .c, .h, .y, or .l extensions), you'll need to generate a file called cscope.files, which should contain the name of all files that you wish to have Cscope scan (one file name per line).

You'll probably want to use absolute paths (at least if you're planning to use the Cscope database within an editor), so that you can use the database from directories other than the one you create. The commands I show will first cd to root, so that find prints out absolute paths.

For many projects, your find command may be as as simple as

cd /
find /my/project/dir -name '*.java' >/my/cscope/dir/cscope.files

For the Linux kernel, it's a little trickier, since we want to exclude all the code in the docs and scripts directories, plus all of the architecture and assembly code for all chips except for the beloved Intel x86 (which I'm guessing is the architecture you're interested in). Additionally, I'm excluding all kernel driver code in this example (they more than double the amount of code to be parsed, which bloats the Cscope database, and they contain many duplicate definitions, which often makes searching harder. If you are interested in the driver code, omit the relevant line below, or modify it to print out only the driver files you're interested in):

cd /
find $LNX -path "$LNX/arch/*" ! -path "$LNX/arch/i386*" -prune -o -path "$LNX/include/asm-*" ! -path "$LNX/include/asm-i386*" -prune -o -path "$LNX/tmp*" -prune -o -path "$LNX/Documentation*" -prune -o -path "$LNX/scripts*" -prune -o -path "$LNX/drivers*" -prune -o -name "*.[chxsS]" -print >/home/jru/cscope/cscope.files

While find commands can be a little tricky to write, for large projects they are much easier than editing a list of files manually, and you can also cut and paste a solution from someone else.

4. Generate the Cscope database. Now it's time to generate the Cscope database:

cd /home/jru/cscope # the directory with 'cscope.files'
cscope -b -q -k

The -b flag tells Cscope to just build the database, and not launch the Cscope GUI. The -q causes an additional, 'inverted index' file to be created, which makes searches run much faster for large databases. Finally, -k sets Cscope's 'kernel' mode--it will not look in /usr/include for any header files that are #included in your source files (this is mainly useful when you are using Cscope with operating system and/or C library source code, as we are here).

On my 900 MHz Pentium III system (with a standard IDE disk), parsing this subset of the Linux source takes only 12 seconds, and results in 3 files (cscope.out,, and cscope.po.out) that take up a total of 25 megabytes.

5. Using the database. If you like to use vim or emacs/xemacs, I recommend that you learn how to run Cscope within one of these editors, which will allow you to run searches easily within your editor. We have a tutorial for Vim, and emacs users will of course be clever enough to figure everything out from the helpful comments in the cscope/contrib/xcscope/ directory of the Cscope distribution.

Otherwise, you can use the standalone Cscope curses-based GUI, which lets you run searches, then launch your favorite editor (i.e., whatever $EDITOR is set to in your environment, or 'vi' by default) to open on the exact line of the search result.

If you use the standalone Cscope browser, make sure to invoke it via

cscope -d

This tells Cscope not to regenerate the database. Otherwise you'll have to wait while Cscope checks for modified files, which can take a while for large projects, even when no files have changed. If you accidentally run 'cscope', without any flags, you will also cause the database to be recreated from scratch without the fast index or kernel modes being used, so you'll probably need to rerun your original cscope command above to correctly recreate the database.

6. Regenerating the database when the source code changes.

If there are new files in your project, rerun your 'find' command to update cscope.files if you're using it.

Then simply invoke cscope the same way (and in the same directory) as you did to generate the database initially (i.e., cscope -b -q -k).

Friday, October 27, 2006


Tips & Tricks
Featured Article: /proc/meminfo Explained

March 2003

"Free," "buffer," "swap," "dirty." What does it all mean? If you said,
"something to do with the Summer of '68", you may need a primer on

The entries in the /proc/meminfo can help explain what's going on with
your memory usage, if you know how to read it.

Example of "cat /proc/meminfo":

root: total: used: free: shared: buffers: cached:
Mem: 1055760384 1041887232 13873152 0 100417536 711233536
Swap: 1077501952 8540160 1068961792

MemTotal: 1031016 kB
MemFree: 13548 kB
MemShared: 0 kB
Buffers: 98064 kB
Cached: 692320 kB
SwapCached: 2244 kB
Active: 563112 kB
Inact_dirty: 309584 kB
Inact_clean: 79508 kB
Inact_target: 190440 kB
HighTotal: 130992 kB
HighFree: 1876 kB
LowTotal: 900024 kB
LowFree: 11672 kB
SwapTotal: 1052248 kB
SwapFree: 1043908 kB
Committed_AS: 332340 kB

The information comes in the form of both high-level and low-level
statistics. At the top you see a quick summary of the most common values
people would like to look at. Below you find the individual values we
will discuss. First we will discuss the high-level statistics.

High-Level Statistics
* MemTotal: Total usable ram (i.e. physical ram minus a few
reserved bits and the kernel binary code)
* MemFree: Is sum of LowFree+HighFree (overall stat)
* MemShared: 0; is here for compat reasons but always zero.
* Buffers: Memory in buffer cache. mostly useless as metric
* Cached: Memory in the pagecache (diskcache) minus SwapCache
* SwapCache: Memory that once was swapped out, is swapped back in
but still also is in the swapfile (if memory is needed it
doesn't need to be swapped out AGAIN because it is already in
the swapfile. This saves I/O)
Detailed Level Statistics
VM Statistics
VM splits the cache pages into "active" and "inactive" memory. The idea
is that if you need memory and some cache needs to be sacrificed for
that, you take it from inactive since that's expected to be not used.
The vm checks what is used on a regular basis and moves stuff around.

When you use memory, the CPU sets a bit in the pagetable and the VM
checks that bit occasionally, and based on that, it can move pages back
to active. And within active there's an order of "longest ago not
used" (roughly, it's a little more complex in reality). The longest-ago
used ones can get moved to inactive. Inactive is split into two in the
above kernel (2.4.18-24.8.0). Some have it three.

* Active: Memory that has been used more recently and usually not
reclaimed unless absolutely necessary.
* Inact_dirty: Dirty means "might need writing to disk or swap."
Takes more work to free. Examples might be files that have not
been written to yet. They aren't written to memory too soon in
order to keep the I/O down. For instance, if you're writing
logs, it might be better to wait until you have a complete log
ready before sending it to disk.
* Inact_clean: Assumed to be easily freeable. The kernel will try
to keep some clean stuff around always to have a bit of
breathing room.
* Inact_target: Just a goal metric the kernel uses for making sure
there are enough inactive pages around. When exceeded, the
kernel will not do work to move pages from active to inactive. A
page can also get inactive in a few other ways, e.g. if you do a
long sequential I/O, the kernel assumes you're not going to use
that memory and makes it inactive preventively. So you can get
more inactive pages than the target because the kernel marks
some cache as "more likely to be never used" and lets it cheat
in the "last used" order.
Memory Statistics
* HighTotal: is the total amount of memory in the high region.
Highmem is all memory above (approx) 860MB of physical RAM.
Kernel uses indirect tricks to access the high memory region.
Data cache can go in this memory region.
* LowTotal: The total amount of non-highmem memory.
* LowFree: The amount of free memory of the low memory region.
This is the memory the kernel can address directly. All kernel
datastructures need to go into low memory.
* SwapTotal: Total amount of physical swap memory.
* SwapFree: Total amount of swap memory free.
* Committed_AS: An estimate of how much RAM you would need to make
a 99.99% guarantee that there never is OOM (out of memory) for
this workload. Normally the kernel will overcommit memory. That
means, say you do a 1GB malloc, nothing happens, really. Only
when you start USING that malloc memory you will get real memory
on demand, and just as much as you use. So you sort of take a
mortgage and hope the bank doesn't go bust. Other cases might
include when you mmap a file that's shared only when you write
to it and you get a private copy of that data. While it normally
is shared between processes. The Committed_AS is a guesstimate
of how much RAM/swap you would need worst-case.

Thursday, October 26, 2006

understand blackfin MM - 1026

Linux Virtual Memory

* bootmem
* credits
* home
* kernalloc
* kmap
* mmap
* numa
* pagecache
* swap
* vm_paulwilson
* vminit
* vmnotes
* vmoutline
* vmpolicy
* zonealloc


Swapping and the Page


Pretty much every page
of user process RAM is
kept in either the page
cache or the swap cache
(the swap cache is just
the part of the page
cache associated with
swap devices). Most user
pages are added to the
cache when they are
initially mapped into
process VM, and remain
there until they are
reclaimed for use by
either another process
or by the kernel itself.
The purpose of the cache
is simply to keep as
much useful data in
memory as possible, so
that page faults may be
serviced quickly. Pages
in different parts of
their life cycle must be
managed in different
ways by the VM system,
and likewise pages that
are mapped into process
VM in different ways.

The cache is a layer
between the kernel
memory management code
and the disk I/O code.
When the kernel swaps
pages out of a task,
they do not get written
immediately to disk, but
rather are added to the
cache. The kernel then
writes the cache pages
out to disk as necessary
in order to create free

The kernel maintains a
number of page lists
which collectively
comprise the page cache.
The active_list, the
inactive_dirty_list, and
the inactive_clean_list
are used to maintain a
sorting of user pages
(the page replacement
policy actually
implemented is something
like "not recently
used", since Linux is
fairly unconcerned about
keeping a strict LRU
ordering of pages).
Furthermore, each page
of an executable image
or mmap()ed file is
associated with a
per-inode cache,
allowing the disk file
to be used as backing
storage for the page.
Finally, anonymous pages
(those without a disk
file to serve as backing
storage - pages of
malloc()'d memory, for
example) are assigned an
entry in the system
swapfile, and those
pages are maintained in
the swap cache.

Note that anonymous
pages don't get added to
the swap cache - and
don't have swap space
reserved - until the
first time they are
evicted from a process's
memory map, whereas
pages mapped from files
begin life in the page
cache. Thus, the
character of the swap
cache is different than
that of the page cache,
and it makes sense to
make the distinction.
However, the cache code
is mostly generic, and I
won't be too concerned
about the differences
between mapped pages and
swap pages here.

The general
characteristics of pages
on the LRU page lists
are as follows:

* active_list:
pages on the
active list have
page->age > 0,
may be clean or
dirty, and may
be (but are not
mapped by

* inactive_dirty_list: pages on this list have page->age == 0, may be clean or dirty, and are not mapped by any process PTE.

* inactive_clean_list: each zone has its own inactive_clean_list, which contains clean pages with age == 0, not mapped by any process PTE.

During page-fault
handling, the kernel
looks for the faulting
page in the page cache.
If it's found, it can be
moved to the active_list
(if it's not already
there) and used
immediately to service
the fault.

Life Cycle of a User
I'll present the common
case of a page (call it
P) that's part of an
mmap()ed data file.
(Executable text pages
have a similar life
cycle, except they never
get dirtied and thus are
never written out to

1. The page is read
from disk into
memory and added
to the page
cache. This can
happen in a
number of
different ways:
* Process
page P;
it is
read in
by the
page-fault handler for the process's VM area corresponding to the mapped file and added to the page cache, and to the process page tables. The page starts its life on the inode cache for the file's inode, and on the active_list of the LRU, where it remains while it is actively being used.


* Page P
is read
during a
readahead operation, and added to the page cache. In this case, the reason the page is read in is simply that it is part of the cluster of blocks on disk that happens to be easy to read; we don't necessarily know the page will be needed, but it's cheap to read a bunch of pages that are sequential on the disk - and the cost of throwing away those pages if it turns out we don't need them is trivial, since they can be immediately reclaimed if they're never referenced. Such pages start life in the swap cache, and on the active_list. (Actually this will never be the case for an mmapped page - such pages are never written to swap.)


* Page P
is read
during a
readahead operation, in which case a sequence of adjacent pages following the faulting page in an mmapped file is read. Such pages start their life in the page cache associated with the mmapped file, and on the active list.

2. P is written by
the process, and
thus dirtied. At
this point P is
still on the

3. P is not used
for a while.
invocations of
the kernel swap
daemon kswapd()
will gradually
reduce the
page->age count.
kswapd() wakes
up more
frequently as
memory pressure
increases. P's
age will
gradually decay
to 0 if it is
not referenced,
due to periodic
invocations of
refill_inactive() by kswapd.

4. If memory is
swap_out() will
eventually be
called by
kswapd() to try
to evict pages
from Process A's
virtual address
space. Since
page P hasn't
been referenced
and has age 0,
the PTE will be
dropped, and the
only remaining
reference to P
is the one
resulting from
its presence in
the page cache
(assuming, of
course, that no
other process
has mapped the
file in the
swap_out() does
not actually
swap the page
out; rather, it
simply removes
the process's
reference to the
page, and
depends upon the
page cache and
swap machinery
to ensure the
page gets
written to disk
if necessary.
(If a PTE has
been referenced
when swap_out()
examines it, the
mapped page is
aged up - made
younger - rather
than being

5. Time passes... a
little or a lot,
depending on
memory demand.

6. refill_inactive_scan() comes along, trying to find pages that can be moved to the inactive_dirty list. Since P is not mapped by any process and has age 0, it is moved from the active_list to the inactive_dirty list.

7. Process A
attempts to
access P, but
it's not present
in the process
VM since the PTE
has been cleared
by swap_out().
The fault
handler calls
__find_page_nolock() to try to locate P in the page cache, and lo and behold, it's there, so the PTE can be immediately restored, and P is moved to the active_list, where it remains as long as it is actively used by the process.

8. More time
clears Process
A's PTE for page
refill_inactive_scan() deactivates P, moving it to the inactive_dirty list.

9. More time
passes... memory
gets low.

10. page_launder()
is invoked to
clean some dirty
pages. It finds
P on the
inactive_dirty_list, notices that it's actually dirty, and attempts to write it out to the disk. When the page has been written, it can then be moved to the inactive_clean_list. The following sequence of events occurs when page_launder() actually decides to write out a page:
* Lock the

* We
determine the page needs to be written, so we call the writepage method of the page's mapping. That call invokes some filesystem-specific code to perform an asynchronous write to disk with the page locked. At that point, page_launder() is finished with the page: it remains on the inactive_dirty_list, and will be unlocked once the async write completes.

* Next
page_launder() is called it will find the page clean and move it to the inactive_clean_list, assuming no process has found it in the pagecache and started using it in the meantime.

11. page_launder()
runs again,
finds the page
unused and
clean, and moves
it to the
inactive_clean_list of the page's zone.

12. An attempt is
made by someone
to allocate a
single free page
from P's zone.
Since the
request is for a
single page, it
can be satisfied
by reclaiming an
page; P is
chosen for
removes P from
the page cache
ensuring that no
other process
will be able to
gain a reference
to it during
page fault
handling), and
it is given to
the caller as a
free page.


kreclaimd comes
along trying to
create free
memory. It
reclaims P and
then frees it.

Note that this is only
one possible sequence of
events: a page can live
in the page cache for a
long time, aging, being
deactivated, being
recovered by processes
during page fault
handling and thereby
reactivated, aging,
being deactivated, being
laundered, being
recovered and

Pages can be recovered
from the inactive_clean
and active lists as well
as from the
inactive_dirty list.
Read-only pages,
naturally, are never
dirty, so page_launder()
can move them from the
inactive_dirty_list to
the inactive_clean_list
"for free," so to speak.

Pages on the
inactive_clean list are
periodically examined by
the kreclaimd kernel
thread and freed. The
purpose of this is to
try to produce larger
contiguous free memory
blocks, which are needed
in some situations.

Finally, note that P is
in essence a logical
page, though of course
it is instantiated by
some particular physical

Usage Notes
When a page is in the
page cache, the cache
holds a reference to the
page. That is, any code
that adds a page to the
page cache must
increment page->count,
and any code that
removes a page from the
page cache must
decrement page->count.
Failure to honor these
rules will cause Bad
Thingstm to happen,
since the page
reclamation code expects
cached pages to have a
reference count of
exactly 1 (or 2 if the
page is also in the
buffer cache) in order
to be reclaimed.

The existing public
interface to the page
etc) already handles
page reference counting
properly. You shouldn't
attempt to add pages to
the cache in any other

Page Cache Data
The basic data
structures involved in
the page cache are

* The LRU list of
page structs:
* The
active_list, containing active pages (mapped by process page tables).

* The
inactive_dirty_list, containing pages that are not mapped by any process page tables, but which may be dirty.

* The
inactive_clean_list, containing freeable pages that are clean and not mapped by any process page tables. There is one inactive_clean_list per zone, so that allocations from each zone can be fulfilled by reclaiming appropriate inactive_clean pages.

* The struct
which represents
the virtual
memory image of
a file. Each
mmap()ed file,
executables and
files which are
mmap()ed by a
program, is
represented by
an address_space
struct which
holds on to the
inode for the
file, a list of
all the VM
mappings of the
file, and a
collection of
pointers to
functions needed
to handle
various VM tasks
on the file,
such as reading
in and writing
out pages. No
matter how many
processes have
mapped a
particular file,
there is only
struct for the
file. Swap files
are also
represented by
structs, which
allows the page
cache code to be
generic for
pages and
mmapped file

* The ubiquitous
page struct -
these are the
items that make
up the LRU lists
and inode

* The page hash
queues - lists
of pages with
the same hash
values. The hash
queues are used
when looking for
mapped pages in
the page cache
based on the
file and the
offset of the
page within the
Page Cache Code
add_to_page_cache( page
struct * page, struct
address_space * mapping,
unsigned long offset )
is the public interface
by which pages are added
to the page cache. It
adds the given page to
the inode and hash
queues for the specified
address_space, and to
the LRU lists.

The actual cache-add
operation is performed
by the helper function
in mm/filemap.c on line

* Line 506 if the
page is locked,
it's a bug. (The
page may, for
example, be in
the process of
being written
out to disk.)

* Lines 509 and
510: clear the
error, dirty,
uptodate, and
(architecture-specific?) bits, and set the locked bit in the page frame.

* Line 511 take a
reference to the
page. This is
the cache's
reference, and
is critical for
the proper
operation of
other page-cache
code (notably

* The remainder of
the code in the
function adds
the page to the
mapping, to the
page hash queue,
and to the LRU

This is the same as
except that it first
checks to be sure the
page being added isn't
already in the cache.

page struct *
__find_page_nolock(struct address_space *mapping, unsigned long offset, page struct *page) is the basic cache-search function. It looks at the pages in the hash queue given by the page argument, examining only those that are associated with the given mapping (address_space), and returns the one with the matching offset, if such exists. The code is "interesting", but pretty straightforward.

As an example of how
__find_page_nolock() is
used, consider the
kernel's activity when
handling a page fault on
an executable text page:

* The fault
handler uses the
fault address to
find the
corresponding to
the fault
address. It
calls the
*nopage member
of the
vm_ops, which in
the case of a
text area will

* filemap_nopage()
tracks down the
via the
vm_area_struct->file member. It computes the page's offset into the file using the vm_area_struct's base address and the fault address. It computes the hash value of the faulting page, and calls __find_get_page() to search the page cache, giving it the address_space mapping, offset, and hash queue.

* __find_get_page() calls __find_page_nolock() to search the hash queue for the page at the given offset in the proper mapping. If found, it takes an additional reference to the page on behalf of the faulting process, and returns the page. If the page isn't found, filemap_nopage() reads the page into the page cache and starts over from the beginning.

file * file, unsigned
long offset) checks to
see if a page
corresponding to the
given file and offset is
already in the page
cache. If it's not, it
reads in the page and
adds it to the cache.

Well, actually the other
way around:

* Line 557 is the
page in the
cache? If so, do

* Line 562
allocate a page
(page_cache_alloc() == get_free_page()).

* Line 566 add the
page to the page
associating it
with the
specified file
and offset.

* Line 567 do
whatever is
necessary to
read the page.
Note that adding
the page to the
page cache has
locked it;
anyone who wants
to use the page
must wait for it
to be unlocked
The page will be
unlocked when
its read
completes. [How
does this

void wait_on_page(page
struct *page) simply
checks to see if the
given page is locked,
and if it is, waits for
it to be unlocked. The
actual waiting is done
in ___wait_on_page() in

* Line 609 declare
a waitqueue
entry and add it
to the page's
wait queue.
We're declaring
the waitqueue
entry on the
stack; that's
OK, since this
instance of
___wait_on_page() won't return until the page is unlocked.

* The loop on line
612 is executed
while the page
is locked.
* On line
613 we
try to
any I/O
on the
page to

* We then
set the
state to
uninterruptible sleep, check to see if the page has been unlocked in the meantime, and if not, run the disk I/O task queue and schedule another process. [sync_page() seems to normally do run_task_queue(&tq_disk), via block_sync_page(). So we're forcing I/O twice?]

* We'll
wake up
schedule() as soon as the page gets unlocked and the wait queue is awoken.

When we need to write a
page from the page cache
into backing storage, in
page_launder() for
example, we invoke the
writepage member of the
page mapping's
structure. In most cases
this will ultimately
__block_write_full_page(struct inode *inode, page struct *page, get_block_t *get_block), defined in fs/buffer.c on line 1492 __b_w_f_p() writes out the given page via the buffer cache. Essentially, all we do here is create buffers for the page if it hasn't been done yet, map them to disk blocks, and submit them to the disk I/O subsystem to be physically written out asynchronously. We set the I/O completion handler to end_buffer_io_async(), which is a special handler that knows how to deal with page-cache read and write completions.

kswapd() is the kernel
swap thread. It simply
loops, deactivating and
laundering pages as
necessary. If memory is
tight, it is more
aggressive, but it will
always trickle a few
inactive pages per
minute back into the
free pool, if there are
any inactive pages to be

When there's nothing to
do, kswapd() sleeps. It
can be woken explicitly
by tasks that need
memory using
wakeup_kswapd(); a task
that wakes kswapd() can
either let it carry on
asynchronously, or wait
for it to free some
pages and go to sleep

try_to_free_pages() does
almost the same thing
that kswapd() does, but
it does it synchronously
without invoking kswapd.
Some tasks might
deadlock with kswapd()
if, for example, they
are holding kernel locks
that kswapd() needs in
order to operate; such
processes call
instead of
wakeup_kswapd(). [An
example would be good

Most of kswapd's work is
done by

do_try_to_free_pages(unsigned int gfp_mask, int user) takes a number of actions in order to try to create free (or freeable) pages.

* Line 921 move
pages to the
list, and if
launder some
pages by
starting writes
to the disk.

* Line 931 attempt
to move some
active pages to

* Line 935 free
unused slab

int gfp_mask, int user)
is invoked to try to
deactivate active pages.
It tries to create
enough freeable pages to
erase any existing free
memory shortage.

* The loop
beginning on
line 849 loops
over priority
until it reaches
0. Each time
through the loop
we decrement the
priority; lower
indicates higher
memory pressure,
and will cause
us to take more
action to try to
free pages.

* On line 857 we
attempt to
refill the
list by
deactivating as
many unused
active pages as
possible, using
refill_inactive_scan(). If this works well, it's all we need, and the loop will exit.

* On line 868 we
shrink the inode
cache and the
dentry cache, if

* Then, on line
874 we attempt
to forcibly
remove process
page mappings.
The swap_out()
function is
mis-named: it
doesn't actually
do any swapping
out. Rather, it
tries to evict
pages from
process PTE
mappings. If it
can eliminate
all process
mappings of the
page, the page
becomes eligible
for deactivation
and laundering.

* If we've
eliminated the
memory shortage,
exit the loop.
If we haven't
been able to
free any pages,
do the loop
again, more
Note that the
loop is certain
to exit
eventually by
eliminating the
memory shortage,
or failing to
make progress
even at priority
0. (Unless
you're one of
those lucky
souls with an
infinite number
of freeable
pages on your

looks at each page on
the active list, aging
the pages up if they've
been referenced since
last time and down if
they haven't (this is
counterintuitive - older
pages have lower age

If a page's age is 0,
and the only remaining
references to the page
are those of the buffer
cache and/or page cache,
then the page is
deactivated (moved to

There is some misleading
commentary in this
function. It doesn't
actually use
age_page_down() to age
down the page and
deactivate it; rather,
it ages the page down,
and then additionally
checks the reference
count to be sure it is
really unreferenced by
process PTEs before
actually deactivating
the page.

int swap_out(unsigned
int priority, int
gfp_mask) scans process
page tables and attempts
to unmap pages from
process VM. It computes
the per-task swap_cnt
value - basically a
"kick me" rating, higher
values make a process
more likely to be
victimized - and uses it
to decide which process
to try to "swap out"
first. Larger processes
which have not been
swapped out in a while
have the best chance of
being chosen.

The actual unmapping is
done by swap_out_mm(),
which tries to unmap
pages from a process's
memory map;
swap_out_vma(), which
tries to unmap pages
from a particular VM
area within a process by
walking the part of the
page directory
corresponding to the VM
area looking for page
middle directory entries
to unmap; swap_out_pgd()
which walks page middle
directory entries
looking for page tables
to unmap; swap_out_pmd()
which walks page tables
looking for page table
entries to unmap; and
try_to_swap_out(), where
all the interesting
stuff happens.

mm_struct * mm, struct
vm_area_struct* vma,
unsigned long address,
pte_t * page_table, int
gfp_mask) attempts to
unmap the page indicated
by the given pte_t from
the given
vm_area_struct. It
returns 0 if the
higher-level swap-out
code should continue to
scan the current
process, 1 if the
current scan should be
aborted and another swap
victim chosen. It's a
bit complicated:

* Line 46 if the
given page isn't
present, return
0 to continue
the scan.

* Do some paranoia
checks to be
sure the page is
in a sane

* Line 52 if the
process's memory
map has been
victimized by
the swap code,
return 1 to
force a new
process to be
reduce the
process's "good
victim" rating.

* Line 57 check if
the page is

* Line 59 age the
page up (make it
younger -
someone didn't
have enough
caffeine when
they wrote the
code :-) if it's
been referenced

* Line 65 if the
page is not
active, age the
page down, but
do not move it
to the
inactive_dirty_list because (by virtue of us looking at the page in try_to_swap_out()) it's still mapped and thus not freeable. That's the meaning of the mysterious comment on line 64

* Line 72 if the
page is young,
don't unmap it.
Note that under
high memory
load, both
swap_out() and
refill_inactive_scan() will be called more frequently (often by processes attempting to allocate free pages), thus pages will age more quickly, and we have a reasonable chance to find young pages in swap_out().

* Line 75 lock the
page frame so
happens to it
while we're
molesting it. If
we can't lock
it, continue to
scan the VMA.

* Line 83 clear
the PTE, flush
the TLB (so the
CPU doesn't
continue to
believe the page
is mapped).

* Line 94 if the
page is already
in the swap
cache (that is,
already in the
page cache and
associated with
swap space
rather than with
a regular file
mapping), we
just replace the
pte with the
swapfile entry
that will allow
us to swap the
page in again
when required.
We also mark the
page dirty, if
necessary, to
force it to be
written out next
looks at it.

* Line 102 -
drop_pte: we
jump to this
label from a
number of places
in order to
finish the
Unlock the page,
attempt to
deactivate it
(this will only
succeed if no
other process is
using it), and
drop the
Return 0 to
continue the

* Line 124 We get
here if the page
wasn't already
in the swap
cache (it's
either an
anonymous page
that's never
been swapped out
before, or else
it's an mmapped
page with
backing storage
on a regular
file and a valid
If the page is
clean we just
goto drop_pte,
since the page
can be recovered
by paging it in
from its backing
store, if any.
Note that if
it's a clean
anonymous page,
and it's not in
the swap cache,
then no process
has ever written
data to the page
- so it can be
replaced, if
necessary, by
allocating a
free page.

* Line 133 the
page is
dirty. If the
page has a
mapping, we're
cool, because
will write it
out if
necessary, so we
can just goto
drop_pte and be

* Line 144 The
page had no
mapping, so we
need to allocate
a swap page for
it and add it to
the swap cache.
(The swap cache
is just that
part of the page
cache that's
associated with
mapping rather
than with a
regular file
mapping.) If the
attempt to
allocate a swap
page fails, we
must restore the
PTE by jumping

* Line 149 swap
space allocated
successfully. We
add the page to
the page cache,
set it dirty so
knows to write
it out, and goto
set_swap_pte to
replace the PTE
with the swap
deactivate the
page, and
continue the
Simple, eh?

int page_launder(int
gfp_mask, int sync) is
responsible for moving
inactive_dirty pages to
the inactive_clean list,
writing them out to
backing disk if
necessary. It's pretty
straightforward. It
makes at most two loops
over the
inactive_dirty_list. On
the first loop it simply
moves as many clean
pages as it can to the
inactive_clean_list. On
the second loop it
starts asynchronous
writes for any dirty
pages it finds, after
locking the pages; but
it leaves those pages on
the inactive_dirty list.
Eventually the writes
will complete and the
pages will be unlocked,
at which point future
invocations of
page_launder() can move
them to the

page_launder() may also
do synchronous writes if
it's invoked by a user
process that's trying to
free up enough memory to
complete an allocation.
In this case we start
the async writes as
above (mm/vmscan.c line
561), but then wait for
them to complete, by
with wait > 1 (vmscan.c,
line 595 - line 603).
Starting the page write
causes the page to be
added to the buffer
attempts to remove the
page from the buffer
cache by writing all its
buffers to disk, if

page_struct *
reclaim_page(zone_t *
zone) attempts to
reclaim an
inactive_clean page from
the given zone. It
examines each page in
the zone's
moving referenced pages
to the active list, and
dirty pages to the
inactive_dirty list.
(Both of those tests are
basically paranoid and
should never happen,
since code that remaps
an inactive page in the
cache also moves it to
the active list.) When
it finds a genuine clean
page, which should
happen immediately, it
removes the page from
the page cache and
returns it as a free

Note the code from line
431 to line 439 every
inactive_clean page is
either in the swap cache
(if it was an anonymous
page) or else it's in
the page cache
associated with an
mmapped file
(page->mapping && !
Exactly one of those
conditions will be true.

Note also line 455 this
is just another paranoia
check. Pages on the
inactive_clean list are
no longer in the buffer
cache (since they've
been written out and
their buffers freed in
page_launder), so every
reclaimed page will have
a reference count of
exactly 1 (the page
cache's reference), in
the absence of broken
kernel modules or

deactivate_page() is
called within
try_to_swap_out() to try
to deactivate a page. It
for the interesting
bits. It checks that the
page is unused by anyone
except the caller (which
presumably is about to
release a reference) and
the page and buffer
caches, then makes the
page old by setting its
age to 0 and moves it to
the inactive_dirty_list,
but only if it was on
the active list. We
never want to deactivate
a page that's not in the
page cache, since we
have no idea what else
such pages may be being
used for, and the page
cache makes some pretty
strong assumptions about
what it can do with the
pages under its control.



Linux MM



Questions and comments
to Joe Knapka

The links in this page were produced by lxrreplace.tcl, which is available for free.


Last changed:
01-25-06 10:11:22

This page was rendered by LittleSite.
All content Copyright (c) 2005 by J.Knapka.
Questions and comments to JK

Blog Archive