Friday, January 23, 2009

grub usage one case

Originally grub is installed in MRB of /dev/sda0. Since I want to replace the disk of /dev/sd0, need to install the same grub in /dev/sda1.

#sudo grub
grub> setup (hd1)
grub> quit

Install Grub
In order to install GRUB as your boot loader, you need to first install the GRUB system and utilities under your UNIX-like operating system (see Obtaining and Building GRUB). You can do this either from the source tarball, or as a package for your OS.

After you have done that, you need to install the boot loader on a drive (floppy or hard disk). There are two ways of doing that - either using the utility grub-install (see Invoking grub-install) on a UNIX-like OS, or by running GRUB itself from a floppy. These are quite similar, however the utility might probe a wrong BIOS drive, so you should be careful.

Also, if you install GRUB on a UNIX-like OS, please make sure that you have an emergency boot disk ready, so that you can rescue your computer if, by any chance, your hard drive becomes unusable (unbootable).

GRUB comes with boot images, which are normally put in the directory /usr/lib/grub/i386-pc. If you do not use grub-install, then you need to copy the files stage1, stage2, and *stage1_5 to the directory /boot/grub, and run the grub-set-default (see Invoking grub-set-default) if you intend to use `default saved' (see default) in your configuration file. Hereafter, the directory where GRUB images are initially placed (normally /usr/lib/grub/i386-pc) will be called the image directory, and the directory where the boot loader needs to find them (usually /boot/grub) will be called the boot directory.

Installing GRUB natively

Caution: Installing GRUB's stage1 in this manner will erase the normal boot-sector used by an OS.

GRUB can currently boot GNU Mach, Linux, FreeBSD, NetBSD, and OpenBSD directly, so using it on a boot sector (the first sector of a partition) should be okay. But generally, it would be a good idea to back up the first sector of the partition on which you are installing GRUB's stage1. This isn't as important if you are installing GRUB on the first sector of a hard disk, since it's easy to reinitialize it (e.g. by running `FDISK /MBR' from DOS).

If you decide to install GRUB in the native environment, which is definitely desirable, you'll need to create a GRUB boot disk, and reboot your computer with it. Otherwise, see Installing GRUB using grub-install.

Once started, GRUB will show the command-line interface (see Command-line interface). First, set the GRUB's root device4 to the partition containing the boot directory, like this:

grub> root (hd0,0)

If you are not sure which partition actually holds this directory, use the command find (see find), like this:

grub> find /boot/grub/stage1

This will search for the file name /boot/grub/stage1 and show the devices which contain the file.

Once you've set the root device correctly, run the command setup (see setup):

grub> setup (hd0)

This command will install the GRUB boot loader on the Master Boot Record (MBR) of the first drive. If you want to put GRUB into the boot sector of a partition instead of putting it in the MBR, specify the partition into which you want to install GRUB:

grub> setup (hd0,0)

If you install GRUB into a partition or a drive other than the first one, you must chain-load GRUB from another boot loader. Refer to the manual for the boot loader to know how to chain-load GRUB.

After using the setup command, you will boot into GRUB without the GRUB floppy. See the chapter Booting to find out how to boot your operating systems from GRUB.


This description details the structure at the start of a hard disk in a PC, and how GRUB fits into that structure.
The PC hard disk layout
The diagram below is the disk layout of my linux laptop which is very standard. There are 4 primary partitions whose size and function are labeled, but this document is going to concentrate on the expanded area in the diagram at the start of the disk. I will describe its structure and how GRUB fits in. Note the structure is the same for windows systems, only the content (grub) is different.

PC hard disk layout
The Master Boot Record
The first sector (512 bytes) is the MBR and consists of, 446 bytes bootloader, 64 bytes partition table and a 2 byte signature (0xAA55). It get its name because it is the first boot record that is loaded, which can then load other boot records from various locations on the disk. To give a hexdump of this first sector of the disk you can do dd bs=512 count=1 if=/dev/hda | od -Ax -tx1z -v (changing /dev/hda as appropriate for your setup):

000000 eb 48 90 d0 bc 00 7c fb 50 07 50 1f fc be 1b 7c >.H....|.P.P....|<
000010 bf 1b 06 50 57 b9 e5 01 f3 a4 cb be be 07 b1 04 >...PW...........<
000020 38 2c 7c 09 75 15 83 c6 10 e2 f5 cd 18 8b 14 8b >8,|.u...........<
000030 ee 83 c6 10 49 74 16 38 2c 74 f6 be 10 07 03 02 >....It.8,t......<
000040 80 00 00 80 b8 85 64 00 00 08 fa 80 ca 80 ea 53 >......d........S<
000050 7c 00 00 31 c0 8e d8 8e d0 bc 00 20 fb a0 40 7c >|..1....... ..@|<
000060 3c ff 74 02 88 c2 52 be 79 7d e8 34 01 f6 c2 80 ><.t...R.y}.4....<
000070 74 54 b4 41 bb aa 55 cd 13 5a 52 72 49 81 fb 55 >tT.A..U..ZRrI..U<
000080 aa 75 43 a0 41 7c 84 c0 75 05 83 e1 01 74 37 66 >.uC.A|..u....t7f<
000090 8b 4c 10 be 05 7c c6 44 ff 01 66 8b 1e 44 7c c7 >.L...|.D..f..D|.<
0000a0 04 10 00 c7 44 02 01 00 66 89 5c 08 c7 44 06 00 >....D...f.\..D..<
0000b0 70 66 31 c0 89 44 04 66 89 44 0c b4 42 cd 13 72 >pf1..D.f.D..B..r<
0000c0 05 bb 00 70 eb 7d b4 08 cd 13 73 0a f6 c2 80 0f >...p.}....s.....<
0000d0 84 f0 00 e9 8d 00 be 05 7c c6 44 ff 00 66 31 c0 >........|.D..f1.<
0000e0 88 f0 40 66 89 44 04 31 d2 88 ca c1 e2 02 88 e8 >..@f.D.1........<
0000f0 88 f4 40 89 44 08 31 c0 88 d0 c0 e8 02 66 89 04 >..@.D.1......f..<
000100 66 a1 44 7c 66 31 d2 66 f7 34 88 54 0a 66 31 d2 >f.D|f1.f.4.T.f1.<
000110 66 f7 74 04 88 54 0b 89 44 0c 3b 44 08 7d 3c 8a >f.t..T..D.;D.}<.<
000120 54 0d c0 e2 06 8a 4c 0a fe c1 08 d1 8a 6c 0c 5a >T.....L......l.Z<
000130 8a 74 0b bb 00 70 8e c3 31 db b8 01 02 cd 13 72 >.t...p..1......r<
000140 2a 8c c3 8e 06 48 7c 60 1e b9 00 01 8e db 31 f6 >*....H|`......1.<
000150 31 ff fc f3 a5 1f 61 ff 26 42 7c be 7f 7d e8 40 >1.....a.&B|..}.@<
000160 00 eb 0e be 84 7d e8 38 00 eb 06 be 8e 7d e8 30 >.....}.8.....}.0<
000170 00 be 93 7d e8 2a 00 eb fe 47 52 55 42 20 00 47 >...}.*...GRUB .G<
000180 65 6f 6d 00 48 61 72 64 20 44 69 73 6b 00 52 65 >eom.Hard Disk.Re<
000190 61 64 00 20 45 72 72 6f 72 00 bb 01 00 b4 0e cd >ad. Error.......<
0001a0 10 ac 3c 00 75 f4 c3 00 00 00 00 00 00 00 00 00 >..<.u...........<
0001b0 00 00 00 00 00 00 00 00 5f 00 5f 00 00 00 00 01 >........_._.....<
0001c0 01 00 0b fe 7f 97 3f 00 00 00 59 03 64 00 00 00 >......?...Y.d...<
0001d0 41 98 83 fe 7f a4 98 03 64 00 cd 2f 03 00 00 00 >A.......d../....<
0001e0 41 a5 83 fe ff ff 65 33 67 00 fc 08 fa 00 80 fe >A.....e3g.......<
0001f0 ff ff 0f fe ff ff 61 3c 61 01 1f ed f2 00 55 aa >......a
Note there is no mystery about or dissassembly required of the blue section above as GRUB is open source and one can download, inspect and modify its stage 1 source code.
The DOS compatibility region
This region is optional as far as linux is concerned at least, but is added by default by most partition managers. To understand why this region was required we need to describe how disks used to be addressed. Generally now disks are addressed in LBA mode which allows for greater capacity disks while abstracting software away from the specifics of the disk itself. Previously though disks were addressed in CHS mode, which represented the physical construction of the disk as can be seen in the diagram below:

Cylinders Heads Sectors

DOS had the requirement that its image did not span across cylinders, and so this region was added by partition managers so that the first partition was aligned on a cylinder boundary. Therefore this region's size is determined by the number of sectors (512 bytes) per cylinder. The maximum (and usual given todays disk sizes and LBA) sectors per cylinder is 63, which leaves 62 sectors free after the MBR (31,744 bytes).

GRUB uses this region to store its stage 1.5, which is filesystem specific code used to find the operating system image on the "boot" filesystem. Currently GRUB does not need all this space as can be seen for the copies of all the stage 1.5 files in the boot partition:

$ ls -lS /boot/grub/*stage1_5
-rw-r--r-- 1 root root 9428 Mar 8 14:27 /boot/grub/reiserfs_stage1_5
-rw-r--r-- 1 root root 9308 Mar 8 14:27 /boot/grub/xfs_stage1_5
-rw-r--r-- 1 root root 8448 Mar 8 14:27 /boot/grub/jfs_stage1_5
-rw-r--r-- 1 root root 7956 Mar 8 14:27 /boot/grub/e2fs_stage1_5
-rw-r--r-- 1 root root 7684 Mar 8 14:27 /boot/grub/fat_stage1_5
-rw-r--r-- 1 root root 7272 Mar 8 14:27 /boot/grub/ufs2_stage1_5
-rw-r--r-- 1 root root 7188 Mar 8 14:27 /boot/grub/minix_stage1_5
-rw-r--r-- 1 root root 7028 Mar 8 14:27 /boot/grub/iso9660_stage1_5
-rw-r--r-- 1 root root 6996 Mar 8 14:27 /boot/grub/ffs_stage1_5
-rw-r--r-- 1 root root 6612 Mar 8 14:27 /boot/grub/vstafs_stage1_5

Therefore one could theoretically use the last 43 sectors of this region (22,016 bytes) for anything. There is no point in creating a filesystem in here as the overhead would be too much, but you could dd stuff in and out like:
dd bs=512 seek=20 count=43 if=myfile of=/dev/hda
dd bs=512 skip=20 count=43 if=/dev/hda of=myfile
Note be very sure you know what your doing before running these commands.

Note on my laptop the last sector of this region contains the first sector of a MSWIN4.1 boot record (which I understand is 3 sectors in total), as it came with winxp installed (in a FAT32 partition). GRUB makes this sector redundant even in a dual boot situation, so don't worry about overwriting it. You can inspect this sector using: dd bs=512 skip=62 count=1 if=/dev/hda | od -Ax -tx1z -v
GRUB 1 boot process
GRUB or the GRand Unified Bootloader is the usual bootloader used on linux systems, and resides on the system as described above. The boot process with GRUB is as follows:

* BIOS code reads MBR from a disk (looks at last 2 bytes to verify if MBR).
* Starts executing bootloader code (GRUB stage 1).
* Bootloader jumps to location (sector num) of next stage. This sector num is stored at a particular location in the bootloader "code" at GRUB install time and usually points to a stage 1.5 located in the "DOS compat space" immediately after the MBR.
* Stage 1.5 knows about the boot filesystem so it opens the filesystem on the specified (at install time) partition. It looks for the stage 2 executable here and executes it. Note since stage 1.5 knows about the boot filesystem it gives much greater flexibility in upgrading stage 2 and the kernel etc. as their new locations don't need to be written to the earlier GRUB stages.
* Stage 2 contains most of the GRUB logic. It loads the menu.lst file and executes the statements, usually providing a menu to the user etc.

Subsequent steps are distro specific but for completeness I'll describe them for my fedora linux distribution:

* When GRUB starts booting one of the entries, it reads the initial ramdisk and starts the kernel running telling it about the ramdisk.
* In the initial ramdisk, the nash shell is run to parse the /linuxrc file. It essentially finds the location of the filesystem it itself is on and passes that to the kernel as its root filesystem. This allows for greater flexibility of the devices the kernel resides on.
* The kernel reads its root filesystem and executes /bin/init by default. This in turn parses /etc/inittab which usually sets up the login consoles and starts executing the scripts in /etc/init.d
* These scripts start various subsystems, culminating in starting X. X in turn starts the display manager which gives a login prompt.

GRUB 2 boot process
The structure of GRUB has changed quite a bit with version 2, (which is still in development at the time of writing). Instead of stage 1, stage 1.5 and stage 2 images it has a collection of modules that can be combined in different ways. To mirror the functionality the original GRUB's stages as described above, you would have for example:

stage 1 = boot.img
stage 1.5 = diskboot.img+kernel.img+pc.mod+ext2.mod (the core image)
stage 2 = normal.mod+_chain.mod

The boot process for GRUB 2 then would be:

* BIOS code reads MBR from a disk (looks at last 2 bytes to verify if MBR).
* Starts executing bootloader code (boot.img).
* This loads the first sector of the core image (diskboot.img), whose location was stored at a particular location in boot.img at GRUB install time and usually points to the core image located in the "DOS compat space" immediately after the MBR. diskboot.img in turn loads and executes the rest of the core image.
* These core image modules know about the boot filesystem and can open the filesystem on the specified (at install time) partition. It looks for and loads modules there, including normal.mod
* normal.mod loads the grub.cfg file and executes the statements, usually providing a menu to the user etc.

This is a more flexible mechanism. For example one can prepend pxeboot.img to the core image instead of diskboot.img. This will then load the whole core image from the network and then start executing kernel.img.

Thursday, January 15, 2009

I/O scheduler

Although most Linux users are familiar with the role of process
schedulers, such as the new O(1) scheduler, many users are not so
familiar with the role of I/O schedulers. I/O schedulers are similar in
some aspects to process schedulers; for instance, both schedule some resource
among multiple users. A process scheduler virtualizes the resource of
processor time among multiple executing processes on the system. So,
what does an I/O scheduler schedule?

A naïve system would not even include an I/O scheduler. Unlike the process
scheduler, the I/O scheduler is not a mandatory component of the operating
system. Instead, performance is the I/O scheduler's sole
raison d'être.

To understand the role of an I/O scheduler, let's go over some background
information and then look at how a system behaves without an I/O
scheduler. Hard disks address their data using the familiar geometry-based
addressing of cylinders, heads and sectors. A hard
drive is composed of multiple platters, each consisting of a single
disk, spindle and read/write head. Each platter is divided further into
circular ring-like tracks, similar to a CD or record. Finally,
each track is composed of some integer number of sectors.

To locate a specific unit of data in a hard drive, the drive's logic
requires three pieces of information: the cylinder, the head and the
sector. The cylinder specifies the track on which the data resides. If
you lay the platters on top of one another (as they are in a hard disk),
a given track forms a cylinder through each platter. The head then
identifies the exact read/write head (and thus the exact platter)
in question. The search now is narrowed down to a single track on a
single platter. Finally, the sector value denotes the exact sector on
the track. The search is complete: the hard disk knows what sector,
on what track, on what platter the data resides. It can position the
read/write head of the correct platter over the correct track and read
the proper sector.

Thankfully, modern hard disks do not force computers to communicate
with them in terms of cylinders, heads and sectors. Instead, modern
hard drives map a unique block number over each cylinder/head/sector
triplet. The unique number identifies a specific cylinder/head/sector
value. Modern operating systems then can address hard drives using
this block number—called logical block addressing—and the
hard drive translates the block number into the correct
cylinder/head/sector value.

One thing of note about this block number: although nothing guarantees
it, the physical mapping tends to be sequential. That is, logical
block n tends to be physically adjacent to logical block
n+1. We discuss why that is important later on.

Now, let's consider the typical UNIX system. Applications as varied
as databases, e-mail clients, Web servers and text editors issue I/O
requests to the disk, such as read this block and write to that
block. The blocks tend to be located physically all over the disk. The
e-mail spool may be located in an entirely different region of the disk
from the Web server's HTML data or the text editor's configuration
file. Indeed, even a single file can be strewn all over the disk if the
file is fragmented, that is, not laid out in sequential blocks. Because the files
are broken down into individual blocks, and hard drives are addressed by
block and not the much more abstract concepts of files, reading or writing
file data is broken down into a stream of many individual I/O requests,
each to a different block. With luck, the blocks are sequential or at
least physically close together. If the blocks are not near one
another, the disk head must move to another location on the disk. Moving
the disk head is called seeking, and it is one of the most expensive
operations in a computer. The seek time on modern hard drives is measured
in the tens of milliseconds. This
is one reason why defragmented files are a good thing.

Unfortunately, it does not matter if the files are defragmented because
the system is generating I/O requests for multiple files, all over the
disk. The e-mail client wants a little from here and the Web server wants
a little from there—but wait, now the text editor wants to read a
file. The net effect is that the disk head is made to jump around the
disk. In a worst-case scenario, with interleaved I/O requests to multiple
files, the head can spend all of its time jumping around from
one location to another—not a good thing for overall system performance.

This is where the I/O scheduler comes in. The I/O scheduler schedules
the pending I/O requests in order to minimize the time spent moving the
disk head. This, in turn, minimizes disk seek time and maximizes hard
disk throughput.

This magic is accomplished through two main actions, sorting
and merging. First, the I/O scheduler keeps the list of pending
I/O requests sorted by block number. When a new I/O request is issued,
it is inserted, block-wise, into the list of pending requests. This
prevents the drive head from seeking all around the disk to service I/O
requests. Instead, by keeping the list sorted, the disk head moves in
a straight line around the disk. If the hard drive is busy servicing
a request at one part of the disk, and a new request comes in to the
same part of the disk, that request can be serviced before moving off
to other parts of the disk.

Merging occurs when an I/O request is issued to an identical or adjacent
region of the disk. Instead of issuing the new request on its own, it
is merged into the identical or adjacent request. This minimizes
the number of outstanding requests.

Let's look at an example. Consider the case where two applications issue requests
to the following block numbers, such that they arrived in the kernel in
this order: 10, 500, 12, 502, 14, 504 and 12. The I/O scheduler-less
approach would service these blocks in the given order. That is seven
long seeks, back and forth between two parts of the disk. What a waste!
If the kernel sorted and merged these requests, however, and serviced
them in that order, the results would be much different: 10, 12, 14, 500,
502 and 504. Only a single far-off seek and one less request overall.

In this manner, an I/O scheduler virtualizes the resources of
disk I/O among multiple I/O requests in order to maximize global
throughput. Because I/O throughput is so crucial to system performance
and because seek time is so horribly slow, the job of an I/O scheduler
is important.

The Linus Elevator

The I/O scheduler found in the 2.4 Linux kernel is named the Linus
Elevator. I/O schedulers often are called elevator algorithms,
because they tackle a problem similar to that of keeping an elevator
moving smoothly in a large building. The Linus Elevator functions almost
exactly like the classic I/O scheduler described above. For the most part,
this was great because simplicity is a good thing and the 2.4 kernel's
I/O scheduler just worked. Unfortunately, in the I/O scheduler's quest
to maximize global I/O throughput, a trade-off was made: local
fairness—in particular, request latency—can go easily out the window. Let's
look at an example.

Consider a stream of requests to logical disk blocks such as 20, 30,
700 and 25. The I/O scheduler's sorting algorithm would queue and issue
the requests in the following order (assuming the disk head currently
is at the logical start of the disk): 20, 25, 30 and 700. This is expected
and indeed correct. Assume, however, that halfway through servicing
the request to block 25, another request comes in to the same part of
the disk. And then another. And yet another. It is entirely feasible
that the request way over to block 700 is not serviced for a relatively
long time.

Worse, what if the request was to read a disk block? Read requests
generally are synchronous. When an application issues a request to
read some data, it typically blocks and waits until the kernel
returns the data. The application must sit and wait, twiddling
its thumbs, until that request way over at block 700 finally
is serviced. Writes, on the other hand, typically are not
are asynchronous. When an application issues a write, the kernel
copies the data and metadata into the kernel, prepares a buffer to hold
the data and returns to the application. The application does not really
care or even know when the data actually hits the disk.

It gets worse for our friend the read request, however. Because writes
are asynchronous, writes tend to stream. That is, it is common
for a large writeback of a lot of data to occur. This implies that many
individual write requests are submitted to a close area of the hard
disk. As an example, consider saving a large file. The application dumps
write requests on the system and hard drive as fast as it is scheduled.

Read requests, conversely, usually do not stream. Instead, applications
submit read requests in small one-by-one chunks, with each chunk dependent
on the last. Consider reading all of the files in a directory. The
application opens the first file, issues a read request for a suitable
chunk of the file, waits for the returned data, issues a read request for
the next chunk, waits and continues likewise until the entire file is
read. Then the file is closed, the next file is opened and the process
repeats. Each subsequent request has to wait for the previous,
which means substantial delays to this application if the requests are to
far-off disk blocks. The phenomenon of streaming write requests starving
dependent read requests is called writes-starving-reads (see Sidebar
“Test 1. Writes-Starving-Reads”).

The possibility of not servicing some requests in a reasonable amount
of time is called starvation. Request starvation results in
unfairness. In the case of I/O schedulers, the system explicitly
has decided to trade fairness for improved global throughput. That is,
the system attempts to improve the overall performance of the system
at the possible expense of any one specific I/O request. This is
accepted and, indeed, desired—except that prolonged starvation is
bad. Starving read requests for even moderate lengths of time results in
high application latency for applications issuing read requests during
other disk activity. This high read latency adversely affects
system performance and feel (see Sidebar “Test 2. Effects of High
Read Latency”).

The Deadline I/O Scheduler

Preventing the starvation of requests in general, and read requests in
particular, was a goal of the new 2.6 I/O schedulers.

The Deadline I/O Scheduler was introduced to solve the starvation issue
surrounding the 2.4 I/O scheduler and traditional elevator algorithms
in general. As discussed, the Linus Elevator maintains the sorted list
of pending I/O requests in a single queue. The I/O request at the head of
the queue is the next one to be serviced. The Deadline I/O Scheduler continues to
keep this queue, but kicks things up a notch by introducing two additional
queues—the read FIFO queue and the write FIFO queue. The
Deadline I/O Scheduler keeps the items in each of these queues sorted by
submission time (effectively, first in is first out). The read FIFO
queue, as its name suggests, contains only read requests. The write FIFO
queue, likewise, contains only write requests. Each FIFO queue is assigned
an expiration value. The read FIFO queue has an expiration time
of 500 milliseconds. The write FIFO queue has an expiration time of five seconds.

When a new I/O request is submitted, it is insertion-sorted into the
standard queue and placed at the tail of its respective (either read
or write) FIFO queue. Normally, the hard drive is sent I/O requests from
the head of the standard sorted queue. This maximizes global throughput
by minimizing seeks, as the normal queue is sorted by block number,
as with the Linus Elevator.

When the item at the head of one of the FIFO queues, however, grows
older than the expiration value associated with its queue, the I/O
scheduler stops dispatching I/O requests from the standard queue.
Instead, it services the I/O request at the head of the FIFO queue, plus a
couple extra for good measure. The I/O scheduler needs to check and
handle only the requests at the head of the FIFO queues, as those are the
oldest requests in the queue.

Remember our old friend, the request to block 700? Despite the flood of
write requests to the faraway blocks, after 500 milliseconds the Deadline
I/O Scheduler would stop servicing those requests and service the read
over at block 700. The disk would seek to block 700, service the read
request and then continue servicing the other outstanding requests.

In this manner, the Deadline I/O Scheduler can enforce a soft deadline on
I/O requests. Although it makes no promise that an I/O request is serviced
before the expiration time, the I/O scheduler generally services requests
near their expiration times. Thus, the Deadline I/O Scheduler continues
to provide good global throughput without starving any one request for
an unacceptably long time. Because read requests are given short
expiration times, the writes-starving-reads problem is minimized.

Anticipatory I/O Scheduler

This is all well and good, but it's not a perfect solution. Consider what
happens with our fictional request to block 700, which presumably is the
first of many dependent reads to that area of the disk. After servicing
the read request, the Deadline I/O Scheduler continues servicing the
write requests to the earlier blocks. This is fine, until the reading
application submits its next read request (say, to block 710). In 500
milliseconds, that request expires and the disk seeks over
to block 710, services the request, seeks back to where it was before
and continues servicing the streaming write. And then another read

The problem again stems from those darn dependent reads. Because reads
are issued in dependent chunks, the application issues the next read
only when the previous is returned. But by the time the application
receives the read data, is scheduled to run and submits the next read,
the I/O scheduler has moved on and begun servicing some
other requests. This results in a wasted pair of seeks for each read:
seek to the read, service it and seek back. If only there was some way
for the I/O scheduler to know—nay, to anticipate—that another
read would soon be submitted to the same part of the disk. Instead
of seeking back and forth, it could wait in anticipation for the next
read. Saving those awful seeks certainly is worth a few milliseconds of
waiting; we save two seeks.

This is, of course, exactly what the Anticipatory I/O Scheduler
does. It began as the Deadline I/O Scheduler; it implements the
same deadline-based scheduling. But it was gifted with the addition
of an anticipation mechanism. When a read request is submitted,
the Anticipatory I/O Scheduler services it within its deadline, as
usual. Unlike the Deadline I/O Scheduler, however, the Anticipatory
I/O Scheduler then sits and waits, doing nothing, for up to six
milliseconds. Chances are good that the
application will issue another read to the same part of the filesystem
during those six milliseconds. If
so, that request is serviced immediately, and the Anticipatory I/O
Scheduler waits some more. If six milliseconds go by without a read
request, the Anticipatory I/O Scheduler guessed wrong and returns to
whatever it was doing before.

If even a moderate number of requests are anticipated correctly, a great
deal of time (two expensive seeks, each) is saved (Table 1). Because most reads are
dependent, the anticipation pays off most of the time. To further improve
the odds of a correct anticipation, the Anticipatory I/O Scheduler uses a
heuristic to better guess for which processes to wait. To
this end, the scheduler maintains I/O statistics about each process to
keep track of its behavior. Because of these statistics and intelligent
heuristics, the Anticipatory I/O Scheduler correctly anticipates the
actions of applications a sufficiently large amount of the time to be
well worth the overhead.

Table 1. The Results

I/O Scheduler and KernelTest 1Test 2
Linus Elevator on 2.445 seconds30 minutes, 28 seconds
Deadline I/O Scheduler on 2.640 seconds3 minutes, 30 seconds
Anticipatory I/O Scheduler on 2.64.6 seconds15 seconds

By minimizing unneeded seeks and more quickly servicing read requests,
in many workloads the Anticipatory I/O Scheduler provides both improved
request latency and global throughput over the Deadline I/O Scheduler
and the Linus Elevator. Unsurprisingly, the Anticipatory I/O Scheduler
is the default I/O scheduler in the 2.6 kernel. And rightly so, it rocks.


Andrea Arcangeli and Jens Axboe are the primary authors of the
Linus Elevator. Jens Axboe is the primary author of the Deadline I/O
Scheduler. Nick Piggin is the primary author of the Anticipatory I/O

Robert Love ( is a kernel hacker at MontaVista Software
and a student at the University of Florida. He is the author of Linux
Kernel Development
. Robert loves O.A.R. and lives in Gainesville,

Tuesday, January 13, 2009

Gnu Makefile by example



# tells make to export all variables to child processes by default. make adds the variable and
# its value to the environment for running each command. The sub-make, in turn, uses the
# environment to initialize its table of variable values.

include $(ROOTDIR)/vendors/config/common/config.arch

# The makefiles need to know how to do things in different contexts
# To save some pain we put it all here
# First settings we always want for all builds

MACHINE = bfin # this is for uClibc
ARCH = blackfin # this is for the kernel
ENDIAN = little

CONFIGURE_HOST := bfin-linux-uclibc
CONFIGURE_SHARED_ENABLE := --enable-shared
CONFIGURE_SHARED_WITH := --with-shared
CONFIGURE_HOST := bfin-uclinux
CONFIGURE_SHARED_ENABLE := --disable-shared
CONFIGURE_SHARED_WITH := --without-shared
KERNEL_CROSS_COMPILE := bfin-uclinux-

# ":=" - Simple Expanded Variable. The value of a simply expanded variable is scanned once and for all, expanding any references to other variables and functions, when the variable is defined. This makes Makefile programing simple.

CONFIG_SITE = $(ROOTDIR)/vendors/config/


MKFS_CRAMFS = $(ROOTDIR)/user/cramfs/host_build/mkcramfs
MKFS_EXT2 = $(ROOTDIR)/user/genext2fs/build-host/genext2fs
MKFS_JFFS2 = $(ROOTDIR)/user/mtd-utils/mkfs.jffs2
MTD_SUMTOOL = $(ROOTDIR)/user/mtd-utils/sumtool
MKFS_ROMFS = $(ROOTDIR)/user/genromfs/genromfs
MKFS_UBIFS = $(ROOTDIR)/user/mtd-utils/mkfs.ubifs
MKFS_YAFFS = $(ROOTDIR)/user/yaffs-utils/mkfs.yaffs
MKFS_YAFFS2 = $(ROOTDIR)/user/yaffs-utils/mkfs.yaffs2

# Settings for configure / autotools

CONFIGURE_BUILD := $(shell sh $(ROOTDIR)/tools/config.guess)

# Here "shell" is a function - The shell function performs the same function that backquotes (``') perform in most shells: it does command expansion.

--host=$(CONFIGURE_HOST) \
--build=$(CONFIGURE_BUILD) \
--prefix=/usr \
--sysconfdir=/etc \
--datadir=/usr/share \
--mandir=/usr/share/man \
--infodir=/usr/share/info \
--localstatedir=/var/lib \
--disable-dependency-tracking \
ifneq ($(findstring s,$(MAKEFLAGS)),)

# $(findstring find,text) - Locate find in text.
# MAKEFLAGS - The flags given to make. You can set this in the environment or a makefile to set flags.
# "-s" : silent, quiet

cpu-$(CONFIG_BF512) := bf512
cpu-$(CONFIG_BF514) := bf514
cpu-$(CONFIG_BF516) := bf516
cpu-$(CONFIG_BF518) := bf518
cpu-$(CONFIG_BF522) := bf522
cpu-$(CONFIG_BF523) := bf523
cpu-$(CONFIG_BF524) := bf524
cpu-$(CONFIG_BF525) := bf525
cpu-$(CONFIG_BF526) := bf526
cpu-$(CONFIG_BF527) := bf527
cpu-$(CONFIG_BF531) := bf531
cpu-$(CONFIG_BF532) := bf532
cpu-$(CONFIG_BF533) := bf533
cpu-$(CONFIG_BF534) := bf534
cpu-$(CONFIG_BF536) := bf536
cpu-$(CONFIG_BF537) := bf537
cpu-$(CONFIG_BF538) := bf538
cpu-$(CONFIG_BF539) := bf539
cpu-$(CONFIG_BF542) := bf542
cpu-$(CONFIG_BF544) := bf544
cpu-$(CONFIG_BF547) := bf547
cpu-$(CONFIG_BF548) := bf548
cpu-$(CONFIG_BF549) := bf549
cpu-$(CONFIG_BF561) := bf561

rev-$(CONFIG_BF_REV_0_0) := 0.0
rev-$(CONFIG_BF_REV_0_1) := 0.1
rev-$(CONFIG_BF_REV_0_2) := 0.2
rev-$(CONFIG_BF_REV_0_3) := 0.3
rev-$(CONFIG_BF_REV_0_4) := 0.4
rev-$(CONFIG_BF_REV_0_5) := 0.5
rev-$(CONFIG_BF_REV_0_6) := 0.6
rev-$(CONFIG_BF_REV_ANY) := any
rev-$(CONFIG_BF_REV_NONE) := none
ifeq ($(rev-y),)
$(warning )
$(warning The Blackfin Silicon Revision you are targetting is not known.)
$(warning Please file a bug report about this.)
$(warning )
rev-y := any

ifeq ($(cpu-y),)
$(warning )
$(warning The Blackfin CPU you are targetting is not known.)
$(warning Please file a bug report about this.)
$(warning )
CPUFLAGS = -mcpu=$(cpu-y)-$(rev-y)

# Set up all our fun CFLAGS/CPPFLAGS/LDFLAGS. Normalize our
# settings so we don't differentiate between user and lib.


CFLAGS-y := -pipe -Wall
CXXFLAGS-y := -pipe -Wall
CPPFLAGS-y := -DEMBED -D__uClinux__ -I$(ROOTDIR)


ifeq ($(CONFIG_ALL_DEBUG),y)
CFLAGS-y += -O0 -g
CFLAGS-y += $(subst ",,$(strip $(CONFIG_USER_CFLAGS)))

# $(strip string) - Remove excess whitespace characters from string.
# $(subst from,to,text) - Replace from with to in text.




CFLAGS-$(CONFIG_FMT_USE_SHARED_FLAT) += -mid-shared-library
LDFLAGS-$(CONFIG_FMT_USE_SHARED_FLAT) += -Wl,-shared-lib-id,0 -mid-shared-library
LDFLAGS-$(CONFIG_FMT_USE_SEP_DATA) += -Wl,-shared-lib-id,0 -msep-data

# This will prevent building up of code in lib/ as shared libs.
# You will need to build/link those by hand anyways ...
CFLAGS-$(CONFIG_FMT_USE_SHARED_FLAT) += -mshared-library-id=0
FLAT_LIBC := $(shell $(CC) $(CPUFLAGS) -mid-shared-library -print-file-name=libc.gdb)



Blog Archive