adam's home: 2009

Wednesday, June 24, 2009

0624


1. GDB:

"display" in three flavors:

display expr
    Add the expression expr to the list of expressions to display each time your program stops. See section Expressions.

    display does not repeat if you press RET again after using it.

display/fmt expr
    For fmt specifying only a display format and not a size or count, add the expression expr to the auto-display list but arrange to display it each time in the specified format fmt. See section Output Formats.

display/fmt addr
    For fmt `i' or `s', or including a unit-size or a number of units, add the expression addr as a memory address to be examined each time your program stops. Examining means in effect doing `x/fmt addr'. See section Examining Memory. 


Example for the 3rd usage: 
 display/1xh 0xffc00fc0

This is useful to examine a 16-bit register

2. /proc/devices: show the major number

root:/> cat /proc/devices 
Character devices:
  1 mem
  5 /dev/tty
  5 /dev/console
  5 /dev/ptmx
 10 misc
 13 input
128 ptm
136 pts
150 rtpipe
204 ttyBF
254 rtc

Block devices:
  1 ramdisk
259 blkext
 31 mtdblock

3. About Trace Buffer

Blackfin records every program sequence change in trace buffer, if trace buffer is enabled. There is a 16-entry TBUF register table.

- A bit (TBUGOVF) in the TB control register can enable exception when the TBUF overflows (>16 entry). In the exception handler, TBUF can be read into memory, so that we can record more than 16 traces.
- The trace buffer can be configured to omit the recording of changes in program flow that match either the last entry or one of the last two entries.
- If TBUFOVF = 1, then the Trace Unit does not record discontinuities in the exception, NMI, and reset routines, because TB itself triggers exception.
- Setting TBUFOVF have impact on performance - every 16 sequence change would trigger an exception.

Wednesday, April 15, 2009

bfin spi device


从小就有种强烈的恐惧感。在空间中渺小无比，在时间中如流星一瞬而又无法重复。没有“我”的宇宙依然如故，而“我”却最终空无一物，无踪无影。这一切残酷，却又是无法回避的现实。

想到这里，万物皆空，万念俱灰。


-------------------------------------------------------------------------------

struct spi_device {
 struct device  dev;
 struct spi_master *master;
 u32   max_speed_hz;
 u8   chip_select;
 u8   mode;
#define SPI_CPHA 0x01   /* clock phase */
#define SPI_CPOL 0x02   /* clock polarity */
#define SPI_MODE_0 (0|0)   /* (original MicroWire) */
#define SPI_MODE_1 (0|SPI_CPHA)
#define SPI_MODE_2 (SPI_CPOL|0)
#define SPI_MODE_3 (SPI_CPOL|SPI_CPHA)
#define SPI_CS_HIGH 0x04   /* chipselect active high? */
#define SPI_LSB_FIRST 0x08   /* per-word bits-on-wire */
#define SPI_3WIRE 0x10   /* SI/SO signals shared */
#define SPI_LOOP 0x20   /* loopback mode */
 u8   bits_per_word;
 int   irq;
 void   *controller_state;
 void   *controller_data;
 const char  *modalias;
};

-------------------------------------------------------------------------------
struct bfin5xx_spi_chip {
 u16 ctl_reg;
 u8 enable_dma;
 u8 bits_per_word;
 u8 cs_change_per_word;
 u16 cs_chg_udelay; /* Some devices require 16-bit delays */
 u32 cs_gpio;
 /* Value to send if no TX value is supplied, usually 0x0 or 0xFFFF */
 u16 idle_tx_val;
};


#if defined(CONFIG_ENC28J60) || defined(CONFIG_ENC28J60_MODULE)
static struct bfin5xx_spi_chip enc28j60_spi_chip_info = {
 .enable_dma = 1,
 .bits_per_word = 8,
 .cs_gpio = GPIO_PF10,
};
#endif



------------------------------------------------------------------------------

/*
 * These structures are used in two places.  Their primary role is to
 * be stored in tables of board-specific device descriptors, which are
 * declared early in board initialization and then used (much later) to
 * populate a controller's device tree after the that controller's driver
 * initializes.  A secondary (and atypical) role is as a parameter to
 * spi_new_device() call, which happens after those controller drivers
 * are active in some dynamic board configuration models.
 */
struct spi_board_info {
 /* the device name and module name are coupled, like platform_bus;
  * "modalias" is normally the driver name.
  *
  * platform_data goes to spi_device.dev.platform_data,
  * controller_data goes to spi_device.controller_data,
  * irq is copied too
  */
 char  modalias[32];
 const void *platform_data;
 void  *controller_data;
 int  irq;

 /* slower signaling on noisy or low voltage boards */
 u32  max_speed_hz;


 /* bus_num is board specific and matches the bus_num of some
  * spi_master that will probably be registered later.
  *
  * chip_select reflects how this chip is wired to that master;
  * it's less than num_chipselect.
  */
 u16  bus_num;
 u16  chip_select;

 /* mode becomes spi_device.mode, and is essential for chips
  * where the default of SPI_CS_HIGH = 0 is wrong.
  */
 u8  mode;

 /* ... may need additional spi_device chip config data here.
  * avoid stuff protocol drivers can set; but include stuff
  * needed to behave without being bound to a driver:
  *  - quirks like clock rate mattering when not selected
  */
};

static struct spi_board_info bfin_spi_board_info[] __initdata = {

[snip]

#if defined(CONFIG_ENC28J60) || defined(CONFIG_ENC28J60_MODULE)
 {
  .modalias = "enc28j60",
  .max_speed_hz = 20000000,     /* max spi clock (SCK) speed in HZ */
  .irq = IRQ_PF6,
  .bus_num = 0,
  .chip_select = 0, /* GPIO controlled SSEL */
  .controller_data = &enc28j60_spi_chip_info,
  .mode = SPI_MODE_0,
 },
#endif

[snip]

}

static int __init stamp_init(void)
{
 printk(KERN_INFO "%s(): registering device resources\n", __func__);
 i2c_register_board_info(0, bfin_i2c_board_info,
    ARRAY_SIZE(bfin_i2c_board_info));
 bfin_plat_nand_init();
 platform_add_devices(stamp_devices, ARRAY_SIZE(stamp_devices));
 spi_register_board_info(bfin_spi_board_info, ARRAY_SIZE(bfin_spi_board_info));

[snip]
}

later in:
int spi_register_master(struct spi_master *master) -->
        static void scan_boardinfo(struct spi_master *master) -->
               struct spi_device *spi_new_device(struct spi_master *master, struct spi_board_info *chip)

in spi_new_device(), spi_board_info will be copied to struct spi_device.

spi_board_info.controller_data (i.e enc28j60_spi_chip_info) will be copied to spi_device.controller_data.

In  bfin_spi_setup() --> spi_set_ctldata(spi, chip); // this will set spi_device.controller_state.

--------------------------------------------------------------------------

Tuesday, March 17, 2009

0317 log


How do I remove a file whose name begins with a "-" ?

rm ./-filename

or

rm -- -filename

http://www.faqs.org/faqs/unix-faq/faq/part2/section-1.html

Wednesday, March 11, 2009

bfin generic irq handling


Docs:
================
linux/Document/Docbook/genericirq/ (make htmldoc)

LWN: A new generic IRQ layer - http://lwn.net/Articles/184750/

Summery:
================

"struct irq_desc" represents an IRQ. "struct irq_chip" - chip level handling.
Arch code need to define its "irq_chip" structures. See bellow.

All the chip level initialization is done in init_arch_irq(), it calls set_irq_handler(), set_irq_chip() to initialize each "irq desc" structures.
This setups the the generic irq framework.

A driver need to call request_irq() to register an irq handler.

When an irq happens, the arch code finally calls into irq_desc->handler(), (this handler is intialized in init_arch_irq()), then finally, a list of irq_action is walked through,
the driver's registered handler is finally called.

irq chips:
======================
static struct irq_chip bfin_core_irqchip = {
 .name = "CORE",
 .ack = bfin_ack_noop,
 .mask = bfin_core_mask_irq,
 .unmask = bfin_core_unmask_irq,
};

static struct irq_chip bfin_internal_irqchip = {
 .name = "INTN",
 .ack = bfin_ack_noop,
 .mask = bfin_internal_mask_irq,
 .unmask = bfin_internal_unmask_irq,
 .mask_ack = bfin_internal_mask_irq,
 .disable = bfin_internal_mask_irq,
 .enable = bfin_internal_unmask_irq,
#ifdef CONFIG_PM
 .set_wake = bfin_internal_set_wake,
#endif
};
static struct irq_chip bfin_generic_error_irqchip = {
 .name = "ERROR",
 .ack = bfin_ack_noop,
 .mask_ack = bfin_generic_error_mask_irq,
 .mask = bfin_generic_error_mask_irq,
 .unmask = bfin_generic_error_unmask_irq,
};

static struct irq_chip bfin_gpio_irqchip = {
 .name = "GPIO",
 .ack = bfin_gpio_ack_irq,
 .mask = bfin_gpio_mask_irq,
 .mask_ack = bfin_gpio_mask_ack_irq,
 .unmask = bfin_gpio_unmask_irq,
 .disable = bfin_gpio_mask_irq,
 .enable = bfin_gpio_unmask_irq,
 .set_type = bfin_gpio_irq_type,
 .startup = bfin_gpio_irq_startup,
 .shutdown = bfin_gpio_irq_shutdown,
#ifdef CONFIG_PM
 .set_wake = bfin_gpio_set_wake,
#endif
};

struct irq_desc:
====================
Each interrupt is described by an interrupt descriptor structure irq_desc

/**
 * struct irq_desc - interrupt descriptor
 * @irq:  interrupt number for this descriptor
 * @handle_irq:  highlevel irq-events handler [if NULL, __do_IRQ()]
 * @chip:  low level interrupt hardware access
 * @msi_desc:  MSI descriptor
 * @handler_data: per-IRQ data for the irq_chip methods
 * @chip_data:  platform-specific per-chip private data for the chip
 *   methods, to allow shared chip implementations
 * @action:  the irq action chain
 * @status:  status information
 * @depth:  disable-depth, for nested irq_disable() calls
 * @wake_depth:  enable depth, for multiple set_irq_wake() callers
 * @irq_count:  stats field to detect stalled irqs
 * @irqs_unhandled: stats field for spurious unhandled interrupts
 * @last_unhandled: aging timer for unhandled count
 * @lock:  locking for SMP
 * @affinity:  IRQ affinity on SMP
 * @cpu:  cpu index useful for balancing
 * @pending_mask: pending rebalanced interrupts
 * @dir:  /proc/irq/ procfs entry
 * @name:  flow handler name for /proc/interrupts output
 */
struct irq_desc {
 unsigned int  irq;
 irq_flow_handler_t handle_irq;
 struct irq_chip  *chip;
 struct msi_desc  *msi_desc;
 void   *handler_data;
 void   *chip_data;
 struct irqaction *action; /* IRQ action list */
 unsigned int  status;  /* IRQ status */

 unsigned int  depth;  /* nested irq disables */
 unsigned int  wake_depth; /* nested wake enables */
 unsigned int  irq_count; /* For detecting broken IRQs */
 unsigned int  irqs_unhandled;
 unsigned long  last_unhandled; /* Aging timer for unhandled count */
 spinlock_t  lock;
#ifdef CONFIG_SMP
 cpumask_t  affinity;
 unsigned int  cpu;
#endif
#ifdef CONFIG_GENERIC_PENDING_IRQ
 cpumask_t  pending_mask;
#endif
#ifdef CONFIG_PROC_FS
 struct proc_dir_entry *dir;
#endif
 const char  *name;
} ____cacheline_internodealigned_in_smp;


struct irq_chip:
===================================================
/**
 * struct irq_chip - hardware interrupt chip descriptor
 *
 * @name:  name for /proc/interrupts
 * @startup:  start up the interrupt (defaults to ->enable if NULL)
 * @shutdown:  shut down the interrupt (defaults to ->disable if NULL)
 * @enable:  enable the interrupt (defaults to chip->unmask if NULL)
 * @disable:  disable the interrupt (defaults to chip->mask if NULL)
 * @ack:  start of a new interrupt
 * @mask:  mask an interrupt source
 * @mask_ack:  ack and mask an interrupt source
 * @unmask:  unmask an interrupt source
 * @eoi:  end of interrupt - chip level
 * @end:  end of interrupt - flow level
 * @set_affinity: set the CPU affinity on SMP machines
 * @retrigger:  resend an IRQ to the CPU
 * @set_type:  set the flow type (IRQ_TYPE_LEVEL/etc.) of an IRQ
 * @set_wake:  enable/disable power-management wake-on of an IRQ
 *
 * @release:  release function solely used by UML
 * @typename:  obsoleted by name, kept as migration helper
 */
struct irq_chip {
 const char *name;
 unsigned int (*startup)(unsigned int irq);
 void  (*shutdown)(unsigned int irq);
 void  (*enable)(unsigned int irq);
 void  (*disable)(unsigned int irq);

 void  (*ack)(unsigned int irq);irqreturn_t handle_IRQ_event(unsigned int irq, struct irqaction *action)
{
 irqreturn_t ret, retval = IRQ_NONE;
 unsigned int status = 0;

 if (!(action->flags & IRQF_DISABLED))
  local_irq_enable_in_hardirq();

 do {
  ret = action->handler(irq, action->dev_id);
  if (ret == IRQ_HANDLED)
   status |= action->flags;
  retval |= ret;
  action = action->next;
 } while (action);

 if (status & IRQF_SAMPLE_RANDOM)
  add_interrupt_randomness(irq);
 local_irq_disable();

 return retval;
}
 void  (*mask)(unsigned int irq);
 void  (*mask_ack)(unsigned int irq);
 void  (*unmask)(unsigned int irq);
 void  (*eoi)(unsigned int irq);

 void  (*end)(unsigned int irq);
 void  (*set_affinity)(unsigned int irq, cpumask_t dest);
 int  (*retrigger)(unsigned int irq);
 int  (*set_type)(unsigned int irq, unsigned int flow_type);
 int  (*set_wake)(unsigned int irq, unsigned int on);

 /* Currently used only by UML, might disappear one day.*/
#ifdef CONFIG_IRQ_RELEASE_METHOD
 void  (*release)(unsigned int irq, void *dev_id);
#endif
 /*
  * For compatibility, ->typename is copied into ->name.
  * Will disappear.
  */
 const char *typename;
};


irq flow
＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝
Low level arch code will finally calls 
"irq_desc[irq]-> handler()"

asm_do_IRQ() -->
    generic_handle_irq() -->
       desc->handle_irq() --> (Or __do_IRQ() if desc->handle_irq == NULL)
       （In init_arch_irq() desc->handle_irq() is assigned to default handlers: handle_simple_irq(), handle_level_irq(), etc.)
              --> handle_IRQ_event() (handle the irqaction chains).


static inline void generic_handle_irq_desc(unsigned int irq, struct irq_desc *desc)
{
#ifdef CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ
 desc->handle_irq(irq, desc);
#else
 if (likely(desc->handle_irq))
  desc->handle_irq(irq, desc);
 else
  __do_IRQ(irq);
#endif
}


Register an IRQ:
=======================================================================
request_irq() -->
      __setup_irq() -->
         __irq_set_trigger() -->
             chip->set_type(irq, flags & IRQF_TRIGGER_MASK) -->
                  bfin_gpio_irq_type() [mach-common/ints-priority.c]

static int
__setup_irq(unsigned int irq, struct irq_desc * desc, struct irqaction *new)
{
 struct irqaction *old, **p;
 const char *old_name = NULL;
 unsigned long flags;
 int shared = 0;
 int ret;

 if (!desc)
  return -EINVAL;

 if (desc->chip == &no_irq_chip)
  return -ENOSYS;
 /*
  * Some drivers like serial.c use request_irq() heavily,
  * so we have to be careful not to interfere with a
  * running system.
  */
 if (new->flags & IRQF_SAMPLE_RANDOM) {
  /*
   * This function might sleep, we want to call it first,
   * outside of the atomic block.
   * Yes, this might clear the entropy pool if the wrong
   * driver is attempted to be loaded, without actually
   * installing a new handler, but is this really a problem,
   * only the sysadmin is able to do this.
   */
  rand_initialize_irq(irq);
 }

 /*
  * The following block of code has to be executed atomically
  */
 spin_lock_irqsave(&desc->lock, flags);
 p = &desc->action;
 old = *p;
 if (old) {
  /*
   * Can't share interrupts unless both agree to and are
   * the same type (level, edge, polarity). So both flag
   * fields must have IRQF_SHARED set and the bits which
   * set the trigger type must match.
   */
  if (!((old->flags & new->flags) & IRQF_SHARED) ||
      ((old->flags ^ new->flags) & IRQF_TRIGGER_MASK)) {
   old_name = old->name;
   goto mismatch;
  }

#if defined(CONFIG_IRQ_PER_CPU)
  /* All handlers must agree on per-cpuness */
  if ((old->flags & IRQF_PERCPU) !=
      (new->flags & IRQF_PERCPU))
   goto mismatch;
#endif

  /* add new interrupt at end of irq queue */
  do {
   p = &old->next;
   old = *p;
  } while (old);
  shared = 1;
 }

 if (!shared) {
  irq_chip_set_defaults(desc->chip);

  /* Setup the type (level, edge polarity) if configured: */
  if (new->flags & IRQF_TRIGGER_MASK) {
   ret = __irq_set_trigger(desc, irq, new->flags);

   if (ret) {
    spin_unlock_irqrestore(&desc->lock, flags);
    return ret;
   }
  } else
   compat_irq_chip_set_default_handler(desc);
#if defined(CONFIG_IRQ_PER_CPU)
  if (new->flags & IRQF_PERCPU)
   desc->status |= IRQ_PER_CPU;
#endif

  desc->status &= ~(IRQ_AUTODETECT | IRQ_WAITING |
      IRQ_INPROGRESS | IRQ_SPURIOUS_DISABLED);

  if (!(desc->status & IRQ_NOAUTOEN)) {
   desc->depth = 0;
   desc->status &= ~IRQ_DISABLED;
   desc->chip->startup(irq);
  } else
   /* Undo nested disables: */
   desc->depth = 1;

  /* Exclude IRQ from balancing if requested */
  if (new->flags & IRQF_NOBALANCING)
   desc->status |= IRQ_NO_BALANCING;

  /* Set default affinity mask once everything is setup */
  do_irq_select_affinity(irq, desc);

 } else if ((new->flags & IRQF_TRIGGER_MASK)
   && (new->flags & IRQF_TRIGGER_MASK)
    != (desc->status & IRQ_TYPE_SENSE_MASK)) {
  /* hope the handler works with the actual trigger mode... */
  pr_warning("IRQ %d uses trigger mode %d; requested %d\n",
    irq, (int)(desc->status & IRQ_TYPE_SENSE_MASK),
    (int)(new->flags & IRQF_TRIGGER_MASK));
 }

 *p = new;

 /* Reset broken irq detection when installing new handler */
 desc->irq_count = 0;
 desc->irqs_unhandled = 0;

 /*
  * Check whether we disabled the irq via the spurious handler
  * before. Reenable it and give it another chance.
  */
 if (shared && (desc->status & IRQ_SPURIOUS_DISABLED)) {
  desc->status &= ~IRQ_SPURIOUS_DISABLED;
  __enable_irq(desc, irq);
 }

 spin_unlock_irqrestore(&desc->lock, flags);

 new->irq = irq;
 register_irq_proc(irq, desc);
 new->dir = NULL;
 register_handler_proc(irq, new);

 return 0;

mismatch:
#ifdef CONFIG_DEBUG_SHIRQ
 if (!(new->flags & IRQF_PROBE_SHARED)) {
  printk(KERN_ERR "IRQ handler type mismatch for IRQ %d\n", irq);
  if (old_name)
   printk(KERN_ERR "current handler: %s\n", old_name);
  dump_stack();
 }
#endif
 spin_unlock_irqrestore(&desc->lock, flags);
 return -EBUSY;
}

Tuesday, March 10, 2009

linker - relocation


http://www.airs.com/blog/archives/category/programming/page/14/

Basic Linker Data Types

The linker operates on a small number of basic data types: symbols, relocations, and contents. These are defined in the input object files. Here is an overview of each of these.

A symbol is basically a name and a value. Many symbols represent static objects in the original source code–that is, objects which exist in a single place for the duration of the program. For example, in an object file generated from C code, there will be a symbol for each function and for each global and static variable. The value of such a symbol is simply an offset into the contents. This type of symbol is known as a defined symbol. It’s important not to confuse the value of the symbol representing the variable my_global_var with the value of my_global_var itself. The value of the symbol is roughly the address of the variable: the value you would get from the expression &my_global_var in C.

Symbols are also used to indicate a reference to a name defined in a different object file. Such a reference is known as an undefined symbol. There are other less commonly used types of symbols which I will describe later.

During the linking process, the linker will assign an address to each defined symbol, and will resolve each undefined symbol by finding a defined symbol with the same name.

A relocation is a computation to perform on the contents. Most relocations refer to a symbol and to an offset within the contents. Many relocations will also provide an additional operand, known as the addend. A simple, and commonly used, relocation is “set this location in the contents to the value of this symbol plus this addend.” The types of computations that relocations do are inherently dependent on the architecture of the processor for which the linker is generating code. For example, RISC processors which require two or more instructions to form a memory address will have separate relocations to be used with each of those instructions; for example, “set this location in the contents to the lower 16 bits of the value of this symbol.”

During the linking process, the linker will perform all of the relocation computations as directed. A relocation in an object file may refer to an undefined symbol. If the linker is unable to resolve that symbol, it will normally issue an error (but not always: for some symbol types or some relocation types an error may not be appropriate).

The contents are what memory should look like during the execution of the program. Contents have a size, an array of bytes, and a type. They contain the machine code generated by the compiler and assembler (known as text). They contain the values of initialized variables (data). They contain static unnamed data like string constants and switch tables (read-only data or rdata). They contain uninitialized variables, in which case the array of bytes is generally omitted and assumed to contain only zeroes (bss). The compiler and the assembler work hard to generate exactly the right contents, but the linker really doesn’t care about them except as raw data. The linker reads the contents from each file, concatenates them all together sorted by type, applies the relocations, and writes the result into the executable file.

Basic Linker Operation

At this point we already know enough to understand the basic steps used by every linker.

    * Read the input object files. Determine the length and type of the contents. Read the symbols.
    * Build a symbol table containing all the symbols, linking undefined symbols to their definitions.
    * Decide where all the contents should go in the output executable file, which means deciding where they should go in memory when the program runs.
    * Read the contents data and the relocations. Apply the relocations to the contents. Write the result to the output file.
    * Optionally write out the complete symbol table with the final values of the symbols.


In the common case, code may be compiled in two different modes. By default, code is position dependent. Putting position dependent code into a shared library will cause the program linker to generate a lot of relocation information, and cause the dynamic linker to do a lot of processing at runtime. Code may also be compiled in position independent mode, typically with the -fpic option. Position independent code is slightly slower when it calls a non-static function or refers to a global or static variable. However, it requires much less relocation information, and thus the dynamic linker will start the program faster.

Position independent code will call non-static functions via the Procedure Linkage Table or PLT. This PLT does not exist in .o files. In a .o file, use of the PLT is indicated by a special relocation. When the program linker processes such a relocation, it will create an entry in the PLT. It will adjust the instruction such that it becomes a PC-relative call to the PLT entry. PC-relative calls are inherently position independent and thus do not require a relocation entry themselves. The program linker will create a relocation for the PLT entry which tells the dynamic linker which symbol is associated with that entry. This process reduces the number of dynamic relocations in the shared library from one per function call to one per function called.


A detailed example can be found here:

http://bottomupcs.sourceforge.net/csbu/x3867.htm

Libraries
The Procedure Lookup Table

Libraries may contain many functions, and a program may end up including many libraries to get its work done. A program may only use one or two functions from each library of the many available, and depending on the run-time path through the code may use some functions and not others.

As we have seen, the process of dynamic linking is a fairly computationally intensive one, since it involves looking up and searching through many tables. Anything that can be done to reduce the overheads will increase performance.

The Procedure Lookup Table (PLT) facilitates what is called lazy binding in programs. Binding is synonymous with the fix-up process described above for variables located in the GOT. When an entry has been "fixed-up" it is said to be "bound" to its real address.

As we mentioned, sometimes a program will include a function from a library but never actually call that function, depending on user input. The process of binding this function is quite intensive, involving loading code, searching through tables and writing memory. To go through the process of binding a function that is not used is simply a waste of time.

Lazy binding defers this expense until the actual function is called by using a PLT.

Each library function has an entry in the PLT, which initially points to some special dummy code. When the program calls the function, it actually calls the PLT entry (in the same was as variables are referenced through the GOT).

This dummy function will load a few parameters that need to be passed to the dynamic linker for it to resolve the function and then call into a special lookup function of the dynamic linker. The dynamic linker finds the real address of the function, and writes that location into the calling binary over the top of the dummy function call.

Thus, the next time the function is called the address can be loaded without having to go back into the dynamic loader again. If a function is never called, then the PLT entry will never be modified but there will be no runtime overhead.
The PLT in action

Things start to get a bit hairy here! If nothing else, you should begin to appreciate that there is a fair bit of work in resolving a dynamic symbol!

Let us consider the simple "hello World" application. This will only make one library call to printf to output the string to the user.

Example 9-8. Hello World PLT example

    $ cat hello.c
    #include 
    
    int main(void)
  5 {
            printf("Hello, World!\n");
            return 0;
    }
    
 10 $ gcc -o hello hello.c
    
    $ readelf --relocs ./hello
    
    Relocation section '.rela.dyn' at offset 0x3f0 contains 2 entries:
 15   Offset          Info           Type           Sym. Value    Sym. Name + Addend
    6000000000000ed8  000700000047 R_IA64_FPTR64LSB  0000000000000000 _Jv_RegisterClasses + 0
    6000000000000ee0  000900000047 R_IA64_FPTR64LSB  0000000000000000 __gmon_start__ + 0
    
    Relocation section '.rela.IA_64.pltoff' at offset 0x420 contains 3 entries:
 20   Offset          Info           Type           Sym. Value    Sym. Name + Addend
    6000000000000f10  000200000081 R_IA64_IPLTLSB    0000000000000000 printf + 0
    6000000000000f20  000800000081 R_IA64_IPLTLSB    0000000000000000 __libc_start_main + 0
    6000000000000f30  000900000081 R_IA64_IPLTLSB    0000000000000000 __gmon_start__ + 0;

We can see above that we have a R_IA64_IPLTLSB relocation for our printf symbol. This is saying "put the address of symbol printf into memory address 0x6000000000000f10". We have to start digging deeper to find the exact procedure that gets us the function.

Below we have a look at the disassembly of the main() function of the program.

Example 9-9. Hello world main()

    4000000000000790 :
    4000000000000790:       00 08 15 08 80 05       [MII]       alloc r33=ar.pfs,5,4,0
    4000000000000796:       20 02 30 00 42 60                   mov r34=r12
    400000000000079c:       04 08 00 84                         mov r35=r1
  5 40000000000007a0:       01 00 00 00 01 00       [MII]       nop.m 0x0
    40000000000007a6:       00 02 00 62 00 c0                   mov r32=b0
    40000000000007ac:       81 0c 00 90                         addl r14=72,r1;;
    40000000000007b0:       1c 20 01 1c 18 10       [MFB]       ld8 r36=[r14]
    40000000000007b6:       00 00 00 02 00 00                   nop.f 0x0
 10 40000000000007bc:       78 fd ff 58                         br.call.sptk.many b0=4000000000000520 <_init+0xb0>
    40000000000007c0:       02 08 00 46 00 21       [MII]       mov r1=r35
    40000000000007c6:       e0 00 00 00 42 00                   mov r14=r0;;
    40000000000007cc:       01 70 00 84                         mov r8=r14
    40000000000007d0:       00 00 00 00 01 00       [MII]       nop.m 0x0
 15 40000000000007d6:       00 08 01 55 00 00                   mov.i ar.pfs=r33
    40000000000007dc:       00 0a 00 07                         mov b0=r32
    40000000000007e0:       1d 60 00 44 00 21       [MFB]       mov r12=r34
    40000000000007e6:       00 00 00 02 00 80                   nop.f 0x0
    40000000000007ec:       08 00 84 00                         br.ret.sptk.many b0;;;

The call to 0x4000000000000520 must be us calling the printf function. We can find out where this is by looking at the sections with readelf.

Example 9-10. Hello world sections

    $ readelf --sections ./hello
    There are 40 section headers, starting at offset 0x25c0:
    
    Section Headers:
  5   [Nr] Name              Type             Address           Offset
           Size              EntSize          Flags  Link  Info  Align
      [ 0]                   NULL             0000000000000000  00000000
           0000000000000000  0000000000000000           0     0     0
    ...
 10   [11] .plt              PROGBITS         40000000000004c0  000004c0
           00000000000000c0  0000000000000000  AX       0     0     32
      [12] .text             PROGBITS         4000000000000580  00000580
           00000000000004a0  0000000000000000  AX       0     0     32
      [13] .fini             PROGBITS         4000000000000a20  00000a20
 15        0000000000000040  0000000000000000  AX       0     0     16
      [14] .rodata           PROGBITS         4000000000000a60  00000a60
           000000000000000f  0000000000000000   A       0     0     8
      [15] .opd              PROGBITS         4000000000000a70  00000a70
           0000000000000070  0000000000000000   A       0     0     16
 20   [16] .IA_64.unwind_inf PROGBITS         4000000000000ae0  00000ae0
           00000000000000f0  0000000000000000   A       0     0     8
      [17] .IA_64.unwind     IA_64_UNWIND     4000000000000bd0  00000bd0
           00000000000000c0  0000000000000000  AL      12     c     8
      [18] .init_array       INIT_ARRAY       6000000000000c90  00000c90
 25        0000000000000018  0000000000000000  WA       0     0     8
      [19] .fini_array       FINI_ARRAY       6000000000000ca8  00000ca8
           0000000000000008  0000000000000000  WA       0     0     8
      [20] .data             PROGBITS         6000000000000cb0  00000cb0
           0000000000000004  0000000000000000  WA       0     0     4
 30   [21] .dynamic          DYNAMIC          6000000000000cb8  00000cb8
           00000000000001e0  0000000000000010  WA       5     0     8
      [22] .ctors            PROGBITS         6000000000000e98  00000e98
           0000000000000010  0000000000000000  WA       0     0     8
      [23] .dtors            PROGBITS         6000000000000ea8  00000ea8
 35        0000000000000010  0000000000000000  WA       0     0     8
      [24] .jcr              PROGBITS         6000000000000eb8  00000eb8
           0000000000000008  0000000000000000  WA       0     0     8
      [25] .got              PROGBITS         6000000000000ec0  00000ec0
           0000000000000050  0000000000000000 WAp       0     0     8
 40   [26] .IA_64.pltoff     PROGBITS         6000000000000f10  00000f10
           0000000000000030  0000000000000000 WAp       0     0     16
      [27] .sdata            PROGBITS         6000000000000f40  00000f40
           0000000000000010  0000000000000000 WAp       0     0     8
      [28] .sbss             NOBITS           6000000000000f50  00000f50
 45        0000000000000008  0000000000000000  WA       0     0     8
      [29] .bss              NOBITS           6000000000000f58  00000f50
           0000000000000008  0000000000000000  WA       0     0     8
      [30] .comment          PROGBITS         0000000000000000  00000f50
           00000000000000b9  0000000000000000           0     0     1
 50   [31] .debug_aranges    PROGBITS         0000000000000000  00001010
           0000000000000090  0000000000000000           0     0     16
      [32] .debug_pubnames   PROGBITS         0000000000000000  000010a0
           0000000000000025  0000000000000000           0     0     1
      [33] .debug_info       PROGBITS         0000000000000000  000010c5
 55        00000000000009c4  0000000000000000           0     0     1
      [34] .debug_abbrev     PROGBITS         0000000000000000  00001a89
           0000000000000124  0000000000000000           0     0     1
      [35] .debug_line       PROGBITS         0000000000000000  00001bad
           00000000000001fe  0000000000000000           0     0     1
 60   [36] .debug_str        PROGBITS         0000000000000000  00001dab
           00000000000006a1  0000000000000001  MS       0     0     1
      [37] .shstrtab         STRTAB           0000000000000000  0000244c
           000000000000016f  0000000000000000           0     0     1
      [38] .symtab           SYMTAB           0000000000000000  00002fc0
 65        0000000000000b58  0000000000000018          39    60     8
      [39] .strtab           STRTAB           0000000000000000  00003b18
           0000000000000479  0000000000000000           0     0     1
    Key to Flags:
      W (write), A (alloc), X (execute), M (merge), S (strings)
 70   I (info), L (link order), G (group), x (unknown)
      O (extra OS processing required) o (OS specific), p (processor specific);

That address is (unsurprisingly) in the .plt section. So there we have our call into the PLT! But we're not satisfied with that, let's keep digging further to see what we can uncover. We disassemble the .plt section to see what that call actually does.

Example 9-11. Hello world PLT

    40000000000004c0 <.plt>:
    40000000000004c0:       0b 10 00 1c 00 21       [MMI]       mov r2=r14;;
    40000000000004c6:       e0 00 08 00 48 00                   addl r14=0,r2
    40000000000004cc:       00 00 04 00                         nop.i 0x0;;
  5 40000000000004d0:       0b 80 20 1c 18 14       [MMI]       ld8 r16=[r14],8;;
    40000000000004d6:       10 41 38 30 28 00                   ld8 r17=[r14],8
    40000000000004dc:       00 00 04 00                         nop.i 0x0;;
    40000000000004e0:       11 08 00 1c 18 10       [MIB]       ld8 r1=[r14]
    40000000000004e6:       60 88 04 80 03 00                   mov b6=r17
 10 40000000000004ec:       60 00 80 00                         br.few b6;;
    40000000000004f0:       11 78 00 00 00 24       [MIB]       mov r15=0
    40000000000004f6:       00 00 00 02 00 00                   nop.i 0x0
    40000000000004fc:       d0 ff ff 48                         br.few 40000000000004c0 <_init+0x50>;;
    4000000000000500:       11 78 04 00 00 24       [MIB]       mov r15=1
 15 4000000000000506:       00 00 00 02 00 00                   nop.i 0x0
    400000000000050c:       c0 ff ff 48                         br.few 40000000000004c0 <_init+0x50>;;
    4000000000000510:       11 78 08 00 00 24       [MIB]       mov r15=2
    4000000000000516:       00 00 00 02 00 00                   nop.i 0x0
    400000000000051c:       b0 ff ff 48                         br.few 40000000000004c0 <_init+0x50>;;
 20 4000000000000520:       0b 78 40 03 00 24       [MMI]       addl r15=80,r1;;
    4000000000000526:       00 41 3c 70 29 c0                   ld8.acq r16=[r15],8
    400000000000052c:       01 08 00 84                         mov r14=r1;;
    4000000000000530:       11 08 00 1e 18 10       [MIB]       ld8 r1=[r15]
    4000000000000536:       60 80 04 80 03 00                   mov b6=r16
 25 400000000000053c:       60 00 80 00                         br.few b6;;
    4000000000000540:       0b 78 80 03 00 24       [MMI]       addl r15=96,r1;;
    4000000000000546:       00 41 3c 70 29 c0                   ld8.acq r16=[r15],8
    400000000000054c:       01 08 00 84                         mov r14=r1;;
    4000000000000550:       11 08 00 1e 18 10       [MIB]       ld8 r1=[r15]
 30 4000000000000556:       60 80 04 80 03 00                   mov b6=r16
    400000000000055c:       60 00 80 00                         br.few b6;;
    4000000000000560:       0b 78 c0 03 00 24       [MMI]       addl r15=112,r1;;
    4000000000000566:       00 41 3c 70 29 c0                   ld8.acq r16=[r15],8
    400000000000056c:       01 08 00 84                         mov r14=r1;;
 35 4000000000000570:       11 08 00 1e 18 10       [MIB]       ld8 r1=[r15]
    4000000000000576:       60 80 04 80 03 00                   mov b6=r16
    400000000000057c:       60 00 80 00                         br.few b6;;;

Let us step through the instructions. Firstly, we add 80 to the value in r1, storing it in r15. We know from before that r1 will be pointing to the GOT, so this is saying "store in r15 80 bytes into the GOT". The next thing we do is load into r16 the value stored in this location in the GOT, and post increment the value in r15 by 8 bytes. We then store r1 (the location of the GOT) in r14 and set r1 to be the value in the next 8 bytes after r15. Then we branch to r16.

In the previous chapter we discussed how functions are actually called through a function descriptor which contains the function address and the address of the global pointer. Here we can see that the PLT entry is first loading the function value, moving on 8 bytes to the second part of the function descriptor and then loading that value into the gp register before calling the function.

But what exactly are we loading? We know that r1 will be pointing to the GOT. We go 80 bytes past the got (0x50)

Example 9-12. Hello world GOT

    $ objdump --disassemble-all ./hello 
    Disassembly of section .got:
    
    6000000000000ec0 <.got>:
  5         ...
    6000000000000ee8:       80 0a 00 00 00 00                   data8 0x02a000000
    6000000000000eee:       00 40 90 0a                         dep r0=r0,r0,63,1
    6000000000000ef2:       00 00 00 00 00 40       [MIB] (p20) break.m 0x1
    6000000000000ef8:       a0 0a 00 00 00 00                   data8 0x02a810000
 10 6000000000000efe:       00 40 50 0f                         br.few 6000000000000ef0 <_GLOBAL_OFFSET_TABLE_+0x30>
    6000000000000f02:       00 00 00 00 00 60       [MIB] (p58) break.m 0x1
    6000000000000f08:       60 0a 00 00 00 00                   data8 0x029818000
    6000000000000f0e:       00 40 90 06                         br.few 6000000000000f00 <_GLOBAL_OFFSET_TABLE_+0x40>
    Disassembly of section .IA_64.pltoff:
 15 
    6000000000000f10 <.IA_64.pltoff>:
    6000000000000f10:       f0 04 00 00 00 00       [MIB] (p39) break.m 0x0
    6000000000000f16:       00 40 c0 0e 00 00                   data8 0x03b010000
    6000000000000f1c:       00 00 00 60                         data8 0xc000000000
 20 6000000000000f20:       00 05 00 00 00 00       [MII] (p40) break.m 0x0
    6000000000000f26:       00 40 c0 0e 00 00                   data8 0x03b010000
    6000000000000f2c:       00 00 00 60                         data8 0xc000000000
    6000000000000f30:       10 05 00 00 00 00       [MIB] (p40) break.m 0x0
    6000000000000f36:       00 40 c0 0e 00 00                   data8 0x03b010000
 25 6000000000000f3c:       00 00 00 60                         data8 0xc000000000;

0x6000000000000ec0 + 0x50 = 0x6000000000000f10, or the .IA_64.pltoff section. Now we're starting to get somewhere!

We can decode the objdump output so we can see exactly what is being loaded here. Swapping the byte order of the first 8 bytes f0 04 00 00 00 00 00 40 we end up with 0x4000000000004f0. Now that address looks familiar! Looking back up at the assemble output of the PLT we see that address.

The code at 0x4000000000004f0 firstly puts a zero value into r15, and then branches back to 0x40000000000004c0. Wait a minute! That's the start of our PLT section.

We can trace this code through too. Firstly we save the value of the global pointer (r2) then we load three 8 byte values into r16, r17 and finally, r1. We then branch to the address in r17. What we are seeing here is the actual call into the dynamic linker!

We need to delve into the ABI to understand exactly what is being loaded at this point. The ABI says two things -- dynamically linked programs must have a special section (called the DT_IA_64_PLT_RESERVE section) that can hold three 8 byte values. There is a pointer where this reserved area in the dynamic segment of the binary.

Example 9-13. Dynamic Segment

    Dynamic segment at offset 0xcb8 contains 25 entries:
      Tag        Type                         Name/Value
     0x0000000000000001 (NEEDED)             Shared library: [libc.so.6.1]
     0x000000000000000c (INIT)               0x4000000000000470
  5  0x000000000000000d (FINI)               0x4000000000000a20
     0x0000000000000019 (INIT_ARRAY)         0x6000000000000c90
     0x000000000000001b (INIT_ARRAYSZ)       24 (bytes)
     0x000000000000001a (FINI_ARRAY)         0x6000000000000ca8
     0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
 10  0x0000000000000004 (HASH)               0x4000000000000200
     0x0000000000000005 (STRTAB)             0x4000000000000330
     0x0000000000000006 (SYMTAB)             0x4000000000000240
     0x000000000000000a (STRSZ)              138 (bytes)
     0x000000000000000b (SYMENT)             24 (bytes)
 15  0x0000000000000015 (DEBUG)              0x0
     0x0000000070000000 (IA_64_PLT_RESERVE)  0x6000000000000ec0 -- 0x6000000000000ed8
     0x0000000000000003 (PLTGOT)             0x6000000000000ec0
     0x0000000000000002 (PLTRELSZ)           72 (bytes)
     0x0000000000000014 (PLTREL)             RELA
 20  0x0000000000000017 (JMPREL)             0x4000000000000420
     0x0000000000000007 (RELA)               0x40000000000003f0
     0x0000000000000008 (RELASZ)             48 (bytes)
     0x0000000000000009 (RELAENT)            24 (bytes)
     0x000000006ffffffe (VERNEED)            0x40000000000003d0
 25  0x000000006fffffff (VERNEEDNUM)         1
     0x000000006ffffff0 (VERSYM)             0x40000000000003ba
     0x0000000000000000 (NULL)               0x0;

Do you notice anything about it? It's the same value as the GOT. This means that the first three 8 byte entries in the GOT are actually the reserved area; thus will always be pointed to by the global pointer.

When the dynamic linker starts it is its duty to fill these values in. The ABI says that the first value will be filled in by the dynamic linker giving this module a unique ID. The second value is the global pointer value for the dynamic linker, and the third value is the address of the function that finds and fixes up the symbol.

Example 9-14. Code in the dynamic linker for setting up special values (from libc sysdeps/ia64/dl-machine.h)

    /* Set up the loaded object described by L so its unrelocated PLT
       entries will jump to the on-demand fixup code in dl-runtime.c.  */
    
    static inline int __attribute__ ((unused, always_inline))
  5 elf_machine_runtime_setup (struct link_map *l, int lazy, int profile)
    {
      extern void _dl_runtime_resolve (void);
      extern void _dl_runtime_profile (void);
    
 10   if (lazy)
        {
          register Elf64_Addr gp __asm__ ("gp");
          Elf64_Addr *reserve, doit;
    
 15       /*
           * Careful with the typecast here or it will try to add l-l_addr
           * pointer elements
           */
          reserve = ((Elf64_Addr *)
 20                  (l->l_info[DT_IA_64 (PLT_RESERVE)]->d_un.d_ptr + l->l_addr));
          /* Identify this shared object.  */
          reserve[0] = (Elf64_Addr) l;
    
          /* This function will be called to perform the relocation.  */
 25       if (!profile)
            doit = (Elf64_Addr) ((struct fdesc *) &_dl_runtime_resolve)->ip;
          else
            {
              if (GLRO(dl_profile) != NULL
 30               && _dl_name_match_p (GLRO(dl_profile), l))
                {
                  /* This is the object we are looking for.  Say that we really
                     want profiling and the timers are started.  */
                  GL(dl_profile_map) = l;
 35             }
              doit = (Elf64_Addr) ((struct fdesc *) &_dl_runtime_profile)->ip;
            }
    
          reserve[1] = doit;
 40       reserve[2] = gp;
        }
    
      return lazy;
    };

We can see how this gets setup by the dynamic linker by looking at the function that does this for the binary. The reserve variable is set from the PLT_RESERVE section pointer in the binary. The unique value (put into reserve[0]) is the address of the link map for this object. Link maps are the internal representation within glibc for shared objects. We then put in the address of _dl_runtime_resolve to the second value (assuming we are not using profiling). reserve[2] is finally set to gp, which has been found from r2 with the __asm__ call.

Looking back at the ABI, we see that the relocation index for the entry must be placed in r15 and the unique identifier must be passed in r16.

r15 has previously been set in the stub code, before we jumped back to the start of the PLT. Have a look down the entries, and notice how each PLT entry loads r15 with an incremented value? It should come as no surprise if you look at the relocations the printf relocation is number zero.

r16 we load up from the values that have been initialised by the dynamic linker, as previously discussed. Once that is ready, we can load the function address and global pointer and branch into the function.

What happens at this point is the dynamic linker function _dl_runtime_resolve is run. It finds the relocation; remember how the relocation specified the name of the symbol? It uses this name to find the right function; this might involve loading the library from disk if it is not already in memory, or otherwise sharing the code.

The relocation record provides the dynamic linker with the address it needs to "fix up"; remember it was in the GOT and loaded by the initial PLT stub? This means that after the first time the function is called, the second time it is loaded it will get the direct address of the function; short circuiting the dynamic linker.
Summary

You've seen the exact mechanism behind the PLT, and consequently the inner workings of the dynamic linker. The important points to remember are

    *

      Library calls in your program actually call a stub of code in the PLT of the binary.
    *

      That stub code loads an address and jumps to it.
    *

      Initially, that address points to a function in the dynamic linker which is capable of looking up the "real" function, given the information in the relocation entry for that function.
    *

      The dynamic linker re-writes the address that the stub code reads, so that the next time the function is called it will go straight to the right address.

Thursday, February 12, 2009

2009-02-12 log


cat /home/adam/Desktop/work_log/0212.log
svn export  - export a clean tree from repository or a work copy.

svn branches:

- create a branch:

$ svn copy http://svn.example.com/repos/calc/trunk \
           http://svn.example.com/repos/calc/branches/my-calc-branch \
      -m "Creating a private branch of /calc/trunk."

Committed revision 341.

This command causes a near-instantaneous commit in the repository, creating a new directory in revision 341. The new directory is a copy of /calc/trunk. This is shown in Figure 4.3, “Repository with new copy”. [20] While it's also possible to create a branch by using svn copy to duplicate a directory within the working copy, this technique isn't recommended. It can be quite slow, in fact! Copying a directory on the client side is a linear-time operation, in that it actually has to duplicate every file and subdirectory on the local disk. Copying a directory on the server, however, is a constant-time operation, and it's the way most people create branches.

- run svn log -r 9238 to read about the exact changeset that fixed the bug, and run svn diff -c 9238 to see the patch itself.

GDB:

It is often useful to do `display/i $pc' when stepping by machine instructions. This makes GDB automatically display the next instruction to be executed, each time your program stops. See section Automatic Display. 

GCC manual on inline asm:

5.37 Assembler Instructions with C Expression Operands
http://gcc.gnu.org/onlinedocs/gcc-4.3.3/gcc/Extended-Asm.html#Extended-Asm


patch:
       -N  or  --forward
          Ignore patches that seem to be reversed or already applied.  See also -R.

adam@adam-desktop:~$ cat /home/adam/Desktop/work_log/0212.log
svn export  - export a clean tree from repository or a work copy.

svn branches:

- create a branch:

$ svn copy http://svn.example.com/repos/calc/trunk \
           http://svn.example.com/repos/calc/branches/my-calc-branch \
      -m "Creating a private branch of /calc/trunk."

Committed revision 341.

This command causes a near-instantaneous commit in the repository, creating a new directory in revision 341. The new directory is a copy of /calc/trunk. This is shown in Figure 4.3, “Repository with new copy”. [20] While it's also possible to create a branch by using svn copy to duplicate a directory within the working copy, this technique isn't recommended. It can be quite slow, in fact! Copying a directory on the client side is a linear-time operation, in that it actually has to duplicate every file and subdirectory on the local disk. Copying a directory on the server, however, is a constant-time operation, and it's the way most people create branches.

- run svn log -r 9238 to read about the exact changeset that fixed the bug, and run svn diff -c 9238 to see the patch itself.

GDB:

It is often useful to do `display/i $pc' when stepping by machine instructions. This makes GDB automatically display the next instruction to be executed, each time your program stops. See section Automatic Display. 

GCC manual on inline asm:

5.37 Assembler Instructions with C Expression Operands
http://gcc.gnu.org/onlinedocs/gcc-4.3.3/gcc/Extended-Asm.html#Extended-Asm


patch:
       -N  or  --forward
          Ignore patches that seem to be reversed or already applied.  See also -R.

Monday, February 09, 2009

plat_nand.c driver for BF537-STAMP nand flash driver:

http://docs.blackfin.uclinux.org/doku.php?id=linux-kernel:drivers:bfin_async_nand

Clockevents and dyntick
http://lwn.net/Articles/223185/
linux/Documentation/timers/
http://kernelnewbies.org/Linux_2_6_21#head-8547911895fda9cdff32a94771c8f5706d66bba0

Monday, February 02, 2009

fstab

Key points:

One can use Label, UUID, besides the device nodes to mount a device.

More information can be found at http://ubuntuforums.org/showthread.php?t=283131


Understanding fstab

Sorry this is such a long post.

I added much of this information to the Ubuntu wiki.

Ubuntu Wiki : fstab

There are essentially 5 sections:

   1. Introduction / Mount.
   2. "fstab syntax" - Syntax and fstab options.
   3. How to label, FAT and Linux file systems.
   4. Examples, FAT and Linux native file systems.
   5. References


Scroll down to the section you need.

Introduction

/etc/fstab is a system configuration file and is used to tell the Linux kernel which partitions (file systems) to mount and where on the file system tree.

/etc/mtab is an index of all mounted partitions/file systems.

Note: See references section at the end of this how to for useful links.

How to mount

The mount command and fstab go hand in hand:

   1. Options for mount and fstab are similar.
   2. If a device/partition is not listed in fstab ONLY ROOT may mount the device/partition.
   3. Users can mount a removable device using pmount.
   4. Users may mount a device/partition if the device is in fstab with the proper options.


How to mount
Mount Partitions Automatically (At BOOT).
Filesystems and Mounting Thanks Hermanzone

mount has a multitude of options. Manpage: man mount

pmount: Pmount allows a user to mount removable media.
pmount uses /media/ as the mount point.

Syntax:
Quote:
pmount  
Example:
Code:

pmount /dev/dsa1 data

This creates a directory "data" in /media (mount point is /media/data) and mounts your removable device there.

To unmount:
Code:

pumount 

Note: pmount does not like to mount to an existing directory in /media.

    * For example, if you have a directory /media/usb ; pmount /dev/sda1 usb may fail.
    * If you are having problems with gnome-volume-manager or pmount check the contents of /media and delete directories as needed.
    * Obviously do not delete a directory in /media if a device is mounted to this mount point.


Configure pmount for internal drives

To show your partitions/usb devices, first plug in your usb card.

To list your mounted partitions:
Code:

mount

To list all your partitions, mounted or not:
Code:

sudo fdisk -l

To list all your partitions by UUID:
First connect all your devices, then:
Code:

ls /dev/disk/by-uuid -alh

============== END OF INTRODUCTION ===============


fstab Syntax

Quote:
[Device] [Mount Point] [File_system] [Options] [dump] [fsck order]
Device = Physical location.

/dev/hdxy or /dev/sdxy.

x will be a letter starting with a, then b,c,....
y will be a number starting with 1, then 2,3,....

Thus hda1 = First partition on the master HD.

    See Basic partitioning for more information

Note: zip discs are always numbered "4".
Example: USB Zip = /dev/sda4.

Note: You can also identify a device by udev, volume label (AKA LABEL), or uuid.

These fstab techniques are helpful for removable media because the device (/dev/sdxy) may change. For example, sometimes the USB device will be assigned /dev/sda1, other times /dev/sdb1. This depends on what order you connect USB devices, and where (which USB slot) you use to connect. This can be a major aggravation as you must identify the device before you can mount it. fstab does not work well if the device name keeps changing.

To list your devices, first put connect your USB device (it does not need to be mounted).
By volume label:
Code:

ls /dev/disk/by-label -lah

By id:
Code:

ls /dev/disk/by-id -lah

By uuid:
Code:

ls /dev/disk/by-uuid -lah

IMO, LABEL is easiest to use as you can set a label and it is human readable.

The format to use instead of the device name in the fstab file is:

LABEL= (Where  is the volume label name, ex. "data").

UUID= (Where  is some alphanumeric (hex) like fab05680-eb08-4420-959a-ff915cdfcb44).

Again, IMO, using a label has a strong advantage with removable media (flash drives).

See How to use Labels below.

For udev: udev does the same thing as LABEL, but I find it more complicated.
See How to udev for a very nice how to on udev.


Mount point.
This is where the partition is mounted or accessed within the "tree" (ie /mnt/hda1).
You can use any name you like.
In general

   1. /mnt Typically used for fixed hard drives HD/SCSI.
   2. /media Typically used for removable media (CD/DVD/USB/Zip).


Examples:

   1. /mnt/windows
   2. /mnt/data
   3. /media/usb


To make a mount point:
Code:

sudo mkdir /media/usb

File types:

auto: The file system type (ext3, iso9660, etc) it detected automatically. Usually works. Used for removable devices (CD/DVD, Floppy drives, or USB/Flash drives) as the file system may vary on these devices.


Linux file systems: ext2, ext3, jfs, reiserfs, reiser4, xfs, swap.

Windows:
vfat = FAT 32, FAT 16
ntfs= NTFS

Note: For NTFS rw ntfs-3g

CD/DVD/iso: iso9660

    To mount an iso image (*.iso NOT CD/DVD device):
    Code:

    sudo mount -t iso9660 -o ro,loop=/dev/loop0  

Network file systems: This section assumes the server and client are already setup.


nfs Example:
Quote:
server:/shared_directory /mnt/nfs nfs  0 0

    More detailed information on nfs

smb (samba) : Samba mounts can be performed very easily via gui tools (See Ubuntu Wiki Setting up Samba). If you mount a samba share with the gui tools it will be placed in ~/.gvfs , a hidden directory in your home directory.

This section is limited to fstab and you will need a fstab entry to mount samba shares at boot.

smbfs is now depreciated for cifs : http://linux-cifs.samba.org/

cifs still uses a credentials file to avoid the need to enter a password. If you do not use a credentials file, you will mount a samba share with sudo and enter your username and password in a terminal.

Quote:
//Server/share /mnt/samba cifs users,auto,credentials=/path/credentials_file,noexec 0 0

    * Server = Name (if in /etc/hosts) or IP Address of samba server.
    * share = Name of shared directory (folder).
    * /mnt/samba = your desired mount point.
    * /path/credentials_file = full path to your credentials file. A credentials file should be owned by root (permissions 400) and contain two lines :
      Quote:
      username = samba_user
      password = samba_user_password

        samba_user = samba user (on server).

        samba_user_password = samba user password (on server).

    * noexec for security (it can be bypassed ...).


smbfs : depreciated, but similar.

Quote:
//win_box/shared_folder /mnt/samba smbfs rw,credentials=/home/user_name/winbox-credentials.txt 0 0
And from Buck2348:

    Quote:
    I don't mount any vfat shares but uid and gid work with smbfs shares. I might have to try out your syntax.

    I could not automount at boot my smbfs shares until I found the this fix in the Forums. I hope it will help someone else. I think the problem was related to the fact that I don't use a username and password in the Windows systems. All I had to do was add
    Code:

    username=share,password=

    to the options list in the fstab line for these shares.

    More detailed information on samba

sshfs : Network shares over ssh

http://ubuntu.wordpress.com/2005/10/...m-using-sshfs/

Code:

sshfs#user@server:/share  fuse  user,allow_other  0  0

    * "Server" = Samba server (by IP or name if you have an entry for the server in your hosts file
    * "share" = name of the shared directory



Options:

Ubuntu 8.04 now defaults to "relatime". For a discussion of this option see : http://lwn.net/Articles/244829/

defaults = rw, suid, dev, exec, auto, nouser, and async.

Options for a separate /home : nodev,nosuid,relatime

My recommended options for removable (USB) drives are in green.

auto= mounted at boot
noauto= not mounted at boot

user= when mounted the mount point is owned by the user who mounted the partition
users= when mounted the mount point is owned by the user who mounted the partition and the group users

ro= read only
rw= read/write

VFAT/NTFS:

Ownership and permissios of vfat / ntfs are set at the time of mounting. This is often a source of confusion.

uid= Sets owner. Syntax: may use user_name or user ID #.
gid= sets group ownership of mount point. Again may use group_name or GID #.

umask can be used to set permissions if you wish to change the default.
Syntax is "odd" at first.
To set a permissions of 777, umask=000
To set permissions of 700, umask=077

Best is to set directories with executable permissions and file with read write. To do this, use fmask and dmask (rather then umask):
dmask=027
fmask=137

With these options files are not executable (all colored green in a terminal w/ ls)


Linux native file systems: Use defaults or users. To change ownership and permissions, mount the partition, then use chown and chmod.

Note: Warning re: sync and flash devices:
Warning

Additional Options: (From wiki.linuxquestions.org/wiki/Fstab):

    * sync/async - All I/O to the file system should be done (a)synchronously.
    * auto - The filesystem can be mounted automatically (at bootup, or when mount is passed the -a option). This is really unnecessary as this is the default action of mount -a anyway.
    * noauto - The filesystem will NOT be automatically mounted at startup, or when mount passed -a. You must explicitly mount the filesystem.
    * dev/nodev - Interpret/Do not interpret character or block special devices on the file system.
    * exec / noexec - Permit/Prevent the execution of binaries from the filesystem.
    * suid/nosuid - Permit/Block the operation of suid, and sgid bits.
    * ro - Mount read-only.
    * rw - Mount read-write.
    * user - Permit any user to mount the filesystem. This automatically implies noexec, nosuid,nodev unless overridden.
    * nouser - Only permit root to mount the filesystem. This is also a default setting.
    * defaults - Use default settings. Equivalent to rw, suid, dev, exec, auto, nouser, async.
    * _netdev - this is a network device, mount it after bringing up the network. Only valid with fstype nfs.



Dump
Dump: Dump field sets whether the backup utility dump will backup file system. If set to "0" file system ignored, "1" file system is backed up.



Fsck order
Fsck: Fsck order is to tell fsck what order to check the file systems, if set to "0" file system is ignored.

See also: Tuning the Filesystem Check at Bootup

Fstab Examples


Quote:
/dev/sda14 /mnt/zen ext3 relatime 0 2

# Usb device (assuming vfat)
/dev/sdb1 /media/usb auto users,uid=1000,gid=100,utf8,dmask=027,fmask=137 0 0

#Data partition
LABEL=data /mnt/usr_data ext3 auto,users,rw,relatime 0 0

# Flash drive By UUID
UUID=fab05680-eb08-4420-959a-ff915cdfcb44 /media/flash vfat user,uid=1000,gid=100,utf8,dmask=027,fmask=137 0 0


/dev/disk/by-id/usb-IOMEGA_ZIP_250_059B00301400B0F1-part4 /mnt/zip vfat users,uid=1000,gid=100,utf8,dmask=027,fmask=137 0 0

/dev/hda1 /mnt/windows ntfs-3g auto,users,uid=1000,gid=100,utf8,dmask=027,fmask=1 37 0 0

# VFAT
# FAT ~ Linux calls FAT file systems vfat)
# /dev/hda1
UUID=12102C02102CEB83 /media/windows vfat auto,users,uid=1000,gid=100,utf8,dmask=027,fmask=1 37 0 0

# NTFS ~ Use ntfs-3g for write access (rw)
# /dev/hda1
UUID=12102C02102CEB83 /media/windows ntfs-3g auto,users,uid=1000,gid=100,utf8,dmask=027,fmask=1 37 0 0


# Separate Home
# /dev/sda7
UUID=413eee0c-61ff-4cb7-a299-89d12b075093 /home ext3 nodev,nosuid,relatime 0 2

# Samba
//server/share /media/samba cifs user=user,uid=1000,gid=100 0 0
# "Server" = Samba server (by IP or name if you have an entry for the server in your hosts file
# "share" = name of the shared directory
# "user" = your samba user
# This set up will ask for a password when mounting the samba share. If you do not want to enter a password, use a credentials file.
# replace "user=user" with "credentials=/etc/samba/credentials" In the credentials file put two lines
# user=user
# password=password
# make the file owned by root and ro by root (sudo chown root.root /etc/samba/credentials && sudo chmod 400 /etc/samba/credentials)

# NFS
Server:/share /media/nfs nfs rsize=8192 and wsize=8192,noexec,nosuid
# "Server" = Samba server (by IP or name if you have an entry for the server in your hosts file
# "share" = name of the shared directory

#SSHFS
Sshfs#user@server:/share fuse user,allow_other 0 0
# "Server" = Samba server (by IP or name if you have an entry for the server in your hosts file
# "share" = name of the shared directory
=========== End of fstab =============


How to Label

Linux: How the label and the UUID are set depends on the file system type used. It can normally be set when creating/formatting the file system and the file system type usually has some tool to change it later on (e.g. e2tunefs,xfs_admin,reiserfstune,etc.)

Labels

Mke2fs/e2label/tune2fs:

Note: For either ext2 or ext3 file systems.

WARNING: mke2fs will reformat your partition and set a label at the same time. This will delete any data on the target partition.

To set a label without reformatting use e2label or tune2fs

   1. Make a label:
      Code:

      mke2fs -L  

      OR
      Code:

      e2label  

      OR
      Code:

      tune2fs -L  

      Examples:
      Quote:
      mke2fs -L data /dev/hda3
      OR
      Quote:
      e2label /dev/hda3 data
      OR
      Quote:
      tune2fs -L data /dev/hda3
   2. Create a mount point:
      Code:

      sudo mkdir /media/data

   3. Add an entry to /etc/fstab:
      Quote:
      LABEL=data /media/data ext3 defaults 0 0
   4. To mount:
      Code:

      sudo mount LABEL=data



ReiserFS:
Use reiserfstune:
Code:

reiserfstune --l  

    Note:That is a small "L" and not the number 1.

JFS:
Use jfs_tune:
Code:

jfs_tune -L  

To show the label:
Code:

jfs_tune -l 

    Note:That is a small "L" and not the number 1.

XFS:
Use xfs_admin:
Code:

sudo xfs_admin -L  

To show the label:
Code:

xfs_admin -l 

    Note:That is a small "L" and not the number 1.

FAT (Windows partitions):

Use mtools to label a FAT partition:

   1. Install mtools:
      Code:

      sudo aptitude install mtools

   2. Copy the mtools configuration file to ~:
      Code:

      cp /etc/mtools.conf ~/.mtoolsrc

      Note: ~ is shorthand for /home/user_name.
   3. Mount your flash drive.
   4. Edit ~/.mtoolsrc:
      Code:

      gedit ~/.mtoolsrc

   5. Add these lines to the end of ~/.mtoolsrc:
      Quote:
      drive i: file=""
      mtools_skip_check=1
      Where  is the device assigned to your mounted USB device/flash drive (ie sda1, sdb1, ...).
      Note: You can do this from the command line:
      Code:

      echo 'drive i: file=""' >> ~/.mtoolsrc
      echo mtools_skip_check=1 >> ~/.mtoolsrc

          Although you will need to edit ~/.mtoolsrc for each new device if the device assignment changes.

      Example: = drive i: file="/dev/sda1"
   6. Change to drive i:
      Code:

      mcd i:

   7. Check the current label:
      Code:

      mlabel -s i:

   8. Change the current label:
      Code:

      sudo mlabel -s i:DATA

   9. Or
      Code:

      sudo mlabel i:DATA

          pieroxy reports the -s flag did not work, thanks pieroxy

          Note: mlabel USES ALL CAPS.

  10. Add an entry to fstab:
      Quote:
      LABEL=DATA  vfat defaults 0 0
      Note: You can also mount the usb device with:
      Code:

      mount LABEL=


NTFS (Windows partitions):

    Thanks to rudyj for pointing out the oversight.

Use ntfsprogs:

First install ntfsprogs:
Code:

sudo aptitude install ntfsprogs

Or use Synaptic.

Then:

   1. Show label:
      Code:

      ntfslabel 

   2. Change label:
      Code:

      ntfslabel  

          Where:
              *  = your new label
              *  = your partition to label (/dev/hda1 perhaps)

   3. Add an entry to fstab:
      Quote:
      LABEL=DATA  ntfs(or ntfs-3g) defaults 0 0
      Note: You can also mount the usb device with:
      Code:

      mount LABEL=



============== END OF LABEL ===============


Examples of fstab options


********* FAT **********

FAT partitions are easy to share between Linux and Windows as both OS will read FAT "out of the box" without additional installation or configuration.

In this example I will use /mnt/data as my mount point.

Code:

sudo mkdir /mnt/data

fstab:
Quote:
LABEL=data /mnt/data vfat  0 0
Default permissions of /mnt/data:
Quote:
drwxr-xr-x 2 root root

   1. fstab options: defaults
      mount /mnt/data yields: mount: only root can mount /dev/sdb1 on /mnt/data

      sudo mount /mnt/data mounts the device.
      Permissions:
      Quote:
      drwxr-xr-x 7 root root
      Note: ONLY ROOT has rw permissions.
   2. fstab options: users,noauto,rw
      mount /mnt/data mounts the partition.
      Permissions:
      Quote:
      drwxr-xr-x 7 bodhi adm
      Note: The user can mount the device and has rw permissions.
      Note: The ownership and permissions of the mount point have changed !
   3. fstab options: users,noauto,gid=100,umask=007
      mount /mnt/data mounts the partition.
      Permissions:
      Quote:
      drwxrwx--- 7 bodhi users
      Note: The user can mount the device and now both the user and the users group have rw permissions.
      Note: The ownership and permissions of the mount point have changed again !



********* Linux Native File Systems **********

In this example I will use ext3, but this holds true for ext2, reiserfs, jfs, and xfs.

Code:

sudo mkdir /mnt/ext3

fstab:
Quote:
LABEL=ext3 /mnt/ext3 auto  0 0

   1. fstab options: defaults
      mount /mnt/data yields: mount: only root can mount LABEL=ext3 on /mnt/ext3

      sudo mount /mnt/ext3 mounts the device.
      Permissions:
      Quote:
      bodhi@Arch:~$ls -l /mnt | grep ext3
      drwxr-xr-x 3 bodhi users 1024 2006-11-07 17:26 ext3
      Note: Ownership has changed ! owner=bodhi, group=users, however ONLY USER (and root of course) has rw permissions.
   2. fstab options: users,noauto

      mount /mnt/ext3 mounts the partition.
      Permissions:
      Quote:
      bodhi@Arch:~$mount /mnt/ext3/
      bodhi@Arch:~$ls -l /mnt | grep ext3
      drwxr-xr-x 3 bodhi users 1024 2006-11-07 17:26 ext3
      Note: The user can mount the device and has rw permissions.
      Note: Ownership remains bodhi:users

      Note: ext2 and ext3 do not take uid=xxx, gid=xxx, or umask=xxx

      To set group rw permissions:
      fstab options: users,noauto
         1. mount the partition: mount /mnt/ext3
         2. Set permisions of the mount point: chmod 777 /mnt/ext3

      The set ownership and permissions will remain in effect with un-mount and re-boot.

      Example:
      Quote:
      bodhi@Arch:~$chmod 777 /mnt/ext3
      bodhi@Arch:~$ls -l /mnt | grep ext3
      drwxrwxrwx 3 bodhi users 1024 2006-11-07 17:51 ext3
      bodhi@Arch:~$umount /mnt/ext3/
      bodhi@Arch:~$ls -l /mnt | grep ext3
      drwxr-xr-x 2 root root 4096 2006-11-07 17:28 ext3
      bodhi@Arch:~$mount /mnt/ext3/
      bodhi@Arch:~$ls -l /mnt | grep ext3
      drwxrwxrwx 3 bodhi users 1024 2006-11-07 17:51 ext3
      bodhi@Arch:~$
      Note: The permissions revert when the partition is un-mounted RED
      Note: The permissions remain rw when the partition is re-mounted BLUE
      Permissions:
      Note: The user can mount the device and now both the user and the users group have rw permissions.



============== END OF EXAMPLES ===============


References

Partitioning: Basic partitioning

Mount:

How to mount filesystems in Linux
Ubuntu Automatically Mount Partitions
man mount
Mount Other Filesystems

Fstab:
fstab wiki
How to edit and understand /etc/fstab
Tuning the Filesystem Check at Bootup

Labels: How to use Labels

udev: How to udev

NTFS: ntfs-3g

Zip dirve how-to: How to Zip Drive

nfs:
How to set up NFS
How to NFS v4
Debian/Ubuntu NFS Guide Short but sweeeet !

Mount Windows Sares: Mount Windows shares permanently

Samba:
Setting up Samba
How to mount smbfs shares permanently

bodhi.zazen

Friday, January 23, 2009

grub usage one case

Originally grub is installed in MRB of /dev/sda0. Since I want to replace the disk of /dev/sd0, need to install the same grub in /dev/sda1.

#sudo grub
grub> setup (hd1)
grub> quit


Install Grub
------------------
In order to install GRUB as your boot loader, you need to first install the GRUB system and utilities under your UNIX-like operating system (see Obtaining and Building GRUB). You can do this either from the source tarball, or as a package for your OS.

After you have done that, you need to install the boot loader on a drive (floppy or hard disk). There are two ways of doing that - either using the utility grub-install (see Invoking grub-install) on a UNIX-like OS, or by running GRUB itself from a floppy. These are quite similar, however the utility might probe a wrong BIOS drive, so you should be careful.

Also, if you install GRUB on a UNIX-like OS, please make sure that you have an emergency boot disk ready, so that you can rescue your computer if, by any chance, your hard drive becomes unusable (unbootable).

GRUB comes with boot images, which are normally put in the directory /usr/lib/grub/i386-pc. If you do not use grub-install, then you need to copy the files stage1, stage2, and *stage1_5 to the directory /boot/grub, and run the grub-set-default (see Invoking grub-set-default) if you intend to use `default saved' (see default) in your configuration file. Hereafter, the directory where GRUB images are initially placed (normally /usr/lib/grub/i386-pc) will be called the image directory, and the directory where the boot loader needs to find them (usually /boot/grub) will be called the boot directory. 

 Installing GRUB natively
-------------------------------

Caution: Installing GRUB's stage1 in this manner will erase the normal boot-sector used by an OS.

GRUB can currently boot GNU Mach, Linux, FreeBSD, NetBSD, and OpenBSD directly, so using it on a boot sector (the first sector of a partition) should be okay. But generally, it would be a good idea to back up the first sector of the partition on which you are installing GRUB's stage1. This isn't as important if you are installing GRUB on the first sector of a hard disk, since it's easy to reinitialize it (e.g. by running `FDISK /MBR' from DOS).

If you decide to install GRUB in the native environment, which is definitely desirable, you'll need to create a GRUB boot disk, and reboot your computer with it. Otherwise, see Installing GRUB using grub-install.

Once started, GRUB will show the command-line interface (see Command-line interface). First, set the GRUB's root device4 to the partition containing the boot directory, like this:

     grub> root (hd0,0)

If you are not sure which partition actually holds this directory, use the command find (see find), like this:

     grub> find /boot/grub/stage1

This will search for the file name /boot/grub/stage1 and show the devices which contain the file.

Once you've set the root device correctly, run the command setup (see setup):

     grub> setup (hd0)

This command will install the GRUB boot loader on the Master Boot Record (MBR) of the first drive. If you want to put GRUB into the boot sector of a partition instead of putting it in the MBR, specify the partition into which you want to install GRUB:

     grub> setup (hd0,0)

If you install GRUB into a partition or a drive other than the first one, you must chain-load GRUB from another boot loader. Refer to the manual for the boot loader to know how to chain-load GRUB.

After using the setup command, you will boot into GRUB without the GRUB floppy. See the chapter Booting to find out how to boot your operating systems from GRUB.

grub



Abstract

This description details the structure at the start of a hard disk in a PC, and how GRUB fits into that structure.

The PC hard disk layout

The diagram below is the disk layout of my linux laptop which is very standard. There are 4 primary partitions whose size and function are labeled, but this document is going to concentrate on the expanded area in the diagram at the start of the disk. I will describe its structure and how GRUB fits in. Note the structure is the same for windows systems, only the content (grub) is different.



PC hard disk layout

The Master Boot Record

The first sector (512 bytes) is the MBR and consists of, 446 bytes bootloader, 64 bytes partition table and a 2 byte signature (0xAA55). It get its name because it is the first boot record that is loaded, which can then load other boot records from various locations on the disk. To give a hexdump of this first sector of the disk you can do dd bs=512 count=1 if=/dev/hda | od -Ax -tx1z -v (changing /dev/hda as appropriate for your setup):





000000 eb 48 90 d0 bc 00 7c fb 50 07 50 1f fc be 1b 7c  >.H....|.P.P....|<

000010 bf 1b 06 50 57 b9 e5 01 f3 a4 cb be be 07 b1 04  >...PW...........<

000020 38 2c 7c 09 75 15 83 c6 10 e2 f5 cd 18 8b 14 8b  >8,|.u...........<

000030 ee 83 c6 10 49 74 16 38 2c 74 f6 be 10 07 03 02  >....It.8,t......<

000040 80 00 00 80 b8 85 64 00 00 08 fa 80 ca 80 ea 53  >......d........S<

000050 7c 00 00 31 c0 8e d8 8e d0 bc 00 20 fb a0 40 7c  >|..1....... ..@|<

000060 3c ff 74 02 88 c2 52 be 79 7d e8 34 01 f6 c2 80  ><.t...R.y}.4....<

000070 74 54 b4 41 bb aa 55 cd 13 5a 52 72 49 81 fb 55  >tT.A..U..ZRrI..U<

000080 aa 75 43 a0 41 7c 84 c0 75 05 83 e1 01 74 37 66  >.uC.A|..u....t7f<

000090 8b 4c 10 be 05 7c c6 44 ff 01 66 8b 1e 44 7c c7  >.L...|.D..f..D|.<

0000a0 04 10 00 c7 44 02 01 00 66 89 5c 08 c7 44 06 00  >....D...f.\..D..<

0000b0 70 66 31 c0 89 44 04 66 89 44 0c b4 42 cd 13 72  >pf1..D.f.D..B..r<

0000c0 05 bb 00 70 eb 7d b4 08 cd 13 73 0a f6 c2 80 0f  >...p.}....s.....<

0000d0 84 f0 00 e9 8d 00 be 05 7c c6 44 ff 00 66 31 c0  >........|.D..f1.<

0000e0 88 f0 40 66 89 44 04 31 d2 88 ca c1 e2 02 88 e8  >..@f.D.1........<

0000f0 88 f4 40 89 44 08 31 c0 88 d0 c0 e8 02 66 89 04  >..@.D.1......f..<

000100 66 a1 44 7c 66 31 d2 66 f7 34 88 54 0a 66 31 d2  >f.D|f1.f.4.T.f1.<

000110 66 f7 74 04 88 54 0b 89 44 0c 3b 44 08 7d 3c 8a  >f.t..T..D.;D.}<.<

000120 54 0d c0 e2 06 8a 4c 0a fe c1 08 d1 8a 6c 0c 5a  >T.....L......l.Z<

000130 8a 74 0b bb 00 70 8e c3 31 db b8 01 02 cd 13 72  >.t...p..1......r<

000140 2a 8c c3 8e 06 48 7c 60 1e b9 00 01 8e db 31 f6  >*....H|`......1.<

000150 31 ff fc f3 a5 1f 61 ff 26 42 7c be 7f 7d e8 40  >1.....a.&B|..}.@<

000160 00 eb 0e be 84 7d e8 38 00 eb 06 be 8e 7d e8 30  >.....}.8.....}.0<

000170 00 be 93 7d e8 2a 00 eb fe 47 52 55 42 20 00 47  >...}.*...GRUB .G<

000180 65 6f 6d 00 48 61 72 64 20 44 69 73 6b 00 52 65  >eom.Hard Disk.Re<

000190 61 64 00 20 45 72 72 6f 72 00 bb 01 00 b4 0e cd  >ad. Error.......<

0001a0 10 ac 3c 00 75 f4 c3 00 00 00 00 00 00 00 00 00  >..<.u...........<

0001b0 00 00 00 00 00 00 00 00 5f 00 5f 00 00 00 00 01  >........_._.....<

0001c0 01 00 0b fe 7f 97 3f 00 00 00 59 03 64 00 00 00  >......?...Y.d...<

0001d0 41 98 83 fe 7f a4 98 03 64 00 cd 2f 03 00 00 00  >A.......d../....<

0001e0 41 a5 83 fe ff ff 65 33 67 00 fc 08 fa 00 80 fe  >A.....e3g.......<

0001f0 ff ff 0f fe ff ff 61 3c 61 01 1f ed f2 00 55 aa  >......a

Note there is no mystery about or dissassembly required of the blue section above as GRUB is open source and one can download, inspect and modify its stage 1 source code.

The DOS compatibility region

This region is optional as far as linux is concerned at least, but is added by default by most partition managers. To understand why this region was required we need to describe how disks used to be addressed. Generally now disks are addressed in LBA mode which allows for greater capacity disks while abstracting software away from the specifics of the disk itself. Previously though disks were addressed in CHS mode, which represented the physical construction of the disk as can be seen in the diagram below:



Cylinders Heads Sectors



DOS had the requirement that its image did not span across cylinders, and so this region was added by partition managers so that the first partition was aligned on a cylinder boundary. Therefore this region's size is determined by the number of sectors (512 bytes) per cylinder. The maximum (and usual given todays disk sizes and LBA) sectors per cylinder is 63, which leaves 62 sectors free after the MBR (31,744 bytes).



GRUB uses this region to store its stage 1.5, which is filesystem specific code used to find the operating system image on the "boot" filesystem. Currently GRUB does not need all this space as can be seen for the copies of all the stage 1.5 files in the boot partition:



$ ls -lS /boot/grub/*stage1_5

-rw-r--r--  1 root root 9428 Mar  8 14:27 /boot/grub/reiserfs_stage1_5

-rw-r--r--  1 root root 9308 Mar  8 14:27 /boot/grub/xfs_stage1_5

-rw-r--r--  1 root root 8448 Mar  8 14:27 /boot/grub/jfs_stage1_5

-rw-r--r--  1 root root 7956 Mar  8 14:27 /boot/grub/e2fs_stage1_5

-rw-r--r--  1 root root 7684 Mar  8 14:27 /boot/grub/fat_stage1_5

-rw-r--r--  1 root root 7272 Mar  8 14:27 /boot/grub/ufs2_stage1_5

-rw-r--r--  1 root root 7188 Mar  8 14:27 /boot/grub/minix_stage1_5

-rw-r--r--  1 root root 7028 Mar  8 14:27 /boot/grub/iso9660_stage1_5

-rw-r--r--  1 root root 6996 Mar  8 14:27 /boot/grub/ffs_stage1_5

-rw-r--r--  1 root root 6612 Mar  8 14:27 /boot/grub/vstafs_stage1_5



Therefore one could theoretically use the last 43 sectors of this region (22,016 bytes) for anything. There is no point in creating a filesystem in here as the overhead would be too much, but you could dd stuff in and out like:

dd bs=512 seek=20 count=43 if=myfile of=/dev/hda

dd bs=512 skip=20 count=43 if=/dev/hda of=myfile

Note be very sure you know what your doing before running these commands.



Note on my laptop the last sector of this region contains the first sector of a MSWIN4.1 boot record (which I understand is 3 sectors in total), as it came with winxp installed (in a FAT32 partition). GRUB makes this sector redundant even in a dual boot situation, so don't worry about overwriting it. You can inspect this sector using: dd bs=512 skip=62 count=1 if=/dev/hda | od -Ax -tx1z -v

GRUB 1 boot process

GRUB or the GRand Unified Bootloader is the usual bootloader used on linux systems, and resides on the system as described above. The boot process with GRUB is as follows:



    * BIOS code reads MBR from a disk (looks at last 2 bytes to verify if MBR).

    * Starts executing bootloader code (GRUB stage 1).

    * Bootloader jumps to location (sector num) of next stage. This sector num is stored at a particular location in the bootloader "code" at GRUB install time and usually points to a stage 1.5 located in the "DOS compat space" immediately after the MBR.

    * Stage 1.5 knows about the boot filesystem so it opens the filesystem on the specified (at install time) partition. It looks for the stage 2 executable here and executes it. Note since stage 1.5 knows about the boot filesystem it gives much greater flexibility in upgrading stage 2 and the kernel etc. as their new locations don't need to be written to the earlier GRUB stages.

    * Stage 2 contains most of the GRUB logic. It loads the menu.lst file and executes the statements, usually providing a menu to the user etc.



      Subsequent steps are distro specific but for completeness I'll describe them for my fedora linux distribution:



    * When GRUB starts booting one of the entries, it reads the initial ramdisk and starts the kernel running telling it about the ramdisk.

    * In the initial ramdisk, the nash shell is run to parse the /linuxrc file. It essentially finds the location of the filesystem it itself is on and passes that to the kernel as its root filesystem. This allows for greater flexibility of the devices the kernel resides on.

    * The kernel reads its root filesystem and executes /bin/init by default. This in turn parses /etc/inittab which usually sets up the login consoles and starts executing the scripts in /etc/init.d

    * These scripts start various subsystems, culminating in starting X. X in turn starts the display manager which gives a login prompt.



GRUB 2 boot process

The structure of GRUB has changed quite a bit with version 2, (which is still in development at the time of writing). Instead of stage 1, stage 1.5 and stage 2 images it has a collection of modules that can be combined in different ways. To mirror the functionality the original GRUB's stages as described above, you would have for example:



stage 1 = boot.img

stage 1.5 = diskboot.img+kernel.img+pc.mod+ext2.mod (the core image)

stage 2 = normal.mod+_chain.mod



The boot process for GRUB 2 then would be:



    * BIOS code reads MBR from a disk (looks at last 2 bytes to verify if MBR).

    * Starts executing bootloader code (boot.img).

    * This loads the first sector of the core image (diskboot.img), whose location was stored at a particular location in boot.img at GRUB install time and usually points to the core image located in the "DOS compat space" immediately after the MBR. diskboot.img in turn loads and executes the rest of the core image.

    * These core image modules know about the boot filesystem and can open the filesystem on the specified (at install time) partition. It looks for and loads modules there, including normal.mod

    * normal.mod loads the grub.cfg file and executes the statements, usually providing a menu to the user etc.



This is a more flexible mechanism. For example one can prepend pxeboot.img to the core image instead of diskboot.img. This will then load the whole core image from the network and then start executing kernel.img.

Thursday, January 15, 2009

I/O scheduler

Although most Linux users are familiar with the role of process
schedulers, such as the new O(1) scheduler, many users are not so
familiar with the role of I/O schedulers. I/O schedulers are similar in
some aspects to process schedulers; for instance, both schedule some resource
among multiple users. A process scheduler virtualizes the resource of
processor time among multiple executing processes on the system. So,
what does an I/O scheduler schedule?

A naïve system would not even include an I/O scheduler. Unlike the process
scheduler, the I/O scheduler is not a mandatory component of the operating
system. Instead, performance is the I/O scheduler's sole
raison d'être.

To understand the role of an I/O scheduler, let's go over some background
information and then look at how a system behaves without an I/O
scheduler. Hard disks address their data using the familiar geometry-based
addressing of cylinders, heads and sectors. A hard
drive is composed of multiple platters, each consisting of a single
disk, spindle and read/write head. Each platter is divided further into
circular ring-like tracks, similar to a CD or record. Finally,
each track is composed of some integer number of sectors.

To locate a specific unit of data in a hard drive, the drive's logic
requires three pieces of information: the cylinder, the head and the
sector. The cylinder specifies the track on which the data resides. If
you lay the platters on top of one another (as they are in a hard disk),
a given track forms a cylinder through each platter. The head then
identifies the exact read/write head (and thus the exact platter)
in question. The search now is narrowed down to a single track on a
single platter. Finally, the sector value denotes the exact sector on
the track. The search is complete: the hard disk knows what sector,
on what track, on what platter the data resides. It can position the
read/write head of the correct platter over the correct track and read
the proper sector.

Thankfully, modern hard disks do not force computers to communicate
with them in terms of cylinders, heads and sectors. Instead, modern
hard drives map a unique block number over each cylinder/head/sector
triplet. The unique number identifies a specific cylinder/head/sector
value. Modern operating systems then can address hard drives using
this block number—called logical block addressing—and the
hard drive translates the block number into the correct
cylinder/head/sector value.

One thing of note about this block number: although nothing guarantees
it, the physical mapping tends to be sequential. That is, logical
block n tends to be physically adjacent to logical block
n+1. We discuss why that is important later on.

Now, let's consider the typical UNIX system. Applications as varied
as databases, e-mail clients, Web servers and text editors issue I/O
requests to the disk, such as read this block and write to that
block. The blocks tend to be located physically all over the disk. The
e-mail spool may be located in an entirely different region of the disk
from the Web server's HTML data or the text editor's configuration
file. Indeed, even a single file can be strewn all over the disk if the
file is fragmented, that is, not laid out in sequential blocks. Because the files
are broken down into individual blocks, and hard drives are addressed by
block and not the much more abstract concepts of files, reading or writing
file data is broken down into a stream of many individual I/O requests,
each to a different block. With luck, the blocks are sequential or at
least physically close together. If the blocks are not near one
another, the disk head must move to another location on the disk. Moving
the disk head is called seeking, and it is one of the most expensive
operations in a computer. The seek time on modern hard drives is measured
in the tens of milliseconds. This
is one reason why defragmented files are a good thing.

Unfortunately, it does not matter if the files are defragmented because
the system is generating I/O requests for multiple files, all over the
disk. The e-mail client wants a little from here and the Web server wants
a little from there—but wait, now the text editor wants to read a
file. The net effect is that the disk head is made to jump around the
disk. In a worst-case scenario, with interleaved I/O requests to multiple
files, the head can spend all of its time jumping around from
one location to another—not a good thing for overall system performance.

This is where the I/O scheduler comes in. The I/O scheduler schedules
the pending I/O requests in order to minimize the time spent moving the
disk head. This, in turn, minimizes disk seek time and maximizes hard
disk throughput.

This magic is accomplished through two main actions, sorting
and merging. First, the I/O scheduler keeps the list of pending
I/O requests sorted by block number. When a new I/O request is issued,
it is inserted, block-wise, into the list of pending requests. This
prevents the drive head from seeking all around the disk to service I/O
requests. Instead, by keeping the list sorted, the disk head moves in
a straight line around the disk. If the hard drive is busy servicing
a request at one part of the disk, and a new request comes in to the
same part of the disk, that request can be serviced before moving off
to other parts of the disk.

Merging occurs when an I/O request is issued to an identical or adjacent
region of the disk. Instead of issuing the new request on its own, it
is merged into the identical or adjacent request. This minimizes
the number of outstanding requests.

Let's look at an example. Consider the case where two applications issue requests
to the following block numbers, such that they arrived in the kernel in
this order: 10, 500, 12, 502, 14, 504 and 12. The I/O scheduler-less
approach would service these blocks in the given order. That is seven
long seeks, back and forth between two parts of the disk. What a waste!
If the kernel sorted and merged these requests, however, and serviced
them in that order, the results would be much different: 10, 12, 14, 500,
502 and 504. Only a single far-off seek and one less request overall.

In this manner, an I/O scheduler virtualizes the resources of
disk I/O among multiple I/O requests in order to maximize global
throughput. Because I/O throughput is so crucial to system performance
and because seek time is so horribly slow, the job of an I/O scheduler
is important.

The Linus Elevator

The I/O scheduler found in the 2.4 Linux kernel is named the Linus
Elevator. I/O schedulers often are called elevator algorithms,
because they tackle a problem similar to that of keeping an elevator
moving smoothly in a large building. The Linus Elevator functions almost
exactly like the classic I/O scheduler described above. For the most part,
this was great because simplicity is a good thing and the 2.4 kernel's
I/O scheduler just worked. Unfortunately, in the I/O scheduler's quest
to maximize global I/O throughput, a trade-off was made: local
fairness—in particular, request latency—can go easily out the window. Let's
look at an example.

Consider a stream of requests to logical disk blocks such as 20, 30,
700 and 25. The I/O scheduler's sorting algorithm would queue and issue
the requests in the following order (assuming the disk head currently
is at the logical start of the disk): 20, 25, 30 and 700. This is expected
and indeed correct. Assume, however, that halfway through servicing
the request to block 25, another request comes in to the same part of
the disk. And then another. And yet another. It is entirely feasible
that the request way over to block 700 is not serviced for a relatively
long time.

Worse, what if the request was to read a disk block? Read requests
generally are synchronous. When an application issues a request to
read some data, it typically blocks and waits until the kernel
returns the data. The application must sit and wait, twiddling
its thumbs, until that request way over at block 700 finally
is serviced. Writes, on the other hand, typically are not
synchronous—they
are asynchronous. When an application issues a write, the kernel
copies the data and metadata into the kernel, prepares a buffer to hold
the data and returns to the application. The application does not really
care or even know when the data actually hits the disk.

It gets worse for our friend the read request, however. Because writes
are asynchronous, writes tend to stream. That is, it is common
for a large writeback of a lot of data to occur. This implies that many
individual write requests are submitted to a close area of the hard
disk. As an example, consider saving a large file. The application dumps
write requests on the system and hard drive as fast as it is scheduled.

Read requests, conversely, usually do not stream. Instead, applications
submit read requests in small one-by-one chunks, with each chunk dependent
on the last. Consider reading all of the files in a directory. The
application opens the first file, issues a read request for a suitable
chunk of the file, waits for the returned data, issues a read request for
the next chunk, waits and continues likewise until the entire file is
read. Then the file is closed, the next file is opened and the process
repeats. Each subsequent request has to wait for the previous,
which means substantial delays to this application if the requests are to
far-off disk blocks. The phenomenon of streaming write requests starving
dependent read requests is called writes-starving-reads (see Sidebar
“Test 1. Writes-Starving-Reads”).

Test 1. Writes-Starving-Reads

In the background, perform a streaming write, such as:


while true
do
        dd if=/dev/zero of=file bs=1M
done

Meanwhile, time how long a simple read of a 200MB file takes:


time cat 200mb-file > /dev/null

This test demonstrates the writes-starving-reads problem.

The possibility of not servicing some requests in a reasonable amount
of time is called starvation. Request starvation results in
unfairness. In the case of I/O schedulers, the system explicitly
has decided to trade fairness for improved global throughput. That is,
the system attempts to improve the overall performance of the system
at the possible expense of any one specific I/O request. This is
accepted and, indeed, desired—except that prolonged starvation is
bad. Starving read requests for even moderate lengths of time results in
high application latency for applications issuing read requests during
other disk activity. This high read latency adversely affects
system performance and feel (see Sidebar “Test 2. Effects of High
Read Latency”).

Test 2. Effects of High Read Latency

Start a streaming read in the background:


while true
do
        cat big-file > /dev/null
done

Meanwhile, measure how long it takes for a read of every file in the kernel
source tree to complete:


time find . -type f -exec cat '{}' ';' > /dev/null

This measures the performance of a series of small dependent reads during a
large streaming read.

The Deadline I/O Scheduler

Preventing the starvation of requests in general, and read requests in
particular, was a goal of the new 2.6 I/O schedulers.

The Deadline I/O Scheduler was introduced to solve the starvation issue
surrounding the 2.4 I/O scheduler and traditional elevator algorithms
in general. As discussed, the Linus Elevator maintains the sorted list
of pending I/O requests in a single queue. The I/O request at the head of
the queue is the next one to be serviced. The Deadline I/O Scheduler continues to
keep this queue, but kicks things up a notch by introducing two additional
queues—the read FIFO queue and the write FIFO queue. The
Deadline I/O Scheduler keeps the items in each of these queues sorted by
submission time (effectively, first in is first out). The read FIFO
queue, as its name suggests, contains only read requests. The write FIFO
queue, likewise, contains only write requests. Each FIFO queue is assigned
an expiration value. The read FIFO queue has an expiration time
of 500 milliseconds. The write FIFO queue has an expiration time of five seconds.

When a new I/O request is submitted, it is insertion-sorted into the
standard queue and placed at the tail of its respective (either read
or write) FIFO queue. Normally, the hard drive is sent I/O requests from
the head of the standard sorted queue. This maximizes global throughput
by minimizing seeks, as the normal queue is sorted by block number,
as with the Linus Elevator.

When the item at the head of one of the FIFO queues, however, grows
older than the expiration value associated with its queue, the I/O
scheduler stops dispatching I/O requests from the standard queue.
Instead, it services the I/O request at the head of the FIFO queue, plus a
couple extra for good measure. The I/O scheduler needs to check and
handle only the requests at the head of the FIFO queues, as those are the
oldest requests in the queue.

Remember our old friend, the request to block 700? Despite the flood of
write requests to the faraway blocks, after 500 milliseconds the Deadline
I/O Scheduler would stop servicing those requests and service the read
over at block 700. The disk would seek to block 700, service the read
request and then continue servicing the other outstanding requests.

In this manner, the Deadline I/O Scheduler can enforce a soft deadline on
I/O requests. Although it makes no promise that an I/O request is serviced
before the expiration time, the I/O scheduler generally services requests
near their expiration times. Thus, the Deadline I/O Scheduler continues
to provide good global throughput without starving any one request for
an unacceptably long time. Because read requests are given short
expiration times, the writes-starving-reads problem is minimized.

Anticipatory I/O Scheduler

This is all well and good, but it's not a perfect solution. Consider what
happens with our fictional request to block 700, which presumably is the
first of many dependent reads to that area of the disk. After servicing
the read request, the Deadline I/O Scheduler continues servicing the
write requests to the earlier blocks. This is fine, until the reading
application submits its next read request (say, to block 710). In 500
milliseconds, that request expires and the disk seeks over
to block 710, services the request, seeks back to where it was before
and continues servicing the streaming write. And then another read
arrives.

The problem again stems from those darn dependent reads. Because reads
are issued in dependent chunks, the application issues the next read
only when the previous is returned. But by the time the application
receives the read data, is scheduled to run and submits the next read,
the I/O scheduler has moved on and begun servicing some
other requests. This results in a wasted pair of seeks for each read:
seek to the read, service it and seek back. If only there was some way
for the I/O scheduler to know—nay, to anticipate—that another
read would soon be submitted to the same part of the disk. Instead
of seeking back and forth, it could wait in anticipation for the next
read. Saving those awful seeks certainly is worth a few milliseconds of
waiting; we save two seeks.

This is, of course, exactly what the Anticipatory I/O Scheduler
does. It began as the Deadline I/O Scheduler; it implements the
same deadline-based scheduling. But it was gifted with the addition
of an anticipation mechanism. When a read request is submitted,
the Anticipatory I/O Scheduler services it within its deadline, as
usual. Unlike the Deadline I/O Scheduler, however, the Anticipatory
I/O Scheduler then sits and waits, doing nothing, for up to six
milliseconds. Chances are good that the
application will issue another read to the same part of the filesystem
during those six milliseconds. If
so, that request is serviced immediately, and the Anticipatory I/O
Scheduler waits some more. If six milliseconds go by without a read
request, the Anticipatory I/O Scheduler guessed wrong and returns to
whatever it was doing before.

If even a moderate number of requests are anticipated correctly, a great
deal of time (two expensive seeks, each) is saved (Table 1). Because most reads are
dependent, the anticipation pays off most of the time. To further improve
the odds of a correct anticipation, the Anticipatory I/O Scheduler uses a
heuristic to better guess for which processes to wait. To
this end, the scheduler maintains I/O statistics about each process to
keep track of its behavior. Because of these statistics and intelligent
heuristics, the Anticipatory I/O Scheduler correctly anticipates the
actions of applications a sufficiently large amount of the time to be
well worth the overhead.

Table 1. The Results

I/O Scheduler and Kernel	Test 1	Test 2
Linus Elevator on 2.4	45 seconds	30 minutes, 28 seconds
Deadline I/O Scheduler on 2.6	40 seconds	3 minutes, 30 seconds
Anticipatory I/O Scheduler on 2.6	4.6 seconds	15 seconds

By minimizing unneeded seeks and more quickly servicing read requests,
in many workloads the Anticipatory I/O Scheduler provides both improved
request latency and global throughput over the Deadline I/O Scheduler
and the Linus Elevator. Unsurprisingly, the Anticipatory I/O Scheduler
is the default I/O scheduler in the 2.6 kernel. And rightly so, it rocks.

Acknowledgement

Andrea Arcangeli and Jens Axboe are the primary authors of the
Linus Elevator. Jens Axboe is the primary author of the Deadline I/O
Scheduler. Nick Piggin is the primary author of the Anticipatory I/O
Scheduler.

Robert Love (rml@tech9.net) is a kernel hacker at MontaVista Software
and a student at the University of Florida. He is the author of Linux
Kernel Development. Robert loves O.A.R. and lives in Gainesville,
Florida.

adam's home