The process of booting a computer system over the network is well understood, and it's been around for donkey's ages. Basically, the way it works is that a computer system requests an IP address from a BOOTP/DHCP server, obtains the name of a bootstrap program (e.g. PXELINUX) it should load from a TFTP server, and subsequently uses that to boot the machine. This is used extensively when installing operating systems onto a number of machines. I've been wanting to avoid using TFTP because:

  • The first T in TFTP stands for trivial; TFTP is unreliable and error-prone and won't work over wide area networks. Ideally, PXE systems would implement alternative protocols but most don't.
  • TFTP is an all-or-nothing proposition: there's no access control to the content of the server's directory. (There is at least one server that includes libwrap capabilities.)
  • Configuration files for PXELINUX (i.e. the things that live in its pxelinux.cfg directory) cannot be created on demand. I can pre-create a file and save it in the required directory for TFTP to send out, but files must exist by the time PXELINUX asks for them.

Earlier this year I mentioned I was setting up lots of bare metal, and I mentioned iPXE (formerly gPXE, formerly Etherboot). iPXE is a network boot loader which provides a full PXE implementation with some exciting features: it can boot via HTTP (and from an iSCSI SAN), and I can control the boot process with a script. Ideally, the network cards (NIC) we use would have iPXE burnt in (which can be done) but in this project we haven't yet evaluated what that would mean in terms of hardware.

In the following discussion I assume you've downloaded a copy of the iPXE source code and that you've have unpacked that and run a make in the src directory. This first make takes a bit of time; it creates all of iPXE's target formats. Later on I'll show you how to embed a script, and the make for that takes a second or two.

Three scenarios

iPXE can be used in a variety of ways, but I'll concentrate on three scenarios in the following diagram:

ipxe

The three machines boot as follows:

  1. machine1 sends out a PXE request which is answered by a near-by DHCP server. It then loads iPXE as undionly.kpxe from the TFTP server, and the rest happens over HTTP. undionly.kpxe is created with make bin/undionly.kpxe, and I drop that file into my TFTP root directory and then have my DHCP server give this file as boot file to my clients, ensuring I break the infinite loop that would result. (My dhcpd.conf is below.)

  2. machine2 boots with a customized iPXE script, either from a modified network ROM or via, say, a CD-ROM. It obtains its network address via DHCP and can then directly "speak" to a HTTP server. To create a customized boot loader with an embedded script (e.g. jpmens.ipxe), I invoke make bin/undionly.kpxe EMBED=jpmens.ipxe and store the resulting file on a bootable floppy or burn it onto a CD-ROM, etc. The embedded script uses a iPXE commands to obtain DHCP parameters when it starts, or I can hard-code IP address, net mask, etc., and I can use iPXE settings in the script, as we'll see for machine3.

  3. In the case of machine3, I've created a custom iPXE image with which the machine boots. The script contains hard-coded network addresses, and it should be straight-forward to mass-create custom images with a bit of sh and make. This is interesting if there is no DHCP server (or relay) close to (network-wise) the node.

DHCP, TFTP, and HTTP

machine1 uses DHCP and a TFTP server to load iPXE's undionly.kpxe, after which the latter takes over. The DHCP server configuration I'm using is:

option space ipxe;
option ipxe-encap-opts code 175 = encapsulate ipxe;
option ipxe.priority code 1 = signed integer 8;
option ipxe.keep-san code 8 = unsigned integer 8;
option ipxe.skip-san-boot code 9 = unsigned integer 8;
option ipxe.no-pxedhcp code 176 = unsigned integer 8;
option ipxe.bus-id code 177 = string;
option ipxe.bios-drive code 189 = unsigned integer 8;
option ipxe.username code 190 = string;
option ipxe.password code 191 = string;
option ipxe.reverse-username code 192 = string;
option ipxe.reverse-password code 193 = string;
option ipxe.version code 235 = string;
option iscsi-initiator-iqn code 203 = string;
option ipxe.pxeext code 16 = unsigned integer 8;
option ipxe.iscsi code 17 = unsigned integer 8;
option ipxe.aoe code 18 = unsigned integer 8;
option ipxe.http code 19 = unsigned integer 8;
option ipxe.https code 20 = unsigned integer 8;
option ipxe.tftp code 21 = unsigned integer 8;
option ipxe.ftp code 22 = unsigned integer 8;
option ipxe.dns code 23 = unsigned integer 8;
option ipxe.bzimage code 24 = unsigned integer 8;
option ipxe.multiboot code 25 = unsigned integer 8;
option ipxe.slam code 26 = unsigned integer 8;
option ipxe.srp code 27 = unsigned integer 8;
option ipxe.nbi code 32 = unsigned integer 8;
option ipxe.pxe code 33 = unsigned integer 8;
option ipxe.elf code 34 = unsigned integer 8;
option ipxe.comboot code 35 = unsigned integer 8;
option ipxe.efi code 36 = unsigned integer 8;
option ipxe.fcoe code 37 = unsigned integer 8;
option ipxe.no-pxedhcp 1;

authoritative;

ddns-update-style interim;
ignore client-updates;

allow booting;
allow bootp;

set vendorclass = option vendor-class-identifier;

subnet 10.0.12.0 netmask 255.255.254.0 {
   option routers         10.0.12.4;
   option subnet-mask      255.255.254.0;

   option domain-name      "jpmens.net";
   option domain-name-servers   10.1.1.1;

   set clIP = binary-to-ascii(10, 8, ".", leased-address);
   set clHW = concat (
      suffix (concat ("0", binary-to-ascii (16, 8, "", substring(hardware, 1, 1))),2), ":",
      suffix (concat ("0", binary-to-ascii (16, 8, "", substring(hardware, 2, 1))),2), ":",
      suffix (concat ("0", binary-to-ascii (16, 8, "", substring(hardware, 3, 1))),2), ":",
      suffix (concat ("0", binary-to-ascii (16, 8, "", substring(hardware, 4, 1))),2), ":",
      suffix (concat ("0", binary-to-ascii (16, 8, "", substring(hardware, 5, 1))),2), ":",
      suffix (concat ("0", binary-to-ascii (16, 8, "", substring(hardware, 6, 1))),2));

   default-lease-time 21600;
   max-lease-time 43200;
   next-server 10.0.12.249;

   if exists user-class and option user-class = "iPXE" {
      set uri = concat("http://${next-server}/netboot.php?MAC=", clHW);
      filename = uri;
   } else {
      filename = "undionly.kpxe";
   }
}
host machine1 {
   hardware ethernet 00:50:56:9a:00:1d;
   fixed-address 10.0.12.251;
}

When the machine (node) boots it fires off its first PXE request, our DHCP server receives the request and gives it an IP address, netmask, etc. as well as a boot filename undionly.kpxe. The node then retrieves undionly.kpxe via TFTP and loads and executes it. iPXE (undionly.kpxe) then again issues a DHCP request. Without the if exists user-class magic we'd enter an endless loop where iPXE would load itself, then load itself, ad nauseam. The if ensures that when iPXE issues a DHCP request, it is given the filename called netboot.php which resides on a HTTP server. From this point onwards, everything happens over HTTP!

The file name iPXE chains into is an HTTP URL which, in my case, creates an on-the-fly configuration script for iPXE. (The strange-looking concat business in dhcpd.conf is to ensure the hardware address is correctly formatted.) To make things easier, I'll omit showing the code the iPXE script is generated from (basically a database access and some Mustache); instead, here is its output:

#!ipxe
echo +----- NETBOOT ----------------------------------------------
echo |hostname: ${hostname}, next-server: ${next-server}
echo |mac.....: ${net0/mac} / 
echo +------------------------------------------------------------
echo .
kernel http://10.0.12.1/sw/linux root=/dev/ram0 load_ramdisk=1 initrd=initrd showopts ramdisk_size=65535  install=http://10.0.12.1/sw/iso textmode=1 autoyast=http://10.0.12.1/sw/baremetal.php?MAC=00:50:56:9a:00:1d
initrd http://10.0.12.1/sw/initrd
boot ||
shell

The echo prints information to the screen, using some of iPXE's settings. Apart from that, a kernel is loaded together with an initrd image, and we attempt to boot that. If that fails, we fall back into iPXE's shell.

Statically dynamic

The configuration for machine2 and machine3 differ only slightly in that the former lets iPXE obtain network parameters via DHCP, and the latter has them embedded in the script. I can test with a VirtualBox client which boots from an ISO image created with one of the iPXE make targets. What I did was to create a script called jpstatic.ipxe and I then built the ISO image I attached to VirtualBox with

cd ipxe/src
make bin/ipxe.iso EMBED=../../jpstatic.ipxe
cp bin/ipxe.iso /tmp/ipxe.iso

The file jpstatic.ipxe is an iPXE script which defines network addresses for the machine and subsequently chains to the boot file.

#!ipxe
# by JPM
echo +----- STATIC (embedded) -------------------------
ifopen net0
set net0/ip 192.168.1.201
set net0/netmask 255.255.255.0
set net0/gateway 192.168.1.1
set net0/dns 192.168.1.20
set net0/domain mens.de
set filename http://bootr.${domain}/node.ipxe 
chain ${filename} ||
echo Booting ${filename} failed, dropping to shell
shell

When I launch the virtual machine, it boots from the ISO image containing iPXE. iPXE initializes its network stack and proceeds to run the embedded script. Note how the chain command loads a script or image from the specified HTTP server and then boots into that.

The node.ipxe script I'm chaining into doesn't do much except print out some iPXE's variable values obtained via DHCP or hardcoded into the script, and it then launches the iPXE shell:

#!ipxe
echo mac...............: ${mac}
echo ip................: ${ip}
echo netmask...........: ${netmask}
echo gateway...........: ${gateway}
echo dns...............: ${dns}
echo domain............: ${domain}
echo dhcp-server.......: ${dhcp-server}
echo syslog............: ${syslog}
echo filename..........: ${filename}
echo next-server.......: ${next-server}
echo hostname..........: ${hostname}
echo uuid..............: ${uuid}
echo serial............: ${serial}
echo .
shell

From the iPXE shell, I can chain into whatever I want to, say, the demo image. I enter the chain command with the URL, the kernel and initrd are loaded from the iPXE HTTP server and it is booted:

PXE> chain http://boot.ipxe.org/demo/boot.php

PXELINUX over HTTP

To be as flexible as possible with regard to booting different types of images, allowing boot menus, etc. I'm adding a level of indirection. PXELINUX versions >= 3.70 can boot over HTTP. (I tried with the latest version (4.04) but that failed, so I fell back to using version 3.86.) I installed nasm and built the code from a SYSLINUX distribution:

make
cp core/pxelinux.0  $httproot/pxelinux.0

Take note that I'm copying pxelinux.0 to the HTTP document root, and not the TFTP root. I then changed my netboot.php to return the following iPXE script:

#!ipxe
imgfree
set 210:string http://10.0.12.249/pxe/
set 209:string http://10.0.12.249/pxelinux.php?MAC=${net0/mac}&ip=${ip}
set filename ${210:string}pxelinux.0
chain ${filename} ||
echo Netboot failed
shell

The two DHCP options define the HTTP URL to the root of the HTTP server (209) and to the configuration file for PXELINUX (210) respectively. Without option 209, when PXELINUX is loaded it will attempt to retrieve its configuration (via HTTP) from the following URLS:

GET /pxe/pxelinux.0 HTTP/1.1" 200 26582 "-" "iPXE/1.0.0+"
GET /pxe/pxelinux.cfg/421a7b8d-c336-ce6f-8dcc-5178ff8b8c7e HTTP/1.1" 404 328 "-" "iPXE/1.0.0+"
GET /pxe/pxelinux.cfg/01-00-50-56-9a-00-1d HTTP/1.1" 404 312 "-" "iPXE/1.0.0+"
GET /pxe/pxelinux.cfg/0A000CFB HTTP/1.1" 404 300 "-" "iPXE/1.0.0+"
GET /pxe/pxelinux.cfg/0A000CF HTTP/1.1" 404 300 "-" "iPXE/1.0.0+"
GET /pxe/pxelinux.cfg/0A000C HTTP/1.1" 404 300 "-" "iPXE/1.0.0+"
GET /pxe/pxelinux.cfg/0A000 HTTP/1.1" 404 300 "-" "iPXE/1.0.0+"
GET /pxe/pxelinux.cfg/0A00 HTTP/1.1" 404 300 "-" "iPXE/1.0.0+"
GET /pxe/pxelinux.cfg/0A0 HTTP/1.1" 404 300 "-" "iPXE/1.0.0+"
GET /pxe/pxelinux.cfg/0A HTTP/1.1" 404 300 "-" "iPXE/1.0.0+"
GET /pxe/pxelinux.cfg/0 HTTP/1.1" 404 300 "-" "iPXE/1.0.0+"
GET /pxe/pxelinux.cfg/default HTTP/1.1" 404 300 "-" "iPXE/1.0.0+"

Instead of using static files I create PXELINUX configuration on the fly. For example, if pxelinux.php outputs

PROMPT 1
DISPLAY bootmsg.txt
LABEL centos
  KERNEL centos/vmlinuz
  APPEND initrd=centos/initrd.img

the node would boot Centos, whereas if it, instead, output

DEFAULT chain.c32 hd0 0

then the machine boots from the first hard disk. It is important to realize that all paths I've used (e.g. bootmsg.txt, centos/vmlinuz, chain.c32 (also from SYSLINUX)) are relative to the HTTP root we specified as option 210 above. (Keep an eye on your HTTP access log when experimenting with this.)

dnsmasq as a DHCP server

If you use dnsmasq as your DHCP server, you can also do this. Here's a snippet from my dnsmasq.conf:

# Enable dnsmasq's built-in TFTP server to serve undionly.kpxe
enable-tftp
# Set the root directory for files availble via TFTP.
# this is where I place undionly.kpxe
tftp-root=/c/tftpd

dhcp-range=192.168.1.180,192.168.1.220,255.255.255.0,24h
dhcp-authoritative
log-dhcp
log-queries

# iPXE sends option 175; make a rule named IPXEBOOT to match requests
dhcp-match=IPXEBOOT,175

# if the request does not (#) match the IPXEBOOT rule tell the client
# (most likely standard PXE client) to boot iPXE

dhcp-boot=net:#IPXEBOOT,undionly.kpxe

# Set "next server" for iPXE to boot from this URL
dhcp-boot=http://192.168.1.10/ipxe/boot.php

# Here I define clients and their names with optional [,a.b.c.d] address
dhcp-host=00:0c:29:9c:60:d3,virt1

Summary

To summarize, I need a DHCP server and a TFTP server close by the machines (nodes) I'll be booting this way, unless I go the extra mile and create custom undionly.kpxe images that can be booted from local media. When nodes boot they go through the following chain of events:

  1. Machine boots.
    1. If configured to use local boot media, loads iPXE from that.
    2. Otherwise:
      1. Hardware does a PXE boot and sends out a DHCP request.
      2. DHCP server returns reply and boot filename undionly.kpxe.
      3. Node requests file from TFTP server.
  2. undionly.kpxe (iPXE) loads and optionally issues another DHCP request, and then
  3. chains (boots) into the script returned by netboot.php.
  4. Node loads pxelinux.0 via HTTP.
  5. pxelinux.0 loads configuration file specified in option 209. (pxelinux.php)
  6. pxelinux.0 loads further kernel via HTTP depending on configuration.

This sounds quite convoluted, and it is rather, but we gain a lot of functionality:

  • Nodes can boot over the WAN links (e.g. the Internet).
  • If necessary, we can use caching HTTP proxies to reduce the volume of data transferred from the deployment server to groups of nodes.
  • We can apply granular access-controls to the HTTP server, something very difficult (or impossible?) to do with TFTP.
  • We are highly flexible in how we create configuration for clients; we can use database queries to provision boot scripts to individual nodes or groups of nodes.
  • Client nodes can be set to always PXE boot, and we can remote-control what they do when they're power-cycled: install, boot from disk, show menu, etc.
Flattr this
Network, Boot, HTTP, PXE, Pxelinux, and iPXE :: 18 Jul 2011 :: e-mail

Comments

blog comments powered by Disqus