The process of booting a computer system over the network is well understood, and it's been around for donkey's ages. Basically, the way it works is that a computer system requests an IP address from a BOOTP/DHCP server, obtains the name of a bootstrap program (e.g. PXELINUX) it should load from a TFTP server, and subsequently uses that to boot the machine. This is used extensively when installing operating systems onto a number of machines. I've been wanting to avoid using TFTP because:
- The first T in TFTP stands for trivial; TFTP is unreliable and error-prone and won't work over wide area networks. Ideally, PXE systems would implement alternative protocols but most don't.
- TFTP is an all-or-nothing proposition: there's no access control to the content of the server's directory. (There is at least one server that includes libwrap capabilities.)
- Configuration files for PXELINUX (i.e. the things that live in its
pxelinux.cfgdirectory) cannot be created on demand. I can pre-create a file and save it in the required directory for TFTP to send out, but files must exist by the time PXELINUX asks for them.
Earlier this year I mentioned I was setting up lots of bare metal, and I mentioned iPXE (formerly gPXE, formerly Etherboot). iPXE is a network boot loader which provides a full PXE implementation with some exciting features: it can boot via HTTP (and from an iSCSI SAN), and I can control the boot process with a script. Ideally, the network cards (NIC) we use would have iPXE burnt in (which can be done) but in this project we haven't yet evaluated what that would mean in terms of hardware.
In the following discussion I assume you've downloaded a copy of the iPXE source
code and that you've have unpacked that and run a make in the src directory.
This first make takes a bit of time; it creates all of iPXE's target formats. Later on
I'll show you how to embed a script, and the make for that takes a second or two.
Three scenarios
iPXE can be used in a variety of ways, but I'll concentrate on three scenarios in the following diagram:

The three machines boot as follows:
machine1sends out a PXE request which is answered by a near-by DHCP server. It then loads iPXE asundionly.kpxefrom the TFTP server, and the rest happens over HTTP.undionly.kpxeis created withmake bin/undionly.kpxe, and I drop that file into my TFTP root directory and then have my DHCP server give this file as boot file to my clients, ensuring I break the infinite loop that would result. (Mydhcpd.confis below.)machine2boots with a customized iPXE script, either from a modified network ROM or via, say, a CD-ROM. It obtains its network address via DHCP and can then directly "speak" to a HTTP server. To create a customized boot loader with an embedded script (e.g.jpmens.ipxe), I invokemake bin/undionly.kpxe EMBED=jpmens.ipxeand store the resulting file on a bootable floppy or burn it onto a CD-ROM, etc. The embedded script uses a iPXE commands to obtain DHCP parameters when it starts, or I can hard-code IP address, net mask, etc., and I can use iPXE settings in the script, as we'll see formachine3.In the case of
machine3, I've created a custom iPXE image with which the machine boots. The script contains hard-coded network addresses, and it should be straight-forward to mass-create custom images with a bit ofshandmake. This is interesting if there is no DHCP server (or relay) close to (network-wise) the node.
DHCP, TFTP, and HTTP
machine1 uses DHCP and a TFTP server to load iPXE's undionly.kpxe, after
which the latter takes over. The DHCP server configuration I'm using is:
option space ipxe;
option ipxe-encap-opts code 175 = encapsulate ipxe;
option ipxe.priority code 1 = signed integer 8;
option ipxe.keep-san code 8 = unsigned integer 8;
option ipxe.skip-san-boot code 9 = unsigned integer 8;
option ipxe.no-pxedhcp code 176 = unsigned integer 8;
option ipxe.bus-id code 177 = string;
option ipxe.bios-drive code 189 = unsigned integer 8;
option ipxe.username code 190 = string;
option ipxe.password code 191 = string;
option ipxe.reverse-username code 192 = string;
option ipxe.reverse-password code 193 = string;
option ipxe.version code 235 = string;
option iscsi-initiator-iqn code 203 = string;
option ipxe.pxeext code 16 = unsigned integer 8;
option ipxe.iscsi code 17 = unsigned integer 8;
option ipxe.aoe code 18 = unsigned integer 8;
option ipxe.http code 19 = unsigned integer 8;
option ipxe.https code 20 = unsigned integer 8;
option ipxe.tftp code 21 = unsigned integer 8;
option ipxe.ftp code 22 = unsigned integer 8;
option ipxe.dns code 23 = unsigned integer 8;
option ipxe.bzimage code 24 = unsigned integer 8;
option ipxe.multiboot code 25 = unsigned integer 8;
option ipxe.slam code 26 = unsigned integer 8;
option ipxe.srp code 27 = unsigned integer 8;
option ipxe.nbi code 32 = unsigned integer 8;
option ipxe.pxe code 33 = unsigned integer 8;
option ipxe.elf code 34 = unsigned integer 8;
option ipxe.comboot code 35 = unsigned integer 8;
option ipxe.efi code 36 = unsigned integer 8;
option ipxe.fcoe code 37 = unsigned integer 8;
option ipxe.no-pxedhcp 1;
authoritative;
ddns-update-style interim;
ignore client-updates;
allow booting;
allow bootp;
set vendorclass = option vendor-class-identifier;
subnet 10.0.12.0 netmask 255.255.254.0 {
option routers 10.0.12.4;
option subnet-mask 255.255.254.0;
option domain-name "jpmens.net";
option domain-name-servers 10.1.1.1;
set clIP = binary-to-ascii(10, 8, ".", leased-address);
set clHW = concat (
suffix (concat ("0", binary-to-ascii (16, 8, "", substring(hardware, 1, 1))),2), ":",
suffix (concat ("0", binary-to-ascii (16, 8, "", substring(hardware, 2, 1))),2), ":",
suffix (concat ("0", binary-to-ascii (16, 8, "", substring(hardware, 3, 1))),2), ":",
suffix (concat ("0", binary-to-ascii (16, 8, "", substring(hardware, 4, 1))),2), ":",
suffix (concat ("0", binary-to-ascii (16, 8, "", substring(hardware, 5, 1))),2), ":",
suffix (concat ("0", binary-to-ascii (16, 8, "", substring(hardware, 6, 1))),2));
default-lease-time 21600;
max-lease-time 43200;
next-server 10.0.12.249;
if exists user-class and option user-class = "iPXE" {
set uri = concat("http://${next-server}/netboot.php?MAC=", clHW);
filename = uri;
} else {
filename = "undionly.kpxe";
}
}
host machine1 {
hardware ethernet 00:50:56:9a:00:1d;
fixed-address 10.0.12.251;
}
When the machine (node) boots it fires off its first PXE request, our DHCP server
receives the request and gives it an IP address, netmask, etc. as well as a
boot filename undionly.kpxe. The node then retrieves undionly.kpxe via TFTP
and loads and executes it. iPXE (undionly.kpxe) then again issues a DHCP
request. Without the if exists user-class magic we'd enter an endless loop where
iPXE would load itself, then load itself, ad nauseam. The if ensures that when iPXE
issues a DHCP request, it is given the filename called netboot.php which resides
on a HTTP server. From this point onwards, everything happens over HTTP!
The file name iPXE chains into is an HTTP URL which, in my case, creates an
on-the-fly configuration script for iPXE. (The strange-looking concat
business in dhcpd.conf is to ensure the hardware address is correctly formatted.)
To make things easier, I'll
omit showing the code the iPXE script is generated from (basically a database
access and some Mustache); instead, here is its output:
#!ipxe
echo +----- NETBOOT ----------------------------------------------
echo |hostname: ${hostname}, next-server: ${next-server}
echo |mac.....: ${net0/mac} /
echo +------------------------------------------------------------
echo .
kernel http://10.0.12.1/sw/linux root=/dev/ram0 load_ramdisk=1 initrd=initrd showopts ramdisk_size=65535 install=http://10.0.12.1/sw/iso textmode=1 autoyast=http://10.0.12.1/sw/baremetal.php?MAC=00:50:56:9a:00:1d
initrd http://10.0.12.1/sw/initrd
boot ||
shell
The echo prints information to the screen, using some of iPXE's
settings. Apart from that, a kernel is loaded together with
an initrd image, and we attempt to boot that. If that fails, we
fall back into iPXE's shell.
Statically dynamic
The configuration for machine2 and machine3 differ only slightly in that
the former lets iPXE obtain network parameters via DHCP, and the latter has them
embedded in the script. I can test with a VirtualBox client which boots
from an ISO image created with one of the iPXE make targets. What I
did was to create a script called jpstatic.ipxe and I then built the ISO
image I attached to VirtualBox with
cd ipxe/src
make bin/ipxe.iso EMBED=../../jpstatic.ipxe
cp bin/ipxe.iso /tmp/ipxe.iso
The file jpstatic.ipxe is an iPXE script which defines network addresses
for the machine and subsequently chains to the boot file.
#!ipxe
# by JPM
echo +----- STATIC (embedded) -------------------------
ifopen net0
set net0/ip 192.168.1.201
set net0/netmask 255.255.255.0
set net0/gateway 192.168.1.1
set net0/dns 192.168.1.20
set net0/domain mens.de
set filename http://bootr.${domain}/node.ipxe
chain ${filename} ||
echo Booting ${filename} failed, dropping to shell
shell
When I launch the virtual machine, it boots from the ISO image containing iPXE.
iPXE initializes its network stack and proceeds to run the embedded script. Note
how the chain command loads a script or image from the specified HTTP server
and then boots into that.

The node.ipxe script I'm chaining into doesn't do much except print out some
iPXE's variable values obtained via DHCP or hardcoded into the script,
and it then launches the iPXE shell:
#!ipxe
echo mac...............: ${mac}
echo ip................: ${ip}
echo netmask...........: ${netmask}
echo gateway...........: ${gateway}
echo dns...............: ${dns}
echo domain............: ${domain}
echo dhcp-server.......: ${dhcp-server}
echo syslog............: ${syslog}
echo filename..........: ${filename}
echo next-server.......: ${next-server}
echo hostname..........: ${hostname}
echo uuid..............: ${uuid}
echo serial............: ${serial}
echo .
shell
From the iPXE shell, I can chain into whatever I want to, say, the demo image. I enter the chain command with the URL, the kernel and initrd are loaded from the iPXE HTTP server and it is booted:
PXE> chain http://boot.ipxe.org/demo/boot.php

PXELINUX over HTTP
To be as flexible as possible with regard to booting different types of images, allowing boot menus, etc. I'm adding a level of indirection. PXELINUX versions >= 3.70 can boot over HTTP. (I tried with the latest version (4.04) but that failed, so I fell back to using version 3.86.) I installed nasm and built the code from a SYSLINUX distribution:
make
cp core/pxelinux.0 $httproot/pxelinux.0
Take note that I'm copying pxelinux.0 to the HTTP document root, and not the
TFTP root. I then changed my netboot.php to return the following iPXE
script:
#!ipxe
imgfree
set 210:string http://10.0.12.249/pxe/
set 209:string http://10.0.12.249/pxelinux.php?MAC=${net0/mac}&ip=${ip}
set filename ${210:string}pxelinux.0
chain ${filename} ||
echo Netboot failed
shell
The two DHCP options define the HTTP URL to the root of the HTTP server (209) and to the configuration file for PXELINUX (210) respectively. Without option 209, when PXELINUX is loaded it will attempt to retrieve its configuration (via HTTP) from the following URLS:
GET /pxe/pxelinux.0 HTTP/1.1" 200 26582 "-" "iPXE/1.0.0+"
GET /pxe/pxelinux.cfg/421a7b8d-c336-ce6f-8dcc-5178ff8b8c7e HTTP/1.1" 404 328 "-" "iPXE/1.0.0+"
GET /pxe/pxelinux.cfg/01-00-50-56-9a-00-1d HTTP/1.1" 404 312 "-" "iPXE/1.0.0+"
GET /pxe/pxelinux.cfg/0A000CFB HTTP/1.1" 404 300 "-" "iPXE/1.0.0+"
GET /pxe/pxelinux.cfg/0A000CF HTTP/1.1" 404 300 "-" "iPXE/1.0.0+"
GET /pxe/pxelinux.cfg/0A000C HTTP/1.1" 404 300 "-" "iPXE/1.0.0+"
GET /pxe/pxelinux.cfg/0A000 HTTP/1.1" 404 300 "-" "iPXE/1.0.0+"
GET /pxe/pxelinux.cfg/0A00 HTTP/1.1" 404 300 "-" "iPXE/1.0.0+"
GET /pxe/pxelinux.cfg/0A0 HTTP/1.1" 404 300 "-" "iPXE/1.0.0+"
GET /pxe/pxelinux.cfg/0A HTTP/1.1" 404 300 "-" "iPXE/1.0.0+"
GET /pxe/pxelinux.cfg/0 HTTP/1.1" 404 300 "-" "iPXE/1.0.0+"
GET /pxe/pxelinux.cfg/default HTTP/1.1" 404 300 "-" "iPXE/1.0.0+"
Instead of using static files I create PXELINUX configuration on the fly. For example, if pxelinux.php outputs
PROMPT 1
DISPLAY bootmsg.txt
LABEL centos
KERNEL centos/vmlinuz
APPEND initrd=centos/initrd.img
the node would boot Centos, whereas if it, instead, output
DEFAULT chain.c32 hd0 0
then the machine boots from the first hard disk. It is important to realize
that all paths I've used (e.g. bootmsg.txt, centos/vmlinuz, chain.c32
(also from SYSLINUX)) are relative to the HTTP root we specified as option 210
above. (Keep an eye on your HTTP access log when experimenting with this.)
To summarize, I need a DHCP server and a TFTP server close by the machines
(nodes) I'll be booting this way, unless I go the extra mile and create custom
undionly.kpxe images that can be booted from local media. When nodes boot they go
through the following chain of events:
- Machine boots.
- If configured to use local boot media, loads iPXE from that.
- Otherwise:
- Hardware does a PXE boot and sends out a DHCP request.
- DHCP server returns reply and boot filename
undionly.kpxe. - Node requests file from TFTP server.
undionly.kpxe(iPXE) loads and optionally issues another DHCP request, and then- chains (boots) into the script returned by
netboot.php. - Node loads
pxelinux.0via HTTP. pxelinux.0loads configuration file specified in option 209. (pxelinux.php)pxelinux.0loads further kernel via HTTP depending on configuration.
This sounds quite convoluted, and it is rather, but we gain a lot of functionality:
- Nodes can boot over the WAN links (e.g. the Internet).
- If necessary, we can use caching HTTP proxies to reduce the volume of data transferred from the deployment server to groups of nodes.
- We can apply granular access-controls to the HTTP server, something very difficult (or impossible?) to do with TFTP.
- We are highly flexible in how we create configuration for clients; we can use database queries to provision boot scripts to individual nodes or groups of nodes.
- Client nodes can be set to always PXE boot, and we can remote-control what they do when they're power-cycled: install, boot from disk, show menu, etc.
Comments
blog comments powered by Disqus