This is highly embarrassing for me to confess, but let me tell you anyway: yesterday afternoon I successfully completed the biggest #fail in my career: I screwed up close on four hundred (400) machines at a customer site. Just like that, and it only took a minute or two to do.
The site makes heavy use of sudo, a program which, when correctly configured, allows a specified set of users to obtain superuser rights on Un*x systems without knowledge of a machine’s root password.
At this installation, whenever somebody requires sudo rights, a site administrator used to log on to the particular server, edit /etc/sudoers
, and that was that. The result: a large amount disparate, undocumented, out-of-date, /etc/sudoers
files strewn all over the show.
A few months ago, I proposed to implement a slight improvement: they should maintain a centrally managed sudoers
file and distribute that out to all machines. Said and done, that was implemented using Ansible as the configuration management tool (but that is irrelevant for this story), and it worked like this:
- Distribute a copy of the file into
/etc/sudoers
- Distribute a copy of the file into, say,
/etc/sudoers.safe
- Set up a cron job which would overwrite
/etc/sudoers
from/etc/sudoers.safe
sometime during the night in order to overwrite any (temporary) changes made by an administrator during the day.
I set that up (it took but a few minutes), tested it, and it went into “production”, as they say. All fine and dandy; everybody is satisfied.
Enter yesterday.
(By this time, some of you already know where this is going.)
It was time for me to drive home. It’s a longish drive, so I wanted to get a head-start before the gros of traffic started. Somebody says: will you please add user “jane” for me. Instead of wasting time asking why he doesn’t do that himself, I add user “jane” to the /etc/sudoers
. The pertinent bit of the file looks like this:
User_Alias ADMINS = john, jane,
ADMINS ALL=(ALL) NOPASSWD: ALL
I have three good reasons why I didn’t test that, but be that as it may, it doesn’t matter. The important bit of the story is: I didn’t test that.
I then instructed Ansible to do what I had originally configured it to do: it should apply the three steps enumerated above. And let me tell you one thing: it does it marvelously.
I’ll spell it out for you:
- Ansible connected to all servers, did its sudo thing and installed the
sudoers
file in/etc/sudoers
. - Ansible connected to all servers, did its sudo thing (for the second task, i.e. installing
/etc/sudoers.safe
), and it . . . failed.
Without flinching, I immediately knew what had happened: A syntax error in the sudoers
template caused sudo to fail. Manually logging into a target server, I could verify that:
$ sudo id
>>> /etc/sudoers: syntax error near line 90 <<<
sudo: parse error in /etc/sudoers near line 90
sudo: no valid sudoers sources found, quitting
The “syntax error” in question is, of course, the trailing comma behind “jane”.
I won’t recount here what my very first thoughts about the person who decided to let sudo fail on a syntax error, were, but fair is fair: he (or she) was right to let sudo fail – at least from a security perspective.
To cut this long story short, we fixed it by obtaining root passwords for the machines, using su
, etc.; the operation took four hours to complete.
Now for a bit of irony.
The original intention of sudoers.safe
was to overwrite changes made on sudoers
. For the record, this is what the Ansible playbook looked like:
---
- hosts: all
sudo: True
tasks:
- action: template src=sudoers.j2 dest=/etc/sudoers.safe
- action: template src=sudoers.j2 dest=/etc/sudoers
- action: cron name="sudoerscopy"
hour="21"
minute="01"
backup=yes
state=present
user=root
job="install -m 440 /etc/sudoers.safe /etc/sudoers"
If (I say if) I had reversed the distribution of the files in the first two actions (i.e. swapped numbers 1 and 2 in the above enumeration), I could have gone home: the system would have repaired itself at the specified time!
For the record: none of the mentioned software components are at fault. This is definitely as case of (bilingual pun follows) “MENSliches Versagen”. ;)
Update: Florian and Miek both point out visudo, which I knew, but Miek nails it with an extra option:
a
visudo -c -f _file_
might be a good addition to the ansible manifest
Done. And there was cake.
Update
And henceforth there was a validate
switch implemented on template
et.al.