This is highly embarrassing for me to confess, but let me tell you anyway: yesterday afternoon I successfully completed the biggest #fail in my career: I screwed up close on four hundred (400) machines at a customer site. Just like that, and it only took a minute or two to do.
The site makes heavy use of sudo, a program which, when correctly configured, allows a specified set of users to obtain superuser rights on Un*x systems without knowledge of a machine’s root password.
At this installation, whenever somebody requires sudo rights, a site administrator used to log on to the particular server, edit
/etc/sudoers, and that was that. The result: a large amount disparate, undocumented, out-of-date,
/etc/sudoers files strewn all over the show.
A few months ago, I proposed to implement a slight improvement: they should maintain a centrally managed
sudoers file and distribute that out to all machines. Said and done, that was implemented using Ansible as the configuration management tool (but that is irrelevant for this story), and it worked like this:
- Distribute a copy of the file into
- Distribute a copy of the file into, say,
- Set up a cron job which would overwrite
/etc/sudoers.safesometime during the night in order to overwrite any (temporary) changes made by an administrator during the day.
I set that up (it took but a few minutes), tested it, and it went into “production”, as they say. All fine and dandy; everybody is satisfied.
(By this time, some of you already know where this is going.)
It was time for me to drive home. It’s a longish drive, so I wanted to get a head-start before the gros of traffic started. Somebody says: will you please add user “jane” for me. Instead of wasting time asking why he doesn’t do that himself, I add user “jane” to the
/etc/sudoers. The pertinent bit of the file looks like this:
I have three good reasons why I didn’t test that, but be that as it may, it doesn’t matter. The important bit of the story is: I didn’t test that.
I then instructed Ansible to do what I had originally configured it to do: it should apply the three steps enumerated above. And let me tell you one thing: it does it marvelously.
I’ll spell it out for you:
- Ansible connected to all servers, did its sudo thing and installed the
- Ansible connected to all servers, did its sudo thing (for the second task, i.e. installing
/etc/sudoers.safe), and it . . . failed.
Without flinching, I immediately knew what had happened: A syntax error in the
sudoers template caused sudo to fail. Manually logging into a target server, I could verify that:
The “syntax error” in question is, of course, the trailing comma behind “jane”.
I won’t recount here what my very first thoughts about the person who decided to let sudo fail on a syntax error, were, but fair is fair: he (or she) was right to let sudo fail – at least from a security perspective.
To cut this long story short, we fixed it by obtaining root passwords for the machines, using
su, etc.; the operation took four hours to complete.
Now for a bit of irony.
The original intention of
sudoers.safe was to overwrite changes made on
sudoers. For the record, this is what the Ansible playbook looked like:
If (I say if) I had reversed the distribution of the files in the first two actions (i.e. swapped numbers 1 and 2 in the above enumeration), I could have gone home: the system would have repaired itself at the specified time!
For the record: none of the mentioned software components are at fault. This is definitely as case of (bilingual pun follows) “MENSliches Versagen”. ;)
Update: Florian and Miek both point out visudo, which I knew, but Miek nails it with an extra option:
visudo -c -f _file_might be a good addition to the ansible manifest
Done. And there was cake.