Sometime around 1997 I read a very amusing story about a “magic” switch (it is hilarious and well worth reading) attached to a old PDP system. I now know, that the story is probably true, because some time ago, I was confronted with a similar, if software only, situation.
At this point in time, our Exim servers on the Internet gateways processed in excess of thirty thousand messages per day, without a single hitch. At the beginning of January 2005, I was informed that a very small number of users were repeatedly getting the same message delivered to their Lotus Notes inboxes, all aparently from the same sender.
After a little bit of monitoring, I determined that it was in actual fact a problem, which seemed to be limited to messages from a single domain. After monitoring the situation for some time, it turned out that only certain messages had the feature…
The message was correctly being delivered to our outside gateway servers which indicated that not the sending equipment was at fault. Debugging Exim, both on the Internet gateway and on the inside, where two separate MX are located), showed that such messages had attempted delivery at the first MX, apparently successfully, but the MX disconnected suddenly, leaving the external gateway no option as to attempt delivery at the second MX, where exactly the same thing happened. Both MXen actually correctly received the message and then delivered it to its final destination.
Unfortunately, the outside Exim never saw the SMTP QUIT command issued by the internal MXen. TCP dumps performed on the Firewalls didn’t unveil any strange occurrances, although the missing QUIT was confirmed. Why were the internal Exim servers not sending out a QUIT? And why only for this domain?
I asked for help on the Exim mailing list. Knowledgeable people there strongly suggested it must be the firewall. But why only for these messages? After temporarily disabling all unnecessary features on the internal Exims (content checking (exiscan), LDAP, etc.) the problem still manifested itself.
Now, remember: not all messages; just some of them. Independent of size, recipient, etc. I was going bonkers.
After four days of deliberation, I gave up.
I solved the problem, but unfortunately only via a work-around. For old time’s sake, an ASCII drawing will depict the process:
The external gateway now receives messages from the offending domain and batches them (RFC 2442; The Batch SMTP Media Type). Every fifteen minutes, a cron job submits the individually batched messages via HTTP to the internal MX and the receiving CGI program simply submits them to the internal Exim.
It has been working that way for over a year. Inspite of system upgrades to all components involved during that time, these “mellon” messages fail to be correctly submitted via SMTP.
I’ve decided that I will no longer touch this running system. ;-)