Just Some Idle Anti-Spam Thoughts 1

Posted by Toby Sun, 09 Dec 2007 00:01:00 GMT

Earlier this year, I came off a five year stint in the anti-spam industry. As such, I tend to try to keep up on the latest in that world. Tonight, I was talking with an acquaintance of mine and the topic of spam came up. I was telling him about the Storm Worm and the relatively new wave of stock pump-and-dump spams with professional-looking PDF attachments as payload. He’s an options trader at SIG and he had a different take on that kind of spam: he told me he had looked at getting into that market! Not spam, mind you, but rather shorting the stocks that are featured in these pump-and-dump spams. Obviously, this is how the originating spammers themselves make money but I never figured a legitimate outfit to get involved with stuff like that (they decided not to, he said). Now, I don’t know if this is true, but he also told me that there are entire hedge funds that just watch the mail streams and look for these pump-and-dumps to make short moves on. I found this fascinating for some reason.

This was on my mind as I was driving home and I wandered a bit in my thinking. I started to think about the work I had done and how I really enjoyed the anti-spam challenge. It seems to me that the fundamental challenges in anti-spam are twofold:

  • Your opponents are highly motivated because the profit margins are so high (spam is very cheap to create/send)
  • Commercial filters are bound to increase their effectiveness even in the face of every higher volumes of spam every year

I started thinking harder about that last part and I was thinking, “hmmm… what else has a super low signal:noise ratio?” Then I thought of astronomical observation data. I was talking with someone at Amazon about that very thing and he was telling me that astronomical observation data tends to be huge (~2GB second, raw) and that it contained almost no signal. I wonder what techniques they are using to sift out the meat there and whether or not any of them are applicable to spam filtration. Anybody looked into the parity between these two before?

Finally, I got to thinking about some of the techniques that are in use today and I hit upon an interesting thought I’d never come across before. Distributed reputation services are pretty much the de facto standard these days for all commercial vendors. Every vendor has a different pretentious name for these things but, basically, they constantly update centralized databases of sender reputation in near real-time based on the information about emails flowing into their edge systems. These edge systems can be desktop spam filters (e.g. Cloudmark) or big, honkin’ border MTAs like the SMS 8300 or something in between.

I was thinking, “its easy to do that on one MTA”. Just have the MTA keep a database of reputation information and update it incrementally as new mails flow in and the filter renders a verdict. Hell, if you log your filter verdicts, you could just run through that every hour or so if you wanted it to be not so good.

However, then I thought, “hmmm… I wonder what the marginal value of a distributed reputation service is?” Meaning: what’s the difference in value between a purely on-box reputation database versus one that takes feeds (albeit updated more slowly) from a network of border MTAs? Given that all of these services (with the exception of Vipul’s Razor) are commercial I would have expected to see a showdown in some magazine by now. I guess since they are just pieces of a larger product they are not usually featured on their own (except SenderBase , which IronPort can’t shut up about).

Still, I think that’d be an interesting analysis to really profile the ROI and particulars on a distributed reputation service like that versus purely local reputation information. One could find the theoretical optimal number of nodes in said network and find out the scaling model of a service like that along a bunch of different metrics, etc. You could even throw some agent-based modeling at it and simulate Internet conditions to see more pros/cons of each. Has anyone else ever heard of an analysis like this?

Then, I was thinking, “dood its late…”

Junk Fax 2.0 1

Posted by Toby Sun, 18 Nov 2007 17:39:00 GMT

Having just come off of 5 years in anti-spam, an idea like this one strikes me as hilariously dumb. Here’s how I see this one playing out:

Printing noise from the kitchen during Grandma’s weekly bridge game…

Grandma: Ooh, I must be getting some new pictures from my daughter of the grandkids on their vacation to DisneyWorld!

Four women cozy into the kitchen to see a piece of porn image spam printing out…

Grandma’s friend: DisneyWorld sure has changed since last time I was there…

Economics, people. Economics.

MIT Spam Conference 2006 3

Posted by Toby Wed, 29 Mar 2006 18:57:00 GMT

UPDATE The papers and slides are now available for download.

Overview

Just got done with the MIT Spam Conference 2006 and let me tell you, it was much better than last year. This might have been the best one yet. Most of the talks were pretty good and there was a definitive blunting of the Bayes daisho they had been weilding in previous years. This year, the conference was held in March, so it wasn’t anywhere near as cold as it was in previous years (when it was held in early January). And the next one will be around the same time, so that’s even better!

CipherTrust sponsored the meetup the night before and it was a fun time. I talked to a bunch of people that night, including Jon Zdziarski and Matt Sergeant. It was held at the Cambridge Brewing Company, which cabbies apparently have a tough time finding, but I made it there alright. We ended up going there again for dinner the second night and they had some good food.

The conference itself was very punctual this year. In past years, there had been problems with that, but Bill ran a pretty tight ship this year. There weren’t a lot of people there at all (it was packed last year) and that’s a shame. They missed out. I had some more pictures, but the RAZR doesn’t take very good pictures in low light, so I’m only putting up the ones that came out.

Presentations

First, Tobias Eggendorfer gave a high-level overview of HTTP and SMTP tarpits. As a former TurnTide employee, I was a little disappointed that he seemed to ignore that approach, but overall it was a fairly good overview. One important point that he brought up was that SMTP tarpit effectiveness is local to the protected network whereas HTTP tarpits can be used to slow down and trap spammers as they attempt to harvest email addresses. The difference here is that one protects you (SMTP tarpits) and the other protects all in a strictly weaker capacity by hindering spammers at the start of the journey (HTTP tarpits).

Phil Raymond from Vanquish then got up and gave what amounted to a sales pitch for his new email reputation system. He tried pretty hard to coin the term “personal interrupt value” in describing how he wanted to create a “fluid market for your moment-by-moment attention”. Its not just that I’ve heard this before, but I’ve heard it from every email reputation service vendor. Goodmail, Bonded Sender, smtpRM... they all say the same exact things. How are we to differentiate? In case you don’t know, the basic idea is that senders put up something of value in bond, and then that bond is debitted should there be a conflict. The twist with Vanquish is that if you decide that you know a person or that you like the email, the sender doesn’t pay. That does neatly handle the case of friends and family not having to pay to play, but there are a number of other problems. It hit a couple of my FUSSP sensors, including requiring (or at least substantially benefitting from) a “flag day”, they didn’t have a good answer for zombies (“end users will have so little at risk, it won’t matter”), etc. Basically, I wasn’t impressed. If all these guys got together and standardized something that might move me.

There’s another important point about email reputation systems that a lot of people seem to be missing. This trial that Goodmail has going on with AOL and Yahoo!? That’s the entire industry’s big break. If Goodmail fails to deliver, then BS, smtpRM, Vanquish and any others are going to be set back years, as well. I understand their willingness to get in the game before its truly established, but they’re a niche within a niche. A big setback at either AOL or Yahoo! will not incite other ISPs or enterprises to knock down any of their doors.

As an aside, I have a feeling that no one from AOL or Yahoo! showed up this year in large part because they didn’t want the shitstorm from the audience about their decision to go with Goodmail.

The next talk was very cool: the guys from BitDefender talked about using adaptive neural networks for email classification. Specifically, filtering spam, but its uses are more general than that. Basically, they employed the Adaptive Resonance Theory to a hierarchy of these neural networks and got some pretty promising results. The big thing here is that you don’t need to retrain it with the entire corpus; it can learn new heuristics without forgetting the old ones. The heuristics still have to be created by some other means, of course, but this was a pretty interesting technique, nonetheless.

Next, Giovanni Donelli from the University of Bologna talked about a technique he called “Email Interferometry”. The idea boils down to monitoring a set of related accounts (called an “e-pool”) looking for the same spam messages to come into multiple accounts in the pool. He posited that it might not work well on large scales and did not itself indicate a filtering/classification technology. Didn’t seem too promising.

Bill Yerazunis talked about sorting spam with k-nearest neighbor and his new Hyperspace classifier in CRM114. I missed the first half of this talk, but I think he used kNN to attempt to match or exceed the quality of Markov chaining. Since I missed the meat of this one, I’m not going to comment further, but you can try it today by telling crm114 to use “hyperspace”.

Now, as far as the paper with the biggest potential for impact, I’m going to say it was Kang Li et al’s Towards a Ham Archive. Anyone who works on anti-spam software knows that we can get spam any time we want, in any quantity. The problem is getting a source of quality ham. There is the SA ham corpus and the TREC corpus, but not much else. This is a problem because without a large, quality source of ham all of our effectiveness statistics are eternally suspect. Li has thought of a method that might work for creating a large public corpus of ham without exposing the actual message data. Simply hash bigrams of the message in a sliding window and insert the digest values into a vector. The vector is then the quantity which is published in a public archive. The cool thing is, statistical filters already work on “tokens”, which are currently some number of words from the message. The digests in the digest vector could easily be used in the same capacity. But, since the messages are being digested in bigrams by a sliding window, the original message cannot be reproduced, so users can have confidence in releasing their ham to a public corpus. It wasn’t clear how large of a tradeoff there would be between protecting the privacy of the messages’ authors and effectiveness of the filters trained using digest vectors, but I think its definitely a well needed advance for a tough problem.

Reflexion then talked about their Supplemental Address Management System, which as far as I can tell, differs little from TitanKey other than the fact that it doesn’t employ challenge-response. They have some theory about how their system is better then disposable email addresses, but frankly, I couldn’t see a qualitative difference.

Mr. Palla then talked about how to detect phishes in email. I had some high hopes for this talk, because I talked to him and his wife for at least an hour the night before, but he blazed through those slides way too fast. The slides themselves were also far too dense for the time allotted. From what I gather, he was analysing the headers for rDNS information and also checking the recipient’s sent folder for matching addresses. Andrew from MessageLabs commented that they were getting better effectiveness from rDNS inspection alone than what Palla reported in his talk. Oh well.

Here’s the biggest travesty of the day: we came back just in time for Jon Zdziarski’s last slide for his talk about probabilistic digital fingerprinting techniques as applied to phishing detection. That sucked. I really wanted to see this one because his talk from last year was the best by far. Still, I talked to him the night before about it and he told me that it was basically building fingerprints of the pages that are linked to in email messages. The fingerprints were then correllated to find pages that have a large number of fingrprints in common, so that they can distinguish which of those uncommon fingerprints would be replicated across multiple emails. This would then indicate a set of fingerprints for an author of a phish. I may be messing up the details somewhat, so I will redo this section after I read the paper.

Fidelis Assis then phoned in a talk about “Exponential Differential Document Count” from Brazil. It was somewhat hard to understand over the phone over the loudspeaker, but the EDDC technique attempts to replicate what humans do when reading mail by picking out strong features and lessening the importance of ones that occur about equally in both ham and spam. I wrote down that I should get the paper, but it appears to increase effectiveness in CRM114. In the meantime, you can check out the code here or here.

Aaron Kornblum then talked about how Microsoft’s team let a PC get infected with a zombie and checked to see what it did. They didn’t let any email out from it, but it got a huge number of connection attempts and tried to send a ton of email. They then used that info to file suits against the zombie controllers. Cool stuff.

Jon Praed kept up his streak by talking at this one, this time about CAN-SPAM and some problems it has. Spammers are doing what he calls “microbranding”, which is keeping a low enough profile to appear small while still getting enough volume to be profitable. This entails started a bunch of shell companies and not spamming the biggest ISPs. Spammers are also fleeing offshore, but the fact that they are US citizens poses both problems and solutions for the authorities. Jon then indicated that the costs of CAN-SPAM are not known, but that it was basically really good for ensuring the legit mailers comply but not having much of an effect on spammers. He posed an interesting alternative to CAN-SPAM modeled after 18 USC 2257, which is the regulation that says all adult performers need to have their age and info recorded by someone and available upon request (to prevent another Traci Lords). Good talk, as usual.

Keynotes

Eric Allman, creator of Sendmail, gave the first keynote. He was advocating using Sender Domain Authentication (i.e. DKIM). Mostly it was an overview, but he indicated that there were definitely some rich research topics to explore in this area and that a lot of work was left to do to work out the IETF standards. Benefits listed were making whitelists more reliable, displaying auth results to user, etc. He concluded that it was a valuable tool for the anti-spam toolkit and that authentication was required to achieve a full ID suite for Internet communications.

Barry Shein, President and CEO of The World, gave the second keynote. It was pretty funny, if a little disconnected. He talked about all of this stuff.

Overall, a very good conference and I was glad I decided to go again this year, especially after last year’s debacle. I plan on attending again next year.

MIT Spam Conference 2005

Posted by Toby Sun, 23 Jan 2005 10:36:00 GMT

I just got back from the MIT Spam Conference 2005, which is put on by Paul Graham and Gilberte Houbart.

Jonathan Zdziarski’s talk on Bayesian Noise Reduction was easily the most technically-interesting talk of the entire conference. The idea is to find contexts in the message so that out-of-context text can be ignored with respect to the classification of said message. I highly recommend reading the paper or at least checking out the presentation.

Also interesting was the dissertation on the trial of ROKSO-listed spammer Jeremy Jaynes (a.k.a Gaven Stubberfield) by Jon Praed. Project Honeypot also seems to be coming along quite well under the direction of Matthew Prince. Finally, John Graham-Cumming discussed the interesting, if somewhat informal, results of his People and Spam survey. An interesting point from that: he found that 1% of the men who had used computers for 10+ years had also bought something from spam.

Brian McWilliams gave a pretty interesting look at his book, MailFrontier got two talks in this year’s list, and both were somewhat disappointing. The first was about using Bayesian classification applied to phishing, but the results were somewhat obvious. The main thing that came from the talk was that one should have a corpus for spam and another for phish, as opposed to having an integrated corpus for both. The audience decided that the spam is to ham, as phish is to “phowl”, which was pretty funny. The second talk was weaker, as the speaker misspelled the title of his own talk and also refused to give details on their implementation. It was about using lexicographical distance to normalize tokens. For example, there are 6 quintillion+ ways to form the word “Viagra” in a recognizable manner; it would be useful to see all of these tokens as “Viagra” for the purposes of token frequency counts. So, the idea is to make edits to a token until it reaches a certain threshold of edits; if then it resembles a known token, it is that token; if not, its not.

John Graham-Cumming then stated during the QA session that, in POPfile, they are sorting the letters of words with 6+ letters and using the sorted result as the token. It turns out, he said, that there is a very small chance of collision using that technique given the low probability of those words having anagrams in English and doing so defeats attempts at misdirection via misspelling words.

The IBM and Cisco talks were not very interesting: a form of both talks had been at either or both of the previous Spam Conferences. Cisco is basically advocating that everyone use DomainKeys and IBM has figured out how to put a bunch of filtering technologies together to create a more accurate filter. I really wonder why either of these talks were accepted…

Gordon Cormack came down from the University of Waterloo to talk about Standardized Filter Evaluation. It was a fairly obvious intersection of testing methodologies and anti-spam filters and there were some obvious problems with it, as well.

The rest of the talks were pretty bad.

Dave Mazieras described MailAvenger, an MTA that is both individual-user-configurable and also does SMTP CBV; that didn’t go so well for him. It does lots more than that, but he chose to focus on that and got the results that you’d expect. I brought up the fact that Verizon does this to Paul personally while standing in the back, but I was surprised that no one in the audience told the speaker about Verizon’s use of this technology and the problems it causes. Paul commented that whatever heat Verizon got for doing that obviously wasn’t enough to make them stop, to which I could only shrug agreement.

Rui Dai gave a talk about Regulation, where they proposed that spam was such a problem because spammers could not effectively find targets and thus needed to blast all of their spam runs to everyone on their lists. He then proposed a third-party arbiter that would help spammers target likely recipients better and this would in theory reduce the overall volume of spam. As expected, this was met with more than a little trepidation from the audience, and wasn’t described very well in the talk to begin with.

Oscar Boykin discussed using social network graphs for determining spam, but failed to realize that this is almost exactly what PageRank™ did for Google and further failed to realize why it could be just as easily killed.

The Distributed Quota Management talk never came off, as the speakers could not be found.

I’ve been to every one of the MIT Spam Conferences, but I’m not sure if I’m going to go next year. The palpable undercurrent of the conference was that Bayesian filtering was the solution to spam (John Graham-Cumming basically came out and said this during the final QA session at day’s end), and that’s just not true. Furthermore, I know for a fact that they will reject papers that are too far off-topic with filtering in general, thus lessening the potential worth of the conference as a place for the discovery of new ideas.

Now, the muffins and coffee are very good and its good to meet people and if I lived in Boston or near there I would probably continue to go. But for me to attend next year, I need to see a list of very good papers on topics that are not necessarily related to statistical filters. This will probably only happen when someone other than Paul Graham or one of his fans select the papers for this thing. As it stood this year, I had already read Jonathan’s paper on Bayesian Noise Reduction before I came, and that pretty much made my trip up there almost entirely unnecessary.