Blocking and filtering each have their advocates and advantages. While both reduce the amount of spam delivered to users' mailboxes, blocking does much more to alleviate the bandwidth cost of spam, since spam can be rejected before the message is transmitted to the recipient's mail server. Filtering tends to be more thorough, since it can examine all the details of a message. Many modern spam filtering systems take advantage of machine learning techniques, which vastly improve their accuracy over manual methods. However, some people find filtering intrusive to privacy, and many mail administrators prefer blocking to deny access to their systems from sites tolerant of spammers.
For history and details on DNSBLs, see DNSBL.
Content based filtering can also filter based on content other than the words and phrases that make up the test of the message. Primarily, this means looking at the headers of the email, the part of the message that contains information about the message, and not the text of the message. Spammers will often spoof headers in order to hide their identities, or to try to make the email look more legitimate than it is; many of these spoofing methods can be detected. Also, spam sending software often produces headers that violate the RFC 2822 standard on how email headers are supposed to be formed.
Disadvantages of this static filtering are threefold: First, it is time-consuming to maintain. Second, it is prone to false positives. Third, these false positives are not equally distributed: manual content filtering is prone to reject legitimate messages on topics related to products advertised in spam. A system administrator who attempts to reject spam messages which advertise mortgage refinancing may easily inadvertently block legitimate mail on the same subject.
Finally, spammers can change the phrases and spellings they use, or employ methods to try to trip up phrase detectors. This means more work for the administrator. However, it also has some advantages for the spam fighter. If the spammer starts spelling "Viagra" as "V1agra" (see leet) or "Via_gra", it makes it harder for the spammer's intended audience to read their messages. If they try to trip up the phrase detector, by, for example, inserting an invisible-to-the-user HTML comment in the middle of a word ("Via<---->gra"), this sleight of hand is itself easily detectable, and is a good indication that the message is spam. And if they send spam that consists entirely of images, so that anti-spam software can't analyze the words and phrases in the message, the fact that it is image only can be detected.
Statistical filtering, once set up, requires no maintenance per se: instead, users mark messages as spam or nonspam and the filtering software learns from these judgments. Thus, a statistical filter does not reflect its author's or administrator's biases as to content, but it does reflect the user's biases as to content; a biochemist who is researching Viagra won't have messages containing the word "Viagra" flagged as spam, because "Viagra" will show up often in his or her legitimate messages. It can also respond quickly to changes in spam content, without administrative intervention.
Spammers have attempted to fight statistical filtering by invisibly inserting many random but valid words into their messages, making more likely that the filter will classify the message is neutral; they make the words invisible by giving them a very tiny font, by making the words the same color as the background, or both. However, the countermeasures seem to have been largely ineffective.
Software programs that implement statistical filtering include Bogofilter, the e-mail programs Mozilla and the soon to be released Mozilla Thunderbird, and later revisions of SpamAssassin.
The advantage of this type of filtering is that it lets ordinary users help identify spam, and not just administrators, thus vastly increasing the pool of spam fighters. The disadvantage is that spammers can insert unique invisible gibberish -- known as hashbusters -- into the middle of each of their messages, thus making each message unique and having a different checksum. This leads to an arms race between the developers of the checksum software and the developers of the spam-generating software.
Checksum based filtering methods include:
A number of proposals and specifications have been written to extend the SMTP protocol to avoid spam, including:
Organizations that implement such systems include:
One tarpit design is the teergrube, whose name is simply German for "tarpit." This is an ordinary SMTP server which intentionally responds very slowly to commands. Such a system will bog down SMTP client software, as further commands cannot be sent until the server acknowledges the earlier ones. Several SMTP MTAs, including Postfix, have a teergrube capacity built in: when confronted with a client session which causes errors such as spam rejections, they will slow down their responding. [1] [1]
Another design for tarpits directly controls the TCP/IP protocol stack, holding the spammer's network socket open without allowing any traffic over it. By reducing the TCP window size to zero, but continuing to acknowledge packets, the spammer's process may be tied up indefinitely. This design is more difficult to implement than the former. Aside from anti-spam purposes, it has also been used to absorb attacks from network worms. [1]
A third design is simply an imitation MTA which gives the appearance of being an open mail relay. Spammers who probe systems for open relay will find such a host and attempt to send mail through it, wasting their time. Such a system may simply discard the spam attempts, submit them to DNSBLs, or store them for analysis. It may also selectively deliver relay test messages to give a stronger appearance of open relay. SMTP honeypots of this sort have been suggested as a way that end-users can interfere with spammers' activities. [1] [1]
Another method which may be used by internet service providers (or by specialized services) to combat spam is to require unknown senders to pass various tests before their messages are delivered. These strategies are termed challenge-response systems or C/R, and are currently controversial among email programmers and system administrators.
One example of a challenge-response system is a "captcha" test, in which a mail sender is required to view an image containing a word or phrase, and respond with that word or phrase in text. The purpose of this is to ensure that automated systems (incapable of reading the image) cannot transmit email.
Critics of C/R systems have raised several issues regarding their usefulness as an email defense:
Address munging does not, however, evade so-called "dictionary attacks" in which the spammer generates a number of likely-to-exist addresses out of names and common words. For instance, if there is someone with the address adam@example.com, where 'example.com' is a popular ISP or mail provider, it is likely that he frequently receives spam.
Users can defend against these methods by using mail clients which do not display HTML or attachments, or by configuring their clients not to display these by default.
In Usenet, it is widely considered even more important to avoid responding to spam. Many ISP have software that seeks out and destroys duplicate messages. Often someone sees a spam and responds to it before it's cancelled by their server. This can have the effect of reposting the spammer's spam for them... and since it's not just a duplicate, this reposted copy will actually last longer.
In late 2003, the FCC launched a public relations campaign to encourage email users to simply never respond to a spam email -- ever. This campaign stemmed from the tendency of casual email users to reply to spam, in order to complain about the spam and ask the spammer to stop sending spam. This has the effect of alerting spammers to the existence of a person who actually reads spam email, and it has the effect of increasing spam rather than stopping it.
Two such online tools are SpamCop and Network Abuse Clearinghouse. Both provide automated or semi-automated means to report spam to ISPs. Some spam-fighters regard them as inaccurate compared to what an expert in the email system can do; however, most email users are not experts.
In the past several years, scores of worm programs have used email systems as a conduit for infection. The worm program transmits itself in an email message, usually as a MIME attachment. In order to infect a computer, the executable worm attachment must be opened. In almost all cases, this means the user must click on the attachment. The worm also requires a software environment compatible with its programming.
Email users can defend against worms in a number of ways, including:
Defense against spam
There are a number of services and software systems that mail sites and users can use to reduce the load of spam on their systems and mailboxes. Some of these depend upon rejecting email from Internet sites known or likely to send spam. Others rely on automatically analyzing the content of email messages and weeding out those which resemble spam. These two approaches are sometimes termed blocking and filtering.Spam blocking and filtering techniques
DNSBLs
DNS-based Blackhole Lists, or DNSBLs, are a blocking technique, whereby a site publishes lists of IP addresses via the DNS, in such a way that mail servers can easily be set to reject mail from those addresses. There are literally scores of DNSBLs, each of which reflects different policies: some list sites known to emit spam; others list open mail relays or proxies; others, such as SPEWS, list ISPs known to support spam.Content-based filtering
Until recently, content filtering techniques relied on mail administrators specifying lists of words or regular expressions disallowed in mail messages. Thus, if a site receives spam advertising "herbal Viagra", the administrator might place these words in the filter configuration. The mail server would thence reject any message containing the phrase.Statistical filtering
Statistical filtering was first proposed in 1998 by Mehran Sahami, et al., at the AAAI-98 Workshop on Learning for Text Categorization. A statistical filter is a kind of text classification system, and a number of machine learning researchers have turned their attention to the problem. Statistical filtering was popularized by Paul Graham's influential 2002 article, which used Naive Bayesian classification to predict whether messages are spam or not -- based on collections of spam and nonspam ("ham") email submitted by users. [1] [1]Checksum-based filtering
Checksum-based filter takes advantage of the fact that, for any individual spammer, all of the messages he or she sends out will be mostly identical, the only differences being web bugs, and when the text of the message contains the recipient's name or email address. Checksum-based filters will strip out everything that might vary between messages, reduces it to a checksum, and compares it to a database which collects the checksums of messages that email recipients consider to be spam (some people have a button on their email client which they can click to nominate a message as being spam); if the checksum is in the database, the message is likely to be spam.Protocol extensions
Messages certified as not being spam
There are several third-party organizations which guarantee that certain messages aren't spam, and have the means to prevent spammers from fraudulently using their system, by fining or suing them, for example. Administrators can use this to let through messages that would otherwise be filtered or blocked as spam, thus reducing the false positive rate.Heuristic filtering
Heuristic filtering, such as is implemented in the program SpamAssassin, uses some or all of the various tests for spam mentioned above, and
assings a numerical score to each test. Each message is scanned for these patterns, and the applicable scores tallied up. If the total is above a fixed value, the message is rejected or flagged as spam. By ensuring that no single spam test by itself can flag a message as spam, the false positive rate can be greatly reduced. [1]
Tarpits and Honeypots
A tarpit is any server software which intentionally responds pathologically slowly to client commands. A honeypot is a server which attempts to attract attacks. Some mail administrators operate tarpits to impede spammers' attempts at sending messages, and honeypots to detect the activity of spammers. By running a tarpit which appears to be an open mail relay, or which treats acceptable mail normally and known spam slowly, a site can slow down the rate at which spammers can inject messages into the mail facility.Challenge-response systems
Spam tips for users
Aside from installing client-side filtering software, end users can protect themselves from the brunt of spam's impact in numerous other ways.Address munging
One way that spammers obtain email addresses to target is to trawl the Web and Usenet for strings which look like addresses. Thus, if one's address is never listed on these fora, they cannot find it. Posting anonymously, or with an entirely faked name and address, is one way to avoid this "address harvesting". Users who want to receive legitimate email regarding their posts or Web sites can alter their addresses in some way that humans can figure out but spammers haven't (yet). For instance, joe@example.net might post as joeNOS@PAM.example.net, or display his email address as an image instead of text. This is called address munging, from the jargon word "mung" meaning to break.Defeating Web bugs and JavaScript
Many modern mail programs incorporate Web browser functionality, such as the display of HTML and images. This can easily expose the user to pornographic or otherwise offensive images in spam. In addition, spam written in HTML can contain JavaScript programs to direct the user's Web browser to an advertised page, or to make the spam message difficult or impossible to close or delete. In some cases, spam messages have contained attacks upon security vulnerabilities in the HTML renderer, using these holes to install spyware. (Some computer viruses are borne by the same mechanisms.)Avoiding responding to spam
It is well established that some spammers regard responses to their messages -- even responses which say "Don't spam me" -- as confirmation that an email address refers validly to a reader. Likewise, many spam messages contain Web links or addresses which the user is directed to follow to be removed from the spammer's mailing list. In several cases, spam-fighters have tested these links and addresses and confirmed that they do not lead to the recipient address's removal -- if anything, they lead to more spam.Reporting spam
The majority of ISPs explicitly forbid their users from spamming, and eject from their service users who are found to have spammed. Tracking down a spammer's ISP and reporting the offense often leads to the spammer's service being terminated. Unfortunately, it can be difficult to track down the spammer -- and while there are some online tools to assist, they are not always accurate.Defense against email worms
External links
Tools to reduce the impact of spam