API Query Info
This page discusses some of the considerations involved in using the different kinds of queries available in the BotScout API. The information on this page applies to both the Standard and XML API functions.
- The "MAIL" Query
The MAIL query is one of the simplest and most useful query types in the BotScout API. The MAIL query takes an email address and looks for matches in the BotScout database, searching only in the email field. The email address is, statistically, the best indicator of whether or not a potential "user" is in fact, a bot. This is because virtually all forums and other web services now require that a valid email be used when signing up or registering to gain "post" access. Post access is the level of user access required to (for example) leave a message on a message board, fill out a classified ad, or enter data into a database.
The valid email address is used to confirm receipt of the registration credentials so that bots can't flood a site with thousands (or tens of thousands) of fake user accounts. Since a working email address is required, this provides a credible pointer to a given user (or bot). Although it's common for a single bot to make use of many different email addresses, the available addresses are often used and re-used. Email addresses used by a bot are almost never used by an actual human (a "real" user) so the chance of a collision or false positive is very, very small.
- The "IP" Query
The IP query takes an IP address and looks for matches in the BotScout database, searching only in the IP address field. IP addresses by themselves are reasonably good indicators of bot activity, but since many bots operate from computers that have been infected by malware and joined into a botnet, the chance that a given IP address will correspond to a "real" user's IP address is possible (although very low in practical terms).
To get an idea of the collision rate for a given IP address, consider the following:
There are 4,294,967,296 (4.3 billion) possible IP addresses, of which 2,147,483,648 (2.1 billion) are normally available to typical service-level users (and bots).
If the BotScout database had 100,000,000 (100 million) stored IPs and there were 100,000 new users that tried to sign up to a BotScout-screened service tomorrow, the odds of a collision would still be incredibly small. 100 million IPs is one-quarter of one percent (.025%) of the available IP addresses. 100,000 is one one-thousandth of 100 million. In simple terms, the chance that a given IP will be shared by a bot AND by a user who wants to sign up to your forum or web service is, for all practical terms, incredibly small.
- The "NAME" Query
The NAME query takes a given user name and looks for matches in the BotScout database, searching only in the NAME field. By itself, the NAME query is much less reliable than an IP or MAIL query and should not be relied upon for bot screening. It should be used ONLY as a secondary indicator, and even that is of dubious value. Name collisions are not uncommon since bots use nonsense names as well as "real" names when they run. The NAME query by itself is next to useless; if used it should always be coupled with a MAIL or IP query for reliability. The false positive rate of using the NAME query alone is abysmally high. Using the NAME query to fail a form submission (even in conjunction with EMAIL and IP) is not recommended.
- The "ALL" Query
The ALL query takes a given data item (IP, NAME, or MAIL) and looks for matches in the BotScout database, searching against all of the database fields. This query has limited usefullness, but was provided primarily because some bots will use an email address as the user name. Sometimes this is by design, sometimes this is likely due to operator error or misconfiguration of the bot's operating parameters.
- The "MULTI" Query
The MULTI query is the most effective query- it provides a very high level of detection with a very low false positive rate. It requires more involved parsing on the requester side, and (optionally) some decision making capability built into the processing code.
The MULTI query takes all three data items (an IP address, a name, and an email address) and looks for matches in the BotScout database, searching against all of the database fields uniquely. That is, names will be compared to the NAME fields, IPs will be compared to the IP fields, and the email address will be compared to the MAIL fields.
A composite set of matches with occurrence numbers is returned for all of the items, whether they matched or not. If, for example, the IP and the email matched, they will have numbers showing the times each item was found in the database. The name field would also be returned, but would show zero matches.
The MULTI query can be used to give a reasonable statistical certainty as to whether the items submitted constitute bot activity or not. All of the plugins available for BotScout make use of the MULTI query.
Summary
In general, Email is the single most reliable indicator of a bot- 99.9% or better. The IP address is the second most reliable indicator of a bot- 95% or better. The name is the least reliable indicator of a bot. It varies too much to be assigned a real percentage- possibly 30%.Because bots routinely change their IPs, names, and emails, there's a very good chance that testing for a unique 3-item combination won't return a match and the registration would go through- which is exactly what we don't want. In fact, you're radically reducing the likelihood of catching them for each additional item you test for.
If you tested for any one of those items you stand a very good chance of catching them. Testing for a positive match on two items drops that to ~25% or so, based on all of the unique bot signatures in our database so far. Testing for a positive match on all three items causes the probability to drop way, way down. A match on three items would be a guaranteed bot, but you'd be very lucky if you nailed them the first time through...and all they need is one successful registration.