Welcome, Guest!!
follow us on... rss

Author Topic: API Query Discussion & Info  (Read 11392 times)

Mike

  • Administrator
  • Sr. Member
  • *****
  • Posts: 300
    • View Profile
API Query Discussion & Info
« on: January 29, 2009, 06:41:24 PM »
Original article: http://botscout.com/api_queries.htm

API Query Info

This page discusses some of the considerations involved in using the different kinds of queries available in the BotScout API.

    * The "MAIL" Query

      The MAIL query is one of the simplest and most useful query types in the BotScout API. The MAIL query takes an email address and looks for matches in the BotScout database, searching only in the email field. The email address is, statistically, the best indicator of whether or not a potential "user" is in fact, a bot. This is because virtually all forums and other web services now require that a valid email be used when signing up or registering to gain "post" access. Post access is the level of user access required to (for example) leave a message on a message board, fill out a classified ad, or enter data into a database.

      The valid email address is used to confirm receipt of the registration credentials so that bots can't flood a site with thousands (or tens of thousands) of fake user accounts. Since a working email address is required, this provides a credible pointer to a given user (or bot). Although it's common for a single bot to make use of many different email addresses, the available addresses are often used and re-used. Email addresses used by a bot are almost never used by an actual human (a "real" user) so the chance of a collision or false positive is very, very small.

    * The "IP" Query

      The IP query takes an IP address and looks for matches in the BotScout database, searching only in the IP address field. IP addresses by themselves are reasonably good indicators of bot activity, but since many bots operate from computers that have been infected by malware and joined into a botnet, the chance that a given IP address will correspond to a "real" user's IP address is possible (although very low in practical terms).

      To get an idea of the collision rate for a given IP address, consider the following:

      There are 4,294,967,296 (4.3 billion) possible IP addresses, of which 2,147,483,648 (2.1 billion) are normally available to typical service-level users (and bots).

      If the BotScout database had 100,000,000 (100 million) stored IPs and there were 100,000 new users that tried to sign up to a BotScout-screened service tomorrow, the odds of a collision would still be incredibly small. 100 million IPs is one-quarter of one percent (.025%) of the available IP addresses. 100,000 is one one-thousandth of 100 million. In simple terms, the chance that a given IP will be shared by a bot AND by a user who wants to sign up to your forum or web service is, for all practical terms, incredibly small.

    * The "NAME" Query

      The NAME query takes a given user name and looks for matches in the BotScout database, searching only in the NAME field. By itself, the NAME query is much less reliable than an IP or MAIL quey and should not be relied upon for bot screening. It should be used ONLY as a secondary indicator, and even that is of dubious value. Name collisions are not uncommon since bots use nonsense names as well as "real" names when they run. The NAME query by itself is next to useless; if used it should always be coupled with a MAIL or IP query for reliability. The false positive rate of using the NAME query alone is abysmally high.
    * The "ALL" Query

      The ALL query takes a given data item (IP, NAME, or MAIL) and looks for matches in the BotScout database, searching against all of the database fields. This query has limited usefullness, but was provided primarily because some bots will use an email address as the user name. Sometimes this is by design, sometimes this is likely due to operator error or misconfiguration of the bot's operating parameters.

    * The "MULTI" Query

      The MULTI query is specialized type of query that, when used correctly, can provide a very high level of detection with a very low false positive rate. It requires more involved parsing on the requester side, and (optionally) some decision making capability built into the processing code. It is the recommended default for querying the BotScout database because it returns the most usable data for a single query.

      The MULTI query takes all three data items (an IP address, a name, and an email address) and looks for matches in the BotScout database, searching against all of the database fields uniquely. That is, names will be compared to the NAME fields, IPs will be compared to the IP fields, and the email address will be compared to the MAIL fields.

      A composite set of matches with occurrence numbers is returned for all of the items, whether they matched or not. If, for example, the IP and the email matched, they will have numbers showing the times each item was found in the database. The name field would also be returned, but would show zero matches.

      The MULTI query can be used to give a reasonable statistical certainty as to whether the items submitted constitute bot activity or not. The phpBB plugin available for BotScout makes use of the MULTI query.

Summary
In general, Email is the single most reliable indicator of a bot- 99.9% or better. The IP address is the second most reliable indicator of a bot- 95% or better. The name is the least reliable indicator of a bot. It varies too much to be assigned a real percentage- possibly 30%.

Because bots routinely change their IPs, names, and emails, there's a very good chance that testing for a unique 3-item combination won't return a match and the registration would go through- which is exactly what we don't want. In fact, you're radically reducing the likelihood of catching them for each additional item you test for.

If you tested for any one of those items you stand a very good chance of catching them. Testing for two items drops that to ~25% or so, based on all of the unique bot signatures in our database so far (almost 95,000). Test for a positive match on all three items causes the probability to drop way, way down. A match on three items would be a guaranteed bot, but you'd be very lucky if you nailed them the first time through...and all they need is one successful registration.

« Last Edit: February 11, 2009, 05:43:22 AM by Mike »
Please don't PM me for assistance- post your questions in the forum where others can see them.

Mike

  • Administrator
  • Sr. Member
  • *****
  • Posts: 300
    • View Profile
Re: API Query Discussion & Info
« Reply #1 on: January 29, 2009, 06:50:25 PM »
Additional Note:

Testing an email or IP address all by itself will give you a very good likelihood of detection with a very, very low false positive rate (well under 1% according to my partner). It's what we recommend for most users.

Testing a second item really doesn't buy you that much, in fact in some cases it actually confuses the issue. It's like having two watches that disagree- which one is right? Let's say the IP matches but the email doesn't (or vice versa)...now what do you do?
Please don't PM me for assistance- post your questions in the forum where others can see them.