Web Hosting Info

Search:

featured partner

The IP to Country Database

  Forum Topics : Development / Much better performance (30x average speedup) with MySQL
Submitted by pagesweb on Fri, 09/30/2005 - 15:17.
I found a simple way to significantly speedup MySQL query based IP 2 Country.

It involves adding an indexed partial IP field to the database.

Since field equality is much faster to process, the numeric comparisons are narrowed down to the few records that share the same partial IP value instead of being carried out on the whole database. The bigger the partial IP field, the faster, but the database size will also increase. I found out that dividing the IP space in 16384 parts is a good compromise, giving significant speedup while adding about 10000 records to the table. Table size, including index is about 2.5Mo

Speedup is at least 5x (in some very crowded areas) and reaches more than 60x on large blocks (most traffic comes from large blocks). It's not as fast as some 'C' implementations, but it can be implemented very easily as an upgrade for people already using MySQL and does not require additional RAM.

Here is the modified database (zipped) : http://pages-web.com/data/ip-to-country-s.zip (based on the Sept 13, 2005 release)

Here is the table structure :

CREATE TABLE `ip2c` (
`IP8` smallint(5) unsigned zerofill NOT NULL default '00000',
`IP_FROM` int(10) unsigned zerofill NOT NULL default '0000000000',
`IP_TO` int(10) unsigned zerofill NOT NULL default '0000000000',
`COUNTRY_CODE2` char(2) NOT NULL default '??',
`COUNTRY_CODE3` char(3) NOT NULL default '???',
`COUNTRY_NAME` varchar(50) NOT NULL default '?',
KEY `IP8` (`IP8`)
) TYPE=MyISAM;

- the field was called IP8 since the original implementation used the 8 most significant bits

- use IP8 as an index, else no speedup at all !

Here is the unoptimized (normal) query :

SELECT COUNTRY_NAME FROM ip2c WHERE IP_FROM <= IP_Address and IP_TO >= IP_Address

On my 2.4G Server, execution time is 45 msec / query (a sluggish 22 queries/sec)

Here is the optimized query using the additional field :

SELECT COUNTRY_NAME FROM ip2c WHERE IP8 = IP_Address>>18 and IP_FROM <= IP_Address and IP_TO >= IP_Address

With the same machine, execution time varies from 0.8 msec to 7 msec (in some extreme cases). On a representative set of IP addresses from North America and Europe, average performance is 1.6 msec / query (about 600 queries / sec)

The index size was chosen as a compromise between database size increase (67369 records instead of the original value of 58850) and speedup.

A 100K records database would cut the worst case time in half (but effect on average speed is much lower).

The updating tool (Windows based) that creates the extra field is herein placed in the public domain :

http://pages-web.com/data/geo_ip.exe

It must be placed in the same directory as the ip-to-country.csv file and it will create a file called ip-to-country-s.csv

The file can be used directly to fill the table with phpMyAdmin.

If there is enough interest, a version that generates other formats (including SQL commands) could be made.

Comment viewing options:
Select your preferred way to display the comments and click 'Save settings' to submit your changes.
Nice...
Posted by TheJohnDoe2005 on Tue, 10/04/2005 - 09:43.
My method of indexing is almost similar to yours but I use the 1st 8 bits of the IP Address instead of the 1st 14 bits (2^14) probably because I didn’t want to generate another additional few thousands entries. Yes it is true if you have indexed in your way, the searching of the IP address to the Country would be more efficient and faster. The 14 bits would also be my cut off point for the amount of indexes to be generated too if I want to. (Anything more than 14 bits seems too much…) (Of course you could do the whole 32 bits if you have the storage space to generate all the IP Address from 0.0.0.0 to 255.255.255.255 …but that wouldn’t be practical at all!). Your method includes shifting of 18 bits to the right which the IP_ADDRESS (i.e. 171.0.0.1) must be converted to binary values first and then convert to decimal value for comparison with the IP8. (Correct me if I am wrong…if it’s true…then you would have to do some additional mathematical calculations for conversion??). I prefer to keep mine as simple as possible.

First side issue is that the *.CSV (Comma Separated Values File) file as posted is separated by semi-colon which some excel have problem opening correctly. You might want to change it to comma instead. While opening excel, it would only display up to 65565 entries. It will not display the entire 67369 entries (as listed above)…but reassure that all the entries are all there.

Original Data
"129";"33996344";"33996351";"GB";"GBR";"UNITED KINGDOM"

Change To
"129","33996344","33996351","GB","GBR","UNITED KINGDOM"
PHP-ized script to create the enhanced tables
Posted by ChanceZeus on Tue, 10/04/2005 - 20:01.
Since I had some time to spare and wanted to update my bosses stats counter I have created an php-script that can be run from the command line to download the zip-file, extract it, read in the csv file and while reading it, adding the enhanced data to a database table...
Config is quite simple, you only have to enter the mysql database data and change the amount of significant bits (if required).
The table layout is almost the same as previously suggested, except that the index field carries the amount of bits in its name (i.e. 14 bits means IP14 as the extra index field, 8=>IP8)...
The top line (with #! in front) tells the shell what interpreter to use, if your php installation is installed elsewhere, you need to change this line too...
The script expects it is running on a linux machine (though it might be rewritten to support windows)...
Well I think I've covered all the basics so heres where to get it:
http://chancezeus.student.utwente.nl/ip2country.zip
Using the IP14
Posted by dupex on Fri, 10/28/2005 - 07:32.
Hi!

I'm using the php script created ip14 database,
and notice a very nice speed improvement!

I use this command when searching:
SELECT COUNTRY_NAME FROM ip2c WHERE IP14 = IP_Address>>18 and IP_FROM <= IP_Address and IP_TO >= IP_Address

But I don't understand what this does:
IP_Address>>18

and if it has to be changed when using IP14
(the code were meant for IP8)

So I would be grateful if anyone could clarify the number 18,
and if I need to change it when using 14 bits - tnx!
WHERE IP8 = IP_Address>>18
Posted by dupex on Fri, 10/28/2005 - 15:47.
According to this:
	/**
	 * Creator:   Mark van Beek
	 * Filename:  get_country_list.php
	**/
	/*
	Special script to convert and then import the ip2country database
	Enhanced speed through creation of indexes that contain the first 14bits of the ip-address
	Using this index speeds the search process up to 60x (average 30x)
	SELECT COUNTRY_NAME FROM `ip2c` WHERE
		`IP14` = IP_Address>>18 and `IP_FROM` <= IP_Address and `IP_TO` >= IP_Address
	*/
	
	//Default amount of bits to use of the ip-address 0 to 32 (0 is no change, al indexes will be 1)
	//higher=faster searching but bigger table
	//lower=smaller table, but longer search times
	$ip_bits=14;
	$shiftbits=32-$ip_bits;
the "WHERE IP8 = IP_Address>>18" in the original post is wrong, and should be "WHERE IP8 = IP_Address>>24".

since
$shiftbits=32-$ip_bits;

Efficient Storage
Posted by McAfee on Sat, 12/10/2005 - 13:32.
First of all, it's a good idea to normalize your table. This will make the data a lot smaller. Later on you can use JOINs to retrieve the data. Basicly you'll want to store the 2 char CC on the range table, and another table holding the rest of the codes. The code table will usually have less than 250 rows. This also makes the main table Static, instead of Dynamic.

Here's the MySQL script I use to load the CSV. The script will load all the data as normal, and then split the main table by normalizing it. Two columns will be removed from the main table afterwards.
(adjust the script to your needs)
USE `rads`;
DROP TABLE IF EXISTS `iptoc`;
DROP TABLE IF EXISTS `iptoc_codes`;

CREATE TABLE `iptoc` (
  `ip_from` int(10) unsigned NOT NULL default '0',
  `ip_to` int(10) unsigned NOT NULL default '0',
  `country_code2` char(2) NOT NULL default '',
  `country_code3` char(3) NOT NULL default '',
  `country_name` varchar(50) NOT NULL default '',
  PRIMARY KEY  (`ip_from`,`ip_to`),
  UNIQUE KEY `SECONDARY` (`ip_to`,`ip_from`),
  KEY `index_cc2` (`country_code2`)
) ENGINE=MyISAM;

LOAD DATA LOCAL INFILE 'C:/download/httpserver/ip-to-country/ip-to-country.csv'
 REPLACE INTO TABLE `iptoc`
 FIELDS
  TERMINATED BY ','
  ENCLOSED BY '"'
  ESCAPED BY '\\'
 LINES
  STARTING BY ''
  TERMINATED BY '\r\n';

CREATE TABLE `iptoc_codes` (
  PRIMARY KEY (`country_code2`)
) ENGINE=MyISAM
SELECT `country_code2`,`country_code3`,`country_name` FROM `iptoc` GROUP BY `country_code2` ORDER BY `country_name` ASC;

ALTER TABLE `iptoc` DROP COLUMN `country_code3`, DROP COLUMN `country_name`;

OPTIMIZE TABLE `iptoc`,`iptoc_codes`;

That's what I've been using for a long time now, but now I've started experimenting with the tips stated on this forum topic.

I took the efficient storage approach. I decided to store the first 16 bits in the column `ip16`, and re-adjust the `ip_from and `ip_to` columns so they are 16 bits in size.

So eventhough my row count went from 62448 to 98355 (+57.5%) my data_length went from 686928 to 885195 (+28.9%). Speed, on the other hand, improved considerably. (Statistics based on Nov15 IPTOC table, Both tables already normalized with the script above.)

Here's my optimized table definition:
CREATE TABLE `iptoc_opt` (
  `ip16` smallint(5) unsigned NOT NULL default '0',
  `ip_from` smallint(5) unsigned NOT NULL default '0',
  `ip_to` smallint(5) unsigned NOT NULL default '0',
  `country_code2` char(2) NOT NULL default '',
  PRIMARY KEY  (`ip16`,`ip_from`,`ip_to`),
  UNIQUE KEY `SECONDARY` (`ip16`,`ip_to`,`ip_from`)
) ENGINE=MyISAM

In both the optimized table and the old one, note the use of 2 unique keys (primary and "secondary"). These take advantage of how MySQL Range optimization works. MySQL will use the key that will provide the smalles range. The 3rd column specification is rarely used, and it exists mainly to guarantee uniqueness.

I used a PHP script to generate the new data. But this script is not very solid, and it's not the most reliable way to generate data. So I don't think it's safe to provide the PHP script (for now).
 
Efficient Storage, Side note
Posted by McAfee on Sat, 12/10/2005 - 14:40.
Same experiment using a smaller `IP8` column and and 24 bits for the range (tinyint, mediumint), actually resulted in a slightly smaller data_length and index_length, compared to a regular normalized table. Row count increased by less than 1000.

Speed improved considerably as well, but not as much as with the previous format. So I think the wasted space of ip16 is worth it. IP16 just takes an extra 200 kbytes of storage space, compared to a regular normalized table.

IP8 is not worth it, when compared to IP16. Maybe IP24 could be another solution. That would be using mediumint, and then tiny for the ranges. Speed will surely keep improving, but that will really increase the required storage. Haven't tested it, but will probably go over 16 million rows.
 
Java Tool
Posted by McAfee on Wed, 12/14/2005 - 21:42.
This java tool will generate the new CSV file.
http://mcafee.servegame.com/download/optimize.jar (21.2k)

To use it, use this command on the console:
java -jar optimize.jar

Another option is to extract it (JAR uses ZIP compression) and run this command:
java optimize

The tool will provide help when used with no parameters. You can specify how many bits you want for the new column. Output can be sent to a file or to the console. And, you can choose to output normalized data.

The normalized option will not output the country name nor the CC3, but should make the output a lot smaller. You will need to create the "codes" table using another method. An easy way to do it is thru mysql script posted before.

Consider it beta! The script does not output pretty errors by design. I let java handle all the errors. This script was designed with jdk1.5.0_06, but compiled for 1.4. Tested on Windows XP and linux (knoppix).

CAUTION: bit values above 16 will take a large ammount of storage space, and long processing time. But you could always cancel the process with Ctrl+C or Ctrl+Break.

---

In regards to the keys I was using before, I've re-thinked it and decided they were not very efficient. For both examples, remove the last column on the primary and "secondary" keys.
Another New Approach
Posted by McAfee on Fri, 12/16/2005 - 01:08.
I've started working on another new approach. The base of it is ignoring the "ip_to" column entirely. Basicly I changed the 2nd WHERE condition, to the combination of ORDER BY and LIMIT. The select statement is changed to something like this:
SELECT * FROM ip2c WHERE ip_from <= IP_Address ORDER BY ip_from DESC LIMIT 1
How it works?
The WHERE will take care of listing all rows who's IP from is lower or equal than the one you provided. From this list it takes the last one, which is the country you want.

Pros:
* Speed is improved considerably.
* No redundant data has to be added.
* You can actually use the exact same database.

Cons: (Only 1, and fixable)
* A match might be returned in situations where it used to return an empty set. Example: 127.0.0.1 will return "Japan" instead of an empty set.

The Fix:
We won't use the "ip_to" column for the query, but we will use it as data. Within the program, we could compare the resulting range values with the IP.

If it's in range the country is valid.
If NOT, the database actually doesn't contain a match.

The 2nd statement can be proved with a couple of Logic Statements, but I'm not going into details about that.

PS: Remember to use KEYS (INDEXES) to improve performance. And normalization always helps.
 
Same thing, tweaked for MySQL
Posted by McAfee on Sat, 06/17/2006 - 03:57.
Has anyone analized this method yet? I get a performance boost, but would like it if a DB expert gave his opinion. Basicly, I would like to know if there are any drawbacks. I still haven't found any other Query that matches this speed, while still using the same compacted DB.

Anyways, here's an updated way to do the same thing. If you look at the "old method", you'll notice extra steps had to be done for the ip_to column. In this "updated method", everything is done withing MySQL:
SELECT * FROM (SELECT * FROM ip2c WHERE ip_from <= IP_Address ORDER BY ip_from DESC LIMIT 1) AS `a`
WHERE ip_to >= IP_Address
NOTE: Where "IP_Address" is actually an integer (longip)
Why database?
Posted by Daath on Mon, 02/06/2006 - 07:21.
Why use a MySQL database? That will only slow things down ;) If you haven't seen it, try our implementation here!
The data is based on this site's CSV, and all code is open :)

-
Any technology distinguishable from magic, is insufficiently advanced.
 
Any comments on this method?
Posted by fambi on Sat, 02/18/2006 - 01:23.
Does anyone have anything to say about this method?

We are looking to speed up our ip-country lookups at our sms gateway, because our current lookups are taxing the db unnecesarily.

We thought about moving the relevant tables to an external db, but would rather not!

Any comments would be nice?
Loading into RAM
Posted by raddanesh on Tue, 10/17/2006 - 13:15.
Why using a MySQL database?
another way is to load all data into the RAM... then you will have the maximum speed of Response in your machine!
I've written a Windows Service in C# which reads the whole csv file into the memory at startup.

you can call a method in your Win Service using TCP/IP and a port which is defined in your Code.
The method of finding the Country code is very simple. it is exactly based on .NET SortedList class and hopefully it is the best!!!

I can send you my project via Email. (contact@mypardis.com)