Web Hosting Info

Search:

featured partner

The IP to Country Database

  Forum Topics : Development / Non-Mysql Country Detection
Submitted by drterm on Sat, 07/12/2003 - 22:47.
Hello. This code reads the country database into memory, and uses a binary search algorithm. (i think that is what it's called)

On my p2 233, with 128 megs of ram, it takes 0.7 seconds.
On my friends hosted site, at hostrocket.com, it takes 0.2 seconds.

I'll give you guys the full code, complete with execution time checking. It's simple to change the function around how you see fit.

#exec time begin.
$timeparts = explode(" ",microtime());
$starttime = $timeparts[1].substr($timeparts[0],1);

function findcountry($ip){
    $countrydb = file("ip-to-country.txt");
    $ip_number = sprintf("%u",ip2long($ip));

    #binary search attempt
    $low = 0;
    $high = count($countrydb) -1;

    $count = 0;
    while($low <= $high) {
        $count++;
        $mid = floor(($low + $high) / 2);  // C floors for you
        $num1 = substr($countrydb[$mid], 1, 10);
        $num2 = substr($countrydb[$mid], 14, 10);
        if($num1 <= $ip_number && $ip_number <= $num2){
            #start at 27 go 2
            #substr($$countrydb[$mid], 27, 2);
            print "Found your country: " . substr($countrydb[$mid], 27, 2);
            break;
        } else {
            if ($ip_number < $num1) {
                $high = $mid - 1;
            } else {
                $low = $mid + 1;
            }
        }
    }
    print "<br>\nlines checked:$count total lines in file:"
    .count($countrydb)
    ." filesize: ".filesize("ip-to-country.txt")." kb";
}

findcountry($_SERVER['REMOTE_ADDR']);
#findcountry("150.101.177.134");

#exec time end.
$timeparts = explode(" ",microtime());
$endtime = $timeparts[1].substr($timeparts[0],1);
echo "<br>exec time: ".round($endtime - $starttime, 5);
Comment viewing options:
Select your preferred way to display the comments and click 'Save settings' to submit your changes.
The PHP.net way of doing it!
Posted by sandeep on Mon, 07/14/2003 - 03:05.
The PHP.net guys use a special format of the IP-to-Country Database along with an index file that they create. This means they don't have to read the whole database into memory. Here is a brief explanation of the format and the index file along with code to query the database.

Format
The format of the database is like this:
It has three fields:
  • start of the IP range
  • end of the IP range
  • three letter country code
The first two field are left padded with 0 to 10 bytes so that PHP recognizes them as numbers, if they need to act in a comparision. Therefore, the fixed size of each record is (2*10+3+1) bytes. There is no seperator as it is not needed due to the fixed width of each field and the line is terminated with a '\n'.

The database lookes something like this:
00339963440033996351GBR
00503316480083886079USA
00945854240094585439SWE
01006632960121195295USA
...
If you have imported the IP-to-Country Database into a RDBMS, you can simply export it in the above format.

Index
The index that is created, tells us where to start searching for an IP number, so that we can fseek() to that position and then perform a normal linear search. This is facilitated by the fixed size of each reacord: (2*10+3+1) bytes.

The index file lookes something like this:
10000000
0,0
3,1
5,2
9,3
10,4
...
The first line of the index is the granularity (The code below will explain what it is).


Here's the code to create the index:
This is based on the code written by Gabor Hojtsy [goba@php.net]
for the PHP.net website and mirror sites in 2003.

// Step value used to create the index
define("IDX_GRANULARITY", 10000000);
define("IDX_LENGTH", strlen(IDX_GRANULARITY));

// Filenames
define("IPDB", "ip-to-country.db");
define("IPIDX", "ip-to-country.idx");

// Index the ip-to-country database with the given
// granularity, and save the index in a file
function indexer()
{
// Last indexed number and last record number
$lastidx = 0; $recnum = 0;
 
// We store the IDX in a PHP array temporarily

$idx_list = array("0,0");

// Open database for reading
$ipdb = fopen(IPDB, "r");

// Return with error in case of we cannot open the db
if (!$ipdb) { return FALSE; }

// While we can read the file
while (!feof($ipdb)) {
    
    // Get one record
    $record = fread($ipdb, 24);
    
    // Unable to read a record and not at end => error
    if (strlen($record) != 24 && !feof($ipdb)) { return FALSE; }
    
    // This is a new record
    $recnum++;
    
    // Get the start of the range for this record

    $range_start = (float) substr($record, 0, 10);
    
    // If this range starts a new step with our granularity,
    // add a new element to the index array
    if (intval($range_start / IDX_GRANULARITY) > $lastidx) {
	$lastidx = intval($range_start / IDX_GRANULARITY);
	$idx_list[] = "$lastidx,$recnum";
    }
}

// Close the database file
fclose($ipdb);

// Write out index to file
$idx = fopen(IPIDX, "w");
if (!$idx) { return FALSE; }
fwrite($idx, join("\n", $idx_list));
fclose($idx);

// Success
return TRUE;
}

Quering the database
The code to query the database is in the PHP.net CVS:
http://cvs.php.net/cvs.php/phpweb/include

Go through the file ip-to-country.inc to see how its done.

 
Re: The PHP.net way of doing it!
Posted by Frans on Wed, 08/13/2003 - 16:38.
Thanx for this insight Sandeep.
Anybody any details about speed?
 
Posted by sandeep on Thu, 08/14/2003 - 09:17.
Have'nt checked it myself, but from what i remember, the PHP.net guys reported about 0.03-0.05 seconds for a query on a windows machine.
 
I tested the PHP.net way.
Posted by Frans on Thu, 08/21/2003 - 07:58.
I tested the PHP.net way.

Here are my results

query time: 0.01 - 0.05
GRANULARITY: 2097152 (32 * 256 * 256)
db-size: 952.066 bytes

hardware
AMD K6-3 450Mhz
256MB
60 gig Samsung HDD

Query time differs a lot, it depends on where the wanted ip is in the range, selected by the granularity number.

example
200.31.255.255 -> 0.05393 sec
(at the end of index and at the end of a range)
4.1.1.1 -> 0.00965 sec
(at start of index and at start of range)

I'm looking for ways to improve these results.
One of the ideas is to start at the end of a range and loop backwards if the long_ip > start_range + (granularity / 2).
This must lower the number of freads.

Ofcourse this can be done with the index also.

Anybody more ideas?
 
Conversion...
Posted by koko on Fri, 08/22/2003 - 07:14.
a few lines which will turn the .csv file into the shown above new sandeep's database format directly (you don't need RDBMS):
<?php
set_time_limit(800); /* wouldn't work in safe mode */
$a=file("ip-to-country.csv") or die("no db");

for($i=0;$i<count($a);$i++){

$a[$i]=str_replace("\"","",$a[$i]);
$b=explode(",",$a[$i]);

while( strlen($b[0])<10 ){ $b[0]='0'.$b[0]; } 
while( strlen($b[1])<10 ){ $b[1]='0'.$b[1]; } 

$a[$i]=$b[0].$b[1].$b[3];
} 
$a=implode("\n",$a); 
$fd=fopen("ip-to-country.txt","w") or die("no permission to write");
$fout=fwrite($fd,"$a\n");fclose($fd);

print "ok";
?>
 
Obviously there are different
Posted by strummer on Sat, 08/23/2003 - 04:44.
Obviously there are different ways of doing this. Here is another way:
...

define('IPCSV', 'ip-to-country.csv');
define('IPDB', 'ip-to-country.db');

if(!$ipcsv = fopen(IPCSV, 'r'))
{
    // couldnt open csv file. die or something.
    die('unable to open ' . IPCSV . ' for reading');
}

if(!$ipdb = fopen(IPDB, 'w'))
{
    // couldnt open db file. die or something.
    die('unable to open ' . IPDB . ' for writing');
}

//note: using line length of 150. More than enough, for now.
while($fields_in = fgetcsv($ipcsv, 150, ','))
{
    $fields_in[0] = str_pad($fields_in[0], 10, '0', STR_PAD_LEFT);
    $fields_in[1] = str_pad($fields_in[1], 10, '0', STR_PAD_LEFT);            
    $record = $fields_in[0] . $fields_in[1] . $fields_in[2] . "\n";
    fwrite($ipdb, $record);//could concatenate here but maybe not so clear
}

fclose($ipcsv);
fclose($ipdb);

 
Yep
Posted by koko on Sat, 08/23/2003 - 07:03.
and works faster than mine :)
That's good stuff, thanks guy
Posted by Benja on Fri, 08/15/2003 - 13:27.
That's good stuff, thanks guys.

I'm quite surprised by the execution time of the first script. 0.7 seconds ? It's fucking long...
Why all those steps?
Posted by mdsjack on Sun, 11/02/2003 - 04:29.
Hi, I dont feel actually a dev, but just a user...
I dont understand why doing all those steps to get an execution speed of 0.01 to 0.05 if the fuction below gives me 0.05 (min - avg)...
	function ip2country()
	{
		global $config;
		$ip		= sprintf('%u', ip2long($_SERVER['REMOTE_ADDR']));
		$csv	= fopen($config->ip2country, 'r');
		while ($line = fgetcsv($csv, 1024))
		{
			list ($from, $to, $code1, $code2, $code3) = $line;
			if (($from <= $ip) and ($to >= $ip))
			{
				fclose($csv);
				return strtolower($code1);
				break;
			}
		}
		fclose($csv);
		return '';
	}
 
Because it is more elegant.
Posted by strummer on Mon, 11/03/2003 - 14:10.
Personally I always disregard any benchmark figures when I see them unless the testing environment has been fully documented, using a proven benchmark framework etc etc. Even then I will never be certain unless I actually try it myself.

The scripts here detail methods that can be used to create an index from the csv file from which subsequent methods can directly locate a country code using this index. This will always be more efficient and less resource intensive than reading a file, line by line, potentially until eof on a file that is now ~2MB. As your script works for you then fine but if you are using it on a busy website you will eventually find that it is a bottleneck.

It is not clear what object $config is and it is obviously set elsewhere. Anyone uncertain what to do could replace fopen($config->ip2country, 'r') with fopen('/path/to/csv/file', 'r') and remove the global $config; line. The function could be called with $country_code = ip2country();

mdsjack: stay away from globalising variables where possible. IMO you should remove that global $config line and pass $config to this function as a reference ie. function ip2country(&$config) and then call with something like $country_code = ip2country($config)
Also, you cannot rely on key REMOTE_ADDR to have the ip address and should be checking keys HTTP_CLIENT_IP and HTTP_X_FORWARDED_FOR too.

----------
Q. Should I put a witty comment below my post?
A. No.
 
Thanx
Posted by mdsjack on Mon, 12/15/2003 - 04:44.
thanx for answering. couldn't check the forum sooner, sorry.
thanx for the tips, too... i'm aware this is not a masterpiece in php programming but it works quite fine also on slow websites and let's me upload the new csv as is. so if you don't mind i will keep using my function, i'm not sure i can manage to have the other method working.

i didn't know about HTTP_CLIENT_IP and HTTP_X_FORWARDED_FOR, i'm going to check right now how they work.

thanx again.
jack.

ps. forgive my english
 
PHP.net way returns many NA country names
Posted by afroken9 on Mon, 12/15/2003 - 11:03.
I tried using php.net way. Basically it's a success and I am greatly impressed :), but I got quite a number of false NA countries(got USA when using the WebHosting.Info demo using the tracked ip's). I am using the most current ip to country database from WebHosting.Info.

Managed to track the problematic code down here:

while (!feof($ipdb) && !($range_start <= $ip && $range_end >= $ip)) {

// We had run out of the indexed region,
// where we expected to find the IP
if ($idx[1] != -1 && $idx[0] > $idx[1]) {
$country = "NA"; break;
}

put down some example ip's causing the false "NA" country report: 63.202.49.254
66.194.6.75
68.66.216.133
68.134.166.229
68.111.138.35

Anybody know how to correct this? Thanks.
 
False 'NA's
Posted by smith99 on Mon, 09/13/2004 - 06:31.
There are at least 3 problems with accessing the IP to Country DB the PHP.net way.

Problem 1.
The sructure of the database is not suited to the index method. The indexes are generated by dividing IP range boundaries by 10000000 (or whatever granularity you choose). This generates indexes that assume that each range in the DB beginning with the indexed value starts on a new line, which is not how it is. Example from the real database:
Line 29: 0214858672  0226293055  US  USA  UNITED STATES
Line 30: 0226293056  0226293119  NL  NLD  NETHERLANDS
in this example the index file will indicate that the search for an address that starts with 0226 should begin at line 30. If your address happens to be in the range 0226000000 - 0226293055 you will get a false 'NA'.

This can be fixed by pre-processing the database to ensure that each individual range (with respect to the value used in the index process) starts on a new line. The code below will do that (based on code from one of the previous posts, above). Note, this code is designed to operate on the CSV file as it is downloaded from this site. Also, ensure that the 'granularity' you specify in pre-processing the file is the same as the granularity value you use to index the file.
if(!$csvFh = fopen("$csvFile", 'r')) {
        die("Unable to open $csvFile for reading.");
}
if(!$dbFh = fopen("$dbFile.tmp", 'w')) {
        die("Unable to open $dbFile.tmp for writing.");
}
$gran = 10000000;
//note: using line length of 150. More than enough, for now.
while( $fields_in = fgetcsv($csvFh, 150, ',') ) {
        $fields_in[0] = str_pad($fields_in[0], 10, '0', STR_PAD_LEFT);
        $fields_in[1] = str_pad($fields_in[1], 10, '0', STR_PAD_LEFT);
        $from = intval($fields_in[0] / $gran);
        $to = intval($fields_in[1] / $gran);
        if ( $from < $to ) {
                $diff = $to - $from;
                $r = $fields_in[0] .
                     str_pad(($from * $gran + $gran - 1), 10, '0', STR_PAD_LEFT) .
                     $fields_in[3]. "\n";
                fwrite($dbFh, $r);
                for ( $i = 1; $i < $diff; $i++ ) {
                        $r =  str_pad((($from + $i) * $gran), 10, '0', STR_PAD_LEFT) .
                              str_pad((($from + $i) * $gran + $gran - 1), 10, '0', STR_PAD_LEFT) .
                              $fields_in[3] . "\n";
                        fwrite($dbFh, $r);
                }
                $r = str_pad(($to * $gran), 10, '0', STR_PAD_LEFT) . 
                     $fields_in[1] . $fields_in[3]. "\n";
                fwrite($dbFh, $r);
        } else {
                $record = $fields_in[0] . $fields_in[1] . $fields_in[3] . "\n";
                fwrite($dbFh, $record);
        }
}
fclose($csvFh);
fclose($dbFh);
A (probably better) alternative would be to change the index code (see one of the previous posts). Look for the section:
// Get the start of the range for this record
$range_start = (float) substr($record, 0, 10);

// If this range starts a new step with our granularity,
// add a new element to the index array

if (intval($range_start / IDX_GRANULARITY) > $lastidx) {
    $lastidx = intval($range_start / IDX_GRANULARITY);
    $idx_list[] = "$lastidx,$recnum";
}
And change to:
// Get the start and end boundaries for this record
$range_start = (float) substr($record, 0, 10);
$range_end = (float) substr($record, 10, 10);

// If this range starts a new step with our granularity,
// add a new element to the index array
if (intval($range_start / IDX_GRANULARITY) > $lastidx) {
    $lastidx = intval($range_start / IDX_GRANULARITY);
    $idx_list[] = "$lastidx,$recnum";
} elseif (intval($range_end/IDX_GRANULARITY) > $lastidx) {
    $lastidx = intval($range_end / IDX_GRANULARITY);
    $idx_list[] = "$lastidx,$recnum";
}
Problem 2.
If you are using the search code from the PHP.net site http://www.php.net/source.php?url=/include/ip-to-country.inc there is an error in the function named 'i2c_search_in_db($ip, $idx)'. Replace the code:
// Jump to record $idx
fseek($ipdb, $idx[0]*24);
which will fseek 1 line too far, with:
// Jump to record $idx
fseek($ipdb, ($idx[0] - 1) * 24);
Problem 3.
Going through some proxies (mine at least) will cause a value of 'unknown' to be the first value in $_SERVER['HTTP_X_FORWARDED_FOR'] which will break another piece of the PHP.net search code in the function i2c_realip().
if (!eregi ("^(10|172\.16|192\.168)\.", $ips[$i])) {
...
Change to :
if (!preg_match("/^(unknown|10|172\.16|192\.168)\.*/", $ips[$i])) {
...
That's it.
 
You might want to check this out
Posted by Daath on Wed, 10/13/2004 - 10:56.
You guys might want to check out my open source project, Weird Silence IP to Country - There are solutions for C, C#, Python, PHP. It's extremely fast, while not using a lot of memory.

The pure C version peaks at 3,2 million IP to country lookups per second on my Pentium 4 2 GHz running Windows XP, and 1,3 million lookups on my C3 1GHz linux server. There is a pure PHP version, and a PHP-module written in C...

Anyway, it's free and open source, so go grab it and check it out :)

-
Any technology distinguishable from magic, is insufficiently advanced.
 
GREAT!
Posted by tecM0 on Wed, 11/09/2005 - 10:44.
i fixed my implementation with your suggestions and get no false positives any more! great!

but i have traced down that the code need 1 to 1000 (average ~90) sequential reads to the dat-file to catch the IP. my idea is to use
a binary search on the dat-file (with fixed line length). then it
should be possible to catch the IP in max. 16 sequential reads to the
file with around 64000 entrys and creating/parsing the index-file will be unnecessary then. may be much faster, or?
anyway...i'll try it out and report the result here!

again: thx a lot for the algorithm and the bugfixes...well done!

--
Sven Truemper
Development and Research
sven@siteforum.com
http://www.siteforum.com
_____________________________________________________
SITEFORUM GROUP - Realtime Business Portals
 
Binary search...faster?!
Posted by tecM0 on Thu, 11/10/2005 - 06:09.
i have tried out my idea and here are the results.

first some results of the idx/dat method with different granularitys:

CPU: "Intel(R) Pentium(R) 4 CPU 2.40GHz"
SF Server Version: 4.0.60
OS: SuSE Linux 9.0 (i586)

granularity 100000
total IPs 10000
pass1 failed: 44
false positives: 0
total sequential file reads: 58472
average sequential file reads: 6
average time: 33 ms.
total time: 311.57 s

granularity 1000000
total IPs 10000
pass1 failed: 44
false positives: 0
total sequential file reads: 418673
average sequential file reads: 148
average time: 16 ms.
total time: 99.994 s

granularity 10000000
total IPs 10000
pass1 failed: 44
false positives: 0
total sequential file reads: 4301881
average sequential file reads: 2191
average time: 145 ms.
total time: 332.108 s

granularity 2097152
total IPs 10000
pass1 failed: 44
false positives: 0
total sequential file reads: 1082409
average sequential file reads: 115
average time: 12 ms.
total time: 118.056 s.
a granularity of 2097152 works quite well.
the average time of 12ms per lookup is ok compared to the formerly postet PHP results. the function contains a lot of debug/benchmark code so the average lookup time will be smaller in the productivity version.

and here is the first result of my quite simple binary search method:

no granularity (binary search)
total IPs 10000
pass1 failed: 44
false positives: 0
total sequential file reads: 157340
average sequential file reads: 15
average time: 2 ms.
total time: 20.782 s.

as you can see: it's much faster with lesser sequential reads!
the result is also a bit falsified by debug/benchmark code so the final version runs under average 2ms per lookup.

the function only needs the same dat file than the idx/dat method and should look like this:

00339963440033996351gb
00503316480069956103us
00699561040069956111bm
00699561120083886079us
00945854240094585439se
01006632960121195295us
01211952960121195327it
01211953280152305663us
01523056640152338431gb
...


this is the code of the cleaned lookup function:

$*IP2Country_dev{_ip}
{
	$set{IPnum}{$$ip2long{$get{_ip}}}
	$set{matchedISO}{-1}
	$set{datSize}{$io.getSize{protected://global_files/ip2country.dat}}
	$set{startIndex}{0}
	$set{endIndex}{$toInteger{$divide{$get{datSize}}{23}}}

	$while{1}
	{
		$set{midIndex}{$toInteger{$divide{$add{$get{startIndex}}{$get{endIndex}}}{2}}}
		
		$if{$or{$equals{$get{midIndex}}{$get{startIndex}}}{$equals{$get{midIndex}}{$get{endIndex}}}}
		{
			$break{}
		}
		{
			$set{seqReadS}{$toInteger{$multiply{$get{midIndex}}{23}}}
			$set{seqline}{$io.read{protected://global_files/ip2country.dat}{1}{$get{seqReadS}}{22}}

			$if{$and{$moreOrEqual{$get{IPnum}}{$substring{$get{seqline}}{0}{10}}}{$lessOrEqual{$get{IPnum}}{$substring{$get{seqline}}{10}{20}}}}
			{
				$set{matchedISO}{$substring{$get{seqline}}{20}{22}}  $// .o GOT ISO!
				$break{} 
			} 
			{
				$if{$less{$get{IPnum}}{$substring{$get{seqline}}{0}{10}}}
				{
					$set{endIndex}{$get{midIndex}}	
				}
				{
					$set{startIndex}{$get{midIndex}}	
				}
			}
		}
	}
	{1024} $// theor. max = 32 (255.255.255.255 -> 4294967295)?

	$return{$get{matchedISO}}
}


questions about the script? just ask or get a own evaluation version of the server from our website.

Maybe somebody can confirm my results with a pure PHP version.

best regards,

--
Sven Truemper
Development and Research
sven@siteforum.com
http://www.siteforum.com
_____________________________________________________
SITEFORUM GROUP - Realtime Business Portals