Whois Crawler

The WhoisCrawler is an open-source project, dedicated in building a unix application that will be able to harvest data from whois servers and output a database, in a form able to be queried a lot faster than the server itself. The python scripts output a Radix type pickle, that can then be loaded by our other open-source project CIDR-db Search. The user needs to obtain a pickle file of the supported format and then import the search module in order to query the database by IP.

Data Collection

The current version queries the Cymru whois server (cymru-services), which has been proven the most stable free service available. The crawler harvests data regarding the CIDRs allocated by the RIRs. Specifically, it stores the first IP of the network an IP belongs to, the corresponding network (CIDR), the ISP to whom it is registered to, the country code, the RIR and the AS number.

Operation

The current version is using a distributed approach, creating ten subprocesses of the crawling instances that start querying the server. Every ten minutes the subprocesses die after creating an individual pickle of the data stored. The main process then fires a thread that will merge these pickles into a larger one for each subprocess and restarts the subprocesses to keep querying - if they haven't already finished.

Why 10 Minutes ?

After having spent a lot of time testing many techniques (threading queries, splitting IP ranges, etc), any approach had a steady reduction rate in queries completed per second. We are still trying to explain that, but it might have been the way Unix treats the python threads. When invoking many processes to get the job done, the Unix core has to manage the dig queries performed, which obviously makes it a lot more efficient. If a process starts querying with a rate of 100,000 queries per hour, in a few hours it would drop to about 20,000. When restarting that same process with a couple of seconds timeout though, it keeps a more steady rate of queries.

Invoke Procedure

The python scripts crawling the IPs are started with four command line arguments. The python main process invokes them using the suprocess module, appending the appropriate arguments like:

python crawl.py subprocess_id starting_ip last_ip time_limit

The master process is initialised with an array containing the IP ranges the crawlers should crawl. Every time a subprocess stops due to the small time frame, it will return the IP at which it stopped. The main process will then invoke a new subprocess to continue from that IP to the last IP in the ranges, or the time limit.

Subprocesses & Main Process

The subprocesses communicate with the main process through a python socket at port 6667. When the main process receives 10 connections (as many as the subprocesses) it will then start the merging thread that will gather all individual CIDR pickles to output one containing the data stored in the past ten minutes like:

for indTree in glob.glob('cidrs_?'):
    temp = pickle.load(open(indTree, 'rb'))
    for n in temp:
        node = tree.add(n.prefix)
        node.data['asn'] = n.data['asn']
        node.data['cc'] = n.data['cc']
        node.data['reg'] = n.data['reg']
        node.data['isp'] = n.data['isp']
    os.remove(indTree)

The same process will take place after all threads have finished crawling all IP ranges denoted in the beginning, but this time taking all smaller fragments (created every ten minutes) and outputting a large radix tree with all whois data named CIDRS.p.

Logging

The crawler itself stores a lot of information, including errors, notifications and warning in a log file, in the working directory under logfile.log. This file contains information having to do with the uptime, main process and subprocesses IDs in case they need to be manually killed and everything that has to do with the application's behaviour. Its format follows the template:

2014-01-02 16:20:36 GMT [28788]<Notice>: Worker(7)::createPickles(): Creating Queue pickle file queue_7.
2014-01-02 16:20:36 GMT [28866]<Notice>: Worker(8)::createPickles(): Creating CIDRS pickle file cidrs_8.
2014-01-02 16:20:36 GMT [28866]<Notice>: Worker(8)::createPickles(): Creating Queue pickle file queue_8.
2014-01-02 16:20:36 GMT [14346]<Benchmark>: Total queries: 2870112
2014-01-02 16:20:36 GMT [14346]<Benchmark>: Queries in the past ten minutes: 44619
2014-01-02 16:20:36 GMT [14346]<Benchmark>: Total CIDRs found: 225019
2014-01-02 16:20:36 GMT [14346]<Notice>: MasterCrawler::pickleManager(30): Pickling process started.
2014-01-02 16:20:36 GMT [14346]<Warning>: MasterCrawler::crawlersManager(30): Subrocces 0 restarted.
2014-01-02 16:20:36 GMT [30014]<Notice>: Worker(0)::start(): Started crawling from 11.199.83.0 to 15.0.0.0 for 10.0 minutes.
2014-01-02 16:20:37 GMT [14346]<Warning>: MasterCrawler::crawlersManager(30): Subrocces 1 restarted.
2014-01-02 16:20:37 GMT [14346]<Notice>: MasterCrawler::pickleManager(30): Individual pickles collected. Files removed.
2014-01-02 16:20:37 GMT [30028]<Notice>: Worker(1)::start(): Started crawling from 26.177.100.0 to 30.0.0.0 for 10.0 minutes.
2014-01-02 16:20:37 GMT [14346]<Warning>: MasterCrawler::crawlersManager(30): Subrocces 2 restarted.
2014-01-02 16:20:37 GMT [30047]<Notice>: Worker(2)::start(): Started crawling from 42.138.184.0 to 45.0.0.0 for 10.0 minutes.
2014-01-02 16:20:37 GMT [14346]<Notice>: MasterCrawler::pickleManager(30): Pickling process complete. Files tree_state_30 and queue_state_30 created successfully.
2014-01-02 16:20:37 GMT [14346]<Notice>: MasterCrawler::pickleManager(30): Pickle manager out!

After Crawling

After all subprocesses have completed querying their range and the main process has created the big tree, it will restart the whole process, to keep an endless loop of queries (sorry Cymru), in order to update the database with any changes. When the new process is complete, it will compare the new pickled output, with the former one. If there are any changes, the old pickle is being archived and the new one is the most up to date. If no changes have occurred, then the new pickle is being discarded.

Queue

Unfortunately, not all IPs belong to a specific CIDR. These are about two billion IPs, which would make crawling through them very time consuming. To avoid wasting too much time on these IPs, if the server does not respond with a valid CIDR for the IP being queried, the system will skip the /24 block. For example, if the process queries for 1.1.0.0 the response Cymru ANS Service will return is:

AS      | IP               | BGP Prefix          | CC | Registry | Allocated  | AS Name
NA      | 1.1.0.0          | NA                  | CN | apnic    | 2011-04-14 | NA

In this case, the system will query for the next IP of 1.1.0.0/24 = 1.1.1.0. This might result in falsifying results in the database, since there is a possibility that IP 1.1.0.200 would return a valid response. But since it would take too much time to query all 255 IPs in each /24 range, a decision was made to skip. We suppose that there isn't anyone who would like to dedicate resources in such an excessive procedure. But…since there are people who would like to have a complete database, the system will not discard the IP, but store it in a queue. This queue follows the same pickling rules as the CIDRs and will be available for the user to crawl. Another project is coming soon enough which will support the crawling, taking this specific queue as input and dedicating 10 more distributed threads to crawl it and output a new CIDRs file.

Benchmarking

Coming soon…

What's Next

There are numerous ideas to extend the application and it has been designed in a way to make extensibility actually able!

An interface can be implemented to make the crawling processes invoked to more physical hosts than one.
The crawling can be benchmarked to more hosts and verify the results.
An efficient cross-realm validation of the data can be implemented, by increasing the time spent to crawl. More servers can be added in order for the crawler to gather more accurate results.
More data can be stored in the radix tree, depending on the individual's needs.
The radix tree can be reworked to allow multiple key indexing, in order to make searching more flexible.
An efficient mechanism to verify that the queried server is alive can be implemented. It should be developed in a way that the subprocesses can be notified so that they can pause execution (maybe sockets, or even discard the past 10 minutes results - even the main process could just discard the results and restart the processes from their former starting IP).

This project is built and maintained by George Louloudakis (georgelouloudakis@gmail.com) and Panagiotis Kintis (panagiotious@gmail.com). Feel free to contact us for any suggestions, bugs or ideas! http://panagiotious.github.io/WhoisCrawler