by Himanshu Damle
DATACRYPTO is a web crawler/scraper class of software that systematically archives websites and extracts information from them. Once a cryptomarket has been identified, DATACRYPTO is set up to log in to the market and download its contents, beginning at the web page fixed by the researchers (typically the homepage). After downloading that page, DATACRYPTO parses it for hyperlinks to other pages hosted on the same market and follows each, adding new hyperlinks encountered, and visiting and downloading these, until no new pages are found. This process is referred to as web crawling. DATACRYPTO then switches from crawler to scraper mode, extracting information from the pages it has downloaded into a single database.
One challenge connected to crawling cryptomarkets arises when, despite appearances to the contrary, the crawler has indexed only a subset of a marketplace’s web pages. This problem is particularly exacerbated by sluggish download speeds on the Tor network which, combined with marketplace downtime, may prevent DATACRYPTO from completing the crawl of a cryptomarket. DATACRYPTO was designed to prevent partial marketplace crawls through its ‘state-aware’ capability, meaning that the result of each page request is analysed and logged by the software. In the event of service disruptions on the marketplace or on the Tor network, DATACRYPTO pauses and then attempts to continue its crawl a few minutes later. If a request for a page returns a different page (e.g. asking for a listing page and receiving the home page of the cryptomarket), the request is marked as failed, with each crawl tallying failed page requests.
DATACRYPTO is programmed for each market to extract relevant information connected to listings and vendors, which is then collected into a single database:
DATACRYPTO is not the first crawler to mirror the dark web, but is novel in its ability to pull information from a variety of cryptomarkets at once, despite differences in page structure and naming conventions across sites. For example, “$…” on one market may give you the price of a listing. On another market, price might be signified by “VALUE…” or “PRICE…” instead.
Researchers who want to create a similar tool to gather data through crawling the web should detail which information exactly they would like to extract. When building a web crawler it is, for example, very important to carefully study the structure and characteristics of the websites to be mirrored. Before setting the crawler loose, ensure that it extracts and parses correct and complete information. Because the process of building a crawler-tool like DATACRYPTO can be costly and time consuming, it is also important to anticipate on future data needs, and build in capabilities to extract that kind of data later on, so no large future modifications are necessary.
Building a complex tool like DATACRYPTO is no easy feat. The crawler needs to be able to copy pages, but also stealthily get around CAPTCHAs and log itself in onto the TOR server. Due to their bulkiness, web crawlers can place a heavy burden on a website’s server, and are easily detected due to their repetitive pattern moving between pages. Site administrators are therefore not afraid to IP-ban badly designed crawlers from their sites.