Assignment 2: Multi-threaded Web Crawler

Deadline: Monday 9th March 1:20 PM

High Level Description

You are to write a Java program WebCrawler that takes a URL as a command line argument and prints outgoing URLs from the webpage at that URL. For each URL, the second level outgoing links are printed with a 2 space indent. For each second level URL, the third level URLs are printed with a 4 space indent and so on. URLs that are not traversed should have the reason for not crawling in parenthesis in front of them (ROBOTS EXCLUSION, TOTAL LINKS LIMIT REACHED, LINKS PER DOMAIN REACHED, ERROR WHILE READING). Do all console printing at the end when everything has been crawled and worker threads have terminated. This is important, otherwise very few tasks would be really parallel. This means you will accumulate the crawl tree in memory. (Real applications generate console output per thread and then aggregate it later to avoid accumulating in memory but this is not required in this assignment)

Multithreading, correct synchronization, git branches, and coding quality (as discussed in Code Complete 2) are the prime focus. Performance is not a focus of this assignment.

Submission is via your git repository as for Assignment 1.

Step 1A: Robots.txt parsing

Every domain can (optionally) host a robots.txt file in the root directory e.g. at http://www.google.com/robots.txt. Every web crawler is required to obey the rules set in this file. You are to create a new git branch called robots. In this branch implement the functionality to parse robots.txt (search the format of this file online) and return a boolean telling if you are allowed to crawl a particular URL. It is fine if you only parse the DisAllow lines. In this case, you will be more restrictive but that is okay. If robots.txt does not exist, you are allowed to crawl any page. You should cache the results in a HashMap so that only one network request is done for every root domain.

Step 1B: Redirect handling

Create a new git branch called redirect and implement in this branch support for redirects. Redirects are 301 responses from a server telling that the data for this URL is actually at another URL. e..g http://google.com/robots.txt sends a redirect to http://www.google.com/robots.txt

Step 1C: Blocking Priority Queue

Create a new branch queue and implement a blocking priority queue using synchronized methods and wait/notify. Do not use a builtin class for this purpose. We pretty much implemented this in class. You can implement the priority queue using unsorted list, sorted list, or heap. Performance is not the marking criteria in this assignment.

Back in the master branch, implement support for crawling all links found in the original URL and print all second level URLs. So if you find 10 URLs in the given link, you print each of the 10 URLs, then crawl that URL and print the outgoing links from that page after a 2 space indent.

Step 2: Merging code

The above steps can be done in parallel. You should know how to switch between branches in your code and work in parallel in multiple branches. After the above steps, merge code from robots, redirect, and queue branches in the master branch. Before this point the four features must be implemented in separate branches. After this step you are free to use branches as you need. The merging step will be easier if all features are implemented in well designed classes. Reach this point early as merging is a difficult process to get your head around.

Step 3: Multithreaded crawler

Create 10 crawler threads that each takes a URL from the blocking queue and adds all outgoing URLs back to the blocking queue. The starting URL is given on the command line. You must not violate robots.txt (do not crawl unless you have implemented robots.txt handling), you must limit to crawling 5 pages per domain (google.com and www.google.com are different) and must stop after crawling 100 web pages (this does not include redirects and robots.txt requests). For testing, you can even reduce these limits. Any URL giving an error should be ignored. Absolutely NO URL should be crawled twice even when a new URL returns a redirect response and the redirect URL has already been crawled. You will need to ensure proper synchronization for robots.txt lookups (since you are using a shared hash) and also to access data about which domains/URLs have been crawled. The priority for a URL in the queue is Links_to_that_URL_seen_so_far_from_pages_on_other_domains/(1+Pages_already_indexed_from_domain_of_the_url). Ensure that URLs are processed in order of priority. You will have to be careful with program termination. When the queue is empty and no thread has any work, your program should stop.

Warnings

Taking any code from Internet, from each other, helping each other debug or having access to each other’s code is strictly prohibited. Any perpetrators will be forwarded for strict action. Any guidance on design or even examples you study from the Internet should be quoted in your README file.