Assignment 1: UrlExtractor

Deadline: Sunday 15th Feb 23:59

High level description

You are to write a Java program UrlExtractor that takes a URL as a command line argument and prints all outgoing URLs from the webpage at that url. For example, if we write

java UrlExtractor "http://lums.edu.pk/sse/cs"

then the output should be a list of outgoing URLs from that web page.

Detailed instructions

You need to use the Socket class to connect to a web server and talk in HTTP protocol. You can assume that the URL will always be valid and result in a successful response (200 code).

You should not read the whole HTML response at once. Instead, you should have a fixed size buffer of 100 bytes. Once you process these bytes, you can read the next at most 100 bytes. The last read will be less of course.

You need to implement a state machine to process the HTML. You only need enough states to parse out the URLs but you need to be aware of comments and scripts in HTML. So from your start state, if you see “<script” (case insensitive) you need to change into a state where you discard everything except “/script>”. The same for comments between “<!—” and “—!>”. However, if you see “<a” you come in a state watching for “href=\"” and change to a state where you start saving every character. When you see a closing “"” you call a function and pass it the url. The closing “>” takes you back to the start state. Your parser should never crash but the output can be wrong if the HTML is malformed. Remember to take extra care when you are near the 100 bytes boundary and you can only see half a tag. One way is to copy those left over characters at the start of your buffer and then fill the rest of the buffer with new bytes.

Once you have the URLs, you need to convert the relative URLs to absolute URLs. Remove arguments (starting with “?” sign). Add any missing “http:” and then filter out duplicates. You will also remove URLs pointing to the same page (those starting with “#”). This final list is then output. Print one URL per line and nothing else so your program can be auto-checked.

Code Quality

Correctness alone is not sufficient. For maximum marks, design of algorithms (state machine, buffer management, URL handling) and design of classes and functions (which class is responsible for what task and which functions have a clearly defined single purpose with appropriate arguments) need to be very good. There are hundreds of good ways to design for each of these issues. I recommend to think of your own designs. Also use good variable names, reasonable comments, and good comments with every checkin.

Warnings

Taking any code from Internet, from each other, helping each other debug or having access to each other’s code is strictly prohibited. Any perpetrators will be forwarded for strict action.

Any guidance on design or even examples (like using Socket class) you study from the Internet should be quoted in your README file.

Git Setup

You should work in your git repository from first day. Please checkin frequently. Fewer big checkins will not get full credit. However, you can keep committing on your local machine (e.g. when working offline) and “push” it less frequently to the server and all your checkins will now go the server. REMEMBER “push” SENDS YOUR COMMITS TO THE SERVER i.e. SUBMITS YOUR WORK FOR GRADING. Always check online what has been updated in the repository before the deadline. Here are two steps to setup git.

  1. Go to http://git.junaid.name and login using the LDAP tab using your LUMS login (username is the part before @lums.edu.pk in your email address). Then follow the steps on http://doc.gitlab.com/ce/ssh/ssh.html to generate an SSH key and add it to your account.

  2. Now go to http://git.junaid.name/cs300-sp15/<YOUR-ROLLNO> and follow instructions for git global setup and then for creating a new repository on your local machine. This will be your directory for all assignments in this course. Make a urlextractor folder inside this directory and put code for this assignment in there. Do not commit any binaries. Only your source .java files and a README file can be part of this folder.