Project 1: Repairing XML

This is the archived website of IC 312 from the Fall 2014 semester. Feel free to browse around; you may also find more recent offerings at my teaching page.

1 Introduction

XML ("extensible markup language") is a really really common file format for storing pretty much any kind of data in plain text files, particularly stuff that might get transmitted over the web. XML is used to specify podcasts, math formulas, even Microsoft Office documents. In fact, many web pages that you see are formatted in XML (or at least in its less-strict older cousin, HTML). For example, here is the markup that made this paragraph:

<p>
  XML ("extensible markup language") is a really 
  <em>really</em> common file format for storing 
  pretty much any kind of data in plain text files,
  particularly stuff that might get transmitted 
  over the web.  XML is used to specify 
  <a href="http://en.wikipedia.org/wiki/RSS">podcasts</a>,
  <a href="http://www.w3.org/Math/">math formulas</a>, 
  even Microsoft Office documents. In fact, many web pages 
  that you see are formatted in XML (or at least in its 
  less-strict older cousin, HTML).  For example, here is 
  the markup that made this paragraph:
</p>

The main structure of an XML document is determined by tags such as <this> or </that>. Every tag has a name and is either an open tag or a close tag. So for example <head> is an open tag with name "head", and </button> is a close tag with name "button". For XML to be valid, every open tag must be followed by a matching close tag with the same name, and they must be properly nested.

There's a lot more to XML besides the tags (tags can have attributes, and there is usually data stuffed between the tags), but for this project there will only be "bare" open and close tags, without any attributes. You are going to write a program that uses stacks and queues in order to read in a bunch of open/close XML tags and outputs valid XML where all the tags match. If the input is already valid, your program shouldn't make any changes, but if the input is invalid, your goal will be to add missing open/close tags (as few as possible) in order to produce valid XML.

The project is split into three main "components". Some of it has been written for you, but you will have to structure and write most of the code yourself. You should give yourself more time to complete this project than you've needed in previous classes, because I'm giving you much more freedom (and less guidance) on exactly how you need to structure your code. That's exciting for you because you have more flexibility and creative opportunity, but it also means that you have more decisions to make. You might find yourself writing some code, then deciding that didn't work the way you wanted and you have to go back and re-do it in a different way. If that happens, it's a good thing and it means you're getting better as a programmer.

There are a total of 105 points available, which means that there are 5 built-in bonus points if you want to go for it. Not everything has to be done sequentially, and I've tried to point out times when you can (and should) skip to the next task.

2 First component: Reading Tags (up to 3 points)

Most of this has been written for you. There is a class Tag that represents a single opening or closing XML tag, along with its name. You can add more methods to this class if you want. There is also a class TagScanner that can be used to read Tag objects from an input stream. You use the method hasNext() to determine whether there is another tag on the input stream, and next() to read and return the next Tag object.

Task 1.1: Add attributes and contents (3 points)

You should skip this and come back to it after you have everything else..

Right now the input can only contain bare open and close tags, without any attributes like <tag anAttribute="attribute value"> or contents like <tag>Contents of the tag</tag>. Modify the Tag and TagScanner classes so that tags can have attributes and contents, and make these attributes and contents print out with the toString method. The tag's contents should just be a string, and should not include the other tags that are "nested" inside. Oh, and you should also allow "self-closing" tags such as <br />that are simultaneously open AND close tags.

3 Second component: Stacks and Queues (up to 50 points)

The real point of this project is for you to learn and understand stacks and queues. So you have to implement a stack and a queue! My starter code includes two interfaces Stack.java and Queue.java that specify how your stack and queue should work, but as always you can add more methods to these classes if you need to.

What you are allowed to use: Stacks and queues are very important, basic data structures, and so there are probably thousands of stack and queue classes written in Java that you can find online. Don't use them, because then you wouldn't learn what it takes to write it yourself. Don't use Java's built-in Stack and Queue either. In fact, you may not use anything from java.util except possibly for java.util.Arrays for this project. In fact, my Queue and Stack interfaces would conflict with the queue and stack that are defined in java.util, so you definitely don't want to include java.util.*.

Task 2.1: Implement a stack of tags (15 points)

Create a class TagStack that implements the interface Stack<Tag>. You get to choose the data structure (linked list or array).

Task 2.2: Implement a queue of Objects (15 points)

Create a class MyQueue that implements the interface Queue. Since there is no generic type parameter, this will be a queue of Objects, which could store anything (not just Tag instances).

Task 2.3: Use linked lists and arrays (15 points)

You can skip this task and come back to it later.

Modify your TagStack or MyQueue class so that one of them uses a linked list and the other one uses and array for storage. If it's an array, you should use the repeated doubling idea so that each operation is amortized O(1). I don't care which one is array-based and which one is linked list-based.

Task 2.4: Make the MyQueue class generic (5 points)

You can skip this task and come back to it later.

Modify your MyQueue class so that it has a generic type parameter T and it implements Queue<T>. This way, you can still use your MyQueue instances to store any kind of object, but the code that uses your queue won't have to contstantly be typecasting the results of dequeue() and peek() methods.

If you use an array for your queue, things get a little kooky because the way Java is compiled means that you can't do something like create a new T[100] array. Instead, you have to create an array of Objects and then cast it to T[], like

T[] myArrayOfT = (T[]) new Object[100];
That will work, except you will get some warnings about unsafe casts. To get rid of those warnings and tell Java that you really do know what you're doing, add the annotation @SuppressWarnings("unchecked") immediately before any constructor or method that does that kind of array casting.

4 Third component: Repairing XML (up to 52 points)

This is what this project is supposed to be all about - repairing invalid XML. You will write a class XMLRepair with a main method that reads in XML tags from stdin, then tries to produce valid XML (with matching tags) from that input. The output should first contain an integer indicating the total number of tags that follow, followed by the tags that represent valid XML and that contain all the input tags, in the same order. Note that your repair process is only allowed to add missing tags, and cannot remove any tags from the input.

If you are lucky and the input is already valid XML, then the number of output tags should exactly equal the number of input tags. Otherwise, the size of the output will be somewhat larger than the input size. The best possible repairer would keep this number as small as possible.

The three tasks below are not really separate from each other, but rather they build on each other in terms of the quality of the output, meaning that as you complete task your repairer will get "better" and will produce smaller valid XML output. You could technically skip straight to Task 3.4 if you are very ambitious, but I don't recommend it.

What you are allowed to use: Java contains a lot of stuff for dealing with XML files. In fact, there are at least three incompatible ways to read and XML under javax.xml, called "SAX", "DOM", and "StAX". I don't think any of that would be helpful for this project, because it's really all far more complicated than what we're doing here. You can also find tons of code by searching Google for "XML validation", but again it's mostly about checking whether XML is valid (not repairing it), and it's going to be far more complicated than here (where we are just worrying about tags and not schemas or that other stuff).

Task 3.1: Duplicate every tag (10 points)

The easiest way to "fix" any XML is to immediately output a closing tag after every opening tag, and immediately output an opening tag before every closing tag. Of course this is not a very sophisticated way to do it, but it will help get us started.

You should only need your MyQueue class to get this working. Write a function that takes in a TagScanner instance and returns a MyQueue of Tags. Every time you read the next tag from the scanner, check if it's an open or a close tag. Then add two tags to the output queue - the original tag that was scanned in, and the matching opening/closing tag with the same name, in the proper order (open, then close).

Your program should now work like this:

$ java XMLRepair <<< "<a></b>"
4
<a>
</a>
<b>
</b>

Task 3.2: Only duplicate if invalid (15 points)

The previous repairer is not that great because it will duplicate every tag, even if the original XML was totally valid and could be returned as-is. For this task, you will first check if the input is valid XML, and only run the duplication repair method if the input is NOT valid.

To check for valid XML, you have to use a Stack of tags. The procedure works roughly like this: Start with an empty stack. As you read in each Tag, check whether it is an open or a close tag. If it's an open tag, add it to the stack. If it's a close tag AND it has the same name as the tag on the top of the stack, pop off the top of the stack and keep going. Otherwise (if it's a close tag that doesn't match the top of the stack), then you have invalid XML and have to repair it somehow.

Make sure you understand the stack procedure before continuing! This is what's known as a "parser", and it's very similar to a famous parsing problem called "parenthesis matching" (or sometimes "brackets matching"). You can probably find some helpful examples online of how to do parentheses matching using a stack.

By the way, you will probably have to use multiple queues to keep track of the original input so that you can run the duplicate-repairer if it turns out to be invalid XML. The basic outline might be to first read in the input and add each tag to 2 different queues, then run the stack algorithm on one queue to check whether it's valid, then finally run the duplication repairer on the other queue if the input turns out to be invalid XML.

Your program should now be able to do:

$ java XMLRepair <<< "<a><b></b></a>"
4
<a>
<b>
</b>
</a>

Task 3.3: Add missing closing tags (15 points)

OK, now you're going to make your XML repairer a little more sophisticated. You will run the same parsing algorithm as before with a single stack of tags, but this time you are not just checking if the input is valid XML, you are actually repairing the XML as you go.

Just like with the last task, each time you encounter an open tag you add it to the stack. You should also add it to an output queue of tags. When you encounter a close tag that matches the top of the stack, you pop the stack and add the close tag to the output queue. But if the close tag doesn't match, you should keep popping from the stack and adding corresponding close tags, until you find the matching open tag for the current close tag. At the end (when the input is exhausted), you have to add another loop to add more closing tags for any open tags that remain on the stack.

By the way, this task is very similar to how HTML is usually handled. Web browsers generally assume that the programmer might have forgotten some closing tags like </p> or </td> and adds them as necessary to fix up the HTML.

Your program should now be able to do:

$ java XMLRepair <<< "<a><b></a>"
4
<a>
<b>
</b>
</a>

Task 3.4: Add missing opening tags too (10 points)

This is the ultimate task: given any sequence of input tags, find the smallest possible additions that will make the input valid XML. You will have to keep track of many "possibilities", each of which has its own stack for matching opening/closing tags, as well as a queue of output tags. I recommend making a class to represent a "possibility", and then you will make a queue of possibilities. So yes, it will be a queue of stack+queue objects - awesome!

There are multiple ways to accomplish this, but here is one way that works:

create a queue of possibilities.
add one "empty" possibility
for each tag in the input:
    if it's an open tag:
        for each possibility:
            add the open tag to the possibility
            put the possibility back in the queue
            make a copy of the possibility and add the matching close tag
            put that new possibility in the queue as well
    else if it's a close tag:
        for each possibility:
            if the tag at the top of the stack matches, pop it.
            add the new close tag to the possibility
            put the possibility back in the queue
for each possibility:
    add new closing tags to match anything left on the stack
return the smallest queue of tags out of all possibilities

Note that a correct program for this step will be pretty slow - exponential-time in fact! Having an efficient Queue and Stack class will help with this, but I promise not to test your program on any input with too many open tags (no, I'm not going to say how many).

Your program should now be able to do:

$ java XMLRepair <<< "<a></b></a>"
4
<a>
<b>
</b>
</a>

Task 3.5: Pretty print output (2 points)

You can do this even if you don't complete everything else.

Right now you are probably printing every open or close tag on separate lines. Change this behavior so that an open/close tag pair with no tags in between prints on the same line. And while you're at it, indent each line according to the level of nesting.

$ java XMLRepair <<< "<a><b></b><b><c></c></b></a>"
8
<a>
  <b></b>
  <b>
    <c></c>
  </b>
</a>

5 Starter code

You can get all these files at once by downloading the file proj1.tar.gz and running tar xzvf proj1.tar.gz

6 Submission and grading

Your program will be run by an auto-testing script. It must compile and work as specified; otherwise you will not get full credit for your work. You are encouraged to submit early and submit often. I strongly suggest submitting after every task you complete.

If you are only partially able to complete some task, but it does not actually work, you can leave that code commented out to show that you did some work on it. It is more important that your code works!

You must submit a README.txt file. You can put any notes to me in there, but at the very least your README.txt should list which tasks you completed, as well as any sources you need to cite.