This is the archived website of IC 312 from the Fall 2014 semester. Feel free to browse around; you may also find more recent offerings at my teaching page.
312 proj 1.
XML ("extensible markup language") is a really really common file format for storing pretty much any kind of data in plain text files, particularly stuff that might get transmitted over the web. XML is used to specify podcasts, math formulas, even Microsoft Office documents. In fact, many web pages that you see are formatted in XML (or at least in its less-strict older cousin, HTML). For example, here is the markup that made this paragraph:
<p> XML ("extensible markup language") is a really <em>really</em> common file format for storing pretty much any kind of data in plain text files, particularly stuff that might get transmitted over the web. XML is used to specify <a href="http://en.wikipedia.org/wiki/RSS">podcasts</a>, <a href="http://www.w3.org/Math/">math formulas</a>, even Microsoft Office documents. In fact, many web pages that you see are formatted in XML (or at least in its less-strict older cousin, HTML). For example, here is the markup that made this paragraph: </p>
The main structure of an XML document is determined
by tags such as
</that>. Every tag has a name and
is either an open tag or a close tag. So for example
<head> is an open tag with name "head",
</button> is a close tag with name
"button". For XML to be valid, every open tag must be
followed by a matching close tag with the same name, and they must be
There's a lot more to XML besides the tags (tags can have attributes, and there is usually data stuffed between the tags), but for this project there will only be "bare" open and close tags, without any attributes. You are going to write a program that uses stacks and queues in order to read in a bunch of open/close XML tags and outputs valid XML where all the tags match. If the input is already valid, your program shouldn't make any changes, but if the input is invalid, your goal will be to add missing open/close tags (as few as possible) in order to produce valid XML.
The project is split into three main "components". Some of it has been written for you, but you will have to structure and write most of the code yourself. You should give yourself more time to complete this project than you've needed in previous classes, because I'm giving you much more freedom (and less guidance) on exactly how you need to structure your code. That's exciting for you because you have more flexibility and creative opportunity, but it also means that you have more decisions to make. You might find yourself writing some code, then deciding that didn't work the way you wanted and you have to go back and re-do it in a different way. If that happens, it's a good thing and it means you're getting better as a programmer.
There are a total of 105 points available, which means that there are 5 built-in bonus points if you want to go for it. Not everything has to be done sequentially, and I've tried to point out times when you can (and should) skip to the next task.
Most of this has been written for you. There is a class
Tag that represents a single opening or closing XML tag, along with
its name. You can add more methods to this class if you want. There is also a class
TagScanner that can be used to read
Tag objects from an
input stream. You use the method
hasNext() to determine whether there is
another tag on the input stream, and
next() to read and return the
You should skip this and come back to it after you have everything else..
Right now the input can only contain bare open and close tags, without
any attributes like
<tag anAttribute="attribute value"> or
<tag>Contents of the tag</tag>.
TagScanner classes so that tags can
have attributes and contents, and make these attributes and contents print out
toString method. The tag's contents should just be a string,
and should not include the other tags that are "nested" inside.
Oh, and you should also allow "self-closing" tags such as
<br />that are simultaneously open AND close tags.
The real point of this project is for you to learn and understand stacks and queues. So you have to implement a stack and a queue! My starter code includes two interfaces Stack.java and Queue.java that specify how your stack and queue should work, but as always you can add more methods to these classes if you need to.
What you are allowed to use: Stacks and queues are very important,
basic data structures, and so there are probably thousands of stack and queue
classes written in Java that you can find online. Don't use them, because then
you wouldn't learn what it takes to write it yourself. Don't use
Java's built-in Stack and Queue either. In fact, you may not use anything
java.util except possibly for
for this project. In fact, my
interfaces would conflict with the queue and stack that are defined in
java.util, so you definitely don't want to
Create a class
TagStack that implements
Stack<Tag>. You get to choose
the data structure (linked list or array).
Create a class
MyQueue that implements the
Queue. Since there is no generic type
parameter, this will be a queue of
could store anything (not just
You can skip this task and come back to it later.
MyQueue class so that one of
them uses a linked list and the other one uses and array for storage. If it's an
array, you should use the repeated doubling idea so that each operation is
amortized O(1). I don't care which one is array-based and which one is
You can skip this task and come back to it later.
MyQueue class so that it has a generic type
T and it implements
This way, you can still use your MyQueue instances to store any kind of
object, but the code that uses your queue won't have to contstantly be
typecasting the results of
If you use an array for your queue, things get a little kooky because
the way Java is compiled means that you can't do something like create a
new T array. Instead, you have to create an
Objects and then cast it to
T myArrayOfT = (T) new Object;
@SuppressWarnings("unchecked")immediately before any constructor or method that does that kind of array casting.
This is what this project is supposed to be all about - repairing
invalid XML. You will write a class
XMLRepair with a
main method that reads in XML tags from stdin, then tries to
produce valid XML (with matching tags) from that input. The output should
first contain an integer indicating the total number of tags that follow,
followed by the tags that represent valid XML and that contain all the
input tags, in the same order.
Note that your repair process is only allowed to add missing
tags, and cannot remove any tags from the input.
If you are lucky and the input is already valid XML, then the number of output tags should exactly equal the number of input tags. Otherwise, the size of the output will be somewhat larger than the input size. The best possible repairer would keep this number as small as possible.
The three tasks below are not really separate from each other, but rather they build on each other in terms of the quality of the output, meaning that as you complete task your repairer will get "better" and will produce smaller valid XML output. You could technically skip straight to Task 3.4 if you are very ambitious, but I don't recommend it.
What you are allowed to use: Java contains a lot of stuff for
dealing with XML files. In fact, there are at least three incompatible
ways to read and XML under
javax.xml, called "SAX", "DOM", and
"StAX". I don't think any of that would be helpful for this project,
because it's really all far more complicated than what we're doing here.
You can also find tons of code by searching Google for "XML validation", but
again it's mostly about checking whether XML is valid (not repairing it), and
it's going to be far more complicated than here (where we are just worrying
about tags and not schemas or that other stuff).
The easiest way to "fix" any XML is to immediately output a closing tag after every opening tag, and immediately output an opening tag before every closing tag. Of course this is not a very sophisticated way to do it, but it will help get us started.
You should only need your
MyQueue class to get this working.
Write a function that takes in a
TagScanner instance and returns
MyQueue of Tags. Every time you read the next tag from the scanner,
check if it's an open or a close tag. Then add two tags to the output queue -
the original tag that was scanned in, and the matching opening/closing tag with
the same name, in the proper order (open, then close).
Your program should now work like this:
$ java XMLRepair <<< "<a></b>" 4 <a> </a> <b> </b>
The previous repairer is not that great because it will duplicate every tag, even if the original XML was totally valid and could be returned as-is. For this task, you will first check if the input is valid XML, and only run the duplication repair method if the input is NOT valid.
To check for valid XML, you have to use a Stack of tags. The procedure works roughly like this: Start with an empty stack. As you read in each Tag, check whether it is an open or a close tag. If it's an open tag, add it to the stack. If it's a close tag AND it has the same name as the tag on the top of the stack, pop off the top of the stack and keep going. Otherwise (if it's a close tag that doesn't match the top of the stack), then you have invalid XML and have to repair it somehow.
Make sure you understand the stack procedure before continuing! This is what's known as a "parser", and it's very similar to a famous parsing problem called "parenthesis matching" (or sometimes "brackets matching"). You can probably find some helpful examples online of how to do parentheses matching using a stack.
By the way, you will probably have to use multiple queues to keep track of the original input so that you can run the duplicate-repairer if it turns out to be invalid XML. The basic outline might be to first read in the input and add each tag to 2 different queues, then run the stack algorithm on one queue to check whether it's valid, then finally run the duplication repairer on the other queue if the input turns out to be invalid XML.
Your program should now be able to do:
$ java XMLRepair <<< "<a><b></b></a>" 4 <a> <b> </b> </a>
OK, now you're going to make your XML repairer a little more sophisticated. You will run the same parsing algorithm as before with a single stack of tags, but this time you are not just checking if the input is valid XML, you are actually repairing the XML as you go.
Just like with the last task, each time you encounter an open tag you add it to the stack. You should also add it to an output queue of tags. When you encounter a close tag that matches the top of the stack, you pop the stack and add the close tag to the output queue. But if the close tag doesn't match, you should keep popping from the stack and adding corresponding close tags, until you find the matching open tag for the current close tag. At the end (when the input is exhausted), you have to add another loop to add more closing tags for any open tags that remain on the stack.
By the way, this task is very similar to how HTML is usually handled. Web
browsers generally assume that the programmer might have forgotten some closing
</td> and adds them as necessary
to fix up the HTML.
Your program should now be able to do:
$ java XMLRepair <<< "<a><b></a>" 4 <a> <b> </b> </a>
This is the ultimate task: given any sequence of input tags, find the smallest possible additions that will make the input valid XML. You will have to keep track of many "possibilities", each of which has its own stack for matching opening/closing tags, as well as a queue of output tags. I recommend making a class to represent a "possibility", and then you will make a queue of possibilities. So yes, it will be a queue of stack+queue objects - awesome!
There are multiple ways to accomplish this, but here is one way that works:
create a queue of possibilities. add one "empty" possibility for each tag in the input: if it's an open tag: for each possibility: add the open tag to the possibility put the possibility back in the queue make a copy of the possibility and add the matching close tag put that new possibility in the queue as well else if it's a close tag: for each possibility: if the tag at the top of the stack matches, pop it. add the new close tag to the possibility put the possibility back in the queue for each possibility: add new closing tags to match anything left on the stack return the smallest queue of tags out of all possibilities
Note that a correct program for this step will be pretty slow - exponential-time in fact! Having an efficient Queue and Stack class will help with this, but I promise not to test your program on any input with too many open tags (no, I'm not going to say how many).
Your program should now be able to do:
$ java XMLRepair <<< "<a></b></a>" 4 <a> <b> </b> </a>
You can do this even if you don't complete everything else.
Right now you are probably printing every open or close tag on separate lines. Change this behavior so that an open/close tag pair with no tags in between prints on the same line. And while you're at it, indent each line according to the level of nesting.
$ java XMLRepair <<< "<a><b></b><b><c></c></b></a>" 8 <a> <b></b> <b> <c></c> </b> </a>
You can get all these files at once by downloading the file
tar xzvf proj1.tar.gz
Your program will be run by an auto-testing script. It must compile and work as specified; otherwise you will not get full credit for your work. You are encouraged to submit early and submit often. I strongly suggest submitting after every task you complete.
If you are only partially able to complete some task, but it does not actually work, you can leave that code commented out to show that you did some work on it. It is more important that your code works!
You must submit a README.txt file. You can put any notes to me in there,
but at the very least your
README.txt should list which tasks
you completed, as well as any sources you need to cite.