Next Project ▶

Java: TextDigestor

This is a program for analyzing the words in a plain text file. The project focused on creating a design that would allow for the quick implementation of additional text analysis tools using the Analyzer interface. Control of individual analyzers was simplified because the controlling code need only interact with the interface methods.

Let's begin with the Analyzer interface.

Show - Not much to see here.

We'll start skipping to the more interesting stuff now. We'll ignore the AnalyzerDriver class which is just a main method that starts up an instance of the much more interesting AnalyzeFile class and passes it all the command line arguments.

The AnalyzeFile object takes the location as an argument and opens it for processing. It then creates an instance of each analyzer to be used. Each line of the input file is read and split on regex non-word characters to create individual text tokens. The tokens are passed to all analyzers using the processToken method. When processing is complete the writeOutputFile method is called for all analyzers and we're done.

Show - This is where most of the work happens.

Now let's take a look at a couple of the more interesting analyzers. There are a total of six in the package but we'll just cover the good ones.

I wanted to really stress the program so I tested using the complete works of Edgar Rice Burroughs duplicated three times in a single file. It worked out to about 20GB. Thanks Project Gutenberg! I thought Burroughs would have some interesting words and proper names to examine.

First up is the KeywordAnalyzer. It records the numerical position of every occurrence of a series of keywords. The keyword list is stored in it's own file which is referenced in the properties file. The keywords are stored in a map paired with a List of Integers to record each position in the sequence of words.

The output had to follow a very strict format. This is where I encountered the most troubling bug in the entire project. I ended up calling it the triceratops bug. When formating the output lines we were required to maintain a specific line length. When processing the positions I appended the final partial line to the list of complete lines for each keyword. Unfortunately, I didn't account for words where the list ended with a line of exactly the correct length. During initial testing everything worked fine because almost all words ended on a partial line. I continued adding words to the keyword list and testing. All of the sudden I'm getting IndexOutOfBoundsExceptions. The last word I added to the list was triceratops. Triceratops occurs the exact right number of times in the text to end on a perfectly formed line. It turns out I was trying to add the closing ] to a List entry that didn't exist. Very bad, I know. This is what I took away from the whole thing:

  • Always consider edge cases.
  • Testing is good.
  • Don't assume you know something when you can test for it. I could have checked the length of the List easily.

Show - Home of the triceratops bug.

Show keyword_locations.txt - A pared down version of the KeywordAnalyzer output.

Finally we have the TokenSizeAnalyzer. This one tallies the number of words by length. The report it outputs was the most interesting part. There is a list of the the number of occurrences for each word length. The fun part was graphing the lengths in plaintext with a horizontal and vertical orientation. For whatever reason I really enjoyed wrtiting this one.

Show - Tallies words by length.

Show token_size.txt - TokenSizeAnalyzer report output.

I learned a lot on this project. The class it was completed for was my favorite so far. I really tightened up my use of methods. This is the class that taught me to think of my methods like a sentence. Each method should have a single clear idea and purpose. Thanks for scrolling all the way down here. You can see more of my java projects by hitting the Next Project button. Or you could just go ahead and download my resume below. Thanks again.

Next Project ▶