Java: TextDigestor
This is a program for analyzing the words in a plain text file. The project focused on creating a design that would allow for the quick implementation of additional text analysis tools using the Analyzer interface. Control of individual analyzers was simplified because the controlling code need only interact with the interface methods.
Let's begin with the Analyzer interface.
Show Analyzer.java - Not much to see here.
package java112.analyzer; /** * An Analyzer is a routine for analyzing text tokens and writing * a report to file. * @author Brian Manning */ public interface Analyzer { /** * Processes text tokens for analysis. * @param token String - A single text token for processing. */ void processToken(String token); /** * Writes a report to file with analysis of all processed tokens. * @param inputFilePath String - Path of file to be analyzed. */ void writeOutputFile(String inputFilePath); }
We'll start skipping to the more interesting stuff now. We'll ignore the AnalyzerDriver class which is just a main method that starts up an instance of the much more interesting AnalyzeFile class and passes it all the command line arguments.
The AnalyzeFile object takes the location as an argument and opens it for processing. It then creates an instance of each analyzer to be used. Each line of the input file is read and split on regex non-word characters to create individual text tokens. The tokens are passed to all analyzers using the processToken method. When processing is complete the writeOutputFile method is called for all analyzers and we're done.
Show AnalyzeFile.java - This is where most of the work happens.
package java112.analyzer; import java.io.*; import java.util.*; /** * Analyzes text files by tokenizing and passing to specific analyzers. * Manages all analyzers. Reads input file one line at a time and * splits each line into individual text tokens on regex non-word * characters. Calls each analyzers processing method for each * collected token. Calls output report method for all analyzers. * @author Brian Manning */ public class AnalyzeFile { //Constant, stores expected number of command line arguments. private static final int CORRECT_ARGUMENT_NUMBER = 2; //Stores all analyzers to be run private Listanalyzers; //Stores a string representing path of the file being analyzed. private String inputFilePath; //Properties file for all analyzers. private Properties properties; /** * Manages all processing. Tests for expected commandline arguments, * halts on failure. Opens input file and splits into individual text * tokens. Iterates through all tokens and calls processing methods * for each analyzer. When processing is complete calls output report * methods for each analyzer. * @param arguments String[] - command line arguments passed * by AnalyzerDriver containing path of input file. */ public void runAnalysis(String[] arguments) { if (arguments.length != CORRECT_ARGUMENT_NUMBER) { System.out.println( 'Enter path to input file, path to properties file'); return; } inputFilePath = arguments[0]; loadProperties(arguments[1]); startAnalyzers(); openInputFile(); writeAllOutputFiles(); } /** * Loads the properties file for use with all analyzers * @param propertiesFilePath String - path to properties file */ public void loadProperties(String propertiesFilePath) { properties = new Properties(); try { properties.load(this.getClass().getResourceAsStream(propertiesFilePath)); } catch(IOException ioe) { System.out.println('Cant load the properties file'); ioe.printStackTrace(); } catch(Exception e) { System.out.println('Problem: ' + e); e.printStackTrace(); } } /** * Opens the input file for processing. */ private void openInputFile() { BufferedReader inputFile = null; try { inputFile = new BufferedReader(new FileReader(inputFilePath)); readInputLine(inputFile); } catch (java.io.FileNotFoundException exception) { exception.printStackTrace(); } catch (java.io.IOException exception) { exception.printStackTrace(); } catch (Exception exception) { exception.printStackTrace(); } finally { try { if (inputFile != null) { inputFile.close(); } } catch (java.io.IOException exception) { exception.printStackTrace(); } catch (Exception exception) { exception.printStackTrace(); } } } /** * Iterates through all strings in list of text tokens, test for empty * strings, passes each token to processing methods for analyzers. * @param tokenList String[] - List of all individual tokens * in input line. */ private void processTokenList(String[] tokenList) { for (String token : tokenList) { if (!token.isEmpty()) { processTokenForAllAnalyzers(token); } } } /** * Iterates through all analyzers and calls processing for the token * @param token String - An individual text token. */ private void processTokenForAllAnalyzers(String token) { for (Analyzer analyzer : analyzers) { analyzer.processToken(token); } } /** * Reads input file one line at a time and passes lines to * splitLineOnWord. * @param inputFile BufferedReader - Contents of input file. */ private void readInputLine(BufferedReader inputFile) throws IOException, Exception { String inputLine; while (inputFile.ready()) { inputLine = inputFile.readLine(); splitLineOnWord(inputLine); } } /** * Splits a single line from input file on regex non-word characters. * Passes to processTokenList for processing by analyzers. * @param line String - A single line of the input file. */ private void splitLineOnWord(String line) { String[] lineList = line.split('\W'); processTokenList(lineList); } /** * Instantiates all analyzers and adds to the list. */ private void startAnalyzers() { analyzers = new ArrayList (); analyzers.add(new UniqueTokenAnalyzer(properties)); analyzers.add(new SummaryReport(properties)); analyzers.add(new BigWordAnalyzer(properties)); analyzers.add(new TokenCountAnalyzer(properties)); analyzers.add(new KeywordAnalyzer(properties)); analyzers.add(new TokenSizeAnalyzer(properties)); } /** * Calls output report methods for all analyzers. */ private void writeAllOutputFiles() { for (Analyzer analyzer : analyzers) { analyzer.writeOutputFile(inputFilePath); } } }
Now let's take a look at a couple of the more interesting analyzers. There are a total of six in the package but we'll just cover the good ones.
I wanted to really stress the program so I tested using the complete works of Edgar Rice Burroughs duplicated three times in a single file. It worked out to about 20GB. Thanks Project Gutenberg! I thought Burroughs would have some interesting words and proper names to examine.
First up is the KeywordAnalyzer. It records the numerical position of every occurrence of a series of keywords. The keyword list is stored in it's own file which is referenced in the properties file. The keywords are stored in a map paired with a List of Integers to record each position in the sequence of words.
The output had to follow a very strict format. This is where I encountered the most troubling bug in the entire project. I ended up calling it the triceratops bug. When formating the output lines we were required to maintain a specific line length. When processing the positions I appended the final partial line to the list of complete lines for each keyword. Unfortunately, I didn't account for words where the list ended with a line of exactly the correct length. During initial testing everything worked fine because almost all words ended on a partial line. I continued adding words to the keyword list and testing. All of the sudden I'm getting IndexOutOfBoundsExceptions. The last word I added to the list was triceratops. Triceratops occurs the exact right number of times in the text to end on a perfectly formed line. It turns out I was trying to add the closing ] to a List entry that didn't exist. Very bad, I know. This is what I took away from the whole thing:
- Always consider edge cases.
- Testing is good.
- Don't assume you know something when you can test for it. I could have checked the length of the List easily.
Show KeywordAnalyzer.java - Home of the triceratops bug.
package java112.analyzer; import java.util.*; import java.io.*; /** * Processes individual text tokens. Compares tokens to list of keywords * Records the position in the token sequence of occurances of keywords. * Writes a report of positions for each keyword. * @author Brian Manning */ public class KeywordAnalyzer implements Analyzer { // Properties file for this analyzer. private Properties properties; // stores the position of the token in the sequence of all tokens private int tokenOccurence; // Stores a map of text token and integer occurance number pairs. private Map> keywordMap; /** * Instantiates a TreeMap object as tokenCounts on creation. */ public KeywordAnalyzer() { keywordMap = new TreeMap >(); } /** * Constructor for KeywordAnalyzer analyzer. * @param properties Properties - Properties file values for this analyzer. */ public KeywordAnalyzer(Properties properties) { this(); this.properties = properties; openKeywordFile(); } /** * Accessor for the keywordMap instance variable. * @return keywordMap Map. A map of all keyword tokens paired with a list * of positions. */ public Map > getKeywordMap() { return keywordMap; } /** * Processes a single token. Tests for presence in map, adds current * position to list if present. * @param token String - a single text token. */ public void processToken(String token) { tokenOccurence++; if (keywordMap.containsKey(token)) { List positions = keywordMap.get(token); positions.add(tokenOccurence); } } /** * Writes a text file listing each keyword and a list of the positions in * which it occurs. * @param inputFilePath String - Path of the file being analyzed. */ public void writeOutputFile(String inputFilePath) { PrintWriter tokenReport = null; try { tokenReport = new PrintWriter( new BufferedWriter(new FileWriter(getOutputFilePath()))); writeKeywordList(tokenReport); } catch (java.io.FileNotFoundException exception) { exception.printStackTrace(); } catch (java.io.IOException exception) { exception.printStackTrace(); } catch (Exception exception) { exception.printStackTrace(); } finally { if (tokenReport != null) { tokenReport.close(); } } } /** * Retrieves values from properties for path to output file. * @return outputFilePath String - Path of the output file. */ public String getOutputFilePath() { return properties.getProperty('output.dir') + properties.getProperty('output.file.keyword'); } /** * Loops through all entries in the keywordMap map and writes each keyword * followed by a list of all the positions in which it occurs. * @param keywordReport PrintWriter - Output file for the report. */ private void writeKeywordList(PrintWriter keywordReport) { for (Map.Entry > entry : keywordMap.entrySet()) { String keyword = entry.getKey(); keywordReport.println(keyword + ' ='); List lines = processKeywordPostions(entry.getValue()); writeKeywordListLines(keywordReport, lines); keywordReport.println(''); } } /** * Loops through all entries in the lines List and writes each line to * the report. * @param keywordReport PrintWriter - Output file for the report. * @return List lines- List of formatted token position report lines. */ private void writeKeywordListLines(PrintWriter keywordReport, List lines) { for (String line : lines) { keywordReport.println(line); } } /** * Tests for and empty positions list and returns correctly formated empty * list report. If not empty passes the positions list to * processPositionListLines for formatting. * @param List - List of correctly formatted lines for report. * @return List lines- List of formatted token position report lines. */ private List processKeywordPostions(List positions) { List lines = null; if (positions.isEmpty()) { lines = new ArrayList (); lines.add('[]'); return lines; } lines = processPositionListLines(positions); return lines; } /** * Formats the keyword positions for report output. Begins with an opening * bracket on the first line. Adds positions followed by a comma and space * to the line until MAX_LINE_LENGTH is exceeded. When MAX_LINE_LENGTH is * exceeded trims the trailing space, adds to lines List, and starts a new * line. After all positions have been processed trims the trailing space * the final line and add it to the line list. Finally passes to * correctPositionListLastLine for correction of the final line of the * osition list. * @param List lines- List of token positions. * @return List lines- List of formatted token position report lines. */ private List processPositionListLines(List positions) { final int MAX_LINE_LENGTH = 75; List lines = new ArrayList (); String line = '['; for (Integer position : positions) { line += position + ', '; if (line.length() > MAX_LINE_LENGTH) { lines.add(line.trim()); line = ''; } } lines.add(line.trim()); lines = correctPositionListLastLine(lines); return lines; } /** * Replaces the trailing comma with a closing bracket on the last line * of the position list. Handles extra empty last lines by removing from list * @param List lines- List of formatted token position report lines. * @return List lines- List of formatted token position report lines. */ private List correctPositionListLastLine(List lines) { String lastLine = lines.get(lines.size() - 1); if (lastLine.isEmpty()) { lines.remove(lines.size() - 1); lastLine = lines.get(lines.size() - 1); } lastLine = lastLine.substring(0, lastLine.length() - 1) + ']'; lines.set(lines.size() - 1, lastLine); return lines; } /** * Opens the keyword file for processing. */ private void openKeywordFile() { BufferedReader inputFile = null; try { inputFile = new BufferedReader(new FileReader( properties.getProperty('file.path.keywords'))); readInputLine(inputFile); } catch (java.io.FileNotFoundException exception) { exception.printStackTrace(); } catch (java.io.IOException exception) { exception.printStackTrace(); } catch (Exception exception) { exception.printStackTrace(); } finally { try { if (inputFile != null) { inputFile.close(); } } catch (java.io.IOException exception) { exception.printStackTrace(); } catch (Exception exception) { exception.printStackTrace(); } } } /** * Reads keyword file one line at a time and adds entries to keywordMap * @param inputFile BufferedReader - Contents of input file. */ private void readInputLine(BufferedReader inputFile) throws IOException, Exception { String line = null; while (inputFile.ready()) { line = inputFile.readLine(); if (!line.isEmpty()) { keywordMap.put(line, new ArrayList ()); } } } }
Show keyword_locations.txt - A pared down version of the KeywordAnalyzer output.
Barsoom = [79551, 79731, 80543, 80656, 81653, 81808, 82411, 82505, 82812, 82870, 82929, 82977, 83938, 84041, 84653, 84727, 84737, 84775, 86178, 86567, 86682, 86958, 88127, 88251, 89465, 89517, 90671, 91342, 92780, 92958, 95615, 96174, 97549, [...] 10623694, 10624948, 10625484, 10625575, 10627102, 10631989, 10632748, 10632794, 10633855, 10635111, 10637639, 10637859, 10638141, 10638836, 10640472, 10641004, 10643738, 10644087, 10644819, 10647552, 10648846, 10648851, 10649409] Mars = [78665, 78721, 79184, 79220, 79526, 79634, 79661, 79771, 80846, 82882, 83033, 84683, 86882, 93160, 105398, 110460, 115069, 120513, 126770, 151547, 162038, 167898, 168106, 168452, 895156, 936264, 1103908, 1103964, 1105269, 1105289, [...] 10611346, 10613002, 10616233, 10620078, 10620272, 10626272, 10631844, 10637036, 10637272, 10638431, 10639562, 10643823, 10646536, 10648558, 10648773, 10648956, 10649227, 10650188] disproportionately = [1066486, 2922489, 4778492, 6634495, 8490498, 10346501] expressionlessness = [1012402, 2868405, 4724408, 6580411, 8436414, 10292417] khsdfugiasf = [] triceratops = [352958, 352970, 354181, 355923, 359253, 423766, 2208961, 2208973, 2210184, 2211926, 2215256, 2279769, 4064964, 4064976, 4066187, 4067929, 4071259, 4135772, 5920967, 5920979, 5922190, 5923932, 5927262, 5991775, 7776970, 7776982, 7778193, 7779935, 7783265, 7847778, 9632973, 9632985, 9634196, 9635938, 9639268, 9703781]
Finally we have the TokenSizeAnalyzer. This one tallies the number of words by length. The report it outputs was the most interesting part. There is a list of the the number of occurrences for each word length. The fun part was graphing the lengths in plaintext with a horizontal and vertical orientation. For whatever reason I really enjoyed wrtiting this one.
Show TokenSizeAnalyzer.java - Tallies words by length.
package java112.analyzer; import java.util.*; import java.io.*; /** * Processes individual text tokens and tallies the number of tokens for each * distinct token length. Writes a report listing each token length followed by * the number of occurances. Draws a simple text histogram of length frequency * in horizontal and vertical orientations. * @author Brian Manning */ public class TokenSizeAnalyzer implements Analyzer { // map that stores token length and the number of occurances for that length private MaptokenSizes; // stores properties for this analyzer. private Properties properties; //stores the size of the largest token. private int maximumSize; /** * Constructor for TokenSizeAnalyzer. Instantiates the * tokenSizes Map as TreeMap. */ public TokenSizeAnalyzer() { tokenSizes = new TreeMap (); } /** * Constructor for TokenSizeAnalyzer. * @param properties Properties - Properties file values for this analyzer. */ public TokenSizeAnalyzer(Properties properties) { this(); this.properties = properties; } /** * Accessor for the tokenSizes instance variable. * @return tokenSizes Map. Map of all token sizes and number of occurances. */ public Map getTokenSizes() { return tokenSizes; } /** * Accessor for the maximumSize instance variable. * @return maximumSize int. The size of the largest token. */ public int getMaximumSize() { return maximumSize; } /** * Processes a single token. Tests for presence of token size in map, * adds if not present, increments if present. * @param token String - a single text token. */ public void processToken(String token) { if (tokenSizes.containsKey(token.length())) { int value = tokenSizes.get(token.length()); tokenSizes.put(token.length(), ++value); } else { tokenSizes.put(token.length(), 1); } } /** * Writes a text file listing each length, quantity pair on a new line. * @param inputFilePath String - Path of the file being analyzed. */ public void writeOutputFile(String inputFilePath) { PrintWriter tokenReport = null; try { tokenReport = new PrintWriter( new BufferedWriter(new FileWriter(getOutputFilePath()))); writeTokenSizeReport(tokenReport); tokenReport.println(''); writeTokenHistogram(tokenReport); tokenReport.println(''); writeVerticalTokenHistogram(tokenReport); } catch (java.io.FileNotFoundException exception) { exception.printStackTrace(); } catch (java.io.IOException exception) { exception.printStackTrace(); } catch (Exception exception) { exception.printStackTrace(); } finally { if (tokenReport != null) { tokenReport.close(); } } } /** * Writes a text file listing each length, quantity pair on a new line. * @param tokenReport PrintWriter - Path of the file being analyzed. */ private void writeTokenSizeReport(PrintWriter tokenReport) { for (Map.Entry entry : tokenSizes.entrySet()) { tokenReport.println(entry.getKey() + ' ' + entry.getValue()); } } /** * Writes to a text file a histogram composed of asterisks reflecting the * values for each key in the tokenSizes map. * @param tokenReport PrintWriter - Path of the file being analyzed. */ private void writeTokenHistogram(PrintWriter tokenReport) { //value for number of columns in report final int MAX_LINE_LENGTH = 75; //value for number of spaces before graph starts final int WHITESPACE_BUFFER = 4; double mostTokens = Collections.max(tokenSizes.values()); double scale = mostTokens / MAX_LINE_LENGTH; for (Map.Entry entry : tokenSizes.entrySet()) { String line = addCharacters( entry.getKey().toString(), WHITESPACE_BUFFER - entry.getKey().toString().length(), ' '); line = addCharacters( line, (int) Math.ceil(entry.getValue() / scale), '*'); tokenReport.println(line); } } /** * Adds the specified character the specified number of times to * the input string. * @param input String - string to add characters to. * @param quantity quantity - number of characters to add. * @param character char - character to add. * @return input String. The input string with specified charcters added. */ private static String addCharacters (String input, int quantity, char character) { char[] repeat = new char[quantity]; Arrays.fill(repeat, character); input += new String(repeat); return input; } /** * Writes to a text file. Creates a histogram composed of asterisks * reflecting the values for each key in the tokenSizes map in vertical * orientation. * @param tokenReport PrintWriter - Path of the file being analyzed. */ private void writeVerticalTokenHistogram(PrintWriter tokenReport) { //value for number of rows in report final int MAX_COLUMN_HEIGHT = 34; //value for number of spaces for each row entry final int WHITESPACE_BUFFER = 3; double mostTokens = Collections.max(tokenSizes.values()); double scale = mostTokens / MAX_COLUMN_HEIGHT; int currentRow = (int) Math.ceil(mostTokens / scale); while (currentRow > 0) { tokenReport.println(buildVerticalHistogramRow( currentRow, scale, WHITESPACE_BUFFER)); currentRow--; } tokenReport.println(buildVerticalHistogramFooter(WHITESPACE_BUFFER)); } /** * Loops throught the tokenSizes map and creates a row for the vertical * histogram by assigning asterisk or space to the position for that token * length Draws for the histogram row passed in by currentRow. * @param currentRow int - the current row of the vertical histogram. * Start from the top. * @param scale double - the scale factor used to adjust the raw number of * token occurances. * @param entryLength int - the exact character size for each entry in the * row. * @return line String - The completed row of the histogram. */ private String buildVerticalHistogramRow(int currentRow, double scale, int entryLength) { String line = ''; String marker = '*'; for (int value : tokenSizes.values()) { if ((int) Math.ceil(value / scale) >= currentRow) { line += addCharacters( marker, entryLength - marker.length(), ' '); } else { line = addCharacters(line, entryLength, ' '); } } return line; } /** * Loops through all the keys in the tokenSizes map and formats the token * length values as labels for the vertical histogram. * @param entryLength int - exact character size for each entry in the row. * @return line String - The completed footer of the histogram. */ private String buildVerticalHistogramFooter(int entryLength) { String line = ''; for (Integer key : tokenSizes.keySet()) { line += addCharacters( key.toString(), entryLength - key.toString().length(), ' '); } return line; } /** * Retrieves values from properties for path to output file. * @return outputFilePath String - Path of the output file. */ public String getOutputFilePath() { return properties.getProperty('output.dir') + properties.getProperty('output.file.token.size'); } }
Show token_size.txt - TokenSizeAnalyzer report output.
I learned a lot on this project. The class it was completed for was my favorite so far. I really tightened up my use of methods. This is the class that taught me to think of my methods like a sentence. Each method should have a single clear idea and purpose. Thanks for scrolling all the way down here. You can see more of my java projects by hitting the Next Project button. Or you could just go ahead and download my resume below. Thanks again.
1 462306 2 1810674 3 2748972 4 2004060 5 1288062 6 1004826 7 738930 8 442902 9 325824 10 171984 11 75378 12 38838 13 14940 14 5880 15 1926 16 444 17 60 18 12 1 ************* 2 ************************************************** 3 *************************************************************************** 4 ******************************************************* 5 ************************************ 6 **************************** 7 ********************* 8 ************* 9 ********* 10 ***** 11 *** 12 ** 13 * 14 * 15 * 16 * 17 * 18 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18