Table of Contents
Requirements
Grading
Getting Started
For this project, you will write a Java program that processes all text files in a directory and its subdirectories, cleans and parses the text into word stems, builds an in-memory inverted index from that text, and outputs it JSON format.
The process of stemming reduces a word to a base form (or “stem”), so that words like interesting
, interested
, and interests
all map to the stem interest
. Stemming is a common preprocessing step in many search engines.
An inverted index is a nested data structure that stores the mapping from words to the documents and positions within those documents where those words were found. It is a common in-memory data structure used by many search engines.
Requirements
The following detail the functionality requirements that must be implemented for this project.
Input Requirements
Your main
method must be placed in a class named Driver
and must process the following command-line arguments:
-
-text [path]
where the flag-text
indicates the next argument[path]
is a path to either a single file or a directory.If the value is a single file (regardless of its extension), open and add that file to an inverted index. If the value is a directory, find and add all of the text files (with
.txt
and.text
extensions) in that directory and its subdirectories to an inverted index.See the “Build Requirements” below for details on how to process the files and the “Data Requirements” for how to store that information in an inverted index.
-
-index [path]
where the flag-index
is flag that indicates the inverted index should be output to a JSON file, and the next optional argument[path]
is the path to use for the output file. If the optional path argument is not provided, useindex.json
as the default output path.If the
-index
flag is provided, always generate output even if it is empty. If the-index
flag is not provided, do not produce an output file.See the “Output Requirements” below for details.
The command-line flag/value pairs may be provided in any order or not at all. Do not convert paths to absolute form when processing command-line input!
Output user-friendly error messages in the case of exceptions or invalid input. Under no circumstance should your main()
method output a stack trace to the user!
Build Requirements
If the appropriate command-line arguments are provided, process and add files to the inverted index data structure from the provided input path. Specifically:
-
If a single file is provided as input, add that individual file regardless of its extension to the inverted index.
-
If a directory is provided as input, add all text files within that directory and all subdirectories to the inverted index. Any files that end in the
.text
or.txt
extension (case insensitive) should be considered a text file. -
Use the
UTF-8
character encoding and efficient buffered reading for all file processing.
When adding a file to the inverted index, clean and parse the text into word stems by removing any non-letter symbols (including digits, punctuation, accents, special characters), convert the remaining alphabetic characters to lowercase, split the text into words by whitespace, and then stem the word using the Apache OpenNLP toolkit. Specifically:
-
Use the regular expression
(?U)[^\\p{Alpha}\\p{Space}]+
to remove special characters from text. -
Use the regular expression
(?U)\\p{Space}+
to split text into words by whitespace. -
Use the SnowballStemmer English stemming algorithm in OpenNLP to stem words.
Data Requirements
Your code should use with an inverted index data structure to store processed files. The inverted index must store a mapping from a word to the location(s) the word was found, and the position(s) in that file the word is located. The positions should start at 1.
For example, suppose we have the following mapping stored in our inverted index:
{
"capybara": {
"input/mammals.txt": [
11
]
},
"platypus": {
"input/dangerous/venomous.txt": [
2
],
"input/mammals.txt": [
3,
8
]
}
}
This indicates that the word capybara
is found in the file input/mammals.html
in position 11
. The word platypus
is found in two files, input/mammals.html
and input/dangerous/venomous.html
. In the file input/mammals.html
, the word platypus
appears twice in positions 3
and 8
. In file input/dangerous/venomous.html
, the word platypus
is in position 2
in the file.
This will require nesting multiple built-in data structures.
Output Requirements
If the appropriate command-line arguments are provided, the contents of your inverted index should be output in alphabetically sorted order as a nested JSON object using a “pretty” format using two
space characters for indentation.
According to the JSON standard, numbers like integers should never be quoted. Any string or object key, however, should always be surrounded by "
quotes. Objects (similar to maps) should use {
and }
curly braces, and arrays should use [
and ]
square brackets. Make sure there are no trailing commas after the last element. For example:
{
"capybara": {
"input/mammals.txt": [
11
]
},
"platypus": {
"input/dangerous/venomous.txt": [
2
],
"input/mammals.txt": [
3,
8
]
}
}
The paths should be output in the form they were originally provided. The tests use normalized relative paths, so the output should also be normalized relative paths. As long as command-line parameters are not converted to absolute form, this should be the default output provided by the path object.
Use the UTF-8
character encoding and efficient buffered writing to produce the necessary output.
The project tests account for different path separators (forward slash /
for Linux/Mac systems, and backward slash \
for Windows systems). Your code does not have to convert between the two!
Run Examples
The following are a few non-comprehensive examples to illustrate the usage of the command-line arguments that can be passed to your Driver
class via a “Run Configuration” in Eclipse, assuming you set the working directory to the project-tests
directory.
Consider the following example:
-text "input/text/simple/hello.txt"
-index "actual/index-simple-hello.json"
The above arguments indicate that Driver
should build an inverted index from the single hello.txt
file in the input/text/simple
subdirectory of the current working directory, and output the inverted index as JSON to the index-simple-hello.json
file in the actual
subdirectory.
-text "input/text/simple/" -index
The above arguments indicate that Driver
should build an inverted index from all of the text files found in the text/simple
subdirectory of the current working directory, and output the inverted index as JSON to the default path index.json
in the current working directory.
-text "input/text/simple/"
The above arguments indicate that Driver
should build an inverted index from all of the text files found in the input/text/simple
subdirectory of the current working directory, but it should not produce an output file. It must still build the inverted index however! (This will be useful in the future when we add the ability to search.)
-index
The above arguments indicate that Driver
should output the inverted index as JSON to the default path index.json
in the current working directory. The inverted index will be empty, since no -path
flag argument was provided.
Grading
This project grade is split into the following assignments:
Assignment | Points | Deadline | Release | Prerequisites |
---|---|---|---|---|
Project 1 Tests | 100 | Sep 21, 2022 | v1.0.Z |
Project1Test.java , Test Checks |
Project 1 Review 1 (30 min) | 30 | Oct 12, 2022 | v1.1.Z |
Project 1 Tests, Review Checks |
Project 1 Review 2 (15 min)1 | 20 | Oct 19, 2022 | v1.2.Z |
Project 1 Review 1, Review Checks |
Project 1 Review Y (15 min)2 | 0 | Oct 26, 2022 | v1.Y.Z |
Project 1 Review 2, Review Checks |
Project 1 Design | 50 | Oct 26, 2022 | v1.Y.Z |
Project 1 Review 2, Review Checks, Passing Pull Request |
1 You should start on project 2 after receiving a grade for Project 1 Review 2—do NOT wait to finish this project to start the next one!
2 It takes approximately 4 to 5 code reviews to pass the design of this project. Keep in mind you may have only 30 minutes worth of code review appointments every 5 days. Due to weekends, that works out to approximately 3 weeks or more of code reviews.
See the Project Grading guide for details on how projects are graded.
Getting Started
The following sections may be useful for getting started on this project.
Related Homework
The following homework assignments may be useful for this project:
-
The ArgumentParser homework is useful for processing the command-line arguments. This homework can be used directly for your project.
-
The WordCleaner homework is useful for converting text files into word stems. This homework can be used directly for your project.
-
The WordIndex homework is useful as an example for designing a data structure like an inverted index.
This homework should NOT be used directly for your project; a forward index has a different underlying structure than an inverted index!
-
The PrettyJsonWriter homework is useful for producing pretty JSON output. This homework can be used directly and extended as needed for your project.
-
The FileFinder homework is useful for listing all of the text files in a directory using functional programming. This can also be done without functional programming using examples from the file IO lectures.
You can modify homework assignments as necessary for this project. However, make sure your code is passing all of the tests before using.
You should NOT wait until you have completed all of the associated homework assignments to start the project. You should develop the project iteratively as you progress throughout the semester, integrating assignments one at a time into your project code.
Related Content
The following lecture content may be useful for this project:
-
The lectures on exception handling may be useful for interacting with the user.
-
The lectures on file IO may be useful for reading and writing large files efficiently.
-
The lectures on data structures may be useful for creating and iterating through data structures in an efficient and generalized way.
-
The lectures on object-oriented programming may be useful for designing reusable and generalized code.
-
The lectures on inheritance and functional programming may also be helpful, but are NOT required for core functionality.
You can use and modify lecture code as necessary for this project. However, make sure you understand the concepts before using the code.
You should NOT wait until you have covered all of the associated lecture content to start the project. You should develop the project iteratively as you progress throughout the semester, integrating concepts one at a time into your project code.
Suggestions
Your goal should be to get to testable code as quickly as possible first, and then developing iteratively to pass the functionality tests. One possible breakdown of tasks are:
-
Figure out how to run tests individually and focus only on the first test method until that test passes. Do NOT run all of the tests at once!
-
Create code that handles parsing command-line arguments for the first test. Output the parsed arguments to the console.
-
Create code that is able to parse the single text file specified by the first test into word stems. Output the stems to the console.
-
Create code that is able to write the word stems to the JSON file specified by the first test. Do NOT worry about the data structure or format of the output file yet!
-
Compare the actual output file produced by your code to the expected file. While the output format will not match, you want to manually check the same stems are found in both files.
You now have testable code at this point. Once you have testable code, try working on the following:
-
Create the data structure that will store your inverted index.
-
Create code to output your data structure to file, but not yet in the right format. Manually check that the word stems, path locations, and word positions match the expected output.
-
Create code to output your data structure to file in proper “pretty” JSON format.
Start trying to pass all of the single file tests. Do not move on until the tests for single files are passing. Then, try the following:
-
Create code to find all of the text files in a directory and output those to the console. Manually check that it is finding the right set of text files.
-
Create code (ideally reusing what you already have) to process the text files into your inverted index data structure.
Then, start trying to pass the directory tests. Do not move on until those tests are passing. Finally, work on the exception tests last after everything else is passing.
It is important to get started early so you have plenty of time to think about how you want to approach the project and start coding iteratively. Planning to complete the code in too large of a chunk is a recipe to get stuck and fall behind!
These hints may or may not be useful depending on your approach. Do not be overly concerned if you do not find these hints helpful for your approach for this project.