CS 272 Software Development

CS 272-03 • Fall 2022

Project 4 Search Engine

Table of Contents Requirements
Extra Features
Grading
Getting Started


For this project, you will extend your previous project to create a multithreaded search engine web interface that allows users to enter search queries into an HTML form and get back search results as a dynamically generated web page using embedded Jetty and servlets . This project will be graded and reviewed during finals week only.

This writeup is for the search engine functionality only. See the general Project 4 Writeup for more details.

The following detail the functionality requirements that must be implemented for this project.

Input Requirements

Your main method must be placed in a class named Driver and must process the following command-line arguments:

  • -server [port] where the flag -server indicates to launch a multithreaded search engine web server, and the next optional argument [port] is the port the web server should use to accept socket connections. Use 8080 as the default value if it is not provided.

    If the -server flag is provided, your code should enable multithreading with the default number of worker threads even if the -threads flag is not provided.

The command-line flag/value pairs may be provided in any order or not at all. Do not convert paths to absolute form when processing command-line input!

Output user-friendly error messages in the case of exceptions or invalid input. Under no circumstance should your main() method output a stack trace to the user!

Your code should support all of the command-line arguments from the previous project as well.

Core Functionality (20 Points)

The functionality for this project is broken into 2 parts: core functionality (20 points) and extra features (30 points). You must complete the core functionality before extra features.

The core functionality includes, in addition to maintaining the functionality of the previous project, the following for a total of 20 points:

Points Functionality Description
5 Web Form: Display a web page with a form that includes (at a minimum) a text box where users may enter a multi-word search query and a button to submit that query to the web server.
5 Query Processing: When the web form button is clicked, send the queries to a Jetty servlet and process those queries to match how the data is stored by your inverted index.
5 Partial Search: After processing the queries, the servlet should retrieve the partial search results of those queries from the index generated by the Driver class. The servlet implementation should be thread-safe and multi-user friendly.
5 Search Results: The servlet should return the partial search results to the client (or web browser) as dynamically generated HTML with sorted (most relevant first) and clickable hyperlinks.

You cannot earn credit for extra features until the core functionality is working properly.

Extra Features (30 Points)

Once the core functionality is complete, you may implement 30 points of extra features. These features are broken into several categories. You may choose any combination of features from these categories. See the Extra Features section below for options.

You may also complete more than 30 points for extra credit. See the Extra Credit section for details.

Output Requirements

The output of this project should be the same from the previous project, except that search results will primarily be output as dynamically generated HTML pages instead of JSON files. The search result output should include a clickable HTML link to the web page and must be presented in the same order as previous projects, but otherwise can have any formatting desired.

As before, your code should only generate output files if the necessary flags are provided. If the correct flags are provided, your code should perform the indexing and search operations even if file output is not being generated.

Run Examples

The following are a few examples (non-comprehensive) to illustrate the usage of the command-line arguments that can be passed to your Driver class via a “Run Configuration” in Eclipse, assuming you set the working directory to the project-tests directory.

Consider the following example:

-html "https://usf-cs272-fall2022.github.io/project-web/input/simple/" -max 15 -threads 3 -server 8080

The above arguments behave the same as project 3, except it will also start up a web server on port 8080 for the user to interface with the search engine. No file output will be generated in this example.

Once the core functionality is complete, you may implement 70 points of extra features. These features are broken into several categories. You may choose any combination of features from these categories.

Have a feature idea? You can propose an extra feature in a public post on the course forums. If approved, the instructor will post the number of points that feature will be worth on the final project.

User Tracking Features

The following features requires your search engine to track user data. There are two implementation options (choose one):

  1. Base Functionality: Implemented by storing data in memory; only supports a single user.

  2. Extra Functionality: Implemented by storing data using session tracking or cookies; supports multiple users.

Ideally, you should use the same implementation option for all features in this subcategory. For example, if you implement search history using sessions, you should also implement visited results using sessions.

The possible features are:

Base Extra Description
5 10 Search History: Store a history of all search queries. Allow users to view and clear that history.
5 10 Visited Results: Store a history of all visited search results (i.e. results clicked on). Allow users to view and clear that history.
5 10 Favorite Results: Allow users to save favorite search results. Allow users to view and clear those favorites.
5 5 Time Stamps: Add timestamps to each item stored. Implement this for all related features to earn full credit.
5 5 Private Search: Allow users to set an option that turns off all tracking of user data. Implement this for all related features to earn full credit.
5 5 Last Visit Time: Track and display the last time a user visited your search engine. This is NOT the current time that the page was generated!

There are 30 to 45 points possible in this category depending if you choose to implement base or extra functionality.

Metadata Features

The following features requires your search engine to track search metadata (not specific to users). There are two implementation options:

  1. Base Functionality: Track metadata in memory (non-persistent).

  2. Extra Functionality: Track metadata in the on-campus SQL database (persistent).

Ideally, you should use the same implementation option for all features in this subcategory. For example, if you implement page snippets using a database, you should also implement popular queries using a database too.

The possible features are:

Base Extra Description
10 20 Page Snippets: When a web page is crawled, store a short snippets of the page. Display the snippet whenever that page is returned as a result.
10 20 Page Statistics: When a web page is crawled, store the page title (via the <title> tag in HTML), content length (via the Content-Length HTTP header), and timestamp of the crawl. Display these statistics whenever that page is returned as a result.
5 10 Most Visited Results: Track the number of times a page has been visited by any user. Allow users to see the top 5 visited pages.
5 10 Most Searched Queries: Track the number of times a multi-word query has been searched for. Allow users to see the top 5 most popular queries.
5 10 Reset Metadata: Allow users with an administrator password to clear all the metadata stored.

Some features require others to be implemented first. For example, Reset Metadata cannot be implemented until at least one of the other features that stores metadata is implemented.

There are 35 to 70 points possible in this category depending if you choose to implement base or extra functionality.

Extendable Features

The following features have base functionality that can be extended with additional functionality. The base functionality must be implemented first.

Base Extra Base Functionality Extended Functionality
5 10 New Seed: Allow a user to specify a new seed URL that should be added to the existing inverted index. If the URL has already been crawled, skip crawling that URL and output a warning to the user. Max Support: In addition to entering a new seed URL, allow the user to also specify a maximum number of pages to crawl. This is the maximum number of new pages to crawl in addition to the pages already crawled. URLs that are already included in the inverted index should be skipped and should not contribute to this maximum count.
5 10 Index Browser: Allow users to browse your inverted index as an HTML page with all of the words stored, clickable links to all of the indexed URLs for those words, and the number of positions stored for that word and location (but not list all of the positions). Subindex Browser: Allow users to enter a specific word and display the data stored in your inverted index for that specific word.
5 10 Location Browser: Allow users to browse all of the locations and their word counts stored by your inverted index as an HTML page with clickable links to all of the indexed URLs. Partial Location Search: Allow users to browse all of the locations and their word counts for locations that start with the same text. For example, browse all locations that start with “https://www.cs.usfca.edu/~cs272”.
5 10 Index JSON File: Allow users to download a JSON file of your inverted index by browsing to a specific endpoint on your web server. For example, if users visit “/download”, it returns an index.json file they can download to their system. Alternative Format: Allow users to download a file in another structured standardized file format (XML, YAML, etc.) by browsing to a specific endpoint on your web server. For example, if users visit “/download?file=index&type=yaml”, it returns an index.yaml file they can download to their system.

For example, if you implement base functionality for New Seed, you will earn 5 points. If you implement both base and extra functionality for New Seed (including Max Support), you will earn 10 points instead.

There are 15 to 30 points possible in this category depending if you choose to implement base or extra functionality.

Miscellaneous Features

The following miscellaneous features may also be implemented:

Points Description
5 Graceful Shutdown: Allow an administrator to trigger a graceful shutdown of your search engine without calling System.exit()`.
5 Search Statistics: Display the total number of results along with the time it took to calculate and fetch those results, and display the score and number of matches per search result listed.
5 Server Statistics: As a footer on every page, display the server uptime (i.e. time since the server was started), total number of words stored, total number of locations stored, and total number of queries conducted. This information can be stored in memory by the server.
5 Quick Search: Add a new button to your search form (in addition to the normal search button) that automatically redirects the user to the first search result instead of listing all of the search results. This is similar to the Google Search “I’m Feeling Lucky” button. Output a warning if there are no search results.
5 Reverse Sort Order: Allow the user to select an option to reverse the sort order of the search results using a checkbox on the search form.
5 Partial/Exact Search Toggle: Allow the user to toggle on/off partial versus exact search using a checkbox on the search form.
5 Web Framework: Design a search engine using any popular CSS/style framework to create a consistent style for all the web pages. For example, consider using Bulma, Bootstrap (Twitter), Pure.css, Material (Google), Semantic UI, and many more.
5 Search Brand: Design a search engine with a distinct brand, logo, and tagline. This includes creating a logo and tagline, and including it on all of the web pages. Do not use unlicensed unattributed media on your website.
5 Light/Dark Mode Toggle: Allow users to toggle between light mode (light colored background with dark text) and dark mode (dark colored background with light text) styles for your website.

There are 45 points possible in this category.

The final search engine project is graded differently from the previous projects.

If your project is eligible, it will be graded in a final code review with the instructor during finals week. It is possible to lose points due to poor code design, and possible to earn extra credit by completing extra features.

See below for details.

Eligibility

To be eligible for the Project 4 Review grade (associated with the final search engine project), you must meet the following criteria:

  • You must have a non-zero grade for the Project 2+3 Design assignment in Canvas.
  • You must have a non-zero grade for the Project 4 Tests assignment in Canvas and your code must still pass the associated tests.
  • Your code must additionally pass the code review checks.
  • You must attend a final code review with the instructor during finals week.

You must also complete the core functionality to be eligible for extra features. For example, you cannot earn points for using a web framework if you have not fully implemented core functionality.

Potential Deductions

It is possible to lose points earned for extra features if your implementations have any of the following issues:

Points Description
-10 Multi-User Support: Deducted if your code does not support multi-user search. Users conducting search simultaneously should see results relevant to their own queries only.
-5 Thread Saftey: Deducted if your code is not thread-safe. In-memory data accessed by different threads should be properly protected.
-5 Cross-Site Scripting (XSS) Vulnerabilities: Deducted if your code does not protect against cross-site scripting (XSS) attacks. Escape or sanitize any data from a user (either via the HTTP request or a database) prior to using it on an HTML page.
-5 SQL Injection Vulnerabilities: Deducted if your code does not protect against SQL injection attacks. Use prepared statements where appropriate anytime it accesses a database.
-5 Excessive String Concatenation: Deducted if your code uses excessive String concatenation (within a loop) to generate any of the HTML output.
-5 Poor Encapsulation: Deducted if your code breaks encapsulation.
-5 Poor Code Style: Deducted if your code is not professional. Use professional formatting, variable names, Javadoc, exception handling, and address all compiler warnings.

These deductions will only come out of the points earned for extra features—they will not impact points earned for core functionality.

While there are many ways to lose points, the total possible deduction is capped such that no more than 20 points total will be removed from your project grade due to the above issues.

Extra Credit

You may complete additional extra features as extra credit. You can earn up to 120% on this project assignment to earn back points lost due to late deductions.

You cannot earn over a 100% in any grade category. This extra credit can only help make up for points lost within the project category due to late deductions.

The following sections may be useful for getting started on this project.

The same homework assignments useful for Project 4a Web Crawler are useful for this project.

The following lecture content may be useful for this project:

  • The Jetty and servlets lecture code illustrates how to use Jetty and servlets. This includes the Servlet Basics, Servlet Data, and Sessions sections. The ReverseServer example is a good starting point.

  • The Datbases and JDBC lecture code illustrates how to use a database with servlets, which is useful for some of the extra features.

You can use and modify lecture code as necessary for this project. However, make sure you understand the concepts before using the code.

You should NOT wait until you have covered all of the associated lecture content to start the project. You should develop the project iteratively as you progress throughout the semester, integrating concepts one at a time into your project code.

Suggestions

Your goal should be to get to testable code as quickly as possible first, and then developing iteratively to pass the functionality tests.

Some hints that may be helpful include:

  • Start with using GET requests for basic search functionality.

  • Using POST requests are usually only useful for user tracking features, and it is possible to implement those features using only GET as well.

  • To ensure multi-user support, avoid static and instance members for storing anything related to search queries and results in your servlets.

  • For visited and favorite results, modify the search result links to direct back to your search engine, so that it may first store that the link was clicked on and then redirect as necessary.

  • For crawl metadata, modify the web crawler to store more information per crawl (instead of just which unique URLs have been crawled).

  • For graceful shutdown, you will need to create a special servlet combined with the ShutdownHandler in Jetty.

It is important to get started early so you have plenty of time to think about how you want to approach the project and start coding iteratively. Planning to complete the code in too large of a chunk is a recipe to get stuck and fall behind!

 These hints may or may not be useful depending on your approach. Do not be overly concerned if you do not find these hints helpful for your approach for this project.