The WebCrawler is a tool for searching for Web documents. It constructs a
database by traversing the Internet using a Web Robot and then
indexing the full text with a simple filtering mechanism. The search engine
processes each user request by evaluating each document against the
keywords to compute a weighted sum, then returns a sorted list of matching
documents.
The WebCrawler was recently acquired by America Online, who promise to support
it as an Internet service without censoring its content.
Key Links
URL for Front
Page: http://webcrawler.com/WebCrawler/Home.html
URL for Forms Search Page:
http://webcrawler.com/
URL for Non-Forms
Search Page: http://webcrawler.com/cgi-bin/WebQuery
URL for FAQ Page:
http://webcrawler.com/WebCrawler/FAQ.html
URL for Help
Page: http://webcrawler.com/WebCrawler/Examples.html
URL for Author's
Page: http://www.cs.washington.edu/homes/bp/bp.html
Home Organization: America Online, Inc.
Organization
-
WebCrawler is an exclusively searchable database of Web documents,
built on a custom software engine written by the author using C.
- Features and Limitations:
- Supports simple Boolean OR (by default) or Boolean AND (by
clicking the Forms checkbox) across multiple keywords, but doesn't
handle Boolean Not, complex Boolean combinations, or Proximity
Searching.
- The databases creates its indexes by identifying words on space
and punctation boundaries, converts them to lowercase, and
strips off common suffixes such as -s, -er, and
-ment. It also filters out common words such as web,
Internet, be, and, and or.
- The server weights the hits on the quality the match between keywords
and documents, then returns the highest ranking documents in sorted
order. The user specifies the number of hits as discrete amounts (10,
25, 100, or 500).
- The engine indexes and searches across filenames, document titles,
as well as full textual content.
- WebCrawler provides both Forms and Non-forms interfaces to the search
engine, however Forms support is required for most of the search features.
- The information catalogued by WebCrawler has no specific focus or
content restrictions.
Administration
- Document information is gathered automatically by a custom Web searcher
and from user-suggested URL's.
- Average response time for basic access is about 5 seconds, and searches
return within 30 seconds. However, during peak usage hours (10 am to
4pm weekdays) you may be refused service due to the processor load.
- The server runs on a Pentium computer under NextStep, and the
dcoument gathering engine operates from a similar second machine.
- The WebCrawler index currently contains information on over 100,000
documents, and new links are gathered approximately once a week.
- The layout and organization of the server are very simple and the
information provided is quite helpful. The flexibility of the search
engine (smart truncation, etc), the simplicity of the search page,
and the formatting of the search results make the server ideal for
new and experienced users.
- Additional Services
- The help page demonstrates sample queries, with suggestions for
improving search quality, and a description of the indexing process.
- The server maintains a list of the
Top 25
URLs linked from other documents. This is not a reflection
of the actual traffic on a particular document, but the number of
hotlists and index pages that include a pointer to it.
- The server allows users to suggest documents for inclusion into
the search.
Evaluation
Example Usage
Demonstrate a sample search session explicitly listing:
- Topic
- Keyword(s), Boolean search controls
- Documents delivered
- Output format
fprefect@umich.edu - 6/12/95