Web Matrix: Terminology Reference
Common Terms for Evaluating Internet Indices
This document describes some of the descriptive vocabulary used in the
evaluation of Web indices. It is not intended as a basic glossary
for general Internet or World Wide Web concepts; for such information, you
may look at the Internet
Reference collection for relevant descriptions and tutorials.
- Boolean Search
- A technique for finding documents that include or exclude data based
on multiple criteria. By combining or restricting search keywords using
Boolean operators and, or, and not, you can specify
simple or complex formulas. Boolean searching provides a standard
command interface for extracting matching data from a database.
- Dead Links
- Because the Web consists of documents authored and located at numerous
network locations, the level of document support and maintenance varies.
Some collections are regulary used and incorrect information is updated,
others may not be updated for weeks or months. As machine names change,
users documents move, or network connections fail, document links
become out of date. A dead link is a URL that leads to no existing
document, indicated by the message "404 Error".
- Engine
- See Search Engine
- Forms Search
- Web pages and current browsers now support an interactive mechanism
called HTML Forms, which allow users to enter complex sets of
information and request services from the Web server based on that
information. A Forms-based input page that calls a remote
Search Engine creates a
powerful feedback tool.
- Front Page
- The document that a company or organization uses to establish their
Internet presence, often leading to other sources of online information
related to that organization's purpose. Such a page is different than
an internal homepage, which is accessed by user's or member's of that
organization -- a hotlist of relevant, but not public, links or
information.
- Gathering
- The administrators of subject catalogs and searchable databases must
not only maintain their collection of URL's, but should continue to
find more Internet documents to expand their collection. The process
of gathering URL's can be done manually (by serendipity and scanning
the What's New lists) or by running
automated software (such as Web Spiders )
which returns a list of "discovered" resources.
- Hotlist
- As users explore the net, they build up a list of URL's and links that
they want to remember. Typical links in these hotlists include:
entertainment pages, Internet reference documentss, or the homepages
of their friends. Often a user will make his hotlist available to the
Web public, for reference or easy access. Organizations may also keep
hotlists, but such a pages slowly grow to the size and complexity of a
Subject Index.
- ISINDEX Search
- See Non-Forms Search
- Keyword
- When searching for information inside a database collection, a user needs
to tell the computer how to identify the desired data. The user enters
a word or phrase relevant to the information being sought, and the
database software examines each record for a match. Such matches, called
"hits", are selected because they contain the entered word or phrase.
Keyword searching can be improved by combining with other techniques:
Boolean Searching,
Proximity Searching, or
Vocabulary Control.
- Load Balancing
- Popular Web services can become too busy to run from a single computer,
and administrators may choose to distribute the document collection
and processing across several networked computers. To reduce the
Server Load that numerous users place
place on critical resources, the server may be configured to perform
automatic balancing between available computers. By passing off requests
to alternating machines, the server can improve response time (often
transparently) by significant amounts.
- Mirrors
- When a popular server becomes to busy to support the
Server Load , other sites may volunteer
to run a copy of the same program and database on their own computers.
By duplicating, or mirroring, the original server's data, the new
site can serve local users much faster and reduce the load on the
first computer.
- Non-Forms Search
- Simple search interface that takes a single keyword and processes it
using server software to generate an output document (e.g., entering
a word and getting back a dictionary entry). This type of search has
been superceded by HTML Forms, which allow
complex criteria to be passed to the server and return richer
information.
- Page
- In the context of this collection, a page is any document that is
available for browsing. Some documents provide key access or information
about a service or organization on special pages, called "public pages"
or "homepages".
- Proximity Search
- Another technique for improving the quality of keyword searching, a
proximity search lets you identify documents with certain phrases or
word combinations. Such tools let you specify multiple words that occur
in close proximity, and thus a better chance of correspondence, rather
than two words that may be located anywhere in a single document.
- Regular Expression
- Regular expressions offer a way to search documents using pattern
matching. Such tools let you mix substrings, wildcards, and repetitive
sequences to create a complex key to search against, resulting in a
powerful and specific set of matches. Regular expressions are not
designed for searching by content, but useful for finding a specific
set of files or very specific data strings.
- Root/Suffix Management
- Some search engines are robust enough to recognize and shorten long
words such as "dogs" or "running" to the appropriate root words "dog"
and "run". This makes searching for such words much easier because it
is not necessary to consider every permutation of that word when trying
to find it.
- Server Load
- The amount of work, such as networking or database searching, that a
Web server is performing at any given time. A server with a high load
will not respond to user requests quickly, or may not work reliably.
A site may choose to replace the server with a faster computer, or
purchase a second computer to share the processing load and improve
performance.
- Search Engine
- The software on a Web server that builds a match set by applying user
criteria to a database of documents. The speed of the engine is based
on the size of the collection and the complexity of the search, as well
as how the software written. Custom software written in C is much faster
than those scripted in Perl or csh.
- Searchable Index
- A Web server that lets you find documents in its collection by
entering a keyword or other criteria,
and returning a set of documents that describe the input in some way.
Many of the popular Internet services are searchable indices, and
many others support at least some form of searching. A document set
returned in from a searchable index is characteristically filled with
accidental hits (or false drops), documents that match the user's
criteria but don't really contain the desired information.
- Subject Index
- A Subject Index or Subject Catalog is a service that organizes linked
documents by their content and subject matter. Organized into general
categories (or alphabetically) at the top level, documents are layered
hierarchically and collected with related pages. Although any subject
index may be smaller than a searchable one, it is generally a much more
reliable tool.
- Vocabulary Control
- Selecting a suitable keyword for a search is often difficult, especially
when there are several related terms that have similar meanings. Unless
the criteria reflect every relevant word in the language, matching
documents may be missed in a search. Vocabulary Control is the
practice of establishing a standard set of keywords and identifying
documents by these keywords, to improve the users chances of finding
every relevant document.
- Web Crawlers,
Spiders,
Worms
- Each of these terms are used to describe software that automatically
downloads and catalogs Internet documents. By reading each document it
discovers, the software builds a list of additional pages to visit.
This is a popular method for creating a database suitable for a
Searchable Index.
- What's New? Lists
- As more document collections and cool homepages come online, Web
servers and other organizations regularly compile them into lists.
By scanning a few What's New? lists, you can keep track of the
latest and greatest pages on the Web -- and beat your friends to
them!
fprefect@umich.edu - 6/12/95