Proposed Applications of IV Techniques to the Web
Abstract
|
Introduction
|
IV Techniques
|
Applications
|
Proposals
|
References
I'd like to restrict this discussion to the use of Information Visualization techniques to
display various kinds of information obtained from the WWW. There is a temptation to
get bogged down in Information Retrieval techniques. Information Retrieval is a related
and very relevant subject, but it is also a lot bigger than I'd like to get into. Instead, I
prefer to focus on the kinds of information we can already obtain from the WWW, and
consider ways we can visualize that already-existing data more efficiently.
I believe there are four questions we must ask about how we intend to visualize web data.
- What kind of data do we want to visualize?
- What kind of visualization techniques do we want to use?
- How do we want to implement the visualization front end? That is, what platform,
software, libraries, etc. do we want to use for drawing the displays?
- How do we want to implement the back end? That is, how do we want to collect the
information that we will then visualize?
What kind of data to visualize?
I'd like to think about sorting and filtering WWW data based on the following kinds of
characteristics:
- Source. We can obtain this information from the domain identifier, for example,
.com or .edu.
- Size. Some search engines, such as AltaVista, return this information as a part of
each search result.
- Site. Again, we can obtain this information from the domain name, for example,
cse.ucsc.edu or gvutech.edu.
- Type of page. There are many types of pages. I tend to divide the pages I have seen
into the following types:
- Personal pages.
- Index pages.
- Informational pages.
- Search engines.
Whether we can actually identify the type of page and use this kind of information
remains to be seen.
- Type of information. Presuming that a particular page is informational, there are still
many kinds of information that it can present. I have run across the following kinds of
informational pages:
- Bibliographies.
- Articles.
- Pictures.
- News articles and announcements.
Again, whether we can identify the type of information is an elusive question.
- Link structure. If we analyze each document as we retrieve it, we can determine
what links exist between each of the documents. This allows us to create some kind of
cluster diagram. When used in conjunction with the web site information, this can
help us determine not only the relative importance of specific web pages, but also the
relative importance of specific web sites.
- Date. We might want to look for web pages within a certain time range (for example,
when looking for a news article). We might also want to discard older information
because we assume it is outdated or inaccurate. I'm not sure how to determine the
dates for web pages, but I can think of several possibilities:
- See if the page itself has a date embedded in it somewhere. Many web pages have
signatures that read "this page created on such-and-such a date," or some other
identifier.
- Web search services, such as AltaVista, return the date they last scanned the page
into their archives.
- See if it is possible for the web server to return the actual date on the file. This
might depend on the server hardware platform, software platform, and security
settings.
- Language. We might want to filter out web pages written
in languages that we do not
understand. It ought to be easy to distinguish western european from eastern
european from oriental languages based on a quick syntactic analysis. (Eastern
european languages probably use more 8-bit characters for the purpose of including
diacritical marks. Oriental languages use 2-byte characters that ought to be easy to
spot.) Any further distinctions probably require a more extensive semantic analysis
that might not be worth the cost, especially when the following characteristic,
geographical location, might help determine the language.
- Geographical location. We might want to filter out web pages from particular
countries, perhaps in an attempt to filter out languages we do not understand. We can
sometimes do this based on the domain identifier, such as .uk (for the United
Kingdom), .de (for Germany), and .se (for Sweden).
- Outline. If we analyze the contents of a web page, we can easily extract the structure
of the page itself from its heading and links. I believe this can be a very powerful
feature.
- Keywords. If, in addition, we can extract some kind of information about keywords,
we can use this data to provide even more dimensions around which to visualize the
web information. HTML does provide for the insertion of keywords and other
identifying information as comments embedded in web pages. We can also use some
kind of semantic analysis to determine keywords when the web page author has not
explicitly identified them. Marti Hearst does this for straight text documents with
TileBars. However, her technique does not apply to web pages. Besides, her
technique is beyond the scope of this paper and reaches into the vastness of the
Information Retrieval literature. However, web page headings help to structure web
articles and lend greater importance to particular words. This might make a quick
semantic analysis more feasible.
What visualization techniques to use?
Well-suited for this application:
- Fisheye Views
- SeeSoft
- Perspective Wall
- Butterfly
- Hyperbolic Tree Browser
- Starfield displays
- TileBars
- Pad++ and Zooming Web Browser
- RangeSlider
- Query Spreadsheet / Information Crystal
- Dynamic Queries
- Iterative Query Refinement (Scatter/Gather)
- Animation
- Thumbnails as used in DeckView and Web Forager
- Narcissus-type automatic clustering
- Parallel searches as in MetaCrawler
Not well suited for this application:
- Cone trees (too expensive)
- Multiple Views (as implemented)
- Fractal Tree (poor substitute for hyperbolic)
- Variable Zoom
- Data Sphere (too expensive)
- Tree Maps
- Focus Table
- AlphaSlider (I don't believe it)
- Movable Filter (I don't believe it)
- Magic Lens (I don't believe it)
- Zoom Bar (I don't believe it)
How to implement the visualization front end?
That is, what platform, software, libraries, etc. do we want to use for drawing the
displays?
- Macintosh (my favorite, of course)
- SGI (because of its great graphics capabilities)
- Sun (more widely available than SGI, more portable, more powerful than Mac)
- GL or OpenGL (a standard)
- Mosaic is available in a free version or a version than can be liscenced for research
purposes (I have some Mosaic source code for Macintosh at home)
How do we want to implement the back end?
That is, how do we want to collect the information that we will then visualize?
The back end must be powerful enough to sort and sift data quickly. It must also have a
fast connection to the internet so it can gather the information quickly.
The Suns here at school are powerful and well-connected, but not everyone has a Sun.
What if the researcher is working at home? What if working over a phone line? Then in
might be better to offload the data collection to a back-end machine, and on the front end
machine perform only the graphics rendering.
MetaCrawler is implemented not as a web browser running on the client machine, but as
a CGI script running on the (back-end) server machine. This allows it do perform its
searches and collations very quickly, returning only the end results to the user. This takes
the most advantage of the server's fast connection to the internet, the server's more
powerful sorting features, and the client browser's familiarity (you can use Netscape),
acessibility (you can use it at home), and graphics (you can use your own Mac or PC or
whatever you have at home.)
Abstract
|
Introduction
|
IV Techniques
|
Applications
|
Proposals
|
References
This page maintained by Mark Brautigam
(PDA version)
Last updated 1 March 1997