|
Guest Editors: Ricardo Baeza-Yates, Peter Schäuble
Acrobat Reader is required to display PDF files |
|
Abstract: The
President
of CEPIS, publisher of UPGRADE, reviews the present and future of our
digital
magazine.
Information Retrieval and the Web
Guest
Editors: Ricardo Baeza-Yates, Peter
Schäuble
Presentation -
Retrieving
Information: A Discipline with a Tradition [HTML]
[PDF: 1 page, 833 KB]
(Includes Useful
References
on Information Retrieval)
Ricardo Baeza-Yates,
Peter Schäuble, Guest Editors
Abstract: The guest editors introduce rationale and contents for this issue, including also a list of useful references for those interested in knowing more about Information Retrieval (IR)
Errata: The short professional biography of Ricardo Baeza-Yates has been completed with the following information that was unintentionally missing in its original version: He is co-author of the book Modern Information Retrieval, Addison-Wesley, 1999. To see his short professional biography click here.
Information
Retrieval
for Enterprise Content [PDF: 4 pages,
845
KB]
Prabakhar Raghavan
Abstract: The
last
decade has witnessed the growth of Information Retrieval from a
boutique
discipline in Information and Library Science, to an everyday
experience
for billions of people around the world. This revolution has been
driven
in large measure by the World Wide Web, with vendors focused on Web
content
management, search and taxonomy for the Web. Simultaneously,
enterprises
have invested in networking all of their information together to the
point
where it is increasingly possible for employees to have a single window
into all of their enterprise data. We point out that while employees
seek
“web-like” experiences in the enterprise, the two domains differ
fundamentally
in the nature of the content, the behaviour of users, and the economic
motivations. Our principal focus is on outlining the demands on
information
retrieval for
enterprise content
infrastructure.
Information
Retrieval
on the Web: A New Paradigm [PDF: 3 pages,
843 KB]
Jacques Savoy
Abstract: For the information retrieval (IR) community, the Web now presents a new paradigm, while also generating new challenges and attracting growing interest from around the world. An important example of these challenges is managing huge text collections and evaluating the usefulness of hyperlinks contained within them.
An Analysis of Query
Languages
for XML [PDF: 13 pages, 902 KB]
Adelaida Delgado,
Ricardo
Baeza-Yates
Abstract: A query language for XML should be flexible enough to cover the whole range of information sources that can be tagged by XML, including databases and web documents. In this article we present a comparative analysis of several query languages that have been created for XML. We study sets of desirable features both from the point of view of the semistructured data community – which puts special emphasis on very large databases, the integration of heterogeneous sources and the transformation of data into common interchange formats –, and from the viewpoint of the information retrieval community – which puts special emphasis on full-text searches, manipulation of results sets, inclusion relationships, distances, and ranking of the result documents.
Errata: The short professional biography of Ricardo Baeza-Yates has been completed with the following information that was unintentionally missing in its original version: He is co-author of the book Modern Information Retrieval, Addison-Wesley, 1999. To see his short professional biography click here.
Methodologies to
develop
Web Information Systems and Comparative Analysis [PDF:
12 pages, 887 KB]
María
José
Escalona, Manuel Mejías, and Jesús Torres
Abstract: New systems are currently being developed which have different aims from the systems developed some years ago. Current systems tend to be distributed, with huge storage needs and complex functional requirements. But these systems are also distributed via Internet so interface, information recovery, navigation and multimedia are all fundamental aspects. These new systems are known as Web Information Systems (WIS). The complexity of these systems and the fact that they are systems requiring constant maintenance due to the fact that they continually change to adapt to users’ new requirements means that they have to be developed using a standardized process. There is currently no standard methodology accepted by all the software engineering community. This paper presents the result of a comparative study analysing the different methodologies which are currently applicable to the development of a Web Information System.
Distributed
Information
Retrieval from Web-Accessible Digital Libraries using Mobile Agents [PDF:
7 pages, 1,023 KB]
J. Alfredo Sanchez,
Sandra
Nava, Lourdes Fernández, and Griselda Chevalier
Abstract: MAIDL
(Mobile
Agents in Digital Libraries) is a framework designed to support
distributed
information retrieval from a federation of web-accessible,
heterogeneous
digital libraries. Federation is accomplished through two mechanisms: a
metadata harvesting protocol and mobile agents that explore and filter
information resources from multiple distributed collections. The main
components
of the architecture of
MAIDL are autonomous nodes
where digital libraries reside. The Open Archives Initiative (OAI)
protocol
is the basis for interoperability among participating nodes. At each
node,
information requests from users are received by agents that travel to
the
various nodes in the federation in search of appropriate resources.
Each
agent may use different information retrieval mechanisms available at
each
node to locate resources that are relevant to user requests. This paper
describes the architecture of MAIDL and the encouraging initial results
obtained with a specific federation.
Automatic Extraction
of
Semantically-Meaningful Information from the Web [PDF:
8 pages, 1,012 KB]
Rafael Corchuelo,
José
Luis Arjona, and Miguel Toro
Abstract: The Semantic Web will bring meaning to the Internet, making it possible for web agents to understand the information it contains. However, current trends seem to suggest that the Semantic Web is not likely to be adopted in the forthcoming years. In this sense, meaningful information extraction from the web becomes a handicap for web agents. In this article, we present a framework for automatic extraction of semantically meaningful information from the current web. Separating the extraction process from the business logic of an agent enhances modularity, adaptability, and maintainability. Our approach is novel in that it combines different technologies to extract information, surf the web and automatically adapt to web changes.
Ontologies for
Database
Federation [PDF: 10 pages, 919 KB]
Nieves Brisaboa, Miguel
Penabad, Angeles Places, and Francisco Rodríguez
Abstract: This work presents an architecture to federate Databases in the Web that is based on ontologies. The use of ontologies to federate databases gives an interesting and useful way to integrate different database schemas. In this architecture, ontologies are used not only to represent the global schema but also to guide the execution of software modules in the system. Using ontologies gives our system some advantages: (1) It provides logical and physical independence between layers in the system. (2) It increases the scalability of the system. (3) It reduces the changes that have to be done when a new database is added to or dropped from the federated system. (4) It makes the user interface friendly and easy to use, not only for general users but also for expert users in any federated corpus domain. Users will find a flexible, powerful and highly specialised user interface.
System for
Compressing
and Retrieving Structured Documents [PDF:
8 pages, 873 KB]
Joaquín Adiego,
Pablo de la Fuente, Jesús Vegas, and Miguel Villarroel
Abstract: IR
systems
are used for content based retrieval of documents. The number of
structured
document collections is growing because the use of mark-up standards
XML,
SGML and HTML. For this collections the content and structure of
documents
must be indexed and retrieved to attend the user needs. In order to
reduce
the storage space needed, we can apply compression techniques. Search
and
retrieval functions can be supported via an appropriate indexing
strategy.
This paper presents a technique that encodes and compresses the text
considering
Spanish language characteristics and other elements which are language
independent. We have developed a tool based on an inverted index in
combination
with a compression strategy that considers both structural and plain
content.
The tool attains good compression rates and short
searching times allowing
to retrieve the text like it appears in the original documents.
TEXRET: An
Interactive
TEXture RETrieval System [PDF: 8
pages, 1,477 KB]
Javier Ruiz del Solar,
Pablo Navarrete, and Patricio Parada
Abstract:
Although
texture perception plays an important role in human vision – and
textures
are becoming increasingly important in multimedia communication, design
and marketing – people are unable to describe textures precisely. To
address
the problem of texture retrieval from image databases the TEXRET system
was developed. TEXRET (TEXture RETrieval) uses soft-computing
technologies
to allow an interactive
communication with the
user.
TEXRET main features are: (i) direct access from the Internet, (ii)
high
interactivity, (iii) texture retrieval using human-like or fuzzy
description
of the textures, (iv) content-based texture retrieval using
user-feedback,
and (v) generation of the requested textures when these are not found
in
the database, which allows a growing of the database. TEXRET is
described
in this article.
The CLEF Campaigns:
Evaluation
of Cross-Language Information Retrieval Systems [PDF:
4 pages, 866 KB]
Martin Braschler, Carol
Peters
Abstract:
Cross-Language
Information Retrieval (CLIR) systems that enable users to find and
retrieve
relevant documents regardless of their language are becoming
increasingly
important in the global information society. Evaluation campaigns play
a significant role in the development of such systems, allowing
researchers
to assess and compare approaches and performance. We report on the
success
of the Cross-
Language Evaluation Forum
(CLEF) in its first two years of activity.
The Web of Spain [PDF:
3 pages, 193 KB]
Ricardo Baeza-Yates
Abstract: The
Web
of Spain is a good example of an European Web in which its own Internet
domain (.es) is underutilised due to strict domain register rules,
which
are similar in other countries. In this paper, partially written while
visiting the Dept. of Technology, Universidad Pompeu Fabra (Barcelona,
Catalonia, Spain), we include information of domain types, structure
and
content of the Web of Spain, defined as all the web
sites having IP numbers
assigned to Spain.
Errata: The short professional biography of Ricardo Baeza-Yates has been completed with the following information that was unintentionally missing in its original version: He is co-author of the book Modern Information Retrieval, Addison-Wesley, 1999. To see his short professional biography click here.
Peter Schäuble is CEO of Eurospider Information Technology AG, i.e. the leading Swiss expert in Information Retrieval and providing software for News Monitoring and Corporate Retrieval <http://www.eurospider.com>. Prior to this position, he was Assistant Professor of Computer Science at the Swiss Federal Institute of Technology (ETH) Zurich and headed the Information Retrieval research group. Peter Schäuble has a M.S. (Dipl. Math. ETH) in mathematics and a PhD (Dr. sc. techn.) in computer science both from ETH. He has been a technical staff member of the European Space Agency (ESA) and a visiting scientist at Hewlett-Packard Laboratories in Palo Alto. He published various research papers and books on Information Retrieval. <Peter.Schauble@eurospider.com>
Information Retrieval (IR) is often associated with search engines and Internet; however, it evolved from an academic discipline which has its roots back in the fifties. During the first decades research activities usually took place in a Computer Science department and simple approaches based on occurrence statistics were shown surprisingly effective in retrieving relevant documents. Nevertheless, a small number of Information Retrieval research groups achieved important results in three respects:
1. Theory : Probabilistic retrieval models were developed that imply optimal retrieval effectiveness (see publications by Cooper, Robertson, and others). Later, retrieval was extended to other media, not only text.
2. Systems: Various algorithms and data structures were intended and integrated inpractical text retrieval systems (e.g. SMART, Topic, and Inquiry system) as well as multimedia retrieval systems recently.
3. Evaluation:
Test
collections were built consisting of documents, queries and – mosti
mportantly
– of relevance assessments that determine which documents are relevant
to which queries. These test collections facilitate the comparison of
different
retrieval methods in respect of recall and precision (e.g. Cranfield,
SMART,
TREC collections). When the Internet started growing, these Information
Retrieval building blocks were ready to be used. The large amount of
data
as well as the federation of the Internet opened space for new and
exciting
concepts, such as link based raking, XML retrieval, heterogeneous data
source integration, etc. Some of these concepts are covered by the
authors
of this
special issue on
“Information
Retrieval and the Web”, who come from several countries.
Useful
References on Information Retrieval
Collected by Ricardo
Baeza-Yates
In addition to the
references
and sources mentioned in the articles of this issue, interested readers
may look at the following books, journals
and conference
proceedings,
as well as some of the many web sites available relevant to Web
standards
(<http://w3c.org>), search
engines (<http://www.searchenginewatch.com>),
etc.
Books
Journals
Conferences
| Last updated on June 28th, 2002 | by Rafael Fernández Calvo and François Louis Nicolet, Co-editors of Upgrade |
| <rfcalvo@ati.es> |