Upgrade: Home Page
Upgrade: Editorial Information
Upgrade: Schedule
Upgrade: News
Upgrade: Latest Issue
Upgrade: Past Issues
Contact Upgrade
Upgrade: Search Page
Upgrade, the European Online Magazine for the  IT Professional
Subscribe to our Newslist? Click here

 
 Next issue (August 2002):
XML (eXtensible Markup Language):
The ASCII of the 21st Century?
Upgrade, Vol. III, issue no. 3: cover page by Antonio Crespo Foix, © ATI 2002
Vol. III, Issue no. 3,
June 2002

Information Retrieval and the Web

 Published on behalf of CEPIS
by Novática (ATI, Spain)

Guest Editors: Ricardo Baeza-Yates, Peter Schäuble


Contents
Other editions available
  • Italian, by Tecnoteca / ALSI (abstracts online) 
  • Spanish, by Novática (abstracts and some articles online; full edition printed)

Acrobat Reader is required to display PDF files

 
CEPIS (Council of European Professional Informatics Societies) promotes Upgrade

EUCIP: European Certification of Informatics Professionals

ATI (Spain) publishes Novática, an Upgrade partner

ALSI (Italy) promotes the Italian edition of Upgrade

Tecnoteca (Italy) promotes the Italian edition of Upgrade

SI (Swiss Informaticians Society) cooperates with Upgrade

Website hosted by
Tiscali, leading European Internet and telecommunications service provider

 



Editorial: UPGRADE grows and matures  [PDF: 1 page, 480 KB]
Prof. Wolffried Stucky, President of CEPIS

Abstract: The President of CEPIS, publisher of UPGRADE, reviews the present and future of our digital magazine.
 

Information Retrieval and the Web
Guest Editors: Ricardo Baeza-Yates, Peter Schäuble

Presentation - Retrieving Information: A Discipline with a Tradition [HTML] [PDF: 1 page, 833 KB]
(Includes Useful References on Information Retrieval)
Ricardo Baeza-Yates, Peter Schäuble, Guest Editors

Abstract: The guest editors introduce rationale and contents for this issue, including also a list of useful references for those interested in knowing more about Information Retrieval (IR)

Errata: The short professional biography of Ricardo Baeza-Yates has been completed with the following information that was unintentionally missing in its original version: He is co-author of the book Modern Information Retrieval, Addison-Wesley, 1999. To see his short professional biography click here.

Information Retrieval for Enterprise Content [PDF: 4 pages, 845 KB]
Prabakhar Raghavan

Abstract: The last decade has witnessed the growth of Information Retrieval from a boutique discipline in Information and Library Science, to an everyday experience for billions of people around the world. This revolution has been driven in large measure by the World Wide Web, with vendors focused on Web content management, search and taxonomy for the Web. Simultaneously, enterprises have invested in networking all of their information together to the point where it is increasingly possible for employees to have a single window into all of their enterprise data. We point out that while employees seek “web-like” experiences in the enterprise, the two domains differ fundamentally in the nature of the content, the behaviour of users, and the economic motivations. Our principal focus is on outlining the demands on information retrieval for
enterprise content infrastructure.

Information Retrieval on the Web: A New Paradigm [PDF: 3 pages, 843 KB]
Jacques Savoy

Abstract: For the information retrieval (IR) community, the Web now presents a new paradigm, while also generating new challenges and attracting growing interest from around the world. An important example of these challenges is managing huge text collections and evaluating the usefulness of hyperlinks contained within them.

An Analysis of Query Languages for XML [PDF: 13 pages, 902 KB]
Adelaida Delgado, Ricardo Baeza-Yates

Abstract: A query language for XML should be flexible enough to cover the whole range of information sources that can be tagged by XML, including databases and web documents. In this article we present a comparative analysis of several query languages that have been created for XML. We study sets of desirable features both from the point of view of the semistructured data community – which puts special emphasis on very large databases, the integration of heterogeneous sources and the transformation of data into common interchange formats –, and from the viewpoint of the information retrieval community – which puts special emphasis on full-text searches, manipulation of results sets, inclusion relationships, distances, and ranking of the result documents.

Errata: The short professional biography of Ricardo Baeza-Yates has been completed with the following information that was unintentionally missing in its original version: He is co-author of the book Modern Information Retrieval, Addison-Wesley, 1999. To see his short professional biography click here.

Methodologies to develop Web Information Systems and Comparative Analysis [PDF: 12 pages, 887 KB]
María José Escalona, Manuel Mejías, and Jesús Torres

Abstract: New systems are currently being developed which have different aims from the systems developed some years ago. Current systems tend to be distributed, with huge storage needs and complex functional requirements. But these systems are also distributed via Internet so interface, information recovery, navigation and multimedia are all fundamental aspects. These new systems are known as Web Information Systems (WIS). The complexity of these systems and the fact that they are systems requiring constant maintenance due to the fact that they continually change to adapt to users’ new requirements means that they have to be developed using a standardized process. There is currently no standard methodology accepted by all the software engineering community. This paper presents the result of a comparative study analysing the different methodologies which are currently applicable to the development of a Web Information System.

Distributed Information Retrieval from Web-Accessible Digital Libraries using Mobile Agents [PDF: 7 pages, 1,023 KB]
J. Alfredo Sanchez, Sandra Nava, Lourdes Fernández, and Griselda Chevalier

Abstract: MAIDL (Mobile Agents in Digital Libraries) is a framework designed to support distributed information retrieval from a federation of web-accessible, heterogeneous digital libraries. Federation is accomplished through two mechanisms: a metadata harvesting protocol and mobile agents that explore and filter information resources from multiple distributed collections. The main components of the architecture of
MAIDL are autonomous nodes where digital libraries reside. The Open Archives Initiative (OAI) protocol is the basis for interoperability among participating nodes. At each node, information requests from users are received by agents that travel to the various nodes in the federation in search of appropriate resources. Each agent may use different information retrieval mechanisms available at each node to locate resources that are relevant to user requests. This paper describes the architecture of MAIDL and the encouraging initial results obtained with a specific federation.

Automatic Extraction of Semantically-Meaningful Information from the Web [PDF: 8 pages, 1,012 KB]
Rafael Corchuelo, José Luis Arjona, and Miguel Toro

Abstract: The Semantic Web will bring meaning to the Internet, making it possible for web agents to understand the information it contains. However, current trends seem to suggest that the Semantic Web is not likely to be adopted in the forthcoming years. In this sense, meaningful information extraction from the web becomes a handicap for web agents. In this article, we present a framework for automatic extraction of semantically meaningful information from the current web. Separating the extraction process from the business logic of an agent enhances modularity, adaptability, and maintainability. Our approach is novel in that it combines different technologies to extract information, surf the web and automatically adapt to web changes.

Ontologies for Database Federation [PDF: 10 pages, 919 KB]
Nieves Brisaboa, Miguel Penabad, Angeles Places, and Francisco Rodríguez

Abstract: This work presents an architecture to federate Databases in the Web that is based on ontologies. The use of ontologies to federate databases gives an interesting and useful way to integrate different database schemas. In this architecture, ontologies are used not only to represent the global schema but also to guide the execution of software modules in the system. Using ontologies gives our system some advantages: (1) It provides logical and physical independence between layers in the system. (2) It increases the scalability of the system. (3) It reduces the changes that have to be done when a new database is added to or dropped from the federated system. (4) It makes the user interface friendly and easy to use, not only for general users but also for expert users in any federated corpus domain. Users will find a flexible, powerful and highly specialised user interface.

System for Compressing and Retrieving Structured Documents [PDF: 8 pages, 873 KB]
Joaquín Adiego, Pablo de la Fuente, Jesús Vegas, and Miguel Villarroel

Abstract: IR systems are used for content based retrieval of documents. The number of structured document collections is growing because the use of mark-up standards XML, SGML and HTML. For this collections the content and structure of documents must be indexed and retrieved to attend the user needs. In order to reduce the storage space needed, we can apply compression techniques. Search and retrieval functions can be supported via an appropriate indexing strategy. This paper presents a technique that encodes and compresses the text considering Spanish language characteristics and other elements which are language independent. We have developed a tool based on an inverted index in combination with a compression strategy that considers both structural and plain content. The tool attains good compression rates and short
searching times allowing to retrieve the text like it appears in the original documents.

TEXRET: An Interactive TEXture RETrieval System [PDF: 8 pages, 1,477 KB]
Javier Ruiz del Solar, Pablo Navarrete, and Patricio Parada

Abstract: Although texture perception plays an important role in human vision – and textures are becoming increasingly important in multimedia communication, design and marketing – people are unable to describe textures precisely. To address the problem of texture retrieval from image databases the TEXRET system was developed. TEXRET (TEXture RETrieval) uses soft-computing technologies to allow an interactive
communication with the user. TEXRET main features are: (i) direct access from the Internet, (ii) high interactivity, (iii) texture retrieval using human-like or fuzzy description of the textures, (iv) content-based texture retrieval using user-feedback, and (v) generation of the requested textures when these are not found in the database, which allows a growing of the database. TEXRET is described in this article.

The CLEF Campaigns: Evaluation of Cross-Language Information Retrieval Systems [PDF: 4 pages, 866 KB]
Martin Braschler, Carol Peters

Abstract: Cross-Language Information Retrieval (CLIR) systems that enable users to find and retrieve relevant documents regardless of their language are becoming increasingly important in the global information society. Evaluation campaigns play a significant role in the development of such systems, allowing researchers to assess and compare approaches and performance. We report on the success of the Cross-
Language Evaluation Forum (CLEF) in its first two years of activity.

The Web of Spain [PDF: 3 pages, 193 KB]
Ricardo Baeza-Yates

Abstract: The Web of Spain is a good example of an European Web in which its own Internet domain (.es) is underutilised due to strict domain register rules, which are similar in other countries. In this paper, partially written while visiting the Dept. of Technology, Universidad Pompeu Fabra (Barcelona, Catalonia, Spain), we include information of domain types, structure and content of the Web of Spain, defined as all the web
sites having IP numbers assigned to Spain.

Errata: The short professional biography of Ricardo Baeza-Yates has been completed with the following information that was unintentionally missing in its original version: He is co-author of the book Modern Information Retrieval, Addison-Wesley, 1999. To see his short professional biography click here.


The Guest Editors

Ricardo Baeza-Yates is Ph.D. in Computer Science (University of Waterloo, Canada). Magister in Electrical Engineering from the Universidad de Chile, and Computer Science and Electrical Engineer by the same university. He is currently Tenured Professor in the Computer Science Department of the Universidad de Chile, and Director of the Center for Research of the Web <http://www.ciw.cl>. His fields of research are information retrieval, Web mining, algorithms and information visualization. He is co-author of Modern Information Retrieval, Addison-Wesley, 1999; coauthor of the 2nd edition of the Handbook of Algorithms and Data Structures, Addison-Wesley, 1991; and coeditor of Information Retrieval: Algorithms and Data Structures, Prentice-Hall, 1992. He has also contributed several papers to journals published by professional organizations such as ACM, ATI, IEEE, and SIAM. Currently he is president of CLEI (Centro Latinoamericano de Estudios en Informática), member of the IEEE Computer Society Board of Governors and international coordinator of an Iberoamerican project on models and techniques for searching the Web financed by the Spanish agency CYTED (Programa de Cooperación Iberoamericano). In 2000 he began a startup Internet company to search the Chilean web <http://www.todocl.cl>. His personal page is at <http://www.baeza.cl> and he can be reached by e-mail at <rbaeza@dcc.uchile.cl>

Peter Schäuble is CEO of Eurospider Information Technology AG, i.e. the leading Swiss expert in Information Retrieval and providing software for News Monitoring and Corporate Retrieval <http://www.eurospider.com>. Prior to this position, he was Assistant Professor of Computer Science at the Swiss Federal Institute of Technology (ETH) Zurich and headed the Information Retrieval research group. Peter Schäuble has a M.S. (Dipl. Math. ETH) in mathematics and a PhD (Dr. sc. techn.) in computer science both from ETH. He has been a technical staff member of the European Space Agency (ESA) and a visiting scientist at Hewlett-Packard Laboratories in Palo Alto. He published various research papers and books on Information Retrieval. <Peter.Schauble@eurospider.com>

The English Editors

Mike Andersson, Richard Butchart, David Cash, Arthur Cook, Tracey Darch, Laura Davies, Nick Dunn, Rodney Fennemore, Hilary M. Green, Roger Harris, Michael Hird, Jim Holder, Alasdair MacLeod, Pat Moody, Adam David Moss, Phil Parkin, Brian Robson.


Presentation - Retrieving Information: A Discipline with a Tradition [PDF: 1 page, 833 KB]
(Includes Useful References on Information Retrieval)
Ricardo Baeza-Yates, Peter Schäuble, Guest Editors

Information Retrieval (IR) is often associated with search engines and Internet; however, it evolved from an academic discipline which has its roots back in the fifties. During the first decades research activities usually took place in a Computer Science department and simple approaches based on occurrence statistics were shown surprisingly effective in retrieving relevant documents. Nevertheless, a small number of Information Retrieval research groups achieved important results in three respects:

1. Theory : Probabilistic retrieval models were developed that imply optimal retrieval effectiveness (see publications by Cooper, Robertson, and others). Later, retrieval was extended to other media, not only text.

2. Systems: Various algorithms and data structures were intended and integrated inpractical text retrieval systems (e.g. SMART, Topic, and Inquiry system) as well as multimedia retrieval systems recently.

3. Evaluation: Test collections were built consisting of documents, queries and – mosti mportantly – of relevance assessments that determine which documents are relevant to which queries. These test collections facilitate the comparison of different retrieval methods in respect of recall and precision (e.g. Cranfield, SMART, TREC collections). When the Internet started growing, these Information Retrieval building blocks were ready to be used. The large amount of data as well as the federation of the Internet opened space for new and exciting concepts, such as link based raking, XML retrieval, heterogeneous data source integration, etc. Some of these concepts are covered by the authors of this
special issue on “Information Retrieval and the Web”, who come from several countries.
 

Useful References on Information Retrieval
Collected by Ricardo Baeza-Yates

In addition to the references and sources mentioned in the articles of this issue, interested readers may look at the following books, journals
and conference proceedings, as well as some of the many web sites available relevant to Web standards (<http://w3c.org>), search
engines (<http://www.searchenginewatch.com>), etc.

Books


Journals


Conferences


Last updated on June 28th, 2002 by Rafael Fernández Calvo and François Louis Nicolet, Co-editors of Upgrade 
<rfcalvo@ati.es>

Copyright policy:

  1. Copyright © CEPIS 2002. All rights reserved.
  2. Abstracting of the articles included in UPGRADE is permitted with credit to the source. For copying, reprint, or republication permission, write to the editors.
  3. Unauthorized access to pages, or parts thereof,  in this website reserved exclusively for persons authorized by UPGRADE is expressly prohibited. Any unauthorized access may be prosecuted according to the law.