Automized production of bibliographical information of locally stored internet files : a project to establish archives of electronical press services of parties and trade unions / Walter Wimmer - [Electronic ed.]. - Gent : IALHI. - 18 KB, Text.
In: Acta / International Association of Labor History Institutions. - 32. 2001 (2002), S. 45 - 50
Electronic ed.: Bonn : FES Library, 2001

© Friedrich-Ebert-Stiftung


Automized production of bibliographical information of locally stored internet files - a project to establish archives of electronical press services of parties and trade unions

Walter Wimmer

Aim and scope of the project

The intention of the project is to collect press releases of political parties and trade unions offered on the Internet. On the national level, press releases are collected of those political parties that are represented in the German Bundestag (German parliament). Special emphasis lies on press releases of the German Social Democratic Party (SPD). Here press releases are collected on the national as well as on the regional level. Also in addition press releases of the SPD parliamentary parties are of relevance. Press releases of German trade unions are collected down to regional level.

On the European level, press releases are archived of the member parties of the Socialist International as well as of national trade union organisations. In addition the collecting includes press releases of international trade secretariats and of other international organisations of the labour movement.

At present the project comprises approximately 80 press services which are archived regularly. The most important ones are still German organisations, but European and international ones are getting more and more important for the project.

Archiving of press releases - a sensible addition to conventional collecting activities

The decision to collect press releases of political parties and trade unions systematically is based on different ideas:

  1. It supports the collecting of grey literature of foreign political parties and trade unions, a project which is promoted by the Deutsche Forschungsgemeinschaft (central public funding organisation for academic research in Germany). Until now grey literature was mainly collected on journeys to other countries. The new project emphasises the growing meaning of the Internet as a source for relevant documents.
  2. So far the library of the Friedrich Ebert Foundation and the Archive of Social Democracy collected printed press releases. These collections will be supplemented and continued by this project.
  3. Press releases are naturally predestined to be offered on the Internet. Therefore many organisations choose to publish theirs this way.
  4. Several archives of press releases of single organisations relevant for the project can be found on the Internet, but there are no cumulative archives that contain press releases from different organisations. Now it is possible to compare directly different opinions and statements which offers many advantages to the user of this service.
  5. As press releases are explicitly intended to be openly and widely published there are no copyright problems in the redistribution on the Internet. These press releases usually do not contain any pictures that - differently to written information - often cause copyright problems.

Conceptional considerations on the indexing of the archived materials

The most obvious idea concerning the indexing of the archived documents is the use of a fulltext retrieval programme. The project does offer this, but only in addition to the search with meta data or - speaking in librarian terms - catalogue records.

In certain cases fulltext retrieval is not always sufficient, especially in those cases in which the search is restricted to the finding of documents of a certain period. Fulltext retrieval systems that stay within the library's financial limit do not offer this.

So there was the idea to produce meta data of the archived documents in order to take advantage of the library’s efficient cataloguing system. Because of the great number of locally saved documents and staff shortage it was of course not possible to generate catalogue entries manually.

A method was developed where the cataloguing of locally saved press releases is done with the help of programmes written in Perl. This method is based on the fact that nearly all press releases on the WWW are presented in the form of lists. In many cases these are dynamically developed HTML pages that are generated by database inquiries.

This kind of presentation demands a HTML source code that is also evenly structured. On this basis it is possible to write Perl programmes that extract the kind of information from the source code which is used to produce catalogue entries automatically. Once the information is filtered it can be transferred into a structural format and administrated in a database. Again Perl programmes are used for this process. The database software used in this project is the library's cataloguing and database programme Allegro. But it would also be possible to modify the produced data for administration in relational database management systems like Oracle. These automatically generated data sets are transferred into special databases by their specific update routines.

From theory into practice - the realisation of the concept

The programming that forms the basis of the realisation is done in the library. It is a great disadvantage that for each press release a new subroutine is necessary. This subroutine has to be adjusted to the layout of each single press release. If the layout changes the subroutine also has to be modified.

This preliminary work is the basis for the archiving which is done by a student assistant.

First the list for each press service is locally saved under a standardised file name. If the list is generated from a database it is necessary to determine those parameters for the calling sequence that allow the most efficient work. For instance, some of these services offered on the Internet are often produced in a way that presents lists with only a small number of press releases to the user. A change in the calling sequence parameters that leads to lists with hundreds of press releases means a considerable improvement of the whole procedure.

When the lists are saved locally the Perl programme is called which then produces the catalogue entries. The integration of these records into the database also takes place via batch processing. Part of this process is the conversion of the character set, a problem which is not easy to solve. There are many ways to generate diacritical signs in the HTML source code which all have to be found by appropriate scripts.

The local archiving of press releases

The local archiving again is done by a student assistant. Here so called offline readers are used. These products are able to save HTML pages linked to each other automatically, after the required starting addresses have been selected. In most of all cases Teleport Pro is used. A disadvantage of this programme is that it does not work properly when Javascript is used to call documents linked to each other. In these cases the product Winhttrack - originally a Unix programme - is used though it is not easy to handle.

It is necessary to use exactly the same URLs as starting addresses that are also mentioned in the calling sequences for the locally saved lists. This is to make sure that a locally saved press release exists for each generated record.

The main element of the local directory structure is the URL of that Internet server that provides the press service. A chronological subdivision is now possible and might be necessary, depending on the size of the file. Pictures - mostly graphical elements for the navigation within HTML pages - are saved in separate folders for each server.

Continuing the local work on the press releases

The locally saved press releases are modified by another Perl programme that analizes the automatically generated records. Only HTML files are modified, press releases in PDF or Word format remain unchanged. There are five kinds of modifications:

  1. Links to other pages are disabled. Press releases often contain links to the homepage of the organisation, which is not locally saved. Furthermore, the link to the webmaster is deactivated to avoid any mail that refers to the application of the FES library.
  2. The calling sequences of pictures that belong to the document are adapted to the local file structure. These pictures are - as mentioned before - usually graphical elements for navigation in HTML pages, emblems etc.
  3. The document is given a standardised description in tabular form.
  4. The record is anchored as Dublin Core element set in the header of the HTML file. This gives the option to evaluate this information once improved fulltext retrieval programmes are available.
  5. Each locally saved document is combined with a style sheet file. This makes it possible to a certain grade to control the layout of all archived documents with one single file.

All press releases that are saved can later be modified. That makes it possible to adjust the collection at any time to new developments such as XML.

Fulltext retrieval as a supplementary option for searching

The disadvantages of fulltext retrieval have already been mentioned. Nevertheless, it makes sense to also offer this kind of search. At first it was planned to extract all meaningful words from the archived press releases in order to combine them with the generated records in our library database Allegro. Unfortunately the programming of the batch files necessary for this approach took too much time.

In the present version of the project the fulltext retrieval was implemented as an additional search option. For the indexing of the fulltexts the programme Htdig is used, which is also used in the Occasio project run by the International Institute for Social History. Htdig is a Unix based freeware programme. Within this project Htdig runs (as well as the Allegro database offered on the Internet) with Linux.

Unfortunately the present version of Htdig does not offer the option to analyse the information established as Dublin Core in the header of the archived documents. It only allows a rough geographical classification of the archived material, based on the country codes that are found in those directories where the documents are saved locally. This is possible as these country codes are part of the servers' names.

Interim report after a year

As already mentioned 80 press services are continually saved at present. In the meantime approximately 53.000 press releases have been locally saved, indexed as described above and offered on the Internet [The URL of this service is "//library.fes.de/cgi-bin/populo/press_en.pl" , including the link to the fulltext retrieval as additional search option.].
Our experiences with this project are satisfactory, apart from a few exceptions. The outlined procedure is usually reliable and combines traditional librarian methods of working with innovative approaches. But after having brought the concept from theory into practice some problems appeared, especially in the context of the anarchical realities on the Internet:

  1. The programming effort for the project is much higher than originally assumed. This results from the many changes in the layouts of the Internet pages. For the project every change in the layout means a higher programming effort.
  2. Internet pages are often overloaded with Javascript elements. In some cases this means that offline readers cannot save documents that are linked with such pages. It is not possible though to save these documents manually as there are simply too many. Such press releases cannot be taken into account for the project due to technical reasons, even if they might be of interest.
  3. A growing number of press releases is not offered on the Internet any more but sent to persons that are interested in them via e-mail. At present the Library of the Friedrich Ebert Foundation does not have an efficient concept to handle this. Especially the extraction of relevant meta data from e-mails is hardly possible because the subject of an e-mail often does not contain any relevant information.
  4. The efficiency of the procedure differs extremely. Basically, the efficiency of the procedure grows with the number of press releases that can be reached with a single starting address. This number is different for every server and depends on many criterions. Apart from the general scope of the service, the organisation of the website and the flexibility of possibly used database interfaces is of special importance.

Some of these problems will be solved as soon as improved software is available, especially problems concerning the use of offline readers. Also the Internet and its new developments might require new conceptional ideas. Thus a systematically archiving of such press releases that are published via e-mail might lead to an exclusive fulltext indexing, which means a closer approach to the Occasio project of the IISH [For further information concerning the Occasio project see "//www.iisg.nl/occasio/".].
Hopefully in the near future fulltext retrieval systems that are available as freeware on the Internet will support Dublin Core element sets in order to enable the realisation of the basic idea of the project.

Benefits for the IALHI member organisations

Because of the great amount of information sources offered on the Internet it appears to be impossible for every single institution of the labour movement to collect everything that is published. With the increasing importance of the Internet co-operations in collecting and indexing relevant documents become more and more important. Of course every institution is free in forming scope and profile of its collecting activities corresponding to its aims and abilities. Within the IALHI the Library of the Friedrich Ebert Foundation is one of the biggest and most capable institutions. Further developments in the described project will lead to a great international collection. For IALHI members this could offer the opportunity to link their own Internet pages with this project in order to avoid extra work.

© Friedrich Ebert Stiftung | technical support | net edition fes-library | Oktober 2001