FES | ||
|
|
Automized production of bibliographical information of locally stored internet files : a project to establish archives of electronical press services of parties and trade unions / Walter Wimmer - [Electronic ed.]. - Gent : IALHI. - 18 KB, Text. In: Acta / International Association of Labor History Institutions. - 32. 2001 (2002), S. 45 - 50s Electronic ed.: Bonn : FES Library, 2001 © Friedrich-Ebert-Stiftung
Automized production of bibliographical information of locally stored internet files - a project to establish archives of electronical press services of parties and trade unions Aim and scope of the project The intention of the project is to collect press releases of political parties and trade unions offered on the Internet. On the national level, press releases are collected of those political parties that are represented in the German Bundestag (German parliament). Special emphasis lies on press releases of the German Social Democratic Party (SPD). Here press releases are collected on the national as well as on the regional level. Also in addition press releases of the SPD parliamentary parties are of relevance. Press releases of German trade unions are collected down to regional level. On the European level, press releases are archived of the member parties of the Socialist International as well as of national trade union organisations. In addition the collecting includes press releases of international trade secretariats and of other international organisations of the labour movement. At present the project comprises approximately 80 press services which are archived regularly. The most important ones are still German organisations, but European and international ones are getting more and more important for the project. Archiving of press releases - a sensible addition to conventional collecting activities The decision to collect press releases of political parties and trade unions systematically is based on different ideas:
Conceptional considerations on the indexing of the archived materials The most obvious idea concerning the indexing of the archived documents is the use of a fulltext retrieval programme. The project does offer this, but only in addition to the search with meta data or - speaking in librarian terms - catalogue records. In certain cases fulltext retrieval is not always sufficient, especially in those cases in which the search is restricted to the finding of documents of a certain period. Fulltext retrieval systems that stay within the library's financial limit do not offer this. So there was the idea to produce meta data of the archived documents in order to take advantage of the librarys efficient cataloguing system. Because of the great number of locally saved documents and staff shortage it was of course not possible to generate catalogue entries manually. A method was developed where the cataloguing of locally saved press releases is done with the help of programmes written in Perl. This method is based on the fact that nearly all press releases on the WWW are presented in the form of lists. In many cases these are dynamically developed HTML pages that are generated by database inquiries. This kind of presentation demands a HTML source code that is also evenly structured. On this basis it is possible to write Perl programmes that extract the kind of information from the source code which is used to produce catalogue entries automatically. Once the information is filtered it can be transferred into a structural format and administrated in a database. Again Perl programmes are used for this process. The database software used in this project is the library's cataloguing and database programme Allegro. But it would also be possible to modify the produced data for administration in relational database management systems like Oracle. These automatically generated data sets are transferred into special databases by their specific update routines. From theory into practice - the realisation of the concept The programming that forms the basis of the realisation is done in the library. It is a great disadvantage that for each press release a new subroutine is necessary. This subroutine has to be adjusted to the layout of each single press release. If the layout changes the subroutine also has to be modified. This preliminary work is the basis for the archiving which is done by a student assistant. First the list for each press service is locally saved under a standardised file name. If the list is generated from a database it is necessary to determine those parameters for the calling sequence that allow the most efficient work. For instance, some of these services offered on the Internet are often produced in a way that presents lists with only a small number of press releases to the user. A change in the calling sequence parameters that leads to lists with hundreds of press releases means a considerable improvement of the whole procedure. When the lists are saved locally the Perl programme is called which then produces the catalogue entries. The integration of these records into the database also takes place via batch processing. Part of this process is the conversion of the character set, a problem which is not easy to solve. There are many ways to generate diacritical signs in the HTML source code which all have to be found by appropriate scripts. The local archiving of press releases The local archiving again is done by a student assistant. Here so called offline readers are used. These products are able to save HTML pages linked to each other automatically, after the required starting addresses have been selected. In most of all cases Teleport Pro is used. A disadvantage of this programme is that it does not work properly when Javascript is used to call documents linked to each other. In these cases the product Winhttrack - originally a Unix programme - is used though it is not easy to handle. It is necessary to use exactly the same URLs as starting addresses that are also mentioned in the calling sequences for the locally saved lists. This is to make sure that a locally saved press release exists for each generated record. The main element of the local directory structure is the URL of that Internet server that provides the press service. A chronological subdivision is now possible and might be necessary, depending on the size of the file. Pictures - mostly graphical elements for the navigation within HTML pages - are saved in separate folders for each server. Continuing the local work on the press releases The locally saved press releases are modified by another Perl programme that analizes the automatically generated records. Only HTML files are modified, press releases in PDF or Word format remain unchanged. There are five kinds of modifications:
All press releases that are saved can later be modified. That makes it possible to adjust the collection at any time to new developments such as XML. Fulltext retrieval as a supplementary option for searching The disadvantages of fulltext retrieval have already been mentioned. Nevertheless, it makes sense to also offer this kind of search. At first it was planned to extract all meaningful words from the archived press releases in order to combine them with the generated records in our library database Allegro. Unfortunately the programming of the batch files necessary for this approach took too much time. In the present version of the project the fulltext retrieval was implemented as an additional search option. For the indexing of the fulltexts the programme Htdig is used, which is also used in the Occasio project run by the International Institute for Social History. Htdig is a Unix based freeware programme. Within this project Htdig runs (as well as the Allegro database offered on the Internet) with Linux. Unfortunately the present version of Htdig does not offer the option to analyse the information established as Dublin Core in the header of the archived documents. It only allows a rough geographical classification of the archived material, based on the country codes that are found in those directories where the documents are saved locally. This is possible as these country codes are part of the servers' names. Interim report after a year As already mentioned 80 press services are continually saved at present. In the meantime approximately 53.000 press releases have been locally saved, indexed as described above and offered on the Internet
[The URL of this service is "//library.fes.de/cgi-bin/populo/press_en.pl" , including the link to the fulltext retrieval as additional search option.].
Some of these problems will be solved as soon as improved software is available, especially problems concerning the use of offline readers. Also the Internet and its new developments might require new conceptional ideas. Thus a systematically archiving of such press releases that are published via e-mail might lead to an exclusive fulltext indexing, which means a closer approach to the Occasio project of the IISH
[For further information concerning the Occasio project see "//www.iisg.nl/occasio/".]. Benefits for the IALHI member organisations Because of the great amount of information sources offered on the Internet it appears to be impossible for every single institution of the labour movement to collect everything that is published. With the increasing importance of the Internet co-operations in collecting and indexing relevant documents become more and more important. Of course every institution is free in forming scope and profile of its collecting activities corresponding to its aims and abilities. Within the IALHI the Library of the Friedrich Ebert Foundation is one of the biggest and most capable institutions. Further developments in the described project will lead to a great international collection. For IALHI members this could offer the opportunity to link their own Internet pages with this project in order to avoid extra work. © Friedrich Ebert Stiftung | technical support | net edition fes-library | Oktober 2001 |