THIS DOCUMENTATION IS OUT OF DATE, sorry
Updating the Economists Online database. Documentation
The Economists Online database consists of two collections:
- eo-repo: the collection with harvested content of the Nereus partners
- repec: the collection with the RePEc records
There are separate procedures for updating the two collections. A common procedure is the ConfigMaker.
ConfigMaker
The ConfigMaker is a python script that generates the configuration files for the harvester. There are 5 diffent configuration file types:
- [collection].cf: this file contains general config info for the harvesting of a collection
- [database].target: this file specifies to which database (server) the harvester must send the records.
- [collection].domain: this file informs the harvester about the collection, i.e. about the containing repositoryGroups (subcollections)
- [collection].[repositoryGroup].repositoryGroup: repositories are grouped together in groups; several partners have more than one repository; if there are more than one set, the sets are treated by the harvester as separate repositories.
- [collection].[repository].repository: the URL of the repository, the set, the metadataPrefix, action (refresh/clear), use (harvest the repository yes/no)
The ConfigMaker has also its own config file with general info such as:
- the collection
- the directory where the harvester expects the target, domain, repositoryGroup and repository files
- the name and the (local) port of the target database server
This general info is used during the initialisation of a ConfigMaker object. For each collection a separate object with its own general config file is instantiated.
During initialisation the target file is generated.
There are separate methods for generating the collection specific config files. For the generation of the cf file there is a common method.
eo-repo
The config files for the harvester are generated with the ConfigMaker
Starting the update of the eo-repo collection
https://svn.non-gnu.uvt.nl/uvt-dev/trunk/sources/eo-updater/bin/start-eo-repo-update.sh
This script should be run each 24 h, e.g., by Cron. The only thing it does is starting the following python script with the correct config file.
https://svn.non-gnu.uvt.nl/uvt-dev/trunk/sources/eo-updater/bin/start-eo-repo-update.py
This script contains the workflow for updating the eo-repo collection.
The script is started in the directory in which the harvester will also be started. In this directory the harvester expects to find its cf-file which is a dynamically generated file by the makeCfFile method of ConfigMaker.
After generating the cf-file, the configuration files are generated for the target of the harvester, the domain (collection), the repositoryGroups and the repositories (see above). These configuration files are generated by the makeEoRepoFiles of MakeConfig. This method returns a
repec
repec gateway
Between the oai server of RePEc (http://oai.repec.org) and the harvester, we run a gateway (http://radix-21.uvt.nl:4080/repec/sitemap.xmap).
The function of the gateway is
Generating config files for the harvester with the ConfigMaker
The repec specific config files are generated in a separate directory that is first cleared. The files in this directory are compared with the files in use by the harvester. The files in use by the harvester but not generated anymore are candidates for removal; not only the files themselves but also the related database content. After the determination of the removal candidates, the newly generated config files are copied to the directory of the harvester.
For the repec collection there is just one repositoryGroup 'repec-org'. This repositoryGroup contains more than 4000 repositories. These repositories correspond to the RePEc series. A list with the identifiers (handles) of the series is provided by the repec gateway (see above). The repository identifiers are added to the repositoryGroup file.
For each repository, a config file is generated. This applies also to the repositories that are candidates for removal; the harvester is instructed to clear the repository from the database (the harvester ask the database server to deleted the records).
In the repec-update.cfg file it is possible to list the series that will not be included. The listing is done by the handles of the series. It is also possible to list the handle of an archive for skipping all the series of that archive.
Starting the update of the repec collection
https://svn.non-gnu.uvt.nl/uvt-dev/trunk/sources/eo-updater/bin/start-repec-update.sh
https://svn.non-gnu.uvt.nl/uvt-dev/trunk/sources/eo-updater/bin/start-repec-update.py