DoDoCo - Overview
=================

Fred Vos, Tilburg University

2009-08-19

1. Introduction
+++++++++++++++

This document describes the DoDoCo system.
The name DoDoCo stands for Document Download Counter.
DoDoCo was built for the NEEO project, where download statistics of full text documents (PDF)
must be gathered by the partners and made available for harvesting by the central gateway.
Download statistics must be computed per publication, per scholar and per NEEO partner.
DoDoCo is a framework that sits between the harvester that collects the download metadata from
partners and the EO portal that presents the statistics as graphs and tables.

2. Components
+++++++++++++

The image below shows the components of DoDoCo.

.. image:: ../../images/overview.png
   :height: 8cm
   :width: 12cm
   :align: left

DoDoCo has three major components:

1. The Inbox.

   This is where data on new download events is delivered.
   Typically this is done via HTTP POST.

2. The Processor.

   This component handles the data on download events and makes it accessible for statistics requests.

3. The Service.

   This handles requests for statistics.

3. Components in detail
++++++++++++++++++++++++

In alfabetical order.

Client
------
Sends requests for statistics to DoDoCo.
In the NEEO project, the EO portal is a client of DoDoCo.

InfoSource
----------
Download events that are being pushed to DoDoCo, may lack some information that is
available somewhere else.

Processor
---------
Takes metadata on download events from the queue, processes these objects and hands them
over to the Storage.
It uses an service called the InfoSource to retrieve information that is not available
in the event metadata, but available elsewhere.

Queue
-----
For a new download event that is pushed to DoDoCo, DoDoCo must first analyze
it, then find the associated publication and store the data in the backend storage.
This may take some time and we cannot let the Source wait, so we store the new events
in a queue and return an okay signal to the Source as soon as possible.
The processing of the events from the queue can be done asynchronously in a separate
thread.

Service
-------
Handles HTTP GET requests for statistics.

Source
------
A source of download events.
This is usually an OAI harvester.
A source pushes new download events to DoDoCo.
DoDoCo lets the Inbox handle these events.

Storage
-------
A database to store data in such a way that DoDoCo can respond to statistics
requests from the Client as fast as possible.

4. Flow
+++++++

DoDoCo is setup in a generic way.
The typical flow is explained here with the implementation of DoDoCo for NEEO.

The Meresco Harvester, the **Source**, fetches metadata on download events from the NEEO partners'
OAI repositories.
NEEO partners make metadata on these events available as Context Objects in their OAI repositories.
A Context Object is an XML document containing information on the download event.
There's a lot of information in there, but we concentrate on:

- Identifier of the document (a unique id).
- Timestamp of the download event (including timezone information)
- Country (where did the download request come from) 

The harvester is instructed to send new events to DoDoCo.
The DoDoCo server hands over each 'push' to the Inbox.

The **Inbox** is responsible for handling new data.
Since handling new data can take some time, it extracts the Context Objects data from each batch,
performs some quick quality checks on the data and then stores the Context Objects into the **Queue**.
Context Objects in the **Queue** can be processed later.
The **Inbox** then gives control back to the **Source** (harvester), so it doesn't have to wait too long
before it can continue sending new batches of data.

The **Processor** works in its own thread, completely separate from the **Inbox** or **Service**.
It takes Context Objects from the **Queue**.
For each Context Object it queries the Meresco search server, the **InfoSource**,
to search the record with the particular publication identifier.
This record was harvested and indexed before and it should be available in the indexes.
In the record, returned by the Meresco search server, information is found on the scholars
who wrote the publication and of the NEEO partner that was the source of the record.

The response from the Meresco search server is transformed into an XML document containing the
essential information.
This document is parsed and offered to the **Storage**, together with the Context Object.

The **Storage** is responsible for storing the Context Objects and record data,
in such a way that these can be used for computing statistics.
The **Storage** uses a database for storing data.
For testing purposes a simple version is also available that stores the data (small amounts)
in memory.

When the user of the EO portal requests statistics on a publication, on a scholar or on a partner,
this request is translated first into a HTTP GET request to DoDoCo.
DoDoCo hands over this request to the **Service**.
The **Service** then retrieves the necessary data from the **Storage**,
transforms it into a standard message and hands it over to the **Client**,
in this case the EO portal.