Ingestion

Testsuite Status | Version 1.2 | Generated May 09, 2021

Architecture

Besides the Inbox login system, the long-term database and archive storage, the ingestion pipeline employs a microservice architecture with the following components:

Ingestion Architecture and Connected Components

The reference implementation uses a microservice architecture with an internal database, a staging area and a local broker, for the ingestion pipeline.

Service Description Status
db A Postgres database with appropriate schemas, for saving the pipeline internals.
mq A RabbitMQ message broker with appropriate accounts, exchanges, queues and bindings. We use a federated queue to get messages from CentralEGA's broker and shovels to send answers back.
ingest Split the Crypt4GH header, decrypts the file, checksums its content and move the data portion to the storage backend.
backup Creates multiple backups of the data portion.

We assume the files are already uploaded in the Inbox login system. For a given Local EGA, Central EGA selects the associated vhost and drops, in the upstream queue, one message per file to ingest, with type=ingest.

On the Local EGA side, an worker retrieves this message, finds the associated file in the inbox, splits its Crypt4GH header, decrypts its data portion (aka its payload), checksums its content, and moves the payload to a staging area (with a temporary name). The files are read chunk by chunk in order to bound the memory usage.

A message is then sent to Central EGA with the checksum of the decrypted content (ie the original file) requesting an Accession ID. A message with type=accession comes back via the upstream queue.

The reference implementation includes a backup step, to store 2 copies of the payload. This is obviously for illustration purposes only, as all Local EGA instance will probably already have their own backup system (such as Ceph, for example).

The backend store can be either a regular file system on disk, or an S3 object storage. The reference implementation can interface to a POSIX compliant file system. In order to use an S3-backed storage, the Local EGA system administrator can use s3fs-fuse, or update the code (as it was once done and is now offload to our swedish and finnish partners).

If any of the above steps generates an error, we exit the workflow and log the error. In case the error is related to a misuse from the user, such as submitting to the wrong Local EGA or tempering with the encrypted file, the error is forwarded to Central EGA in order to be displayed for the user. If not, the error is left for a Local EGA system administrator to handle.

Upon completion, the accession id, the header, and the archive paths are saved in a separate long-term database, specific for each Local EGA. The reference implementation provides one for illustration, and saves a few more useful bits of information such as the payload size and checksum. This allows a system administrator to do data curation regularly.

Service Description Status
db A Postgres database. See the long-term database schema, for an example.
storage A POSIX file system or an S3-backed disk
save2db A microservice that saves the final message from the pipeline into the above external long-term database.

Installation & Bootstrap

A reference implementation can be found in the Local EGA Github repository. We containerized the code and use Docker to deploy it.

Since there are several components with multiple settings, we also created a bootstrap script to help deploy a LocalEGA instance, on your local machine. The bootstrap generates random passwords, configuration files, necessary public/secret keys, certificates for secure communication and connect the different components together (via docker-compose files).

Finally, the bootstrap creates a few test users and a fake Central EGA instance, to demonstrate the connection, and allow to run the testsuite

The reference implementation can be deployed locally, using docker-compose (suitable for testing or local development).

$ git clone https://github.com/EGA-archive/LocalEGA.git LocalEGA
$ cd LocalEGA/deploy
$ make -C bootstrap  # Generate the configuration settings
$ make -j 4 images   # optional, (pre/re)generate the images
$ make up            # Start a Local EGA instance, including a fake Central EGA
$ make ps            # See the status of this Local EGA instance
$ make logs          # See the (very verbose) logs of this Local EGA instance

Once the bootstrap files are generated, all interesting settings are found in the deploy/private sub-directory.

There is no need to pre/re-generate the docker images, because they are automatically generated on docker hub, and will be pulled in when booting the LocalEGA instance. This includes a reference implementation of the Inbox login system. That said, executing make -j 4 images will generate them locally.

You can clean up the local instance using make down.

Note

Production deployments: Our partners developed alternative bootstrap methods for Docker Swarm and Kubernetes. Those methods allow you to deploy a LocalEGA instance in a production environment, including scaling and monitoring/healthcheck.

Testsuite

We have implemented a testsuite, grouping tests into the following categories: integration tests, robustness tests, security tests, and stress tests.

All tests simulate real-case user scenarios on how they will interact with the system. All tests are performed on GitHub Actions runner, when there is a push to master or a Pull Request creation (i.e., they are integrated to the CI).

  • Integration Tests: test the overall ingestion architecture and simulate how a user will use the system.

  • Robustness Tests: test the microservice architecture and how the components are inter-connected. They, for example, check that if the database or one microservice is restarted, the overall functionality remains.

  • Security Tests: increase confidence around security of the implementation. They give some deployment guarantees, such as one user cannot see the inbox of another user, or the vault is not accessible from the inbox.

  • Stress Tests: “measure” performance