Skip to content

A project to help support the periodic update and revision of data within Neotoma that comes from external sources.

Notifications You must be signed in to change notification settings

NeotomaDB/periodic_updates

Repository files navigation

Neotoma Periodic Updates

Neotoma pulls data in from a number of external sources, including ORCID, RoR, DataCite and possibly others. This repository is intended to be used to store the scripts required to upload/update periodically updated data in Neotoma.

This repository uses uv as the primary package management tool, and a virtual environment for local development. Once you have a local clone of this repository, use uv sync to install the needed packages, and to create the virtual environment.

To add new Python packages to this project, use uv add ....

Workflow

These updates are triggered by a push to the dev (for neotomatank) or the production (for neotoma) branches of this repository. In each case, the push initiates the GitHub action, the action builds and deploys a Docker container which is then loaded into Batch when EventBridge signals it's time to initiate the script. This approach is similar to the approach used in clean_backup.

GitHub Repository
        │
        ▼
GitHub Actions (CI/CD)
        │
        ▼
Amazon ECR (Container Image Storage)
        │
        ▼
Amazon EventBridge (Monthly Schedule)
        │
        ▼
AWS Batch (Runs the Container)
        │
        ▼
RDS PostgreSQL (Your Database)

An Example

The ndb.institutions table uses data from the Research Organization Registry (RoR) and will be linked to the contacts table, as well as project and other tables. The RoR data is provided in a comprehensive json and csv format that is updated periodically on Zenodo. We want to produce a script that checks the Zenodo record monthly to determine if a new version of the RoR dataset has been generated, and, if so, update any records that have been added or modified.

Writing the Data Ingest Scripts

To use this module, we'll try to keep all data ingest scripts in their own files. In this case, we're using src/periodic_updates/rorUpdate.py as the file to bring in the data. If you look at this file you can see that it uses several helper functions in periodic_updates, but most of its functionality is within the file itself.

We make use of environment variables, so we don't accidentally expose private data when we upload these scripts to GitHub, and we use INSERT ... ON CONFLICT ... type insertions using psycopg to do the actual insertions of data.

Once we have scripted the workflow for this insertion, we import it into the main updater.py script and then test and, ultimately deploy.

Docker Container

About

A project to help support the periodic update and revision of data within Neotoma that comes from external sources.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published