diff --git a/.github/workflows/wayback.yml b/.github/workflows/wayback.yml new file mode 100644 index 0000000..9127906 --- /dev/null +++ b/.github/workflows/wayback.yml @@ -0,0 +1,64 @@ +name: Archive web sources in the Wayback Machine + +# On release (and via workflow_dispatch), submit this project's web sources to the +# Internet Archive Wayback Machine ("Save Page Now") for an immutable, timestamped +# snapshot. Two kinds of source are captured: +# 1. The deployed MyST / Jupyter Book site. The repo *source* is preserved in +# Software Heritage (swh-save.yml) and Zenodo (docker.yml), but the rendered +# *site* is not — this captures it. +# 2. Any URLs listed in `wayback-urls.txt` at the repo root (one per line, '#' +# comments allowed). Use this for Mode-B / paperless claim sources — blogs, +# design notes, README pages — that Software Heritage cannot archive because +# they are prose, not code. +# +# Uses anonymous Save Page Now (no secrets); it is rate-limited. If you hit limits, +# switch to the authenticated SPN2 API: add Internet Archive S3-style keys as the +# secrets IA_ACCESS_KEY / IA_SECRET_KEY and an +# -H "Authorization: LOW ${IA_ACCESS_KEY}:${IA_SECRET_KEY}" +# header to the curl call below. +# +# STATUS: written, NOT yet executed. Validate via Actions -> Run workflow and check +# the run log + the resulting web.archive.org snapshot URLs. + +on: + release: + types: [published] + workflow_dispatch: + +permissions: + contents: read + +jobs: + wayback: + runs-on: ubuntu-latest + continue-on-error: true # best-effort archival must never fail the release + steps: + - uses: actions/checkout@v4 + + - name: Build URL list + run: | + repo="${{ github.repository }}" # owner/name + owner="${repo%%/*}" + name="${repo#*/}" + pages="https://${owner}.github.io/${name}/" + { + echo "${pages}" + if [ -f wayback-urls.txt ]; then + grep -vE '^[[:space:]]*(#|$)' wayback-urls.txt || true + fi + } > /tmp/wayback-urls.txt + echo "URLs to archive:"; cat /tmp/wayback-urls.txt + + - name: Submit to Wayback Machine (Save Page Now) + run: | + while IFS= read -r url; do + [ -z "${url}" ] && continue + echo "::group::Archiving ${url}" + code=$(curl -sS -o /dev/null -w '%{http_code}' \ + -A "forrt-replication-template wayback workflow" \ + "https://web.archive.org/save/${url}") || code="000" + echo "HTTP ${code} for ${url}" + echo "Latest snapshot: https://web.archive.org/web/2/${url}" + echo "::endgroup::" + sleep 5 # be polite to the Internet Archive endpoint + done < /tmp/wayback-urls.txt diff --git a/CLAUDE.md b/CLAUDE.md index e3b7b1c..21c8683 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -122,8 +122,9 @@ Exit: `nanopubs/drafts/05_outcome.md` is written with the conclusion sentence, t - A GitHub release is cut with a Zenodo-facing description (no internal ops detail, no bot signatures — `docs/cicd-conventions.md` § Release notes are Zenodo descriptions). - Zenodo mints a concept DOI; the value is written into `CITATION.cff` and `codemeta.json`. - The Docker image is pushed to GHCR via `.github/workflows/docker.yml` and (optionally) archived on Zenodo. +- On release, two further archival workflows fire automatically (best-effort, never block the release): `swh-save.yml` requests **Software Heritage** Save Code Now so the released revision gets a permanent, forge-agnostic **SWHID**, and `wayback.yml` snapshots the deployed Jupyter Book site plus any URLs in `wayback-urls.txt` (Mode-B / paperless claim sources) in the **Internet Archive Wayback Machine**. See `docs/cicd-conventions.md` § Preservation. -Exit: the release page is live, the Zenodo record exists, and `nanopubs/PUBLISHED.md` lists the source + image DOIs. +Exit: the release page is live, the Zenodo record exists, and `nanopubs/PUBLISHED.md` lists the source + image DOIs. (Software Heritage + Wayback archival are best-effort and may complete asynchronously.) ### Phase 5 — FORRT nanopublication chain diff --git a/docs/chain-decision-tree.md b/docs/chain-decision-tree.md index 43322ed..2da31a8 100644 --- a/docs/chain-decision-tree.md +++ b/docs/chain-decision-tree.md @@ -55,6 +55,21 @@ Don't treat PCC as a generic "research question" template that you can use whene The mirror mistake — using Quote-with-comment to anchor a question-rooted chain — happens when there's a *related* paper but no specific sentence we're testing. In that case, cite the related paper at the AIDA step's *Supported by other publications* group, but anchor the chain in PICO/PCC. +## Paperless claims (Mode-B) — a claim stated in code / README / blog, not a paper + +Not every testable claim lives in a paper. A tool's README, a design note, or a blog post can state a falsifiable claim about how a system behaves. These are first-class — *not everyone who advances knowledge writes a paper, and they shouldn't have to to make a claim that's testable and citable.* But a paperless source has no DOI, which interacts with the chain start: + +- **The Quote-with-comment `Cited DOI` field is DOI-only** (it expects a bare `10.x/y`, not a URL). So you cannot quote a raw GitHub / blog / SWHID URL there. + +Two clean ways to handle a paperless claim: + +1. **Deposit the source to get a DOI, then go paper-rooted.** Archive the code (Software Heritage → SWHID; and/or Zenodo → DOI) or the prose (Zenodo deposit → DOI; Wayback for fixity). Once the source has a DOI, use the normal Quote-with-comment start. +2. **Go question-rooted (PICO/PCC) and cite the source by URL at the CiTO step.** When there's no DOI to quote, frame the claim as an answered research question (PICO/PCC), then at the CiTO Citation step cite the artifact by URL — the CiTO *"DOI **or other URL**"* field accepts any resolvable URI. This is usually the right shape for a claim that isn't quoting a paper anyway. + +**Anchor the source on the most durable artifact identifier available**, in order: **SWHID** (code, forge-agnostic) > **Zenodo DOI** > repo URL > Wayback-snapshotted page URL. + +**Credit the original author by any resolvable URI** inside the nanopub (`prov:wasAttributedTo`): an ORCID if they have one, else an institutional profile or `https://github.com/` — never force an ORCID on a non-academic author (that would re-impose the gatekeeping Mode-B exists to bypass). Note the *signer* of the nanopub stays the Science Live user's ORCID; only the *referenced* source and its author may be a non-DOI / non-ORCID URI. + ## What happens after Phase 5 Once a single chain is published, you have three optional layers: diff --git a/docs/cicd-conventions.md b/docs/cicd-conventions.md index f5a97e7..90cdacb 100644 --- a/docs/cicd-conventions.md +++ b/docs/cicd-conventions.md @@ -168,6 +168,26 @@ If a bad description is already on Zenodo: edit in place via `zenodo.org/records --- +## Preservation: Zenodo (release), Software Heritage (code), Wayback (web sources) + +Three release-time archival paths, each with a distinct job. They are complementary, not redundant — capture all three where applicable. + +| Workflow | Archives | Identifier | Coverage | +|---|---|---|---| +| `docker.yml` (Zenodo) | the release source tarball + (optionally) the Docker image | Zenodo concept DOI | GitHub-only auto-archival | +| `swh-save.yml` (Software Heritage) | the source tree at the released revision | **SWHID** (ISO/IEC standard) | forge-agnostic — GitHub, GitLab.com, self-hosted GitLab, any git | +| `wayback.yml` (Internet Archive) | the deployed Jupyter Book site + the URLs in `wayback-urls.txt` | timestamped `web.archive.org` snapshot | web pages (prose), not code | + +Conventions: + +- **Code → Software Heritage (SWHID).** SWH is the universal, forge-agnostic anchor: it covers GitLab / self-hosted forks that Zenodo's GitHub-only integration misses. `swh-save.yml` requests Save Code Now on each release. Zenodo gives the *citable release + metadata DOI*; SWH gives the *immutable code identity*. Capture both. +- **Prose / web sources → Wayback.** Blogs, design notes, README pages that state a claim are not code, so Software Heritage cannot archive them. List them in `wayback-urls.txt`; `wayback.yml` snapshots them (plus the deployed book site) on release. Pair with a Zenodo deposit if a citable DOI is also wanted. +- **Never anchor on a conda package.** Software Heritage's conda loader is not in production; built conda-forge / bioconda *artifacts* are not archived. The recipes (feedstock GitHub repos) and upstream source repos *are* archived (as git). So anchor reproducibility on **pinned `pixi.toml` / `pixi.lock` + the source repo's SWHID + the container image on Zenodo** — not the conda artifact. + +All three workflows trigger only on `release` (plus manual `workflow_dispatch`), so they never run on an uninitialised template or on routine pushes. + +--- + ## Long-running experiments — don't poll If an analysis takes more than ~5 minutes: diff --git a/wayback-urls.txt b/wayback-urls.txt new file mode 100644 index 0000000..170a7a0 --- /dev/null +++ b/wayback-urls.txt @@ -0,0 +1,16 @@ +# wayback-urls.txt — external web sources to snapshot in the Internet Archive +# Wayback Machine on each release (see .github/workflows/wayback.yml). +# +# One URL per line. Lines starting with '#' and blank lines are ignored. +# +# Use this for Mode-B / paperless claim sources that Software Heritage cannot +# archive because they are prose, not code: blog posts, design notes, README +# pages, or documentation that states a claim your replication tests. A Wayback +# snapshot gives the source immutable fixity at the moment the claim was made, +# which the nanopub can cite as provenance. +# +# The deployed Jupyter Book site is archived automatically — you do NOT need to +# list it here. +# +# Example: +# https://example.org/blog/the-claim-we-are-testing