Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 64 additions & 0 deletions .github/workflows/wayback.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
name: Archive web sources in the Wayback Machine

# On release (and via workflow_dispatch), submit this project's web sources to the
# Internet Archive Wayback Machine ("Save Page Now") for an immutable, timestamped
# snapshot. Two kinds of source are captured:
# 1. The deployed MyST / Jupyter Book site. The repo *source* is preserved in
# Software Heritage (swh-save.yml) and Zenodo (docker.yml), but the rendered
# *site* is not — this captures it.
# 2. Any URLs listed in `wayback-urls.txt` at the repo root (one per line, '#'
# comments allowed). Use this for Mode-B / paperless claim sources — blogs,
# design notes, README pages — that Software Heritage cannot archive because
# they are prose, not code.
#
# Uses anonymous Save Page Now (no secrets); it is rate-limited. If you hit limits,
# switch to the authenticated SPN2 API: add Internet Archive S3-style keys as the
# secrets IA_ACCESS_KEY / IA_SECRET_KEY and an
# -H "Authorization: LOW ${IA_ACCESS_KEY}:${IA_SECRET_KEY}"
# header to the curl call below.
#
# STATUS: written, NOT yet executed. Validate via Actions -> Run workflow and check
# the run log + the resulting web.archive.org snapshot URLs.

on:
release:
types: [published]
workflow_dispatch:

permissions:
contents: read

jobs:
wayback:
runs-on: ubuntu-latest
continue-on-error: true # best-effort archival must never fail the release
steps:
- uses: actions/checkout@v4

- name: Build URL list
run: |
repo="${{ github.repository }}" # owner/name
owner="${repo%%/*}"
name="${repo#*/}"
pages="https://${owner}.github.io/${name}/"
{
echo "${pages}"
if [ -f wayback-urls.txt ]; then
grep -vE '^[[:space:]]*(#|$)' wayback-urls.txt || true
fi
} > /tmp/wayback-urls.txt
echo "URLs to archive:"; cat /tmp/wayback-urls.txt

- name: Submit to Wayback Machine (Save Page Now)
run: |
while IFS= read -r url; do
[ -z "${url}" ] && continue
echo "::group::Archiving ${url}"
code=$(curl -sS -o /dev/null -w '%{http_code}' \
-A "forrt-replication-template wayback workflow" \
"https://web.archive.org/save/${url}") || code="000"
echo "HTTP ${code} for ${url}"
echo "Latest snapshot: https://web.archive.org/web/2/${url}"
echo "::endgroup::"
sleep 5 # be polite to the Internet Archive endpoint
done < /tmp/wayback-urls.txt
3 changes: 2 additions & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,8 +122,9 @@ Exit: `nanopubs/drafts/05_outcome.md` is written with the conclusion sentence, t
- A GitHub release is cut with a Zenodo-facing description (no internal ops detail, no bot signatures — `docs/cicd-conventions.md` § Release notes are Zenodo descriptions).
- Zenodo mints a concept DOI; the value is written into `CITATION.cff` and `codemeta.json`.
- The Docker image is pushed to GHCR via `.github/workflows/docker.yml` and (optionally) archived on Zenodo.
- On release, two further archival workflows fire automatically (best-effort, never block the release): `swh-save.yml` requests **Software Heritage** Save Code Now so the released revision gets a permanent, forge-agnostic **SWHID**, and `wayback.yml` snapshots the deployed Jupyter Book site plus any URLs in `wayback-urls.txt` (Mode-B / paperless claim sources) in the **Internet Archive Wayback Machine**. See `docs/cicd-conventions.md` § Preservation.

Exit: the release page is live, the Zenodo record exists, and `nanopubs/PUBLISHED.md` lists the source + image DOIs.
Exit: the release page is live, the Zenodo record exists, and `nanopubs/PUBLISHED.md` lists the source + image DOIs. (Software Heritage + Wayback archival are best-effort and may complete asynchronously.)

### Phase 5 — FORRT nanopublication chain

Expand Down
15 changes: 15 additions & 0 deletions docs/chain-decision-tree.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,21 @@ Don't treat PCC as a generic "research question" template that you can use whene

The mirror mistake — using Quote-with-comment to anchor a question-rooted chain — happens when there's a *related* paper but no specific sentence we're testing. In that case, cite the related paper at the AIDA step's *Supported by other publications* group, but anchor the chain in PICO/PCC.

## Paperless claims (Mode-B) — a claim stated in code / README / blog, not a paper

Not every testable claim lives in a paper. A tool's README, a design note, or a blog post can state a falsifiable claim about how a system behaves. These are first-class — *not everyone who advances knowledge writes a paper, and they shouldn't have to to make a claim that's testable and citable.* But a paperless source has no DOI, which interacts with the chain start:

- **The Quote-with-comment `Cited DOI` field is DOI-only** (it expects a bare `10.x/y`, not a URL). So you cannot quote a raw GitHub / blog / SWHID URL there.

Two clean ways to handle a paperless claim:

1. **Deposit the source to get a DOI, then go paper-rooted.** Archive the code (Software Heritage → SWHID; and/or Zenodo → DOI) or the prose (Zenodo deposit → DOI; Wayback for fixity). Once the source has a DOI, use the normal Quote-with-comment start.
2. **Go question-rooted (PICO/PCC) and cite the source by URL at the CiTO step.** When there's no DOI to quote, frame the claim as an answered research question (PICO/PCC), then at the CiTO Citation step cite the artifact by URL — the CiTO *"DOI **or other URL**"* field accepts any resolvable URI. This is usually the right shape for a claim that isn't quoting a paper anyway.

**Anchor the source on the most durable artifact identifier available**, in order: **SWHID** (code, forge-agnostic) > **Zenodo DOI** > repo URL > Wayback-snapshotted page URL.

**Credit the original author by any resolvable URI** inside the nanopub (`prov:wasAttributedTo`): an ORCID if they have one, else an institutional profile or `https://github.com/<user>` — never force an ORCID on a non-academic author (that would re-impose the gatekeeping Mode-B exists to bypass). Note the *signer* of the nanopub stays the Science Live user's ORCID; only the *referenced* source and its author may be a non-DOI / non-ORCID URI.

## What happens after Phase 5

Once a single chain is published, you have three optional layers:
Expand Down
20 changes: 20 additions & 0 deletions docs/cicd-conventions.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,26 @@ If a bad description is already on Zenodo: edit in place via `zenodo.org/records

---

## Preservation: Zenodo (release), Software Heritage (code), Wayback (web sources)

Three release-time archival paths, each with a distinct job. They are complementary, not redundant — capture all three where applicable.

| Workflow | Archives | Identifier | Coverage |
|---|---|---|---|
| `docker.yml` (Zenodo) | the release source tarball + (optionally) the Docker image | Zenodo concept DOI | GitHub-only auto-archival |
| `swh-save.yml` (Software Heritage) | the source tree at the released revision | **SWHID** (ISO/IEC standard) | forge-agnostic — GitHub, GitLab.com, self-hosted GitLab, any git |
| `wayback.yml` (Internet Archive) | the deployed Jupyter Book site + the URLs in `wayback-urls.txt` | timestamped `web.archive.org` snapshot | web pages (prose), not code |

Conventions:

- **Code → Software Heritage (SWHID).** SWH is the universal, forge-agnostic anchor: it covers GitLab / self-hosted forks that Zenodo's GitHub-only integration misses. `swh-save.yml` requests Save Code Now on each release. Zenodo gives the *citable release + metadata DOI*; SWH gives the *immutable code identity*. Capture both.
- **Prose / web sources → Wayback.** Blogs, design notes, README pages that state a claim are not code, so Software Heritage cannot archive them. List them in `wayback-urls.txt`; `wayback.yml` snapshots them (plus the deployed book site) on release. Pair with a Zenodo deposit if a citable DOI is also wanted.
- **Never anchor on a conda package.** Software Heritage's conda loader is not in production; built conda-forge / bioconda *artifacts* are not archived. The recipes (feedstock GitHub repos) and upstream source repos *are* archived (as git). So anchor reproducibility on **pinned `pixi.toml` / `pixi.lock` + the source repo's SWHID + the container image on Zenodo** — not the conda artifact.

All three workflows trigger only on `release` (plus manual `workflow_dispatch`), so they never run on an uninitialised template or on routine pushes.

---

## Long-running experiments — don't poll

If an analysis takes more than ~5 minutes:
Expand Down
16 changes: 16 additions & 0 deletions wayback-urls.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# wayback-urls.txt — external web sources to snapshot in the Internet Archive
# Wayback Machine on each release (see .github/workflows/wayback.yml).
#
# One URL per line. Lines starting with '#' and blank lines are ignored.
#
# Use this for Mode-B / paperless claim sources that Software Heritage cannot
# archive because they are prose, not code: blog posts, design notes, README
# pages, or documentation that states a claim your replication tests. A Wayback
# snapshot gives the source immutable fixity at the moment the claim was made,
# which the nanopub can cite as provenance.
#
# The deployed Jupyter Book site is archived automatically — you do NOT need to
# list it here.
#
# Example:
# https://example.org/blog/the-claim-we-are-testing
Loading