Skip to content

git walker/drawer - draw relationship between clones of git/git-annex repos #1

@yarikoptic

Description

@yarikoptic

It often quite tricky, especially for git newbies, to visualize the "space" of git clones and remotes, in particular whenever different remotes have different names across the clones. Here is e.g. a "hand" drawn representation of remotes for spacetop dataset of @jungheejung which we have done in mermaid (original comment)

flowchart TD
    ORIGIN[1076_spacetop]-->|heejung\nremote=rolando|DISCOVERY
    DISCOVERY-->|heejung\nremote=rolando-exchange| EXCHANGE
    subgraph rolando
    EXCHANGE[1076_spacetop.git]-->|yarik| ORIGIN
    end
    subgraph discovery
    DISCOVERY[.../dartmouth]
    DISCOVERY-FMRIPREP[.../fmriprep]
    end
    subgraph laptop
    ORIGIN-->|heejung\nremote=origin|laptop-clone-name
    laptop-clone-name-->|heejung\nremote=rolando-exchange| EXCHANGE
    laptop-clone-name-->|heejung\nadds events| laptop-clone-name
    end
    subgraph JHU
    DISCOVERY-->|patrick| jhu-clone
    jhu-clone-->|used-for| JHU-FMRIPREP[.../fmriprep]
    JHU-FMRIPREP-->|patrick| DISCOVERY-FMRIPREP
    end
Loading

which helped to visualize which clones are out there and what is their relationships (on which computers, names of remotes) .

In addition we could annotate:

  • either it is carrying git-annex (all did there)
  • either it is a bare git repo (like the .git on rolando above)
  • addition to git remotes, there are also git annex special remotes worth visualizing but we did not have them here

In addition to the above, now there is already https://github.com/spatialtopology/ds005256/ and soon there would be one among OpenNeuroDatasets and with S3 special remote in export mode.

Presenting them all would have made a super nice visualization to orient anyone in the "space" of available clones.

In the scope of datalad-registry with @candleindark we could then potentially collect/export such visualizations for clones we identify on https://registry.datalad.org/ .

So we need

  • walker: given a collection of git repositories, utility to "walk" those and their remotes to collect the information about them
    • would align with effort @mih has on formalizing datalad datasets (https://concepts.datalad.org/) but without all gory details
    • for hosting portals like github.com, gitlab.com etc could have plugin system to discover
      • forks/clones and have option(s) to enabling their addition too: by default might just want to add "personal" forks for a specific user.
      • upstream repository (might not even be present locally as a remote)
    • might need logic to harmonize URLs since the same repo could be reached via different protocols (https, ssh, git) and have some optional suffix (like .git on github makes no difference but might matter for local ones - separate folder). Can use https://github.com/nephila/giturlparse for that
      • if git-annex found - on the system record its version. For each repo collect git annex info --json output with all the gory details, and explicitly that clone's annex.uuid from config
      • for git-annex ones, we could (somewhat) rely on annex.uuid of those, but I would not be surprised if someone has a hard copy thus violating core principle of git-annex, and we better visualize that error
      • if rad found on the system, get repo's rad ID via rad . (if any would be returned), and the node's id via rad self
    • walker should be able to continue walking if host was configured to forward ssh identity or has some other mechanism to auth into remotes from that target system. It should be smart enough to not visit already reached/visited host from another node in the network .
  • renderer: a utility to render that information as e.g. mermaid flowchart and including specific to that rendering details. Could optionally be
    • remote names
    • annex UUIDs
    • datalad dataset UUIDs (typically would be the same among clones).
    • highlight errors and warnings:
      • multiple instances with the same annex uuid (especially if anything is different among them, thus not just two different mounts of the same thing)
      • dead known remotes (may be even for the same description/path)
      • ... ?
    • distinguish in rendering (may be with badges etc):
      • tree vs bare git
      • with vs without git-annex
      • special remote:
        • encrypted or not git-annex remote
        • keystore vs exporttree
        • importtree
  • collectors: formalize (schema/model) to how we collect pertinent metadata and allow for plugins to extend it. Additional properties might be "sensed", e.g. it might make total sense to annotate connections with
    • distance in ping hops or alike between two hosts
    • bandwidth in bytes/sec
    • {shared,unique}-content-size - amount of annexed data shared or unique between any two particular connected clones (valuable for balancing things out etc)

I think the metadata to collect and render from already could be the metalad_core (https://github.com/datalad/datalad-metalad/blob/master/datalad_metalad/extractors/core.py#L36) metadata records. The problem is that we need to make tool ssh into remote locations (remotes) to gather their information etc, and AFAIK we do not have a ready to use abstraction for remote instances.

Possible (pieces of) solutions:

here is the graphviz drawing
digraph map {
	subgraph "cluster_openneuro.org" {
		label="openneuro"
		style="filled"
		fillcolor="lightblue"
		"https://openneuro.org/git/0/ds005256" [ label="openneuro-git" ] [ style="filled" ] [ fillcolor="white" ]
	}

	subgraph "cluster_typhon.dartmouth.edu" {
		label="typhon"
		style="filled"
		fillcolor="lightblue"
		"97b6f5e4-4642-43a7-988a-c483caf553c5" [ label="yoh@typhon:/mnt/DATA/data/yoh/1076_spacetop" ] [ style="filled" ] [ fillcolor="white" ]
	}
	"97b6f5e4-4642-43a7-988a-c483caf553c5" -> "https://github.com/spatialtopology/ds005256" [ label="gh-spatialtopology" ]
	"97b6f5e4-4642-43a7-988a-c483caf553c5" -> "https://openneuro.org/git/0/ds005256"
	"97b6f5e4-4642-43a7-988a-c483caf553c5" -> "590b4fd0-0142-4e9d-8964-d1158c242c6a" [ label="origin" ]
	"97b6f5e4-4642-43a7-988a-c483caf553c5" -> "40795e62-527c-4d26-ae8c-af42a6e2da5a" [ label="rolando-exchange" ]

	subgraph "cluster_typhon.dartmouth.edu" {
		label="typhon"
		style="filled"
		fillcolor="lightblue"
		"01ec5571-2578-417a-988d-4c7339930635" [ label="yoh@typhon:/mnt/DATA/data/yoh/1076_spacetop.git" ] [ style="filled" ] [ fillcolor="white" ]
	}
	"01ec5571-2578-417a-988d-4c7339930635" -> "ssh://typhon.dartmouth.edu/mnt/DATA/data/yoh/1076_spacetop"

	subgraph "cluster_smaug.datalad.org" {
		label="smaug"
		style="filled"
		fillcolor="lightblue"
		"fa9e758a-01a1-4a55-abee-d70128cb1144" [ label="yoh@smaug:/mnt/btrfs/datasets/incoming/yoh/1076_spacetop" ] [ style="filled" ] [ fillcolor="white" ]
	}
	"fa9e758a-01a1-4a55-abee-d70128cb1144" -> "590b4fd0-0142-4e9d-8964-d1158c242c6a" [ label="origin" ]
	"fa9e758a-01a1-4a55-abee-d70128cb1144" -> "40795e62-527c-4d26-ae8c-af42a6e2da5a" [ label="rolando-exchange" ]

	subgraph "cluster_rolando.cns.dartmouth.edu" {
		label="rolando"
		style="filled"
		fillcolor="lightblue"
		"40795e62-527c-4d26-ae8c-af42a6e2da5a" [ label="bids@rolando:/inbox/BIDS/Wager/Wager/1076_spacetop.git" ] [ style="filled" ] [ fillcolor="white" ]
	}
	"40795e62-527c-4d26-ae8c-af42a6e2da5a" -> "590b4fd0-0142-4e9d-8964-d1158c242c6a" [ label="origin" ]
	"40795e62-527c-4d26-ae8c-af42a6e2da5a" -> "97b6f5e4-4642-43a7-988a-c483caf553c5" [ label="typhon" ]

	subgraph "cluster_rolando.cns.dartmouth.edu" {
		label="rolando"
		style="filled"
		fillcolor="lightblue"
		"590b4fd0-0142-4e9d-8964-d1158c242c6a" [ label="bids@rolando:/inbox/BIDS/Wager/Wager/1076_spacetop" ] [ style="filled" ] [ fillcolor="white" ]
	}
	"590b4fd0-0142-4e9d-8964-d1158c242c6a" -> "40795e62-527c-4d26-ae8c-af42a6e2da5a" [ label="spacetop-rolando-exchange" ]

	subgraph "cluster_github.com" {
		label="github"
		style="filled"
		fillcolor="lightblue"
		"https://github.com/spatialtopology/ds005256" [ label="gh" ] [ style="filled" ] [ fillcolor="white" ]
	}

	subgraph "cluster_github.com" {
		label="github"
		style="filled"
		fillcolor="lightblue"
		"https://github.com/yarikoptic/ds005256" [ label="gh-yarikoptic" ] [ style="filled" ] [ fillcolor="white" ]
	}

	subgraph "cluster_github.com" {
		label="github"
		style="filled"
		fillcolor="lightblue"
		"https://github.com/OpenNeuroDatasets/ds005256" [ label="gh-openneuro" ] [ style="filled" ] [ fillcolor="white" ]
	}

	subgraph "cluster_discovery.dartmouth.edu" {
		label="discovery"
		style="filled"
		fillcolor="lightblue"
		"ssh://d31548v@discovery.dartmouth.edu/dartfs-hpc/rc/lab/C/CANlab/labdata/data/spacetop/dartmouth" [ label="discovery" ] [ style="filled" ] [ fillcolor="white" ]
	}

	subgraph "cluster_localhost" {
		label="localhost"
		style="filled"
		fillcolor="lightblue"
		"b14a3911-d089-44da-8327-6d2cbbd05871" [ label="yoh@lena:~/datasets/1076_spacetop" ] [ style="filled" ] [ fillcolor="white" ]
	}
	"b14a3911-d089-44da-8327-6d2cbbd05871" -> "ssh://d31548v@discovery.dartmouth.edu/dartfs-hpc/rc/lab/C/CANlab/labdata/data/spacetop/dartmouth"
	"b14a3911-d089-44da-8327-6d2cbbd05871" -> "https://github.com/OpenNeuroDatasets/ds005256"
	"b14a3911-d089-44da-8327-6d2cbbd05871" -> "https://github.com/yarikoptic/ds005256"
	"b14a3911-d089-44da-8327-6d2cbbd05871" -> "https://github.com/spatialtopology/ds005256"
	"b14a3911-d089-44da-8327-6d2cbbd05871" -> "590b4fd0-0142-4e9d-8964-d1158c242c6a" [ label="origin" ]
	"b14a3911-d089-44da-8327-6d2cbbd05871" -> "40795e62-527c-4d26-ae8c-af42a6e2da5a" [ label="rolando-exchange" ]
	"b14a3911-d089-44da-8327-6d2cbbd05871" -> "fa9e758a-01a1-4a55-abee-d70128cb1144" [ label="smaug" ]
	"b14a3911-d089-44da-8327-6d2cbbd05871" -> "01ec5571-2578-417a-988d-4c7339930635" [ label="typhon-exchange" ]
	"b14a3911-d089-44da-8327-6d2cbbd05871" -> "ssh://typhon.dartmouth.edu/mnt/DATA/data/yoh/1076_spacetop"

	"5977e022-46ee-4c0c-a6ee-5a0e2e2ea442" [ label="yoh@typhon:/tmp/ds005256" ] [ style="filled" ] [ fillcolor="white" ]
	"5ded375b-76eb-4a6c-899d-bef65f7b80b2" [ label="openneuro" ] [ style="filled" ] [ fillcolor="white" ]
	"620673e7-dbcc-450c-9622-5394ea652632" [ label="f0042x1@discovery7.hpcc.dartmouth.edu:/dartfs-hpc/rc/lab/C/CANlab/labdata/data/spacetop/1076_spacetop" ] [ style="filled" ] [ fillcolor="white" ]
	"73e9e7ca-f5d1-4b30-9f04-d343d37a456b" [ label="yoh@lena:~/proj/dbic-datasets/1076_spacetop" ] [ style="filled" ] [ fillcolor="white" ]
	"8028ca7a-270c-4be7-b029-cb26d1770d91" [ label="f0042x1@discovery7.hpcc.dartmouth.edu:/dartfs-hpc/rc/lab/C/CANlab/labdata/data/spacetop/dartmouth" ] [ style="filled" ] [ fillcolor="white" ]
	"930e5e3f-5d86-4e07-be36-fef0bc528b77" [ label="OpenNeuro" ] [ style="filled" ] [ fillcolor="white" ]
	"9441b7fd-3c95-4977-9b4a-7a29b4e598c5" [ label="h@h-MacBook-Pro.local:~/Documents/projects_local/1076_spacetop" ] [ style="filled" ] [ fillcolor="white" ]
	"e5f1e780-543c-421e-ad0b-7a270c1ad09b" [ label="s3-PUBLIC" ] [ style="filled" ] [ fillcolor="white" ]
}
rendered svg

Image

claude converted mermaid
graph TB
    subgraph openneuro["openneuro.org"]
        on_git["openneuro-git<br/>https://openneuro.org/git/0/ds005256"]
    end

    subgraph typhon["typhon.dartmouth.edu"]
        typhon_data["yoh@typhon:/mnt/DATA/data/yoh/1076_spacetop<br/>97b6f5e4-4642-43a7-988a-c483caf553c5"]
        typhon_git["yoh@typhon:/mnt/DATA/data/yoh/1076_spacetop.git<br/>01ec5571-2578-417a-988d-4c7339930635"]
    end

    subgraph smaug["smaug.datalad.org"]
        smaug_data["yoh@smaug:/mnt/btrfs/datasets/incoming/yoh/1076_spacetop<br/>fa9e758a-01a1-4a55-abee-d70128cb1144"]
    end

    subgraph rolando["rolando.cns.dartmouth.edu"]
        rolando_exchange["bids@rolando:/inbox/BIDS/Wager/Wager/1076_spacetop.git<br/>40795e62-527c-4d26-ae8c-af42a6e2da5a"]
        rolando_origin["bids@rolando:/inbox/BIDS/Wager/Wager/1076_spacetop<br/>590b4fd0-0142-4e9d-8964-d1158c242c6a"]
    end

    subgraph github["github.com"]
        gh_spatial["gh<br/>https://github.com/spatialtopology/ds005256"]
        gh_yarik["gh-yarikoptic<br/>https://github.com/yarikoptic/ds005256"]
        gh_openneuro["gh-openneuro<br/>https://github.com/OpenNeuroDatasets/ds005256"]
    end

    subgraph discovery["discovery.dartmouth.edu"]
        discovery_ssh["discovery<br/>ssh://d31548v@discovery.dartmouth.edu/dartfs-hpc/rc/lab/C/CANlab/labdata/data/spacetop/dartmouth"]
    end

    subgraph localhost["localhost"]
        lena["yoh@lena:~/datasets/1076_spacetop<br/>b14a3911-d089-44da-8327-6d2cbbd05871"]
    end

    %% Connections from typhon_data
    typhon_data -->|gh-spatialtopology| gh_spatial
    typhon_data --> on_git
    typhon_data -->|origin| rolando_origin
    typhon_data -->|rolando-exchange| rolando_exchange

    %% Connections from typhon_git
    typhon_git --> typhon_data

    %% Connections from smaug_data
    smaug_data -->|origin| rolando_origin
    smaug_data -->|rolando-exchange| rolando_exchange

    %% Connections from rolando_exchange
    rolando_exchange -->|origin| rolando_origin
    rolando_exchange -->|typhon| typhon_data

    %% Connections from rolando_origin
    rolando_origin -->|spacetop-rolando-exchange| rolando_exchange

    %% Connections from lena
    lena --> discovery_ssh
    lena --> gh_openneuro
    lena --> gh_yarik
    lena --> gh_spatial
    lena -->|origin| rolando_origin
    lena -->|rolando-exchange| rolando_exchange
    lena -->|smaug| smaug_data
    lena -->|typhon-exchange| typhon_git
    lena --> typhon_data

    style on_git fill:#fff
    style typhon_data fill:#fff
    style typhon_git fill:#fff
    style smaug_data fill:#fff
    style rolando_exchange fill:#fff
    style rolando_origin fill:#fff
    style gh_spatial fill:#fff
    style gh_yarik fill:#fff
    style gh_openneuro fill:#fff
    style discovery_ssh fill:#fff
    style lena fill:#fff
Loading

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions