Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
f505bc2
benchmark scripts for incremental transformation
sh5i Mar 23, 2026
a898255
MVStore implementaion for cache
sh5i Mar 23, 2026
5b2cb1e
Disable autocommit, use internal serializer
sh5i Mar 23, 2026
b431d05
switch the default cache backend
sh5i Mar 23, 2026
6e4d732
ensure to close even for exceptional cases
sh5i Mar 23, 2026
3c488ec
Add tests for CacheProvider
sh5i Mar 23, 2026
935ce4e
fix: make SingleEntry fully consistent with equals() to ensure the as…
sh5i Mar 23, 2026
6eab52d
Make RefEntry comparable
sh5i Mar 23, 2026
c868b3c
New provider: guava cache
sh5i Mar 23, 2026
1580b11
refactor: move class (extract package)
sh5i Mar 23, 2026
6530090
remove dependencies to SQLite
sh5i Mar 23, 2026
083ff51
Use git-notes for the source of commit mapping at the previous stage
sh5i Mar 24, 2026
f26ee3b
Introduce CheckStyle
sh5i Mar 24, 2026
6527099
Remove unused imports
sh5i Mar 24, 2026
5333233
Use block
sh5i Mar 24, 2026
6621ac0
Store both prev and orig notes
sh5i Mar 24, 2026
a040537
memory profile
sh5i Mar 24, 2026
6caad9b
use Guava cache for entry mapping. --mapping-mem option to specify it…
sh5i Mar 24, 2026
83c7f55
Remove fallback
sh5i Mar 24, 2026
b2ddefa
Remove commit/refentry mappings from cache
sh5i Mar 24, 2026
0d1a198
Unify tree/blob caches
sh5i Mar 24, 2026
24b4709
Remove --cache-backend option
sh5i Mar 24, 2026
2446125
Remove inTransaction block
sh5i Mar 24, 2026
350df2a
Simplify classes
sh5i Mar 24, 2026
696db5c
Use memory-budget constrained persistent cache
sh5i Mar 24, 2026
0ab2c72
refactor: rename
sh5i Mar 24, 2026
3f63adb
Update README
sh5i Mar 24, 2026
bd440a4
Report cache hit rate.
sh5i Mar 24, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
/gradle.properties
/.settings/
/.idea/
__*

*~
.DS_Store
Expand Down
113 changes: 84 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,11 +34,7 @@ $ git stein [options...] # When subcommand available

## Recipes

### Chaining commands

Multiple commands can be listed on the command line.
They are applied sequentially; intermediate repositories are created under `.git/.git-stein.N` in the target directory and cleaned up automatically.
As an optimization, consecutive blob translators are composed into a single pass.
### Splitting and converting to cregit

Split Java files into method-level modules, then convert each to cregit format:
```
Expand Down Expand Up @@ -72,18 +68,6 @@ $ git stein path/to/repo -o path/to/out \
@convert --endpoint=http://localhost:8080/convert --pattern='*.java'
```

### Tracking original commit IDs

When git-stein rewrites a repository, it records the original commit ID in Git notes (enabled by default).
`@note-commit` reads these notes and prepends the original commit ID to each commit message.

A typical workflow is to first transform, then apply `@note-commit`:
```
$ git stein path/to/repo -o path/to/out @historage-jdt @note-commit
```
After this, each commit message in `step2` starts with the original commit ID from `repo`.
This works even after multiple transformations — the notes trace back to the original.

### Writing a custom blob translator

Implement the `BlobTranslator` interface to define your own transformation.
Expand Down Expand Up @@ -119,12 +103,13 @@ public class MyTranslator implements BlobTranslator {
- `-j`, `--jobs=<nthreads>`: Rewrites trees in parallel using `<nthreads>` threads. If the number of threads is omitted (just `-j` is given), _total number of processors - 1_ is used.
- `-n`, `--dry-run`: Do not actually modify the target repository.
- `--stream-size-limit=<num>{,K,M,G}`: increase the stream size limit.
- `--no-notes`: Stop noting the source commit ID to the commits in the target repository.
- `--no-notes`: Stop noting the source commit ID to the commits in the target repository (see [Notes](#notes)).
- `--no-pack`: Stop packing objects after transformation finished.
- `--alternates`: Share source objects via Git alternates to skip writing unchanged objects, which speeds up transformations where many objects are unchanged. The target repository will depend on the source's object store until repacked.
- `--no-composite`: Stop composing multiple blob translators.
- `--no-composite`: Stop composing multiple blob translators (see [Chaining Commands](#chaining-commands)).
- `--extra-attributes`: Allow opportunity to rewrite the encoding and the signature fields in commits.
- `--cache=<level>,...`: Specify the object types for caching (`commit`, `blob`, `tree`. See [Incremental transformation](#incremental-transformation) for the details). Default: none. `commit` is recommended.
- `--cache`: Enable persistent entry caching (see [Caching](#caching)).
- `--mapping-mem=<num>{,K,M,G}`: Max memory for entry mapping cache. Default: 25% of max heap (see [Caching](#caching)).
- `--cmdpath=<path>:...`: Add packages for search for commands.
- `--log=<level>`: Specify log level (default: `INFO`).
- `-q`, `--quiet`: Quiet mode (same as `--log=ERROR`).
Expand All @@ -143,19 +128,10 @@ The git-stein supports three rewriting modes.
- _duplicate_ mode (`<source> -o <target> -d`): given a source repository and a path for the target repository, copying the source repository into the given path and applying overwrite mode to the target repository.


## Incremental Transformation
In case the source repository to be transformed has been evolving, git-stein can transform only newly added objects.
With the option `--cache=<level>`, an SQLite3 cache file "cache.db" will be stored in the `.git` directory of the destination repository.
This file records the correspondence between objects before and after transformation, according to the specified option.
Correspondences between commits (`--cache=commit`), between trees (`--cache=tree`), and between files (`--cache=blob`) are stored.
This cache can save the re-transformation of remaining objects during the second and subsequent transformation trials.


## Bundle Apps

### Blob Translators
_Blob translators_ provide a blob-to-blob(s) translations.
Multiple blob translators can be composed and applied in a single pass.

#### @historage
Generates a [Historage](https://github.com/hideakihata/git2historage)-like repository using [Universal Ctags](https://ctags.io/).
Expand Down Expand Up @@ -285,6 +261,85 @@ A no-op rewriter that copies all objects without transformation.
Useful for verifying that the rewriting pipeline preserves repository content.


## Chaining Commands

Multiple commands can be listed on a single command line.
They are applied sequentially as separate transformation steps.
For example, with three commands `@A @B @C`:
```
source → target/.git/.git-stein.1 → target/.git/.git-stein.2 → target
(@A) (@B) (@C)
```
Intermediate repositories (`.git-stein.N`) are bare repositories created under the target's `.git` directory.

As an optimization, consecutive blob translators are composed into a single pass rather than creating intermediate repositories for each one.
This behavior can be disabled with `--no-composite`.
For example, the following runs `@historage-jdt` and `@cregit` as a single composed blob translator, then `@note-commit` as a separate commit translator step:
```
$ git stein path/to/repo -o path/to/out \
@historage-jdt --no-original --no-classes \
@cregit --pattern='*.cjava' --ignore-case \
@note-commit
```


## Notes

git-stein records the original commit ID as a git note on each target commit (enabled by default).
Each note stores the source commit ID as a 40-character hex string.
This provides the standard way to trace a target commit back to its source, and is visible in `git log` without any extra options (via `refs/notes/commits`).
Notes are also used for [Incremental Transformation](#incremental-transformation) to skip already-processed commits on subsequent runs.

`@note-commit` reads the note on each commit and embeds the original commit ID into the commit message.
Place it at the end of the command list:
```
$ git stein path/to/repo -o path/to/out @historage-jdt @note-commit
```

git-stein uses three notes refs:
`refs/notes/git-stein-prev` stores the immediate source commit ID (i.e., the commit in the input repository of this transformation step),
`refs/notes/git-stein-orig` stores the original source commit ID (traces back through chained transformations to the very first source),
and `refs/notes/commits` points to the same object as `git-stein-orig` (visible in `git log` by default).
For a single transformation, all three refs point to the same notes object.
In a chained transformation (see [Chaining Commands](#chaining-commands)), `git-stein-prev` and `git-stein-orig` may differ.
For example, in `.git-stein.2`, `git-stein-prev` points to the commit in `.git-stein.1`, while `git-stein-orig` points to the commit in the original source.

If `--no-notes` is used, no notes are written, and incremental transformation will not be available on subsequent runs.
The target will be fully rewritten each time.


## Incremental Transformation

git-stein supports incremental transformation:
when the target repository already contains results from a previous run, only new commits are processed.

On subsequent runs, git-stein reads the notes from the target repository to reconstruct the commit mapping and skips already-processed commits.

New commits still need to be transformed.
To try to speed up the transformation of these new commits by reusing previously computed entry mappings, try `--cache` (see [Persistent cache](#persistent-cache-cache)).


## Caching

git-stein uses two levels of caching to avoid redundant work:
an in-memory cache for the current run and an optional persistent cache for repeated runs.

### In-memory cache

During a single run, git-stein keeps an in-memory entry mapping (source entry → transformed entry) backed by a Guava Cache with LRU eviction.
This avoids re-transforming identical entries within the same execution.
The memory budget is controlled by `--mapping-mem` (default: 25% of max heap).

### Persistent cache (`--cache`)

When `--cache` is enabled, the entry mapping is stored in an MVStore (H2) file (`cache.mv.db`) in the target repository's `.git` directory.
This persists entry mappings across runs, so entries that were already transformed in a previous run can be reused without re-computation.
The `--mapping-mem` option also controls the MVStore page cache and write buffer sizes.

`--cache` and the in-memory cache are mutually exclusive:
when `--cache` is enabled, MVStore replaces the in-memory Guava Cache entirely.


## Publications
The following article includes the details of the incremental transformation (and a brief introduction to git-stein).
Those who have used git-stein in their academic work may be encouraged to cite the following in their work:
Expand Down
23 changes: 20 additions & 3 deletions build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,12 @@ plugins {
id 'maven-publish'
id 'com.gradleup.shadow' version '9.4.0'
id 'com.github.ben-manes.versions' version '0.53.0'
id 'checkstyle'
}

checkstyle {
toolVersion = '10.21.4'
configFile = file("${rootDir}/config/checkstyle/checkstyle.xml")
}

repositories {
Expand Down Expand Up @@ -37,8 +43,7 @@ dependencies {
implementation 'org.jgrapht:jgrapht-core:1.5.2'
implementation 'org.jgrapht:jgrapht-io:1.5.2'

implementation 'org.xerial:sqlite-jdbc:3.51.3.0'
implementation 'com.j256.ormlite:ormlite-jdbc:5.7'
implementation 'com.h2database:h2-mvstore:2.3.232'

testImplementation 'org.junit.jupiter:junit-jupiter:5.14.3'
testRuntimeOnly 'org.junit.platform:junit-platform-launcher'
Expand Down Expand Up @@ -75,7 +80,6 @@ publishing {

shadowJar {
minimize {
exclude(dependency('org.xerial:sqlite-jdbc:.*'))
exclude(dependency('ch.qos.logback:logback-classic:.*'))
}
}
Expand All @@ -87,10 +91,23 @@ tasks.register('benchmark', JavaExec) {

def benchArgs = project.hasProperty('benchRepo') ? [project.property('benchRepo')] : ['.']
if (project.hasProperty('alternates')) benchArgs.add('--alternates')
if (project.hasProperty('cache')) benchArgs.add('--cache')
args = benchArgs
jvmArgs = ['-Xmx1g']
}

tasks.register('memoryProfile', JavaExec) {
dependsOn 'testClasses'
classpath = sourceSets.test.runtimeClasspath
mainClass = 'jp.ac.titech.c.se.stein.testing.MemoryProfile'

def profArgs = project.hasProperty('benchRepo') ? [project.property('benchRepo')] : ['.']
if (project.hasProperty('command')) profArgs.add(project.property('command'))
args = profArgs
def heap = project.hasProperty('heap') ? project.property('heap') : '4g'
jvmArgs = ["-Xmx${heap}", '-XX:+UseSerialGC', '-XX:+CrashOnOutOfMemoryError']
}

tasks.register('executableJar') {
dependsOn 'shadowJar'
// cf. https://ujun.hatenablog.com/entry/2017/09/22/010209
Expand Down
13 changes: 13 additions & 0 deletions config/checkstyle/checkstyle.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
<!DOCTYPE module PUBLIC
"-//Checkstyle//DTD Checkstyle Configuration 1.3//EN"
"https://checkstyle.org/dtds/configuration_1_3.dtd">

<module name="Checker">
<module name="TreeWalker">
<!-- Unused imports -->
<module name="UnusedImports"/>

<!-- Require curly braces -->
<module name="NeedBraces"/>
</module>
</module>
107 changes: 107 additions & 0 deletions scripts/bench-incremental.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
#!/bin/sh
# Run incremental transformation benchmarks.
# Usage: ./bench-incremental.sh <jar-path> <work-dir> <command> [cache-opts...]
#
# Example:
# ./bench-incremental.sh ./build/libs/git-stein-all.jar ./work @historage-jdt
# ./bench-incremental.sh ./build/libs/git-stein-all.jar ./work @historage-jdt --cache commit,blob
#
# Runs two experiments:
# A) Incremental over splits (1 -> 2 -> ... -> N)
# B) Independent deltas from base (base+10, base+20, ...)
set -eu

JAR="${1:?Usage: bench-incremental.sh <jar-path> <work-dir> <command> [cache-opts...]}"
WORK_DIR="${2:?}"
COMMAND="${3:?}"
shift 3
CACHE_OPTS="$*"

RESULTS_DIR="$WORK_DIR/results"
mkdir -p "$RESULTS_DIR"
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
LABEL=$(echo "$CACHE_OPTS" | tr ' ' '_')
[ -z "$LABEL" ] && LABEL="none"

TIME=/usr/bin/time

run_stein() {
java -Xmx1g -jar "$JAR" --bare --log=WARN $CACHE_OPTS -o "$2" "$1" "$COMMAND"
}

# Capture wall-clock seconds from "time -p"
time_run_stein() {
$TIME -p sh -c "run_stein='java -Xmx1g -jar $JAR --bare --log=WARN $CACHE_OPTS -o $2 $1 $COMMAND'; eval \"\$run_stein\"" 2>&1 | grep '^real ' | awk '{print $2}'
}

# ============================================================
# Experiment A: incremental over splits
# ============================================================
echo "=== Experiment A: Incremental splits (cache: ${CACHE_OPTS:-none}) ==="
RESULT_A="$RESULTS_DIR/${TIMESTAMP}_splits_${LABEL}.csv"
echo "step,commits,time_seconds" > "$RESULT_A"

SPLITS_DIR="$WORK_DIR/splits"
DEST_A="$WORK_DIR/dest_splits_${LABEL}"
rm -rf "$DEST_A"

SPLITS=$(ls -1d "$SPLITS_DIR"/[0-9]* 2>/dev/null | wc -l | tr -d ' ')

for i in $(seq 1 "$SPLITS"); do
SOURCE="$SPLITS_DIR/$i"
[ -d "$SOURCE" ] || continue
NCOMMITS=$(git -C "$SOURCE" rev-list --all 2>/dev/null | wc -l | tr -d ' ')
printf " Split %d/%d (%d commits) ... " "$i" "$SPLITS" "$NCOMMITS"

ELAPSED=$(time_run_stein "$SOURCE" "$DEST_A")

echo "${ELAPSED}s"
echo "$i,$NCOMMITS,$ELAPSED" >> "$RESULT_A"
done
echo "Results: $RESULT_A"
rm -rf "$DEST_A"

# ============================================================
# Experiment B: independent deltas from base
# ============================================================
echo ""
echo "=== Experiment B: Deltas from base (cache: ${CACHE_OPTS:-none}) ==="
RESULT_B="$RESULTS_DIR/${TIMESTAMP}_deltas_${LABEL}.csv"
echo "delta,commits,time_seconds" > "$RESULT_B"

DELTAS_DIR="$WORK_DIR/deltas"
BASE_SOURCE="$DELTAS_DIR/base"

# First, create the base destination
DEST_BASE="$WORK_DIR/dest_deltas_base_${LABEL}"
rm -rf "$DEST_BASE"
printf " Building base ... "
BASE_TIME=$(time_run_stein "$BASE_SOURCE" "$DEST_BASE")
BASE_COMMITS=$(git -C "$BASE_SOURCE" rev-list --all 2>/dev/null | wc -l | tr -d ' ')
echo "$BASE_COMMITS commits, ${BASE_TIME}s"
echo "0,$BASE_COMMITS,$BASE_TIME" >> "$RESULT_B"

# Run deltas independently (cp base, then incremental transform)
DELTAS=$(ls -1d "$DELTAS_DIR"/[0-9]* 2>/dev/null | sort -n | while read d; do basename "$d"; done)

for i in $DELTAS; do
DELTA_SOURCE="$DELTAS_DIR/$i"
[ -d "$DELTA_SOURCE" ] || continue
NCOMMITS=$(git -C "$DELTA_SOURCE" rev-list --all 2>/dev/null | wc -l | tr -d ' ')
DIFF=$(( NCOMMITS - BASE_COMMITS ))
printf " Delta %s (+%d commits, total %d) ... " "$i" "$DIFF" "$NCOMMITS"

DEST_DELTA="$WORK_DIR/dest_deltas_${LABEL}_${i}"
cp -r "$DEST_BASE" "$DEST_DELTA"

ELAPSED=$(time_run_stein "$DELTA_SOURCE" "$DEST_DELTA")

echo "${ELAPSED}s"
echo "$i,$NCOMMITS,$ELAPSED" >> "$RESULT_B"
rm -rf "$DEST_DELTA"
done
echo "Results: $RESULT_B"
rm -rf "$DEST_BASE"

echo ""
echo "Done."
Loading
Loading