Skip to content

nuclide-research/BARE

Repository files navigation

BARE

Offline semantic ranker: scanner findings to Metasploit modules in a single Rust binary.

release license rust NuClide

FeaturesInstallationUsageAdaptersCorpusScope


BARE reads findings produced by nuclei, nmap, or Shodan adapters and ranks Metasploit modules by semantic similarity. The full pipeline runs offline. A BERT encoder, tokenizer, and 3,904 pre-encoded module descriptions are compiled into a ~101 MB binary. No Python, no PyTorch, no network, no package manager. Built for air-gapped networks, SCIFs, and restricted endpoints where installing a 5 GB ML stack is not an option.

Features

  • Single Rust binary, ~101 MB, fully self-contained
  • 3,904 Metasploit modules baked in at compile time via include_bytes!: 2,647 exploits, 1,257 auxiliary
  • All-MiniLM-L6-v2 BERT encoder running natively in Rust, no Python runtime
  • Cosine-similarity ranking with configurable top-N and minimum-score thresholds
  • --no-match-threshold emits sentinel fields when no module description meaningfully resembles the finding
  • Three input adapters: nuclei JSONL, nmap XML, Shodan JSONL
  • Stable input and output JSON schemas (FORMAT.md, INPUT_FORMAT.md, OUTPUT_FORMAT.md)
  • Parity validation against Python sentence-transformers reference within f32 rounding error
  • Stdout carries only the JSON output, so piping is safe

Installation

Pre-built binary from the releases page:

curl -LO https://github.com/nuclide-research/BARE/releases/latest/download/bare-linux-x86_64
curl -LO https://github.com/nuclide-research/BARE/releases/latest/download/bare-linux-x86_64.sha256
sha256sum -c bare-linux-x86_64.sha256
chmod +x bare-linux-x86_64

Build from source (Rust 1.70 or later):

git clone https://github.com/nuclide-research/BARE
cd BARE
curl -L -o assets/model.safetensors \
  https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/model.safetensors
cargo build --release

The model weights (assets/model.safetensors, ~87 MB) are gitignored and must be fetched once before the first build. After that the binary is self-contained.

Usage

bare [OPTIONS] [INPUT_PATH]

  --top <N>                   Top matches per finding (default: 3)
  --min-score <FLOAT>         Suppress matches below this cosine similarity
                              (default: 0.0). 0.5 high-confidence, 0.4 moderate, 0.3 loose
  --no-match-threshold <FLOAT>
                              When top corpus score falls below this value,
                              clear matches and emit sentinel fields (default: 0.55)
  --encode                    Read stdin, print L2-normalized 384-dim vector to stdout
  --version                   Print version and exit

  INPUT_PATH                  Path to findings.json, or - / omitted to read stdin

Status messages go to stderr. The JSON output document is the only thing on stdout.

Adapters

Adapter Input Adapter command
nuclei nuclei JSONL (-j) nuclei_to_bare.py
nmap nmap XML (-oX) nmap_to_bare.py
shodan Shodan JSONL bulk export shodan_to_bare.py

Each adapter converts scanner output to the findings.json input schema (version 1). Run nmap with -sV to maximize the description surface.

nuclei -u https://target.com -j | python adapters/nuclei/nuclei_to_bare.py | bare
nmap -sV -oX - target.com | python adapters/nmap/nmap_to_bare.py | bare --top 5
cat results.json | python adapters/shodan/shodan_to_bare.py | bare --min-score 0.4

Input and output schema

Input (findings.json):

{
  "version": 1,
  "source": "nuclei",
  "findings": [
    {
      "id": "CVE-2023-22527",
      "title": "Atlassian Confluence SSTI RCE",
      "description": "...",
      "target": "https://example.com",
      "severity": "critical",
      "metadata": {}
    }
  ]
}

target, severity, and metadata are optional. id, title, and description are required and must be non-empty.

Output per finding:

Field Type Notes
id string echoed from input
title string echoed from input
target string echoed, omitted if absent
severity string echoed, omitted if absent
matches array ranked module matches
no_high_confidence_match bool set when top corpus score below threshold
no_match_reason string reason text when no match
top_score_seen float top raw score when no match

Each entry in matches:

Field Notes
rank 1-based
module Metasploit module path
score cosine similarity (0.0 to 1.0)
category first path segment of the module name

The top-level document also carries version, source, and a corpus object with size and sha256.

Example

bare --top 3 findings.json
{
  "version": 1,
  "source": "bare",
  "corpus": {
    "size": 3904,
    "sha256": "a3c1e..."
  },
  "findings": [
    {
      "id": "CVE-2023-22527",
      "title": "Atlassian Confluence SSTI RCE",
      "target": "https://example.com",
      "severity": "critical",
      "matches": [
        {
          "rank": 1,
          "module": "exploits/multi/http/atlassian_confluence_rce_cve_2023_22527",
          "score": 0.8322,
          "category": "exploits"
        },
        {
          "rank": 2,
          "module": "exploits/multi/http/atlassian_confluence_rce_cve_2024_21683",
          "score": 0.7472,
          "category": "exploits"
        }
      ]
    }
  ]
}

When --no-match-threshold fires, matches is empty and three sentinel fields appear: no_high_confidence_match: true, no_match_reason, and top_score_seen.

Corpus

The embedded corpus contains 3,904 Metasploit module descriptions: 2,647 exploits and 1,257 auxiliary. The corpus is baked in at compile time via include_bytes!. Rebuilding from a fresh Metasploit snapshot needs Python with sentence-transformers:

python fetch_modules.py   # fetch module .rb files from GitHub
python serialize.py       # encode to 384-dim vectors, write corpus.bin
cargo build --release     # embed corpus.bin in the binary

Parity validation

The Rust encoder must match the Python sentence-transformers reference to within f32 / f64 rounding error. bare --encode reads stdin and prints a space-separated L2-normalized 384-dimensional vector. CI compares this output element-wise against tools/encode_baseline.py with a 1e-5 floor (typically ~1e-7 in practice) and fails the build on any mismatch.

Scope

BARE ranks modules by semantic similarity. A high score means the module description resembles the finding description. It does not confirm exploitability, check version numbers, or replace a manual triage step. Scores near the corpus floor (below ~0.55) mean no meaningful module coverage for the finding class, not a false negative. The --no-match-threshold flag makes that explicit.

Our other projects

  • aimap — AI/ML infrastructure fingerprint scanner
  • scanner — fast banner stage for population sweeps
  • tiptoe — quiet, congestion-controlled scanner for sensitive targets
  • menlohunt — zero-knowledge GCP perimeter scanner
  • recongraph — typed provenance graph for multi-source recon

License

MIT or Apache 2.0, at your option (standard Rust dual license). The embedded model weights (sentence-transformers/all-MiniLM-L6-v2) are Apache 2.0. Metasploit module descriptions used to build the corpus are BSD 3-Clause (Rapid7). Part of the NuClide toolchain. Contact: nuclide-research.com

About

Offline semantic ranker: maps scanner findings to Metasploit modules in a single Rust binary. BERT encoder, 3,904 embedded modules, ~101 MB. Built for air-gapped networks.

Topics

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Packages

 
 
 

Contributors