A safety-inspired, redundant embedded control platform designed to demonstrate reliability engineering, fault tolerance, deterministic control, crash forensics, and system-level thinking on real hardware.
This project is intentionally built like a “mini critical system” found in automotive, aerospace, and industrial equipment: multiple controllers, strict supervision rules, well-defined degraded modes, fault injection, and a blackbox-style logging pipeline.
- Why this exists
- What the system does
- Core principles
- System architecture
- Hardware
- Software
- Safety and degraded modes
- Fault injection
- Logging and crash forensics
- Build, flash, and run
- Testing strategy
- Acceptance criteria
- Roadmap
- License and disclaimer
Most portfolios show features. This project shows responsibility.
Sentinel is built to prove we can:
- Design a system that detects failures, isolates them, and enters safe states predictably.
- Build deterministic firmware (bounded execution, watchdog discipline, timer-driven loops).
- Structure a fault-tolerant architecture (redundancy, supervision, voting).
- Capture and explain failures with a blackbox logging pipeline (crash forensics).
- Run fault campaigns (repeatable, measurable tests, not “it seems stable”).
Sentinel is a dual-controller platform where two independent microcontrollers (“Worker A” and “Worker B”) run the same control logic and continuously cross-check each other.
A Raspberry Pi acts as:
- Supervisor (health monitoring)
- Logger (blackbox recorder)
- Integration hub (data export, dashboards, scripts, test harness)
The system drives a “safety output” (an Enable line, relay, or actuator simulation) only if strict rules are satisfied.
-
Deterministic control loop
- Fixed control period (example: 1 kHz or 100 Hz depending on load)
- No unbounded work inside the loop
- Timers and bounded processing time
-
Redundancy
- Two independent workers with independent clocks and resets
- Cross-monitoring via heartbeat and status frames
-
Fail-safe default
- Any uncertainty leads to safe state
- Safe state means outputs disabled and latched until recovery policy allows re-arming
-
Observable truth
- Every relevant event becomes a timestamped record
- Fault injection produces traceable evidence
flowchart LR
subgraph WKA["Worker A (STM32)"]
A1["Control FSM + I/O"]
A2["Heartbeat + Status TX"]
A3["Watchdog + Brownout"]
end
subgraph WKB["Worker B (STM32)"]
B1["Control FSM + I/O"]
B2["Heartbeat + Status TX"]
B3["Watchdog + Brownout"]
end
subgraph BUS["CAN Bus"]
C1["CAN frames: HB/Status/Vote/Fault"]
end
subgraph PI["Gateway (Raspberry Pi 3 B+)"]
P1["CAN Monitor + Supervisor"]
P2["Blackbox Logger"]
P3["Fault Injection Controller"]
P4["Exporter (JSON/PCAP/CSV)"]
end
subgraph OUT["Safety Output Stage"]
O1["Enable logic (2oo2)"]
O2["Relay / Actuator Simulation"]
end
subgraph INJ["Fault Injection"]
F1["5V relay: power cut / line cut / enable cut"]
end
A2 --> BUS
B2 --> BUS
BUS --> P1
A1 --> O1
B1 --> O1
O1 --> O2
PI --> F1
F1 -.injected fault.-> WKA
F1 -.injected fault.-> WKB
F1 -.injected fault.-> BUS
F1 -.injected fault.-> OUT
-
Runs the control loop and the safety state machine.
-
Publishes:
- Heartbeat (HB)
- Status snapshot (mode, error flags, loop timing, supply, counters)
- Vote/decision (ARM, DISARM, FAULT, DEGRADED)
-
Consumes:
- Peer frames from Worker B
- Optional supervisor commands from Raspberry Pi (depending on policy)
Same as Worker A. It must be able to keep the system safe even if A is compromised.
-
Passive monitoring by default (never required for safety).
-
Collects all frames and builds a timeline.
-
Runs scripts to:
- trigger fault injection
- generate reports
- export traces for analysis
Default model: 2oo2 (two out of two)
- Output is enabled only if both workers independently vote OK to enable.
- Any mismatch or silence disables output.
Rationale:
- 2oo2 maximizes safety demonstration: one compromised worker cannot keep the system armed alone.
Optional model (future): 1oo2 with strict supervision
- May be introduced later to demonstrate availability vs safety tradeoffs.
- Must include explicit hazard analysis and additional checks.
Minimum target configuration:
-
2x STM32 Nucleo boards (example used: NUCLEO-G474RE)
- Worker A
- Worker B
-
1x Raspberry Pi 3 B+ (gateway + logger)
-
2x CAN transceiver modules (one per worker if not integrated)
- Example families: MCP2551, TJA1050, SN65HVD230 (exact part depends on bus voltage and wiring)
-
CAN wiring
- Twisted pair for CANH/CANL
- Proper termination (typically 120Ω at each end of the bus)
-
Fault injection relay module (5V)
- Used to cut power or signals in a controlled way
-
Optional lab tools (recommended)
- Logic analyzer (for timing and bus verification)
- USB-CAN adapter (for PC capture and sanity checks)
Optional expansions (explicitly not required for MVP):
- Sensors (temperature, presence, etc.)
- External safety output stage (relay driver, MOSFET, isolated IO)
- Dedicated power monitoring IC
Recommended approach for clarity and reproducibility:
-
A dedicated 5V supply rail for:
- Raspberry Pi (stable 5V, adequate current)
- Relay module
-
Workers powered via:
- Nucleo USB (simplest for MVP), or
- a shared regulated 5V rail if you want to demonstrate power domain coupling and brownout behavior
Safety note:
- Do not use the Raspberry Pi 5V pin as a casual power source for everything unless you know current budgets and wiring quality.
Typical minimal wiring (conceptual):
-
CAN bus:
- Worker A CANH/CANL -> Bus
- Worker B CANH/CANL -> Bus
- Raspberry Pi via CAN interface (SPI CAN controller like MCP2515, or USB-CAN)
-
Safety output:
- Worker A Enable GPIO -> Enable logic input A
- Worker B Enable GPIO -> Enable logic input B
- Enable logic -> relay driver / actuator simulation
-
Fault injection relay:
-
Relay controlled by Raspberry Pi GPIO
-
Relay contacts wired to cut one of:
- Worker A power
- Worker B power
- CANH or CANL segment
- Enable line
-
Exact pin numbers depend on your chosen boards and CAN interface. Keep them in docs/wiring.md.
This repo is structured to keep firmware, gateway, tooling, and documentation cleanly separated.
.
├─ docs/
├─ firmware/
│ ├─ worker-common/
│ ├─ worker-a/
│ ├─ worker-b/
│ └─ tools/
│ ├─ openocd/
│ └─ scripts/
├─ gateway/
│ ├─ sentinel-gw/
│ └─ tools/
│ ├─ capture/
│ ├─ replay/
│ └─ report/
├─ tools/
│ ├─ faultctl/
│ ├─ trace/
│ └─ ci/
├─ cad/
│ ├─ enclosure/
│ └─ wiring/
├─ .github/
│ └─ workflows/
├─ LICENSE
└─ README.md
Each worker implements the same core modules:
-
Boot and self-test
- Reset reason capture
- Basic sanity checks (clock, critical peripherals)
- Initialize event log ring buffer
- Enter SAFE state by default
-
Control loop
- Timer-driven tick
- Reads local inputs (optional)
- Updates FSM
- Publishes heartbeat and status on CAN at fixed rates
- Drives a local vote output (Enable GPIO)
-
Supervision
-
Watchdog service only when:
- loop timing is within bounds
- internal state is valid
-
Peer monitoring:
- expects Worker peer HB at a defined interval
- checks peer sequence counter monotonicity
- checks peer mode consistency
-
These are placeholders that you should tune and lock down:
- Control loop: 100 Hz or 1 kHz (pick based on what you want to demonstrate)
- Heartbeat: 10 Hz
- Status: 5 Hz
- Vote frame: 10 Hz or on change
- Fault frame: immediate on fault detection
Keep them defined in one place:
firmware/worker-common/include/sentinel_config.h
The gateway is designed as a set of services:
-
sentinel-can-listener- Reads CAN frames
- Timestamps frames at ingestion
- Writes them to an append-only log
-
sentinel-supervisor- Derives system health from observed frames
- Detects mismatches and missing heartbeats
- Optionally commands fault injection sequences
-
sentinel-faultctl- Controls relay lines via GPIO
- Enforces guardrails (rate limits, arming requirements)
-
sentinel-export-
Converts raw logs to:
- JSON events
- CSV summaries
- PCAP-like representations (if needed)
-
All gateway processes must:
- never block CAN ingestion
- degrade gracefully if storage is full
- rotate logs safely
CAN frames must be documented and stable.
A recommended frame set:
-
HB (Heartbeat)
- node_id
- seq
- mode
- uptime_ms (optional)
-
STATUS
- error_flags bitfield
- last_loop_us
- min/max loop time (rolling window)
- supply voltage (if available)
-
VOTE
- enable_request (0/1)
- reason_code
-
FAULT
- fault_code
- fault_context (optional)
- latch flag
Define IDs and payloads in:
docs/architecture.mdandfirmware/worker-common/include/can_protocol.h
Workers:
- No dynamic allocation in the control loop
- No unbounded I/O in the control loop
- Any logging in loop uses ring buffer and deferred flushing if needed
Gateway:
- CAN ingestion is single responsibility
- Heavy processing happens offline or in separate processes
Every worker uses the same high-level FSM:
BOOTSAFE(default)ARMING(checks, peer sync, timing stable)ARMED(output allowed, strict monitoring)DEGRADED(still safe output behavior, but logs availability loss)FAULT_LATCHED(requires reset or explicit recovery policy)
Document the exact transitions in docs/safety_model.md.
A non-exhaustive set:
- Missed peer heartbeats beyond threshold
- Peer sequence counter regression or stalls
- Mode disagreement (A says ARMED, B says SAFE)
- Local loop time overrun
- Watchdog near-miss counters
- Brownout or reset anomaly
- CAN bus-off (if applicable)
Default output policy:
-
Output enable is asserted only when:
- local FSM is ARMED
- peer is ARMED
- no faults latched
- timing health is good
-
Otherwise output is OFF
Fault injection is a first-class feature, not an afterthought.
Examples:
- Cut power to Worker A
- Cut power to Worker B
- Break CAN line segment
- Force enable line low
- Introduce delayed or dropped frames (future: via gateway replay or bus tool)
A “campaign” is:
- a scripted sequence of injected faults
- with expected outcomes
- producing an evidence bundle (logs + summary report)
See:
docs/test_campaigns.md
We aim for logs that answer:
- What happened
- In what order
- With what timing
- What the system believed
- What it decided
Gateway logs:
-
Every CAN frame with timestamp
-
Derived events:
- “Worker A missed heartbeat”
- “Vote mismatch”
- “System disarmed due to rule X”
Workers log locally (ring buffer):
- State transitions
- Fault latches and reasons
- Timing overruns
Export formats:
- Raw binary log (fast, compact)
- JSON event timeline (human-readable)
- CSV summaries (metrics)
Firmware:
- ARM GNU Toolchain (
arm-none-eabi-gcc) - CMake + Ninja (if using CMake-based build)
- OpenOCD and ST-Link tools (or STM32CubeIDE)
- Python 3 (for helper scripts)
Gateway:
- Raspberry Pi OS Lite
- CAN interface drivers (MCP2515 over SPI or USB-CAN)
systemdfor services- Python or C++ runtime depending on gateway implementation
- Wire the CAN bus and verify termination.
- Flash Worker A and Worker B firmware.
- Bring up Raspberry Pi gateway and verify it sees CAN traffic.
- Run baseline: no faults, stable ARMED.
- Trigger fault injection and verify it returns to SAFE and logs evidence.
This section is intentionally generic. Lock it down once your toolchain is finalized.
Example flow:
-
Build:
cmake -S firmware/worker-a -B build/worker-a -G Ninjacmake --build build/worker-a
-
Flash:
openocd -f interface/stlink.cfg -f target/stm32g4x.cfg -c "program build/worker-a/worker-a.elf verify reset exit"
Repeat for Worker B.
-
Enable required interfaces (depends on CAN hardware)
- SPI if using MCP2515
-
Bring up CAN network interface
ip link set can0 up type can bitrate 500000
-
Start services
systemctl enable --now sentinel-can-listenersystemctl enable --now sentinel-supervisor
-
Validate ingestion
- check logs directory
- confirm heartbeat frames appear at expected rate
We test at four levels:
-
Unit tests
- pure logic: FSM transitions, vote rules, fault rules
-
On-target tests
- timing constraints, watchdog behavior, bus stability
-
Integration tests
- dual worker agreement, mismatch handling, arming rules
-
Fault campaigns
- scripted power cuts, bus interruptions, enable cuts
- must produce stable, repeatable outcomes
The MVP is accepted when:
-
Both workers boot into SAFE by default.
-
Both workers produce heartbeat and status frames at stable rates.
-
System arms only when both are healthy and consistent.
-
Any injected fault forces SAFE state within the defined bound.
-
Gateway produces a report showing:
- timestamps
- detected fault
- resulting disarm event
- recovery behavior
-
At least one fault campaign is fully reproducible with identical expected outcomes.
Near-term:
- Lock CAN protocol IDs and payloads
- Add formal timing budget and measure loop jitter
- Add structured fault codes and a fault taxonomy
- Add replay tool to reproduce traces deterministically
Mid-term:
- Add optional 1oo2 availability mode (with explicit hazard analysis)
- Add a small HMI dashboard (local or remote) to visualize states and faults
- Add hardware PCB for clean wiring and repeatable demo setup
Long-term:
- Add “postmortem pack” generator: one command to export logs, plots, and narrative
- Add additional fault models: stuck-at GPIO, corrupted payload, delayed frames
- Add formal verification or model checking on the FSM (select critical properties)
This project is for educational and demonstration purposes. It is not a certified safety product and must not be used to control real hazardous machinery without proper engineering process, certification, and safety validation.
See LICENSE.