Skip to content

user-guide: add gateway-failover documentation#259

Open
Fredi-raspall wants to merge 2 commits intomasterfrom
pr/fredi/gw_failover
Open

user-guide: add gateway-failover documentation#259
Fredi-raspall wants to merge 2 commits intomasterfrom
pr/fredi/gw_failover

Conversation

@Fredi-raspall
Copy link
Contributor

Closes: #248
Unsure if this closes #249

@Fredi-raspall Fredi-raspall requested a review from a team as a code owner January 28, 2026 22:55
@github-actions
Copy link

github-actions bot commented Jan 28, 2026

🚀 Deployed on https://preview-259--hedgehog-docs.netlify.app

@Fredi-raspall Fredi-raspall force-pushed the pr/fredi/gw_failover branch 2 times, most recently from 6947ea5 to 2e2879c Compare January 29, 2026 09:24
@qmonnet qmonnet requested a review from Copilot January 29, 2026 09:43
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the user guide to document gateway redundancy/fail-over behavior and integrates the new material into the navigation and existing gateway docs. It also slightly refines existing gateway-related titles to better reflect their scope.

Changes:

  • Add a dedicated “Gateway fail-over and redundancy” user-guide page explaining gateway groups, traffic mapping, and fail-over behavior.
  • Link the new page from the overview and the .pages navigation under a new “Gateway” section.
  • Retitle the main gateway and gateway-add docs to “Gateway overview” and “Adding Gateways to the fabric” for clearer context.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
docs/user-guide/overview.md Adds a TOC entry pointing to the new gateway-failover documentation so users can discover redundancy guidance.
docs/user-guide/gateway.md Renames the main heading to “Gateway overview” to clarify that this page introduces gateway concepts now complemented by a separate fail-over page.
docs/user-guide/gateway-failover.md Introduces detailed documentation for gateway redundancy, gateway groups, traffic mapping, and fail-over behavior, including configuration snippets and design rationale.
docs/user-guide/gateway-add.md Updates the title to “Adding Gateways to the fabric” to align with a more general multi-gateway deployment story.
docs/user-guide/.pages Groups gateway-related docs under a “Gateway” nav section and includes the new fail-over page, improving navigation around gateway topics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Member

@qmonnet qmonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great document!

I'm usually picky with the style in the docs (Logan knows something about it), so I've got tons of nitpicks, but nothing major.

I'd also wrap the text on 80-character lines as I find it easier to diff and work with smaller lines, although I'm not sure we have a consensus about that.

One comment would be to remain careful with the number of admonitions (!!! note) in the document. It's good to have a few ones to insert visual pauses in long sections, but having too many ones may break the flow. You have quite a number of nots, and I think some of them could be regular paragraphs and it would help with overall readability.

@Fredi-raspall Fredi-raspall force-pushed the pr/fredi/gw_failover branch 3 times, most recently from f6dacef to 629ca3b Compare January 29, 2026 21:07
@Fredi-raspall Fredi-raspall requested a review from qmonnet January 29, 2026 21:08
@pau-hedgehog
Copy link
Contributor

I come to this PR to say that you could put a diagram on it: ;)
image

I could work on representing the gateway group if you need it

@Fredi-raspall
Copy link
Contributor Author

I come to this PR to say that you could put a diagram on it: ;)
I could work on representing the gateway group if you need it

Hey Pau. I'm fine adding a diagram. However I am not sure if it will help too much if:

  • we cannot represent VPCs
  • How would you represent groups?

We could add some representation, but it will mostly need to be manual?

@Fredi-raspall Fredi-raspall force-pushed the pr/fredi/gw_failover branch 3 times, most recently from 9bae58e to 5f9f499 Compare January 30, 2026 10:50
@pau-hedgehog
Copy link
Contributor

pau-hedgehog commented Jan 30, 2026

I come to this PR to say that you could put a diagram on it: ;)
I could work on representing the gateway group if you need it

Hey Pau. I'm fine adding a diagram. However I am not sure if it will help too much if:

  • we cannot represent VPCs
  • How would you represent groups?

We could add some representation, but it will mostly need to be manual?

On fabricator master we can represent VPCs already by querying a running Fabric (hhfab vlab diagram --live):
image

  • How would you represent groups?

I was thinking something similar to ESLAG groups to enclose gateways in a dashed rectangle. Shouldn't be too difficult. But I don't want to add noise to this review. We can address it some other time

@Fredi-raspall
Copy link
Contributor Author

I come to this PR to say that you could put a diagram on it: ;)
I could work on representing the gateway group if you need it

Hey Pau. I'm fine adding a diagram. However I am not sure if it will help too much if:

  • we cannot represent VPCs
  • How would you represent groups?

We could add some representation, but it will mostly need to be manual?

On fabricator master we can represent VPCs already by querying a running Fabric (hhfab vlab diagram --live): image

  • How would you represent groups?

I was thinking something similar to ESLAG groups to enclose gateways in a dashed rectangle. Shouldn't be too difficult. But I don't want to add noise to this review. We can address it some other time

Nice! I'm fine with adding it if it helps to understand the feature. Can be now or later.

qmonnet
qmonnet previously approved these changes Feb 2, 2026
Copy link
Member

@qmonnet qmonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK from my side, thanks!

Frostman
Frostman previously approved these changes Feb 2, 2026
Copy link
Member

@Frostman Frostman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks okay other then the primary gateway selection - that have to be fixed but docs could be updated later

@Fredi-raspall Fredi-raspall dismissed stale reviews from Frostman and qmonnet via 0d487dc February 2, 2026 21:36
Signed-off-by: Fredi Raspall <fredi@githedgehog.com>
Signed-off-by: Fredi Raspall <fredi@githedgehog.com>
Gateways implement services that are, in many cases, stateful. To correctly handle flows, the packets in the forward and reverse direction should be processed by the same gateway. The Hedgehog Fabric fail-over strategy is such that only one gateway handles a particular flow at any point in time. Gateway group priorities help to ensure that edge devices participating in a VPC peering select the same gateway. In future releases, it may be possible to balance the traffic of a single VPC peering over multiple gateways.

!!! note
Since group membership priorities are specified in the gateways themselves (instead of the `GatewayGroup`s), with many groups and gateways, two or more gateways may end up being assigned the same priority in a given group. The fabric will not reject such a configuration: despite having the same priorities, only one of the gateways will be the preferred; the first when ordering the gateways within the group alphabetically by name. This tie-breaking criteria is implemented by all gateways so that only one gateway per group is selected consistently across the fabric.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should clarify here that it applies to when prio are the same, but ok for me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should clarify here that it applies to when prio are the same, but ok for me

Sorry @Frostman . I don't understand your point. Isn't it clear that we're talking about the case when you have two or more gateways with the same priority (and it is higher than the rest)?

... two or more gateways may end up being assigned the same priority in a given group...

... despite having the same priorities, only one of the gateways will be the preferred; the first when ordering the gateways within the group alphabetically by name

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

However:

!!! warning
Currently, group sizes are limited to 10 members at the most. Such a limit may only affect in case you have more than 10 gateways deployed on the same fabric.
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor grammar: “at the most” should be “at most”.

Suggested change
Currently, group sizes are limited to 10 members at the most. Such a limit may only affect in case you have more than 10 gateways deployed on the same fabric.
Currently, group sizes are limited to 10 members at most. Such a limit may only affect in case you have more than 10 gateways deployed on the same fabric.

Copilot uses AI. Check for mistakes.
Gateways implement services that are, in many cases, stateful. To correctly handle flows, the packets in the forward and reverse direction should be processed by the same gateway. The Hedgehog Fabric fail-over strategy is such that only one gateway handles a particular flow at any point in time. Gateway group priorities help to ensure that edge devices participating in a VPC peering select the same gateway. In future releases, it may be possible to balance the traffic of a single VPC peering over multiple gateways.

!!! note
Since group membership priorities are specified in the gateways themselves (instead of the `GatewayGroup`s), with many groups and gateways, two or more gateways may end up being assigned the same priority in a given group. The fabric will not reject such a configuration: despite having the same priorities, only one of the gateways will be the preferred; the first when ordering the gateways within the group alphabetically by name. This tie-breaking criteria is implemented by all gateways so that only one gateway per group is selected consistently across the fabric.
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor grammar: “This tie-breaking criteria is …” is ungrammatical (either use singular “criterion” or plural “criteria are”).

Suggested change
Since group membership priorities are specified in the gateways themselves (instead of the `GatewayGroup`s), with many groups and gateways, two or more gateways may end up being assigned the same priority in a given group. The fabric will not reject such a configuration: despite having the same priorities, only one of the gateways will be the preferred; the first when ordering the gateways within the group alphabetically by name. This tie-breaking criteria is implemented by all gateways so that only one gateway per group is selected consistently across the fabric.
Since group membership priorities are specified in the gateways themselves (instead of the `GatewayGroup`s), with many groups and gateways, two or more gateways may end up being assigned the same priority in a given group. The fabric will not reject such a configuration: despite having the same priorities, only one of the gateways will be the preferred; the first when ordering the gateways within the group alphabetically by name. This tie-breaking criterion is implemented by all gateways so that only one gateway per group is selected consistently across the fabric.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to 3
# Adding Gateways to the fabric

This section covers adding a gateway node to an existing Fabric. Gateway nodes provide advanced network services (NAT, PAT, firewalling) by
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new title is plural (“Adding Gateways…”), but the opening sentence still describes adding a single gateway node. Consider aligning the wording (either keep the title singular or update the intro to cover adding one or more gateways) to avoid confusing readers.

Copilot uses AI. Check for mistakes.
Comment on lines +19 to +25
items:
- apiVersion: gateway.githedgehog.com/v1alpha1
kind: GatewayGroup
metadata:
name: group-1
namespace: default
spec: {}
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GatewayGroup YAML example is not a valid single-object manifest: it uses a top-level items: list without a corresponding kind: List (and likely needs either kind: GatewayGroup directly, or apiVersion: v1 + kind: List, or --- multi-document YAML). As written, users copy/pasting this will get a validation/apply error.

Suggested change
items:
- apiVersion: gateway.githedgehog.com/v1alpha1
kind: GatewayGroup
metadata:
name: group-1
namespace: default
spec: {}
kind: GatewayGroup
metadata:
name: group-1
namespace: default
spec: {}

Copilot uses AI. Check for mistakes.
```

!!! note
The priority assigned to a gateway in a group has no significance in absolute terms. Configuring three gateways in the same group with priorities 300, 200 and 100 has the same effect as configuring them with priorities 51, 29 and 3.
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several admonition bodies are indented with a tab character. MkDocs/Material admonitions require consistent space indentation; tabs can cause the admonition content to render as a code block or not be associated with the admonition at all. Replace the leading tab with 4 spaces in these blocks.

Suggested change
The priority assigned to a gateway in a group has no significance in absolute terms. Configuring three gateways in the same group with priorities 300, 200 and 100 has the same effect as configuring them with priorities 51, 29 and 3.
The priority assigned to a gateway in a group has no significance in absolute terms. Configuring three gateways in the same group with priorities 300, 200 and 100 has the same effect as configuring them with priorities 51, 29 and 3.

Copilot uses AI. Check for mistakes.
Comment on lines +88 to +91
One consequence of mapping a peering to a non-default `GatewayGroup` is that any gateway that is not a member of that group will not be used to serve the traffic for that peering, even if all gateways in that group become unavailable.

!!! tip
Gateway groups and the peering mappings can be handy for other purposes. For instance, removing a gateway from a group allows pulling the traffic of all peerings mapped to that group out of that gateway. Or, by adjusting member priorities, traffic can be re-mapped without changing the peering mappings to groups.
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These admonition bodies are also tab-indented; use spaces to ensure the content is rendered as part of the admonition (and not as a code block).

Suggested change
One consequence of mapping a peering to a non-default `GatewayGroup` is that any gateway that is not a member of that group will not be used to serve the traffic for that peering, even if all gateways in that group become unavailable.
!!! tip
Gateway groups and the peering mappings can be handy for other purposes. For instance, removing a gateway from a group allows pulling the traffic of all peerings mapped to that group out of that gateway. Or, by adjusting member priorities, traffic can be re-mapped without changing the peering mappings to groups.
One consequence of mapping a peering to a non-default `GatewayGroup` is that any gateway that is not a member of that group will not be used to serve the traffic for that peering, even if all gateways in that group become unavailable.
!!! tip
Gateway groups and the peering mappings can be handy for other purposes. For instance, removing a gateway from a group allows pulling the traffic of all peerings mapped to that group out of that gateway. Or, by adjusting member priorities, traffic can be re-mapped without changing the peering mappings to groups.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update all docs to suggest redundant gateway setup Document how Gateway redundancy works

4 participants

Comments