Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/_static/custom.css
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,10 @@ html[data-theme=light] .graph#doc-flowchart .node text {
fill: black;
}

html[data-theme=light] .graph#doc-flowchart .edge text {
fill: black;
}

.bd-content .sd-tab-set .sd-tab-content {
padding: 1.5rem;
}
Expand Down
53 changes: 53 additions & 0 deletions docs/getting_started/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,59 @@ re-enroll by visiting <https://duo.colorado.edu>. If that did not resolve your i

## General High Performance Computing

### What is Arbiter2?
::::{dropdown} Show
:icon: note

[Arbiter2](https://github.com/chpc-uofu/arbiter2) is a tool created by the University of Utah that allows us to monitor non-compute node resources for undesirable behavior. For an in-depth explanation of how Arbiter2 works, please see the official paper ["Arbiter: Dynamically Limiting Resource Consumption on Login Nodes"](https://dl.acm.org/doi/10.1145/3332186.3333043) (for a general overview, please see the remaining content in this FAQ). Currently, Arbiter2 is deployed on low resource hosts, such as login nodes, to detect work that consumes substantial CPU or memory resources. Once processes that consume substantial resources are detected and the user is moved into a penalty state, the user's total available resources on that host will be reduced and the user will be sent a no-reply warning email. Work that can consume substantial resources are items such as installing/compiling software, running software applications, and modification of large files. For a list of all hosts that Arbiter2 is deployed on, see the `Host` column in the table below.

```{important}
- Arbiter2 is currently setup to track work across all login nodes. For this reason, the user's state will be the same on all login nodes.
- Chosen configuration values (e.g. thresholds) and where Arbiter2 is deployed may change over time. This is due to adjustments we may need to make so that Arbiter2 fits the needs of the system and users. Please refer to this FAQ in the future for the most up-to-date information.
```

For those interested, we will now provide details on how Arbiter2 works, which includes how penalty states are applied. Before we get started, it is important to define some Arbiter2 terms. Please review the table of terms in the dropdown below before proceeding.

:::{dropdown} Show Arbiter2 terms
:icon: note
| Term | Description |
| :---------------------- | :--------------------------------------------- |
| badness | A value between 0 and 100 that accrues when a user exceeds defined resource thresholds. |
| normal state | The default user state that has the maximum amount of CPU and memory resources.|
| Penalty state | A user state with CPU and memory constraints applied. |
| Penalty occurrences | A variable that is used to determine what penalty state the user should be put in (see the table below for penalty state and penalty occurrences mapping) |
| Penalty occurrences timer | A variable that defines how long the user must be in the normal state before their penalty occurrences are reduced by 1. We currently set this value to `3 hours`. |
| CPU threshold | A threshold percentage of normal-state CPU capacity that triggers badness accumulation. We set this value to `0.75`. Since there are `4` CPUs available in the normal state, badness begins accumulating when usage exceeds `3` CPUs (`4 × 0.75`). |
| Memory threshold | A threshold percentage of normal-state memory (RAM) capacity that triggers badness accumulation. We set this value to `0.75`. Since there is `4GB` of memory available in the normal state, badness begins accumulating when usage exceeds `3GB` (`4 × 0.75`). |
| Time to max baddness | The amount of time spent over a threshold that will result in 100 badness and trigger an increase in penalty occurrences. Currently, this value is set to `10 minutes`. |
| Time to min baddness | The amount of time spent under all thresholds to go from 100 to 0 badness. Currently, this value is set to `30 minutes`. |
:::
In general, when a user goes over a CPU or memory threshold, the user will accrue badness. When a user accrues a badness of 100, the user's penalty occurrences will be incremented by 1, they will be moved into the penalty state corresponding to their penalty occurrences, and they will receive a no-reply warning email. Once in a penalty state, the amount of resources they have available to them on the host will be reduced (i.e. throttled) based on the penalty state they are in. They will stay in this penalty state for a set duration.

Once the penalty state duration ends, the user will be placed back into the normal state (i.e. no throttling will be applied). If a user is in a normal state, their penalty occurrences will reduce by 1 after the penalty occurrence timer reaches zero. If a user has more than 1 penalty occurrences, the penalty occurrence timer will restart after reaching zero and repeat until the number of penalty occurrences reaches zero. For a list of threshold values and durations for each penalty state, see the table below. Additionally, in the dropdown below we provide a flowchart representation for the logic Arbiter2 uses to put a user in a normal or penalty state.

:::{dropdown} Show flowchart depiction of Arbiter2
:icon: note
```{eval-rst}
.. raw:: html
:file: ../../graphviz_flowcharts/generated_images/arbiter_flowchart.svg

```
:::

```{note}
When a user attempts to use more resources than their "Resource usage maximum" they will experience the following:
- If they are using more CPUs, they will see that their CPU usage will automatically be throttled below the maximum CPU usage
- If they are trying to use more memory, their program will be automatically killed once they go over their memory usage maximum
```

```{eval-rst}
.. raw:: html
:file: ./faq_html/arbiter_penalty_table.html
```

::::

### How can I add users to a Linux group?
::::{dropdown} Show
:icon: note
Expand Down
114 changes: 114 additions & 0 deletions docs/getting_started/faq_html/arbiter_penalty_table.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
<style>

/* Light theme variables */
html[data-theme="light"] {
--header-color: #91bff0;
--color-data-integrity: #dfdddd;
--color-data-limits: #f2f2f2;
--color-data-access: #dfdddd;
--color-data-transfer: #f2f2f2;
--color-data-compression: #dfdddd;
--border-color: black;
--text-color: black;
--header-border-weight: 2px;
--border-weight: 2px;
}

/* Dark theme variables */
html[data-theme="dark"] {
--header-color: var(--pst-color-surface);
--color-data-integrity: var(--pst-color-on-background);
--color-data-limits: var(--pst-color-surface);
--color-data-access: var(--pst-color-on-background);
--color-data-transfer: var(--pst-color-surface);
--color-data-compression: var(--pst-color-on-background);
--border-color: #CFB87C;
--text-color: white;
--header-border-weight: 0.1px;
--border-weight: 0.1px;
}


.arbiter-table {
border-collapse: collapse;
width: 90%;
font-family: sans-serif;
/* Reapply necessary styles explicitly: */
border: 1px solid #ccc;
}

.arbiter-table th {
background-color: var(--header-color); /* light blue for headers */
border: var(--header-border-weight) solid var(--border-color);
padding: 8px;
font-size: 22px;
color: var(--text-color);
}

.arbiter-table tr.category-login-nodes td {
background-color: var(--color-data-integrity);
border: var(--border-weight) solid var(--border-color);
padding: 8px;
text-align: left;
color: var(--text-color);
font-size: 18px;
}

.arbiter-table td.tier-cell {
text-align: center !important;
}

/* Status Badges */
.badge {
display: inline-block;
padding: 2px 8px;
border-radius: 12px;
font-size: 11px;
font-weight: bold;
}
.normal { background: #dafbe1; color: black; border: 1px solid #bc8; }
.p1 { background: #fff8c5; color: black; border: 1px solid #ee0; }
.p2 { background: #fbac77; color: black; border: 1px solid #ff6600; }
.p3 { background: #ffebe9; color: black; border: 1px solid #f99; }

</style>

<table class="arbiter-table">
<thead>
<tr style="text-align: center;">
<th rowspan="2">Host</th>
<th rowspan="2">Penalty State</th>
<th rowspan="2">Resource usage maximum</th>
<th rowspan="2">Penalty occurrences</th>
<th rowspan="2">Duration of Penalty</th>
</tr>
</thead>
<tbody>
<!-- Data Integrity -->
<tr class="category-login-nodes">
<td rowspan="6">login nodes</td>
<td class="tier-cell"><span class="badge normal">normal</span></td>
<td class="tier-cell">4 CPU cores <br> 4GB of RAM</td>
<td class="tier-cell">N/A</td>
<td class="tier-cell">N/A</td>
</tr>
<tr class="category-login-nodes">
<td class="tier-cell"><span class="badge p1">penalty1</span></td>
<td class="tier-cell">3.2 CPU cores <br> 3.2GB of RAM</td>
<td class="tier-cell">1</td>
<td class="tier-cell">30 minutes</td>
</tr>
<tr class="category-login-nodes">
<td class="tier-cell"><span class="badge p2">penalty2</span></td>
<td class="tier-cell">2 CPU cores <br> 2GB of RAM</td>
<td class="tier-cell">2</td>
<td class="tier-cell">1 hour</td>
</tr>
<tr class="category-login-nodes">
<td class="tier-cell"><span class="badge p3">penalty3</span></td>
<td class="tier-cell">1.2 CPU cores <br> 1.2GB of RAM</td>
<td class="tier-cell">3 or more</td>
<td class="tier-cell">2 hours</td>
</tr>
</tbody>
</table>
24 changes: 0 additions & 24 deletions docs/petalibrary/images_and_html/pl_tier_table.html
Original file line number Diff line number Diff line change
Expand Up @@ -9,30 +9,6 @@
}

/* Light theme variables */
/* html[data-theme="light"] {
--header-color: #91bff0;
--color-data-integrity: #f2f2f2;
--color-data-limits: #dfdddd;;
--color-data-access: #adabab;
--color-data-transfer: #908f8f;
--color-data-compression: #656464;
--border-color: black;
--text-color: black;
--header-border-weight: 2px;
--border-weight: 2px;
} */
/* html[data-theme="light"] {
--header-color: #91bff0;
--color-data-integrity: #f2f2f2;
--color-data-limits: #dfdddd;
--color-data-access: #f2f2f2;
--color-data-transfer: #dfdddd;
--color-data-compression: #f2f2f2;
--border-color: black;
--text-color: black;
--header-border-weight: 2px;
--border-weight: 2px;
} */
html[data-theme="light"] {
--header-color: #91bff0;
--color-data-integrity: #dfdddd;
Expand Down
56 changes: 56 additions & 0 deletions graphviz_flowcharts/dot_files/arbiter_flowchart.dot
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
digraph "" {
bgcolor="transparent";
graph [id="doc-flowchart", rankdir=TB, nodesep=0.7, ranksep=0.8, bgcolor="none", splines=ortho];

// Node Styling
node [fontname="Verdana", fontsize="16", color="#CFB87C", style="filled", fillcolor="#121212", penwidth="2", fontcolor="white", shape=ellipse];
edge [color="#CFB87C", fillcolor="#121212", penwidth="1.5", fontsize="14", fontcolor="#CFB87C"];

// --- ROW 0: The solitary top node ---
{ rank=source; DoSomething; BadnessRed; BehFix;}

// --- ROW 1: The Start ---
{ rank=same; NS; Exceed; Badness; }

// --- ROW 5: Branching Outcomes ---
{ rank=same; TooBad; }

// --- ROW 6: Post-Penalty Actions ---
{ rank=same; DurationEnded; SentEmail; DurationTimer; }

// Nodes
NS [label="normal state"]
Exceed [label="Usage thresholds\nexceeded?", style="filled,dashed"]
DoSomething [label="If penalty occurrences > 0,\nstart penalty occurrences timer;\n at zero, reduce penalty occurrences\nby 1, repeating until zero.", fontsize="13"]
Badness [label="Badness accumulates"]
BehFix [label="Below usage\nthresholds?", style="filled,dashed"]
TooBad [label="Has badness\nreached 100?", style="filled,dashed"]
SentEmail [label="Warning email sent, penalty\noccurrences increases by 1,\n penalty state assigned,\nand resources throttled"]
DurationTimer [label="Penalty duration\ntimer counts down"]
DurationEnded [label="Has the penalty\nduration ended?", style="filled,dashed"]
BadnessRed [label="Badness reduces"]

// Flow Logic
NS -> Exceed
NS -> DoSomething

// Use constraint=false to point UP to the top row without pulling the chart apart
Exceed -> NS [xlabel=" No", constraint=false]
Exceed -> Badness [label="Yes"]

Badness -> BehFix
Badness -> TooBad

BehFix -> Badness [xlabel=" No"]
BehFix -> BadnessRed [label="Yes", constraint=false]
TooBad -> Badness [xlabel="No "]
TooBad -> SentEmail [xlabel="Yes "]

SentEmail -> DurationTimer [constraint=false]
DurationTimer -> DurationEnded [constraint=false]

DurationEnded -> NS [xlabel="Yes "]
DurationEnded -> DurationTimer [xlabel=" No", constraint=false]

BadnessRed -> Exceed
}
Loading