Skip to content

build(deps): bump tensorflow from 2.4.3 to 2.10.0 in /examples/tests#82

Closed
dependabot[bot] wants to merge 1 commit into
masterfrom
dependabot/pip/examples/tests/tensorflow-2.10.0
Closed

build(deps): bump tensorflow from 2.4.3 to 2.10.0 in /examples/tests#82
dependabot[bot] wants to merge 1 commit into
masterfrom
dependabot/pip/examples/tests/tensorflow-2.10.0

Conversation

@dependabot

@dependabot dependabot Bot commented on behalf of github Sep 7, 2022

Copy link
Copy Markdown

Bumps tensorflow from 2.4.3 to 2.10.0.

Release notes

Sourced from tensorflow's releases.

TensorFlow 2.10.0

Release 2.10.0

Breaking Changes

  • Causal attention in keras.layers.Attention and keras.layers.AdditiveAttention is now specified in the call() method via the use_causal_mask argument (rather than in the constructor), for consistency with other layers.
  • Some files in tensorflow/python/training have been moved to tensorflow/python/tracking and tensorflow/python/checkpoint. Please update your imports accordingly, the old files will be removed in Release 2.11.
  • tf.keras.optimizers.experimental.Optimizer will graduate in Release 2.11, which means tf.keras.optimizers.Optimizer will be an alias of tf.keras.optimizers.experimental.Optimizer. The current tf.keras.optimizers.Optimizer will continue to be supported as tf.keras.optimizers.legacy.Optimizer, e.g.,tf.keras.optimizers.legacy.Adam. Most users won't be affected by this change, but please check the API doc if any API used in your workflow is changed or deprecated, and make adaptions. If you decide to keep using the old optimizer, please explicitly change your optimizer to tf.keras.optimizers.legacy.Optimizer.
  • RNG behavior change for tf.keras.initializers. Keras initializers will now use stateless random ops to generate random numbers.
    • Both seeded and unseeded initializers will always generate the same values every time they are called (for a given variable shape). For unseeded initializers (seed=None), a random seed will be created and assigned at initializer creation (different initializer instances get different seeds).
    • An unseeded initializer will raise a warning if it is reused (called) multiple times. This is because it would produce the same values each time, which may not be intended.

Deprecations

  • The C++ tensorflow::Code and tensorflow::Status will become aliases of respectively absl::StatusCode and absl::Status in some future release.
    • Use tensorflow::OkStatus() instead of tensorflow::Status::OK().
    • Stop constructing Status objects from tensorflow::error::Code.
    • One MUST NOT access tensorflow::errors::Code fields. Accessing tensorflow::error::Code fields is fine.
      • Use the constructors such as tensorflow::errors:InvalidArgument to create status using an error code without accessing it.
      • Use the free functions such as tensorflow::errors::IsInvalidArgument if needed.
      • In the last resort, use e.g.static_cast<tensorflow::errors::Code>(error::Code::INVALID_ARGUMENT) or static_cast<int>(code) for comparisons.
  • tensorflow::StatusOr will also become in the future alias to absl::StatusOr, so use StatusOr::value instead of StatusOr::ConsumeValueOrDie.

Major Features and Improvements

  • tf.lite:

    • New operations supported:
      • tflite SelectV2 now supports 5D.
      • tf.einsum is supported with multiple unknown shapes.
      • tf.unsortedsegmentprod op is supported.
      • tf.unsortedsegmentmax op is supported.
      • tf.unsortedsegmentsum op is supported.
    • Updates to existing operations:
      • tfl.scatter_nd now supports I1 for update arg.
    • Upgrade Flatbuffers v2.0.5 from v1.12.0
  • tf.keras:

    • EinsumDense layer is moved from experimental to core. Its import path is moved from tf.keras.layers.experimental.EinsumDense to tf.keras.layers.EinsumDense.
    • Added tf.keras.utils.audio_dataset_from_directory utility to easily generate audio classification datasets from directories of .wav files.
    • Added subset="both" support in tf.keras.utils.image_dataset_from_directory,tf.keras.utils.text_dataset_from_directory, and audio_dataset_from_directory, to be used with the validation_split argument, for returning both dataset splits at once, as a tuple.
    • Added tf.keras.utils.split_dataset utility to split a Dataset object or a list/tuple of arrays into two Dataset objects (e.g. train/test).
    • Added step granularity to BackupAndRestore callback for handling distributed training failures & restarts. The training state can now be restored at the exact epoch and step at which it was previously saved before failing.
    • Added tf.keras.dtensor.experimental.optimizers.AdamW. This optimizer is similar as the existing keras.optimizers.experimental.AdamW, and works in the DTensor training use case.
    • Improved masking support for tf.keras.layers.MultiHeadAttention.
      • Implicit masks for query, key and value inputs will automatically be used to compute a correct attention mask for the layer. These padding masks will be combined with any attention_mask passed in directly when calling the layer. This can be used with tf.keras.layers.Embedding with mask_zero=True to automatically infer a correct padding mask.
      • Added a use_causal_mask call time arugment to the layer. Passing use_causal_mask=True will compute a causal attention mask, and optionally combine it with any attention_mask passed in directly when calling the layer.

... (truncated)

Changelog

Sourced from tensorflow's changelog.

Release 2.10.0

Breaking Changes

  • Causal attention in keras.layers.Attention and keras.layers.AdditiveAttention is now specified in the call() method via the use_causal_mask argument (rather than in the constructor), for consistency with other layers.
  • Some files in tensorflow/python/training have been moved to tensorflow/python/tracking and tensorflow/python/checkpoint. Please update your imports accordingly, the old files will be removed in Release 2.11.
  • tf.keras.optimizers.experimental.Optimizer will graduate in Release 2.11, which means tf.keras.optimizers.Optimizer will be an alias of tf.keras.optimizers.experimental.Optimizer. The current tf.keras.optimizers.Optimizer will continue to be supported as tf.keras.optimizers.legacy.Optimizer, e.g.,tf.keras.optimizers.legacy.Adam. Most users won't be affected by this change, but please check the API doc if any API used in your workflow is changed or deprecated, and make adaptions. If you decide to keep using the old optimizer, please explicitly change your optimizer to tf.keras.optimizers.legacy.Optimizer.
  • RNG behavior change for tf.keras.initializers. Keras initializers will now use stateless random ops to generate random numbers.
    • Both seeded and unseeded initializers will always generate the same values every time they are called (for a given variable shape). For unseeded initializers (seed=None), a random seed will be created and assigned at initializer creation (different initializer instances get different seeds).
    • An unseeded initializer will raise a warning if it is reused (called) multiple times. This is because it would produce the same values each time, which may not be intended.

Deprecations

  • The C++ tensorflow::Code and tensorflow::Status will become aliases of respectively absl::StatusCode and absl::Status in some future release.
    • Use tensorflow::OkStatus() instead of tensorflow::Status::OK().
    • Stop constructing Status objects from tensorflow::error::Code.
    • One MUST NOT access tensorflow::errors::Code fields. Accessing tensorflow::error::Code fields is fine.
      • Use the constructors such as tensorflow::errors:InvalidArgument to create status using an error code without accessing it.
      • Use the free functions such as tensorflow::errors::IsInvalidArgument if needed.
      • In the last resort, use e.g.static_cast<tensorflow::errors::Code>(error::Code::INVALID_ARGUMENT) or static_cast<int>(code) for comparisons.
  • tensorflow::StatusOr will also become in the future alias to absl::StatusOr, so use StatusOr::value instead of StatusOr::ConsumeValueOrDie.

Major Features and Improvements

  • tf.lite:

    • New operations supported:
      • tflite SelectV2 now supports 5D.
      • tf.einsum is supported with multiple unknown shapes.
      • tf.unsortedsegmentprod op is supported.
      • tf.unsortedsegmentmax op is supported.
      • tf.unsortedsegmentsum op is supported.
    • Updates to existing operations:
      • tfl.scatter_nd now supports I1 for update arg.
    • Upgrade Flatbuffers v2.0.5 from v1.12.0
  • tf.keras:

    • EinsumDense layer is moved from experimental to core. Its import path is moved from tf.keras.layers.experimental.EinsumDense to tf.keras.layers.EinsumDense.
    • Added tf.keras.utils.audio_dataset_from_directory utility to easily generate audio classification datasets from directories of .wav files.
    • Added subset="both" support in tf.keras.utils.image_dataset_from_directory,tf.keras.utils.text_dataset_from_directory, and audio_dataset_from_directory, to be used with the validation_split argument, for returning both dataset splits at once, as a tuple.
    • Added tf.keras.utils.split_dataset utility to split a Dataset object or a list/tuple of arrays into two Dataset objects (e.g. train/test).
    • Added step granularity to BackupAndRestore callback for handling distributed training failures & restarts. The training state can now be restored at the exact epoch and step at which it was previously saved before failing.
    • Added tf.keras.dtensor.experimental.optimizers.AdamW. This optimizer is similar as the existing keras.optimizers.experimental.AdamW, and works in the DTensor training use case.
    • Improved masking support for tf.keras.layers.MultiHeadAttention.
      • Implicit masks for query, key and value inputs will automatically be used to compute a correct attention mask for the layer. These padding masks will be combined with any attention_mask passed in directly when calling the layer. This can be used with tf.keras.layers.Embedding with mask_zero=True to automatically infer a correct padding mask.
      • Added a use_causal_mask call time arugment to the layer. Passing use_causal_mask=True will compute a causal attention mask, and optionally combine it with any attention_mask passed in directly when calling the layer.
    • Added ignore_class argument in the loss SparseCategoricalCrossentropy and metrics IoU and MeanIoU, to specify a class index to be ignored during loss/metric computation (e.g. a background/void class).
    • Added tf.keras.models.experimental.SharpnessAwareMinimization. This class implements the sharpness-aware minimization technique, which boosts model performance on various tasks, e.g., ResNet on image classification.

... (truncated)

Commits
  • 359c3cd Merge pull request #57609 from tensorflow/vinila21-patch-6
  • 724308f Update estimator and keras version in TF 2.10 branch for 2.10.0.
  • 203b333 Merge pull request #57608 from tensorflow-jenkins/version-numbers-2.10.0-28960
  • cd950ff Update version numbers to 2.10.0
  • 9b13e9e Merge pull request #57510 from tensorflow/vinila21-patch-1
  • ba47bc7 Update release notes with security updates
  • f082fa9 Merge pull request #57464 from tensorflow/r2.10-b5f6fbfba76
  • 60ed7ce Re-enable testTensorListReserveWithNonScalarNumElements to work with mlir as ...
  • 23cb0d3 Merge pull request #57460 from tensorflow/revert-57075-r2.10-e9863e9a9cb
  • f419a41 Revert "r2.10 cherry-pick: e9863e9a9cb "Fix tf.raw_ops.EmptyTensorList vulner...
  • Additional commits viewable in compare view

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot merge will merge this PR after your CI passes on it
  • @dependabot squash and merge will squash and merge this PR after your CI passes on it
  • @dependabot cancel merge will cancel a previously requested merge and block automerging
  • @dependabot reopen will reopen this PR if it is closed
  • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Bumps [tensorflow](https://github.com/tensorflow/tensorflow) from 2.4.3 to 2.10.0.
- [Release notes](https://github.com/tensorflow/tensorflow/releases)
- [Changelog](https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md)
- [Commits](tensorflow/tensorflow@v2.4.3...v2.10.0)

---
updated-dependencies:
- dependency-name: tensorflow
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot dependabot Bot added dependencies Pull requests that update a dependency file python Pull requests that update Python code labels Sep 7, 2022
nrajanee pushed a commit that referenced this pull request Dec 1, 2022
…etermined-ai#327)

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* chore: FOUNDENG-102 Determined shows killed shells as still running

* Added locking to the monitoredJobs

Co-authored-by: Bradley Laney <bradley.laney@gmail.com>
Co-authored-by: Sean Mackrory <mackrory@determined.ai>
nrajanee pushed a commit that referenced this pull request Dec 1, 2022
… random failures (determined-ai#539)

* chore: update Go dependency files

This is a sort of artificial commit used to concentrate all of the
go.{mod,sum} changes for EE into one place to hopefully reduce
conflicts.

* feat: add support for SCIM provisioning

This commit brings the SCIM support for v0.11.2 back from the dead with
minimal changes. The only new fixes/features are conflicts now correctly
returns 409 not 500, password sync is supported, externalId is not
required, and the username/password that the IdP uses for basic auth
when talking to us is configurable.

* chore: remove Apache V2 license text

The enterprise edition is not licensed under Apache V2, so remove the
license text to avoid confusion.

* fix: ignore SCIM meta field

When we receive a PATCH or PUT from Okta,
they may reply to us with the meta field we
initially gave to them. This is ok; it is in
spec to receive and ignore it.

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add support for SAML

To meet customer needs, we need to provide support for users to
authenticate with SAML. The commit adds /saml/sso and /saml/initiate to
receive SAML responses and initiate SP-initiated flow with SAML
requests, respectively. It also adds SSO providers to /info for the
frontend to determined if SAML is enabled or not and the ability to
configure SAML via master.yaml.

* feat: add support for automatically storing creds

As a Determined user, I don't want to paste a token every time I want to
auth. This adds support for, when auth'ing from the CLI, redirecting to
localhost and automatically storing the auth token.

* ci: modify CI/CD steps for EE

We avoid running jobs that will publish our product to public
repositories. This currently means that we cannot test cloud deployments
in an automated manner. However, none of our EE features are related to
cloud deployments so this should not be a problem.

* ci: enable EE releases with CircleCI

* chore: enable SCIM for test-e2e (#15)

* chore: run apt-get update for deps of master.deb (#16)

When we install ee via a deb, apt-get tries to resolve the named deps of
it but, in CI we haven't run apt-get update so we can't find it.

* feat: support acting as an OAuth 2.0 authorization server (#10)

This adds support for the OAuth 2.0 authorization code flow to the
master. Apart from the core flow itself, HTTP endpoints are provided for
managing the set of OAuth client applications (accessible only by admin
users).

Since we're only addressing one particular scenario at the moment, we
allow only a single client application to be registered, since that's
all we need for this use case and that simplifies management a bit.

* feat: enable OAuth protection for SCIM endpoints (#10)

* ci: add -ee to tags (#29)

* ci: use make publish for releasing Docker images (#36)

(cherry picked from commit a5de11b5cc87e848d0e81601d15038dd4f44d7e1)

* fix: remove det-deploy local from ee ci (#53)

* chore: increase refresh token exp (#60) [DET-4754]

* chore: remove sessions for deprovisioned users (#61) [DET-4844]

* ci: skip a bunch of testing and deployment for EE

* fix: extract master URL correctly from resource pools (#64)

With the introduction of resource pools, the provisioner is no longer a top level field. Update code so that we pull the MasterURL from the resource pools' provider field.

* fix: add ssoProviders to default store info for ee

* chore: do not publish model_hub package in EE

* fix: update master GoReleaser config to account for new binary

* chore: rebrand to HPE MLDE

Basic renaming of Docker images, docs bucket, etc. Also clearly remove
any implication of publishing to NVCR.

* chore: rebrand docs content and theme for HPE before publishing

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* feat: add OIDC support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work.  Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also revert a few diffs between the config for EE vs OSS that don't seem
intentional/warranted.

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* fix: update docs skip check remote address to EE

* chore: add launcher client (determined-ai#209)

* feat: slurm support (#98)

This change adds a dispatcher resource manager, which gives Determined the ability to run on top of Slurm.

Co-authored-by: rcorujo <90728398+rcorujo@users.noreply.github.com>
Co-authored-by: Phillip Gaisford <phillip.gaisford@hpe.com>
Co-authored-by: phillip-gaisford <98362331+phillip-gaisford@users.noreply.github.com>
Co-authored-by: Jerry J. Harrow <84593277+jerryharrow@users.noreply.github.com>
Co-authored-by: Jagadeesh Madagundi <jagadeesh545@gmail.com>
Co-authored-by: Philip Norman <philipnrmn@users.noreply.github.com>

* chore: Provide Slurm job submission failure test cases (FOUNDENG-86) (determined-ai#321)

Wrote test cases for when the CircleCI integration with SLURM is implemented. Each test case launches an experiment, waits for the error, and verifies the log of the error. It also creates a new test category called e2e_slurm.

* chore: created new branch to merge with master instead of dispatcher

* chore: added .yaml test files

* fix: simplified test .yaml files and moved file location

* fix: revert devcluster-casablanca.yaml

* fix: compensate for breaking change determined-ai#4460 (determined-ai#326)

* fix: compensate for breaking change determined-ai#4460

* chore: FOUNDENG-102 Determined shows killed shells as still running (determined-ai#327)

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* chore: FOUNDENG-102 Determined shows killed shells as still running

* Added locking to the monitoredJobs

Co-authored-by: Bradley Laney <bradley.laney@gmail.com>
Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* chore: dispatcher RM supports slot type ROCM (determined-ai#329)

* chore: dispatcher RM supports slot type ROCM

* chore: allow launch using podman (determined-ai#334)

* fix: Cleanup CPU-only system error reporting (FOUNDENG-117) (determined-ai#335)

Ensure that the extended error messages are reported on submission failure by expanding the pattern.

Suppress environment cleanup on LeveDebug and greater as LevelTrace is kind of unusable due to the amount of output logged.

* chore: take agent slot type from partition config (determined-ai#336)

* chore: take agent slot type from partition config

* test: add unit tests. (FOUNDENG-71) (determined-ai#339)

* FOUNDENG-71. Add unit tests.

* test:add unit tests. (FOUNDENG-71)

* test: add coverage for ROCM. (determined-ai#340)

* refactor: make sso a plugin [DET-7560] (determined-ai#341)

* test: add unit tests (FOUNDENG-70) (determined-ai#344)

* chore: Provide a working cache_dir for slurm devcluster (determined-ai#347)

The new cache_dir master.yaml attribute defaults to /var/cache/determined
which users do not normally have access too, so provide a different
default for the tools/slurmcluster.sh script so that it works without
hacking the system.

* chore: Enhance slurmcluster.sh to support authenticated launcher. (determined-ai#349)

* chore: Enhance slurmcluster.sh to support authenticated launcher.

Add new -a option which will attempt to pull the .launcher.token
from the cluster.   If a token file exists for the cluster, it
is used by the master.

* Update slurmcluster.sh

* fix: Exported functions (e.g. which) may break experiments (FOUNDENG-145) (determined-ai#351)

Bash-exported functions are set as environment variables and by default
are inherited into singularity containers.   On some systems the which
command is configured this way and injects arguments into the which
command.  When invoked inside of a determined environment image the
which command does not support these arguments and it breaks the check
for the python3 being on the path, thus breaking most experiments.

Clear all exported functions to avoid this potential collision.

* chore: Compile and document the OSS dependencies of the Launcher [FOUNDENG-105] (determined-ai#354)

Added 181 licenses for OSS dependencies for the slurm launcher. Also modified gen-attributions.py to include Slurm Launcher section in the documentation.

Co-authored-by: Cameron Quilici <cameron.quilici@hpe.com>

* chore: disable dependabot in EE repo (determined-ai#358)

Dependencies are generally updated in the OSS repo.  Disabling all
dependabot updates here until there's a mechanism to selectively do so.

* chore: Specify network mode for PodMan containers over Slurm (FOUNDENG-149) (determined-ai#359)

Unlike Singularity, PodMan behaves like docker.   Set --network=host
to enable dtrain processes on specific ports.

* feat: Slurm support with Singularity or PodMan (determined-ai#361)

Document PodMan support.

* chore: FOUNDENG-134 Handle both CUDA_VISIBLE_DEVICES/ROCR_VISIBLE_DEVICES. (determined-ai#360)

* ci: slurm ci (determined-ai#342)

This change adds basic CI for slurm and enables a few tests. After this, we'll work on enabling more tests and enabling GPU runners.

* chore: launch config for image cache & capabilities (determined-ai#363)

* Launch config for image cache & capabilities

* chore: configuration item resources.devices (determined-ai#369)

* chore: configuration item resources.devices

* feat: Enable det shell start with podman (FOUNDENG-152) (determined-ai#371)

* feat: Enable explicit port management with podman (FOUNDENG-150) (determined-ai#373)

* chore: FOUNDENG-120 convert SLURM ResourcePool request to goroutine to improve response time for determined master. (determined-ai#352)

* fix: e2e_test test_slurm.py test_node_not_available fails on CPU based cluster (Mosaic) due to different Error output (FOUNDENG-132)  (determined-ai#364)

[FOUNDENG-132]
Added extra error logs to account for differences with clusters with no GPUs. Also updated test_docker_login to include checks for error logs that are due to docker download rate limitations.

* chore: add flag to avoid overlapping resource pool requests (determined-ai#377)

* refactor: solidify rm interface (ee)

* chore: Update slurmcluster.sh to handle new casablanca-mgmt1 configuration (determined-ai#380)

The system casablanca has been updated with the name casablanca now pointing to casablanca-login instead of casablanca-mgmt1 as it used to.  Update the script to fix casablanca to fully work on casablanca-login.

* chore: temp fix for failing slurm tests (determined-ai#381)

* fix: skip DET special ports in config podman port mappings (FE-163) (determined-ai#379)

* chore: fix issues with locking and unlocking mutexes (determined-ai#382)

* test: Enable logging tests on slurm (determined-ai#367)

Enable some tests within test_logging.py for slurm.
Fix the cluster.utils agents url, and properly
reference the agents dict within the response.

* test: Add mosiac to slurmcluster to enable testing (determined-ai#384)

Add maosaic cluster as an option to slurmcluster.sh to enable
test validation there and debugging.

* test: Enable test case on slurm (FOUNDENG-171) (determined-ai#385)

Enable test_pytorch_const_warm_start.
Did not enable test_pytorch_load because it uses mnist_pytorch/const-pytorch11.yaml
which forces /tmp to be shared which causes permission problems across multiple
users.

* test: add test coverage for pkg/tasks/dispatcher_task.go. (determined-ai#386)

* test: Add slurmcluster.sh support for -d and cluster osprey (determined-ai#387)

Enable use of slurmcluster.sh with launchers started by the
new loadDevLaucher.sh script.

Add configuration for osprey & swan cluster.

* test: enable test_pytorch_lighting.py for e2e_slurm. (determined-ai#388)

* test: Add test cluster raptor config in slurmcluster.sh (determined-ai#390)

Add to the script to enable testing with raptor.

* chore: determine type of WLM on the cluster (determined-ai#392)

* chore: determine type of WLM on the cluster

* test: enable test_launch for e2e_slurm. (determined-ai#389)

* test: Add default /etc/host bind mount in slurmcluster.sh (determined-ai#395)

Podman V4.0 no longer maps /etc/hosts into the container which means that none of the admin/login nodes, nor compute node names can be resolved.  We are adding that to the default master.yaml configured on install with the launcher.  This adds the same /etc/host setup in slurmcluster.sh so that when using podman (-p) it maps /etc/hosts by default.

* test: update slurmcluster.sh: casablanca-login2 (determined-ai#396)

* feat: Add support for PBS (FOUNDENG-187) (determined-ai#397)

Add search list of both Slurm & PBS carriers to enable dynamic
selection of whichever is available on the target system.
Updated unit tests to expect two carriers.

* fix: provide better message when failed to fetch resource pool details. (determined-ai#398)

* chore: FOUNDENG-126 Enhance Determined container prep logic to work for PBS (determined-ai#399)

* test: Add support for Grenoble Slurm test system (determined-ai#401)

Add new option for test system o184i023.
Include option for non-default slot_type.
Correct the protocol when -d is specified to be http
even if the default for the system is https.

* chore: PBS awareness in resource pools. (determined-ai#402)

* chore: PBS awareness in resource pools.

* fix: Workaround rocm-smi python issue (FOUNDENG-127) (determined-ai#403)

rocm-smi does not work within a singularity container when the host is RHEL
and the container is Ubuntu.   This is a workaround to that incompatibility.

* chore: relax Slurm jobs do not require gres. (determined-ai#404)

* chore: set PBS resource pool properties (determined-ai#408)

* fix: Allow over-mounting of /tmp/work (FOUNDENG-205) (determined-ai#410)

With Singularity /tmp is removed and re-linked to a user directory
to avoid the default host-wide share /tmp and provide more space that
the limited Singularity tmpfs space (10mb).  Make the removal of /tmp
handle injected sub-directories by bind mounts, by detecting the
error and reporting an ERROR message instead of failing.

Also add a FATAL error message giving context if an error in the shell
script is terminated due to non-zero exit (set -e).

* chore: fix bugs wrt backoff package in logging scripts (determined-ai#405)

* fix: remove slurm-resources-info file on job cleanup (determined-ai#411)

* fix: remove slurm-resource-info file on job cleanup

* fix: remove slurm-resources-info file on job cleanup

* fix: Properly pass along PBS queue to launcher (FE-202) (determined-ai#413)

Small fix to pass along the resource_pool name as the PBS/Slurm
queue (partition is only supported by by Slurm, but PBS/Slurm both
support queue).

* chore: config pbs resource manager type. (determined-ai#416)

* fix: Drop unused resource tracking data (FOUNDENG-215) (determined-ai#419)

DispatcherRM has been maintaining data resource mapping data that has been unused since it migrated into the DB.   Drop the fields we do not need.

* chore: support custom experiment config for PBS args (determined-ai#423)

* chore: reload auth token on authorization error  (determined-ai#418)

* FOUNDENG-209. Reload auth token on authorization error.

* choer: reload auth token on authorization error.

* chore: reload auth token on authorization error.

* chore: reload auth token on authorization error

* feat: add EE portion of RBAC (determined-ai#415)

Co-authored-by: Max Russell <max.russell@hpe.com>

* feat: permission summary API and permissions + precanned roles (determined-ai#426)

* chore: attempt calculation of RM SlotsPerAgent (determined-ai#425)

* chore: attempt calculation of RM SlotsPerAgent

* chore: Enhancement to slurmcluster.sh (determined-ai#430)

Usability tweaks to improve robustness.
- If tunnels are to be started terminate any existing non-interactive sshd processes
  for the user which should be from older hung tunnels.
- If -a is unable to retrieve a token (e.g. CTRL/C to abort it), leave the
  existing token intact, instead of destroying it with an empty value.
- Add a short sleep to enable tunnels to stabilize before starting devcluster.
  On occastion they are not ready and it causes a spurious failure.
- Remove conf for interns.

* chore: upgrade bad user agent messaging (determined-ai#431)

* chore: upgrade bad user agent messaging

* fix: Generalize launcher prefixes for PBS (determined-ai#432)

Some messages include Slurm/PBS and the carrier name.   Generalize the regex to allow either Slurm/PBS so that message processing will be handled similarly.

* fix: Avoid using multiple carriers on failure we retry (FOUNDENG-232) (determined-ai#433)

We have different carriers for Slurm/PBS, but if we list them both
and the user job fails, it tries the next.   Use the dispatcherRM
wlmType to specify the carrier in use to avoid this fallback.

* chore: Ensure CUDA_VISIBLE_DEVICES is respresented as a comma separated list of simple numbers (determined-ai#420)

* fix: nil ptr on protoing and workspace permission missing from viewer (determined-ai#438)

* fix: Initial prep_container should fail quickly (FOUNDENG-217) (determined-ai#435)

Fix the exception rasied when master is not reacable (MasterNotFoundException)
APIHttpError only happens when call completes without a successful status
response.

The initial prep_container was ntended to fail reasonably quickly
to enable diagnostics of misconfiguration.   Recent changes
for re-using the common session (DET-8003) settings has caused the initial
communication to retry for more than 30 mins thus defeating the
original intent of prep_container.trial_prep and the error message
it provides.

This change lowers the session retry in prep_container (6 retries with 0.5
backoff -> 64 seconds) to enable the diagnostic message to be posted reasonably quickly.

* feat: Reconnect to Slurm jobs on startup (FOUNDENG-215) (determined-ai#429)

We previously terminated running jobs upon a master restart.
Now that Determined core supports re-attaching to jobs, do the
same for DispatchRM.

- Change configure to indiate that DispatcherRM supports reattach
- Handle allocation messages with Restore:true
- Fail any allocations on Restore:true if the Dispatch ID is missing.
- Handle the case were we no longer have the payload name
  which was lost in the restart.   Ask the launcher for it in the
  very rare case where the job was started but fails before the
  rendezvous and we need the payload name to retrieve the logs.
- When in debug mode, defer dispatch cleanup util the next restart.
  On restart terminate all dispatches.

* feat: rbac authz experiments api [DET-8207] (determined-ai#434)

* feat: add RBAC implementation of workspace authz (determined-ai#436)

* fix: Wait for termination before deleting dispatch (FOUNDENG-217) (determined-ai#442)

When killing a dispatch, we are not able to immediately delete the
dispatch because the files may still be in use by the running container.
Wait until we get to a terminal state before performing the delete.

* feat: rbac implementation for user authorization [DET-8205] (determined-ai#445)

* feat: rbac project authz implementation (determined-ai#446)

* feat: auto assign workspace admin to workspace creator [DET-8212] (determined-ai#440)

* fix: Starting state now shows Running [FOUNDENG-242] (determined-ai#447)

Once the job was queue with Slurm/PBS we triggered the Starting
state.  Prior to 19.4 this use to show as QUEUED in the UI, but
now has changed to "RUNNING (PREPARING ENV)" which is not
accurate.   So map PENDING -> Assigned such that the UI
continues to show "QUEUED" until the job starts running.

* chore: Enhance slurmcluster.sh (determined-ai#448)

Enhance option parsing to relax arg order requrirements.
Add -i arg to override the default logging level to info.

* fix: Improve PBS error reporting [FOUNDENG-248] (determined-ai#453)

Augment the error logs with the HTTP response value on failure.  The returned
error does not always have the underlying info (e.g. 404 Not Found).

Fix the patterns use for matching messages with PBS.   Add an entry for
Slurm (which was showing up previously because no messages matched).

Add filtering based up on the reporter to eliminate some noise that we never
want to see from the Dispatcher infrastructure.

* chore: update dispatcher-wrapper.sh to remove code that sets SLURM_PROCID from PBS_TASKNUM (determined-ai#454)

* feat: get workspace assigned users and groups [DET-8442] (determined-ai#444)

* chore: remove refs to pbs/slurm in environment (determined-ai#458)

* Revert "chore: remove refs to pbs/slurm in environment (determined-ai#458)" (determined-ai#461)

This reverts commit 81f64c80aea646f8c5edeb80164929d877783a80.

* chore: update err message suitable for slurm/pbs. (determined-ai#462)

* chore: generalize message for Slurm/PBS. (determined-ai#463)

* feat: ee support for agent user group settings per workspace. (determined-ai#460)

* chore: remove refs to pbs/slurm in environment (determined-ai#465)

* chore: consume experiment PBS & Slurm batch args (determined-ai#472)

* fix: Add export PATH for PBS Carrier [FOUNDENG-266] (determined-ai#474)

We minimally need the path to be inherited into the PBS
script job such that singularity run can successfully pull
and image.   It needs /usr/sbin/ on the path, but PBS
apparently doesn't inherit the system path or any such reasonable
path.  This changes allows inheritance of all environment
variables to cover PATH, and anything else the launcher may
have added to their environ (PATH, LD_LIBRARY_PATH, etc).

* ci: auto-deploy `latest-ee-gke`. (determined-ai#467)

* feat: echo auth for ee (determined-ai#479)

* chore: assign cluster admin to 'admin' for new clusters (determined-ai#477)

* feat: Add CAN_EDIT_WEBHOOKS permission to pre-canned admin role [WEB-218] (determined-ai#471)

* feat: RBAC authz for user groups [DET-8477] (determined-ai#473)

* chore: Fix build break due to unused import (determined-ai#486)

Drop unused imports.

* fix: deal with some lint (determined-ai#491)

* fix: FOUNDENG-283 Determined UI Resource Pools page incorrectly shows CPU usage (determined-ai#490)

* fix: Correct quoting in error message (determined-ai#492)

We have a custom bash error handler if any command returns
a non-zero.  Fix the quoting and spelling so that it actually works.

* fix: Restore det shell on podman [FOUNDENG-280] (determined-ai#493)

When running rootless podman, inside the container we are
root/uid=0 and that maps to the user account outside the
container.   All is fine until we attempt to ssh into the
container which actually then uses the launching username/uid.
Under normal circumstance /run/determined/ssh has only 0600
permissions for only the owning user to read, but with podman
root maps to the user/uid, so the launching user is not seen
as having access to the files.

Until we find a better solution, dynamically relax the permissions
to be a+x on the /run/determined/ssh directory path such that
the user can read /run/determined/ssh/authorized_keys and enable
ssh into the container to work proplerly.

Additionally, drop use of the podman --hostuser arg, as it doens't help
the situation and we already provide the launching user in a custom
passwd entry.

* chore: ee lint fixes and implement added authz method (determined-ai#497)

* fix: remove =true from sso url querystring (determined-ai#494)

* chore: support slots per node (determined-ai#500)

* fix: add `ON DELETE CASCADE` for `role_assignments.group_id` column (determined-ai#501)

* feat: RBAC authz for RBAC [DET-8206] [DET-8368] (determined-ai#480)

* fix: searching roles results in 500 error (determined-ai#503)

* fix: PodMan map user to UID and GID to 0 in passwd [FOUNDENG-300] (determined-ai#504)

In rootless PodMan the user executes as uid/gid 0:0 inside the container
which maps to the actual launching user outside the container.  If
the entry point user is 'root' then map the agent user to 0:0 in
/run/determined/etc/passwd such that outside the container the access
is seen as the launching user.

/run/determined/etc/passwd contains a single line (written by Determined)
to represent the agent user.

* feat: make list groups roles and list users roles return assignment info (determined-ai#498)

* chore: Disable test_node_not_available [FOUNDENG-304] (determined-ai#510)

The test is queueing instead of getting the expected error message on the mosaic
slurm cluster.  Need to resolve before re-enabling.

* chore: Add sawmill test system to slurmcluster.sh (determined-ai#511)

Add config for sawmil and detect systems that do not have installed launchers, and indicate that -d is required.

* chore: Disable test_node_not_available [FOUNDENG-304] (determined-ai#512)

Additionally rename the disabled test_node_not_available, to avoid
warnings about a test without an annotation.

* chore: experiment log show Slurm/PBS job ID. (determined-ai#502)

* fix: add sso login routes to list of echo routes that don't require auth (determined-ai#509)

Co-authored-by: Addison Snelling <asnell@hpe.com>

* chore: Add node atlas to slurmcluster.sh (determined-ai#513)

Enable testing with another data center system.

* feat: Fully support apptainer fork of singularity [FOUNDENG-292] (determined-ai#507)

Apptainer 1.0 is a fork of Singularity 3.8.  Reduce use of SINGULARITY_* variables.
hpe-hpc-launcher 3.1.4 supports capabilities and cached bypass.   --no-mount=tmp
has been the default for a bit, so not explicitly needed.

We retain the use of the SINGULARITY_DOCKER* and add APPTAINER_DOCKER*
for creds as there is no CLI option alternative.   Adding the APPTAINER_* version
eliminates warnings.

* fix: get group 500 error for rbac can't access case [DET-8588, DET-8589] (determined-ai#506)

* chore: log error on insufficient launcher version. (determined-ai#508)

* fix: redirect to cli relay on det auth login (determined-ai#519)

* refactor: rbac: move from `is_global` to scope type masks [DET-8569] (determined-ai#515)

* fix: 500 error for workspace membership without perms (determined-ai#525)

* test: update expected error messages. (determined-ai#526)

* chore: rbac refactor authorization code (determined-ai#527)

* chore: add checkpoint storage permission (determined-ai#518)

* fix: allow workspace viewers to view roles in webui. (determined-ai#530)

* chore: Fix test_node_not_available test [FOUNDENG-304] (determined-ai#517)

When scheduling CPUs (unlike GPUs), test_node_not_available
will submit a job that will set pending forever due to lack of resources.
This is happening on mosaic (our Slurm runner system today)
so skip the test if no GPUs available.

Also put a limit on the tests wait time for slurm failure test cases to 600s
(5min) to avoid default wait of 30 mins which avoids blocking up the gate
excessively on a test falure.

* test: disable restart on expected failure case. (determined-ai#528)

* chore: make authz_rbac workspaces return PermissionDeniedError (determined-ai#521)

* fix: FOUNDENG-303 Pausing, then resuming an experiment fails (determined-ai#533)

* ci: fix incorrect image name (determined-ai#535)

* ci: fix incorrect image names
* update a comment at the same time

* chore: Add additional configuration options in slurmcluster.sh (determined-ai#537)

Add the capability to set a default image, ask_container_defaults, and
partition_overrides in the master.yaml.

Add configuration for sawmil to make grizzly nodes cuda
and provide a default image, and MPI settings.

Eliminate the need for the CLUSTERS list to be manually updated by
just checking for the cluster configuration directly.

Indicate the generated master.yaml file name to simplify debugging
when injecting multiline options.

* ci: autorebase PRs on master force push [INFENG-122] (determined-ai#532)

* fix: FOUNDENG-310 test_noop_pause_hpc needs timeout increase to avoid random failures

* Test still randomly fails. Passed first time, failed second time. Increased timeout to 420 seconds just to see what happens.

* Increased the overall timeout to 20 minutes

* fixed some merge issues

Co-authored-by: Danny Zhu <dzhu@hpe.com>
Co-authored-by: Bradley Laney <bradlaney@determined.ai>
Co-authored-by: Neil Conway <neil@determined.ai>
Co-authored-by: Bradley Laney <bradley.laney@gmail.com>
Co-authored-by: brian <brian@determined.ai>
Co-authored-by: Danny Zhu <dzhu@determined.ai>
Co-authored-by: Brian Friedenberg <12980763+brain-good@users.noreply.github.com>
Co-authored-by: Armand McQueen <armandmcqueen@users.noreply.github.com>
Co-authored-by: Armand McQueen <armandmcqueen@gmail.com>
Co-authored-by: Caleb Kang <caleb@determined.ai>
Co-authored-by: Ilia Glazkov <ilia@determined.ai>
Co-authored-by: Philip Norman <philipnrmn@users.noreply.github.com>
Co-authored-by: Sean Mackrory <mackrory@determined.ai>
Co-authored-by: Nick Doiron <ndoiron@mapmeld.com>
Co-authored-by: Eric <31023784+eecsliu@users.noreply.github.com>
Co-authored-by: Hamid Zare <12127420+hamidzr@users.noreply.github.com>
Co-authored-by: Phillip Gaisford <phillip.gaisford@hpe.com>
Co-authored-by: phillip-gaisford <98362331+phillip-gaisford@users.noreply.github.com>
Co-authored-by: Jerry J. Harrow <84593277+jerryharrow@users.noreply.github.com>
Co-authored-by: Jagadeesh Madagundi <jagadeesh545@gmail.com>
Co-authored-by: CharlesTran1 <69864849+CharlesTran1@users.noreply.github.com>
Co-authored-by: CanmingCobble <107056780+CanmingCobble@users.noreply.github.com>
Co-authored-by: NicholasBlaskey <nick.blaskey@hpe.com>
Co-authored-by: Cameron Quilici <cameron.quilici@hpe.com>
Co-authored-by: Danny Sauer <danny.sauer@hpe.com>
Co-authored-by: Bradley Laney <bradley.laney@hpe.com>
Co-authored-by: Max <max.russell@hpe.com>
Co-authored-by: Ilia Glazkov <ilia.glazkov@hpe.com>
Co-authored-by: julian-determined-ai <103522725+julian-determined-ai@users.noreply.github.com>
Co-authored-by: Trent Watson <trent.watson@hpe.com>
Co-authored-by: Addison Snelling <asnell@hpe.com>
nrajanee pushed a commit that referenced this pull request Dec 13, 2022
…etermined-ai#327)

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* chore: FOUNDENG-102 Determined shows killed shells as still running

* Added locking to the monitoredJobs

Co-authored-by: Bradley Laney <bradley.laney@gmail.com>
Co-authored-by: Sean Mackrory <mackrory@determined.ai>
nrajanee pushed a commit that referenced this pull request Dec 13, 2022
… random failures (determined-ai#539)

* chore: update Go dependency files

This is a sort of artificial commit used to concentrate all of the
go.{mod,sum} changes for EE into one place to hopefully reduce
conflicts.

* feat: add support for SCIM provisioning

This commit brings the SCIM support for v0.11.2 back from the dead with
minimal changes. The only new fixes/features are conflicts now correctly
returns 409 not 500, password sync is supported, externalId is not
required, and the username/password that the IdP uses for basic auth
when talking to us is configurable.

* chore: remove Apache V2 license text

The enterprise edition is not licensed under Apache V2, so remove the
license text to avoid confusion.

* fix: ignore SCIM meta field

When we receive a PATCH or PUT from Okta,
they may reply to us with the meta field we
initially gave to them. This is ok; it is in
spec to receive and ignore it.

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add support for SAML

To meet customer needs, we need to provide support for users to
authenticate with SAML. The commit adds /saml/sso and /saml/initiate to
receive SAML responses and initiate SP-initiated flow with SAML
requests, respectively. It also adds SSO providers to /info for the
frontend to determined if SAML is enabled or not and the ability to
configure SAML via master.yaml.

* feat: add support for automatically storing creds

As a Determined user, I don't want to paste a token every time I want to
auth. This adds support for, when auth'ing from the CLI, redirecting to
localhost and automatically storing the auth token.

* ci: modify CI/CD steps for EE

We avoid running jobs that will publish our product to public
repositories. This currently means that we cannot test cloud deployments
in an automated manner. However, none of our EE features are related to
cloud deployments so this should not be a problem.

* ci: enable EE releases with CircleCI

* chore: enable SCIM for test-e2e (#15)

* chore: run apt-get update for deps of master.deb (#16)

When we install ee via a deb, apt-get tries to resolve the named deps of
it but, in CI we haven't run apt-get update so we can't find it.

* feat: support acting as an OAuth 2.0 authorization server (#10)

This adds support for the OAuth 2.0 authorization code flow to the
master. Apart from the core flow itself, HTTP endpoints are provided for
managing the set of OAuth client applications (accessible only by admin
users).

Since we're only addressing one particular scenario at the moment, we
allow only a single client application to be registered, since that's
all we need for this use case and that simplifies management a bit.

* feat: enable OAuth protection for SCIM endpoints (#10)

* ci: add -ee to tags (#29)

* ci: use make publish for releasing Docker images (#36)

(cherry picked from commit a5de11b5cc87e848d0e81601d15038dd4f44d7e1)

* fix: remove det-deploy local from ee ci (#53)

* chore: increase refresh token exp (#60) [DET-4754]

* chore: remove sessions for deprovisioned users (#61) [DET-4844]

* ci: skip a bunch of testing and deployment for EE

* fix: extract master URL correctly from resource pools (#64)

With the introduction of resource pools, the provisioner is no longer a top level field. Update code so that we pull the MasterURL from the resource pools' provider field.

* fix: add ssoProviders to default store info for ee

* chore: do not publish model_hub package in EE

* fix: update master GoReleaser config to account for new binary

* chore: rebrand to HPE MLDE

Basic renaming of Docker images, docs bucket, etc. Also clearly remove
any implication of publishing to NVCR.

* chore: rebrand docs content and theme for HPE before publishing

* ci: add SCIM configuration to devcluster test configurations

Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* feat: add OIDC support (#84)

* feat: track EE in analytics (#86)

* chore: det_cloud specific configuration (#93)

* fix: allow CLI to log in via OIDC (#92)

* Show OIDC in sso_providers, if configured

Previously the OIDC configuration was not displayed in the /info
endpoint as a member of sso_providers, even if enabled. This meant that
the CLI could not detect it and use it to log in.

* Pass relayState through OIDC auth flow

In the SAML flow, relayState is passed through the authentication flow
via redirect binding. This was naively copied into OIDC, but as it isn't
part of the OIDC standard it didn't work.  Here we simply add the
relayState to the redirect URI's query string.

* feat: OIDC on helm (#105)

* feat: allow oidc secret to be set from env variable

* allow oidc to be configured from helm

* chore: build rpm/deb/archive packages for EE releases

Along the way, fixup release tooling to be more consistent about EE agent
builds. Previously, the release tooling built a binary called
`determined-ee-agent`; this was renamed to `determined-agent` for Docker images
in the Dockerfile. Instead, call the binary `determined-agent` so that native
packages and Docker images use the same name.

* chore: update CircleCI config to publish native packages for EE

Also revert a few diffs between the config for EE vs OSS that don't seem
intentional/warranted.

* feat: allow OIDC to authenticate users using arbitrary claims

Previously, we hard-coded authentication by matching the `email` field
in the OIDC claim to the username provided through SCIM; now we allow
matching any claim from the OIDC token to any attribute from SCIM to
perform authentication.

* fix: update docs skip check remote address to EE

* chore: add launcher client (determined-ai#209)

* feat: slurm support (#98)

This change adds a dispatcher resource manager, which gives Determined the ability to run on top of Slurm.

Co-authored-by: rcorujo <90728398+rcorujo@users.noreply.github.com>
Co-authored-by: Phillip Gaisford <phillip.gaisford@hpe.com>
Co-authored-by: phillip-gaisford <98362331+phillip-gaisford@users.noreply.github.com>
Co-authored-by: Jerry J. Harrow <84593277+jerryharrow@users.noreply.github.com>
Co-authored-by: Jagadeesh Madagundi <jagadeesh545@gmail.com>
Co-authored-by: Philip Norman <philipnrmn@users.noreply.github.com>

* chore: Provide Slurm job submission failure test cases (FOUNDENG-86) (determined-ai#321)

Wrote test cases for when the CircleCI integration with SLURM is implemented. Each test case launches an experiment, waits for the error, and verifies the log of the error. It also creates a new test category called e2e_slurm.

* chore: created new branch to merge with master instead of dispatcher

* chore: added .yaml test files

* fix: simplified test .yaml files and moved file location

* fix: revert devcluster-casablanca.yaml

* fix: compensate for breaking change determined-ai#4460 (determined-ai#326)

* fix: compensate for breaking change determined-ai#4460

* chore: FOUNDENG-102 Determined shows killed shells as still running (determined-ai#327)

* fix: respect password sync in PUT requests

When the SCIM client attempts to sync a password in the PUT request, as
in the case where a user updates or has their password reset, we should
respect it (because okta sends it).

* feat: add SaaS access control enforcement for Determined EE clusters (#82)

* chore: FOUNDENG-102 Determined shows killed shells as still running

* Added locking to the monitoredJobs

Co-authored-by: Bradley Laney <bradley.laney@gmail.com>
Co-authored-by: Sean Mackrory <mackrory@determined.ai>

* chore: dispatcher RM supports slot type ROCM (determined-ai#329)

* chore: dispatcher RM supports slot type ROCM

* chore: allow launch using podman (determined-ai#334)

* fix: Cleanup CPU-only system error reporting (FOUNDENG-117) (determined-ai#335)

Ensure that the extended error messages are reported on submission failure by expanding the pattern.

Suppress environment cleanup on LeveDebug and greater as LevelTrace is kind of unusable due to the amount of output logged.

* chore: take agent slot type from partition config (determined-ai#336)

* chore: take agent slot type from partition config

* test: add unit tests. (FOUNDENG-71) (determined-ai#339)

* FOUNDENG-71. Add unit tests.

* test:add unit tests. (FOUNDENG-71)

* test: add coverage for ROCM. (determined-ai#340)

* refactor: make sso a plugin [DET-7560] (determined-ai#341)

* test: add unit tests (FOUNDENG-70) (determined-ai#344)

* chore: Provide a working cache_dir for slurm devcluster (determined-ai#347)

The new cache_dir master.yaml attribute defaults to /var/cache/determined
which users do not normally have access too, so provide a different
default for the tools/slurmcluster.sh script so that it works without
hacking the system.

* chore: Enhance slurmcluster.sh to support authenticated launcher. (determined-ai#349)

* chore: Enhance slurmcluster.sh to support authenticated launcher.

Add new -a option which will attempt to pull the .launcher.token
from the cluster.   If a token file exists for the cluster, it
is used by the master.

* Update slurmcluster.sh

* fix: Exported functions (e.g. which) may break experiments (FOUNDENG-145) (determined-ai#351)

Bash-exported functions are set as environment variables and by default
are inherited into singularity containers.   On some systems the which
command is configured this way and injects arguments into the which
command.  When invoked inside of a determined environment image the
which command does not support these arguments and it breaks the check
for the python3 being on the path, thus breaking most experiments.

Clear all exported functions to avoid this potential collision.

* chore: Compile and document the OSS dependencies of the Launcher [FOUNDENG-105] (determined-ai#354)

Added 181 licenses for OSS dependencies for the slurm launcher. Also modified gen-attributions.py to include Slurm Launcher section in the documentation.

Co-authored-by: Cameron Quilici <cameron.quilici@hpe.com>

* chore: disable dependabot in EE repo (determined-ai#358)

Dependencies are generally updated in the OSS repo.  Disabling all
dependabot updates here until there's a mechanism to selectively do so.

* chore: Specify network mode for PodMan containers over Slurm (FOUNDENG-149) (determined-ai#359)

Unlike Singularity, PodMan behaves like docker.   Set --network=host
to enable dtrain processes on specific ports.

* feat: Slurm support with Singularity or PodMan (determined-ai#361)

Document PodMan support.

* chore: FOUNDENG-134 Handle both CUDA_VISIBLE_DEVICES/ROCR_VISIBLE_DEVICES. (determined-ai#360)

* ci: slurm ci (determined-ai#342)

This change adds basic CI for slurm and enables a few tests. After this, we'll work on enabling more tests and enabling GPU runners.

* chore: launch config for image cache & capabilities (determined-ai#363)

* Launch config for image cache & capabilities

* chore: configuration item resources.devices (determined-ai#369)

* chore: configuration item resources.devices

* feat: Enable det shell start with podman (FOUNDENG-152) (determined-ai#371)

* feat: Enable explicit port management with podman (FOUNDENG-150) (determined-ai#373)

* chore: FOUNDENG-120 convert SLURM ResourcePool request to goroutine to improve response time for determined master. (determined-ai#352)

* fix: e2e_test test_slurm.py test_node_not_available fails on CPU based cluster (Mosaic) due to different Error output (FOUNDENG-132)  (determined-ai#364)

[FOUNDENG-132]
Added extra error logs to account for differences with clusters with no GPUs. Also updated test_docker_login to include checks for error logs that are due to docker download rate limitations.

* chore: add flag to avoid overlapping resource pool requests (determined-ai#377)

* refactor: solidify rm interface (ee)

* chore: Update slurmcluster.sh to handle new casablanca-mgmt1 configuration (determined-ai#380)

The system casablanca has been updated with the name casablanca now pointing to casablanca-login instead of casablanca-mgmt1 as it used to.  Update the script to fix casablanca to fully work on casablanca-login.

* chore: temp fix for failing slurm tests (determined-ai#381)

* fix: skip DET special ports in config podman port mappings (FE-163) (determined-ai#379)

* chore: fix issues with locking and unlocking mutexes (determined-ai#382)

* test: Enable logging tests on slurm (determined-ai#367)

Enable some tests within test_logging.py for slurm.
Fix the cluster.utils agents url, and properly
reference the agents dict within the response.

* test: Add mosiac to slurmcluster to enable testing (determined-ai#384)

Add maosaic cluster as an option to slurmcluster.sh to enable
test validation there and debugging.

* test: Enable test case on slurm (FOUNDENG-171) (determined-ai#385)

Enable test_pytorch_const_warm_start.
Did not enable test_pytorch_load because it uses mnist_pytorch/const-pytorch11.yaml
which forces /tmp to be shared which causes permission problems across multiple
users.

* test: add test coverage for pkg/tasks/dispatcher_task.go. (determined-ai#386)

* test: Add slurmcluster.sh support for -d and cluster osprey (determined-ai#387)

Enable use of slurmcluster.sh with launchers started by the
new loadDevLaucher.sh script.

Add configuration for osprey & swan cluster.

* test: enable test_pytorch_lighting.py for e2e_slurm. (determined-ai#388)

* test: Add test cluster raptor config in slurmcluster.sh (determined-ai#390)

Add to the script to enable testing with raptor.

* chore: determine type of WLM on the cluster (determined-ai#392)

* chore: determine type of WLM on the cluster

* test: enable test_launch for e2e_slurm. (determined-ai#389)

* test: Add default /etc/host bind mount in slurmcluster.sh (determined-ai#395)

Podman V4.0 no longer maps /etc/hosts into the container which means that none of the admin/login nodes, nor compute node names can be resolved.  We are adding that to the default master.yaml configured on install with the launcher.  This adds the same /etc/host setup in slurmcluster.sh so that when using podman (-p) it maps /etc/hosts by default.

* test: update slurmcluster.sh: casablanca-login2 (determined-ai#396)

* feat: Add support for PBS (FOUNDENG-187) (determined-ai#397)

Add search list of both Slurm & PBS carriers to enable dynamic
selection of whichever is available on the target system.
Updated unit tests to expect two carriers.

* fix: provide better message when failed to fetch resource pool details. (determined-ai#398)

* chore: FOUNDENG-126 Enhance Determined container prep logic to work for PBS (determined-ai#399)

* test: Add support for Grenoble Slurm test system (determined-ai#401)

Add new option for test system o184i023.
Include option for non-default slot_type.
Correct the protocol when -d is specified to be http
even if the default for the system is https.

* chore: PBS awareness in resource pools. (determined-ai#402)

* chore: PBS awareness in resource pools.

* fix: Workaround rocm-smi python issue (FOUNDENG-127) (determined-ai#403)

rocm-smi does not work within a singularity container when the host is RHEL
and the container is Ubuntu.   This is a workaround to that incompatibility.

* chore: relax Slurm jobs do not require gres. (determined-ai#404)

* chore: set PBS resource pool properties (determined-ai#408)

* fix: Allow over-mounting of /tmp/work (FOUNDENG-205) (determined-ai#410)

With Singularity /tmp is removed and re-linked to a user directory
to avoid the default host-wide share /tmp and provide more space that
the limited Singularity tmpfs space (10mb).  Make the removal of /tmp
handle injected sub-directories by bind mounts, by detecting the
error and reporting an ERROR message instead of failing.

Also add a FATAL error message giving context if an error in the shell
script is terminated due to non-zero exit (set -e).

* chore: fix bugs wrt backoff package in logging scripts (determined-ai#405)

* fix: remove slurm-resources-info file on job cleanup (determined-ai#411)

* fix: remove slurm-resource-info file on job cleanup

* fix: remove slurm-resources-info file on job cleanup

* fix: Properly pass along PBS queue to launcher (FE-202) (determined-ai#413)

Small fix to pass along the resource_pool name as the PBS/Slurm
queue (partition is only supported by by Slurm, but PBS/Slurm both
support queue).

* chore: config pbs resource manager type. (determined-ai#416)

* fix: Drop unused resource tracking data (FOUNDENG-215) (determined-ai#419)

DispatcherRM has been maintaining data resource mapping data that has been unused since it migrated into the DB.   Drop the fields we do not need.

* chore: support custom experiment config for PBS args (determined-ai#423)

* chore: reload auth token on authorization error  (determined-ai#418)

* FOUNDENG-209. Reload auth token on authorization error.

* choer: reload auth token on authorization error.

* chore: reload auth token on authorization error.

* chore: reload auth token on authorization error

* feat: add EE portion of RBAC (determined-ai#415)

Co-authored-by: Max Russell <max.russell@hpe.com>

* feat: permission summary API and permissions + precanned roles (determined-ai#426)

* chore: attempt calculation of RM SlotsPerAgent (determined-ai#425)

* chore: attempt calculation of RM SlotsPerAgent

* chore: Enhancement to slurmcluster.sh (determined-ai#430)

Usability tweaks to improve robustness.
- If tunnels are to be started terminate any existing non-interactive sshd processes
  for the user which should be from older hung tunnels.
- If -a is unable to retrieve a token (e.g. CTRL/C to abort it), leave the
  existing token intact, instead of destroying it with an empty value.
- Add a short sleep to enable tunnels to stabilize before starting devcluster.
  On occastion they are not ready and it causes a spurious failure.
- Remove conf for interns.

* chore: upgrade bad user agent messaging (determined-ai#431)

* chore: upgrade bad user agent messaging

* fix: Generalize launcher prefixes for PBS (determined-ai#432)

Some messages include Slurm/PBS and the carrier name.   Generalize the regex to allow either Slurm/PBS so that message processing will be handled similarly.

* fix: Avoid using multiple carriers on failure we retry (FOUNDENG-232) (determined-ai#433)

We have different carriers for Slurm/PBS, but if we list them both
and the user job fails, it tries the next.   Use the dispatcherRM
wlmType to specify the carrier in use to avoid this fallback.

* chore: Ensure CUDA_VISIBLE_DEVICES is respresented as a comma separated list of simple numbers (determined-ai#420)

* fix: nil ptr on protoing and workspace permission missing from viewer (determined-ai#438)

* fix: Initial prep_container should fail quickly (FOUNDENG-217) (determined-ai#435)

Fix the exception rasied when master is not reacable (MasterNotFoundException)
APIHttpError only happens when call completes without a successful status
response.

The initial prep_container was ntended to fail reasonably quickly
to enable diagnostics of misconfiguration.   Recent changes
for re-using the common session (DET-8003) settings has caused the initial
communication to retry for more than 30 mins thus defeating the
original intent of prep_container.trial_prep and the error message
it provides.

This change lowers the session retry in prep_container (6 retries with 0.5
backoff -> 64 seconds) to enable the diagnostic message to be posted reasonably quickly.

* feat: Reconnect to Slurm jobs on startup (FOUNDENG-215) (determined-ai#429)

We previously terminated running jobs upon a master restart.
Now that Determined core supports re-attaching to jobs, do the
same for DispatchRM.

- Change configure to indiate that DispatcherRM supports reattach
- Handle allocation messages with Restore:true
- Fail any allocations on Restore:true if the Dispatch ID is missing.
- Handle the case were we no longer have the payload name
  which was lost in the restart.   Ask the launcher for it in the
  very rare case where the job was started but fails before the
  rendezvous and we need the payload name to retrieve the logs.
- When in debug mode, defer dispatch cleanup util the next restart.
  On restart terminate all dispatches.

* feat: rbac authz experiments api [DET-8207] (determined-ai#434)

* feat: add RBAC implementation of workspace authz (determined-ai#436)

* fix: Wait for termination before deleting dispatch (FOUNDENG-217) (determined-ai#442)

When killing a dispatch, we are not able to immediately delete the
dispatch because the files may still be in use by the running container.
Wait until we get to a terminal state before performing the delete.

* feat: rbac implementation for user authorization [DET-8205] (determined-ai#445)

* feat: rbac project authz implementation (determined-ai#446)

* feat: auto assign workspace admin to workspace creator [DET-8212] (determined-ai#440)

* fix: Starting state now shows Running [FOUNDENG-242] (determined-ai#447)

Once the job was queue with Slurm/PBS we triggered the Starting
state.  Prior to 19.4 this use to show as QUEUED in the UI, but
now has changed to "RUNNING (PREPARING ENV)" which is not
accurate.   So map PENDING -> Assigned such that the UI
continues to show "QUEUED" until the job starts running.

* chore: Enhance slurmcluster.sh (determined-ai#448)

Enhance option parsing to relax arg order requrirements.
Add -i arg to override the default logging level to info.

* fix: Improve PBS error reporting [FOUNDENG-248] (determined-ai#453)

Augment the error logs with the HTTP response value on failure.  The returned
error does not always have the underlying info (e.g. 404 Not Found).

Fix the patterns use for matching messages with PBS.   Add an entry for
Slurm (which was showing up previously because no messages matched).

Add filtering based up on the reporter to eliminate some noise that we never
want to see from the Dispatcher infrastructure.

* chore: update dispatcher-wrapper.sh to remove code that sets SLURM_PROCID from PBS_TASKNUM (determined-ai#454)

* feat: get workspace assigned users and groups [DET-8442] (determined-ai#444)

* chore: remove refs to pbs/slurm in environment (determined-ai#458)

* Revert "chore: remove refs to pbs/slurm in environment (determined-ai#458)" (determined-ai#461)

This reverts commit 81f64c80aea646f8c5edeb80164929d877783a80.

* chore: update err message suitable for slurm/pbs. (determined-ai#462)

* chore: generalize message for Slurm/PBS. (determined-ai#463)

* feat: ee support for agent user group settings per workspace. (determined-ai#460)

* chore: remove refs to pbs/slurm in environment (determined-ai#465)

* chore: consume experiment PBS & Slurm batch args (determined-ai#472)

* fix: Add export PATH for PBS Carrier [FOUNDENG-266] (determined-ai#474)

We minimally need the path to be inherited into the PBS
script job such that singularity run can successfully pull
and image.   It needs /usr/sbin/ on the path, but PBS
apparently doesn't inherit the system path or any such reasonable
path.  This changes allows inheritance of all environment
variables to cover PATH, and anything else the launcher may
have added to their environ (PATH, LD_LIBRARY_PATH, etc).

* ci: auto-deploy `latest-ee-gke`. (determined-ai#467)

* feat: echo auth for ee (determined-ai#479)

* chore: assign cluster admin to 'admin' for new clusters (determined-ai#477)

* feat: Add CAN_EDIT_WEBHOOKS permission to pre-canned admin role [WEB-218] (determined-ai#471)

* feat: RBAC authz for user groups [DET-8477] (determined-ai#473)

* chore: Fix build break due to unused import (determined-ai#486)

Drop unused imports.

* fix: deal with some lint (determined-ai#491)

* fix: FOUNDENG-283 Determined UI Resource Pools page incorrectly shows CPU usage (determined-ai#490)

* fix: Correct quoting in error message (determined-ai#492)

We have a custom bash error handler if any command returns
a non-zero.  Fix the quoting and spelling so that it actually works.

* fix: Restore det shell on podman [FOUNDENG-280] (determined-ai#493)

When running rootless podman, inside the container we are
root/uid=0 and that maps to the user account outside the
container.   All is fine until we attempt to ssh into the
container which actually then uses the launching username/uid.
Under normal circumstance /run/determined/ssh has only 0600
permissions for only the owning user to read, but with podman
root maps to the user/uid, so the launching user is not seen
as having access to the files.

Until we find a better solution, dynamically relax the permissions
to be a+x on the /run/determined/ssh directory path such that
the user can read /run/determined/ssh/authorized_keys and enable
ssh into the container to work proplerly.

Additionally, drop use of the podman --hostuser arg, as it doens't help
the situation and we already provide the launching user in a custom
passwd entry.

* chore: ee lint fixes and implement added authz method (determined-ai#497)

* fix: remove =true from sso url querystring (determined-ai#494)

* chore: support slots per node (determined-ai#500)

* fix: add `ON DELETE CASCADE` for `role_assignments.group_id` column (determined-ai#501)

* feat: RBAC authz for RBAC [DET-8206] [DET-8368] (determined-ai#480)

* fix: searching roles results in 500 error (determined-ai#503)

* fix: PodMan map user to UID and GID to 0 in passwd [FOUNDENG-300] (determined-ai#504)

In rootless PodMan the user executes as uid/gid 0:0 inside the container
which maps to the actual launching user outside the container.  If
the entry point user is 'root' then map the agent user to 0:0 in
/run/determined/etc/passwd such that outside the container the access
is seen as the launching user.

/run/determined/etc/passwd contains a single line (written by Determined)
to represent the agent user.

* feat: make list groups roles and list users roles return assignment info (determined-ai#498)

* chore: Disable test_node_not_available [FOUNDENG-304] (determined-ai#510)

The test is queueing instead of getting the expected error message on the mosaic
slurm cluster.  Need to resolve before re-enabling.

* chore: Add sawmill test system to slurmcluster.sh (determined-ai#511)

Add config for sawmil and detect systems that do not have installed launchers, and indicate that -d is required.

* chore: Disable test_node_not_available [FOUNDENG-304] (determined-ai#512)

Additionally rename the disabled test_node_not_available, to avoid
warnings about a test without an annotation.

* chore: experiment log show Slurm/PBS job ID. (determined-ai#502)

* fix: add sso login routes to list of echo routes that don't require auth (determined-ai#509)

Co-authored-by: Addison Snelling <asnell@hpe.com>

* chore: Add node atlas to slurmcluster.sh (determined-ai#513)

Enable testing with another data center system.

* feat: Fully support apptainer fork of singularity [FOUNDENG-292] (determined-ai#507)

Apptainer 1.0 is a fork of Singularity 3.8.  Reduce use of SINGULARITY_* variables.
hpe-hpc-launcher 3.1.4 supports capabilities and cached bypass.   --no-mount=tmp
has been the default for a bit, so not explicitly needed.

We retain the use of the SINGULARITY_DOCKER* and add APPTAINER_DOCKER*
for creds as there is no CLI option alternative.   Adding the APPTAINER_* version
eliminates warnings.

* fix: get group 500 error for rbac can't access case [DET-8588, DET-8589] (determined-ai#506)

* chore: log error on insufficient launcher version. (determined-ai#508)

* fix: redirect to cli relay on det auth login (determined-ai#519)

* refactor: rbac: move from `is_global` to scope type masks [DET-8569] (determined-ai#515)

* fix: 500 error for workspace membership without perms (determined-ai#525)

* test: update expected error messages. (determined-ai#526)

* chore: rbac refactor authorization code (determined-ai#527)

* chore: add checkpoint storage permission (determined-ai#518)

* fix: allow workspace viewers to view roles in webui. (determined-ai#530)

* chore: Fix test_node_not_available test [FOUNDENG-304] (determined-ai#517)

When scheduling CPUs (unlike GPUs), test_node_not_available
will submit a job that will set pending forever due to lack of resources.
This is happening on mosaic (our Slurm runner system today)
so skip the test if no GPUs available.

Also put a limit on the tests wait time for slurm failure test cases to 600s
(5min) to avoid default wait of 30 mins which avoids blocking up the gate
excessively on a test falure.

* test: disable restart on expected failure case. (determined-ai#528)

* chore: make authz_rbac workspaces return PermissionDeniedError (determined-ai#521)

* fix: FOUNDENG-303 Pausing, then resuming an experiment fails (determined-ai#533)

* ci: fix incorrect image name (determined-ai#535)

* ci: fix incorrect image names
* update a comment at the same time

* chore: Add additional configuration options in slurmcluster.sh (determined-ai#537)

Add the capability to set a default image, ask_container_defaults, and
partition_overrides in the master.yaml.

Add configuration for sawmil to make grizzly nodes cuda
and provide a default image, and MPI settings.

Eliminate the need for the CLUSTERS list to be manually updated by
just checking for the cluster configuration directly.

Indicate the generated master.yaml file name to simplify debugging
when injecting multiline options.

* ci: autorebase PRs on master force push [INFENG-122] (determined-ai#532)

* fix: FOUNDENG-310 test_noop_pause_hpc needs timeout increase to avoid random failures

* Test still randomly fails. Passed first time, failed second time. Increased timeout to 420 seconds just to see what happens.

* Increased the overall timeout to 20 minutes

* fixed some merge issues

Co-authored-by: Danny Zhu <dzhu@hpe.com>
Co-authored-by: Bradley Laney <bradlaney@determined.ai>
Co-authored-by: Neil Conway <neil@determined.ai>
Co-authored-by: Bradley Laney <bradley.laney@gmail.com>
Co-authored-by: brian <brian@determined.ai>
Co-authored-by: Danny Zhu <dzhu@determined.ai>
Co-authored-by: Brian Friedenberg <12980763+brain-good@users.noreply.github.com>
Co-authored-by: Armand McQueen <armandmcqueen@users.noreply.github.com>
Co-authored-by: Armand McQueen <armandmcqueen@gmail.com>
Co-authored-by: Caleb Kang <caleb@determined.ai>
Co-authored-by: Ilia Glazkov <ilia@determined.ai>
Co-authored-by: Philip Norman <philipnrmn@users.noreply.github.com>
Co-authored-by: Sean Mackrory <mackrory@determined.ai>
Co-authored-by: Nick Doiron <ndoiron@mapmeld.com>
Co-authored-by: Eric <31023784+eecsliu@users.noreply.github.com>
Co-authored-by: Hamid Zare <12127420+hamidzr@users.noreply.github.com>
Co-authored-by: Phillip Gaisford <phillip.gaisford@hpe.com>
Co-authored-by: phillip-gaisford <98362331+phillip-gaisford@users.noreply.github.com>
Co-authored-by: Jerry J. Harrow <84593277+jerryharrow@users.noreply.github.com>
Co-authored-by: Jagadeesh Madagundi <jagadeesh545@gmail.com>
Co-authored-by: CharlesTran1 <69864849+CharlesTran1@users.noreply.github.com>
Co-authored-by: CanmingCobble <107056780+CanmingCobble@users.noreply.github.com>
Co-authored-by: NicholasBlaskey <nick.blaskey@hpe.com>
Co-authored-by: Cameron Quilici <cameron.quilici@hpe.com>
Co-authored-by: Danny Sauer <danny.sauer@hpe.com>
Co-authored-by: Bradley Laney <bradley.laney@hpe.com>
Co-authored-by: Max <max.russell@hpe.com>
Co-authored-by: Ilia Glazkov <ilia.glazkov@hpe.com>
Co-authored-by: julian-determined-ai <103522725+julian-determined-ai@users.noreply.github.com>
Co-authored-by: Trent Watson <trent.watson@hpe.com>
Co-authored-by: Addison Snelling <asnell@hpe.com>
@dependabot @github

dependabot Bot commented on behalf of github Feb 24, 2023

Copy link
Copy Markdown
Author

Superseded by #131.

@dependabot dependabot Bot closed this Feb 24, 2023
@dependabot dependabot Bot deleted the dependabot/pip/examples/tests/tensorflow-2.10.0 branch February 24, 2023 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file python Pull requests that update Python code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants