diff --git a/README.md b/README.md index 491e15ba80e..050ce8f5061 100644 --- a/README.md +++ b/README.md @@ -59,8 +59,6 @@ Nomulus has the following capabilities: implementation that works with BIND. If you are using Google Cloud DNS, you may need to understand its capabilities and provide your own multi-[AS](https://en.wikipedia.org/wiki/Autonomous_system_\(Internet\)) solution. -* **[WHOIS](https://en.wikipedia.org/wiki/WHOIS)**: A text-based protocol that - returns ownership and contact information on registered domain names. * **[Registration Data Access Protocol (RDAP)](https://en.wikipedia.org/wiki/Registration_Data_Access_Protocol)**: A JSON API that returns structured, machine-readable information about diff --git a/docs/architecture.md b/docs/architecture.md index cb654b6b3c1..a09da323a90 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -1,153 +1,97 @@ # Architecture This document contains information on the overall architecture of Nomulus on -[Google Cloud Platform](https://cloud.google.com/). It covers the App Engine -architecture as well as other Cloud Platform services used by Nomulus. - -## App Engine - -[Google App Engine](https://cloud.google.com/appengine/) is a cloud computing -platform that runs web applications in the form of servlets. Nomulus consists of -Java servlets that process web requests. These servlets use other features -provided by App Engine, including task queues and cron jobs, as explained -below. - -### Services - -Nomulus contains three [App Engine -services](https://cloud.google.com/appengine/docs/python/an-overview-of-app-engine), -which were previously called modules in earlier versions of App Engine. The -services are: default (also called front-end), backend, and tools. Each service -runs independently in a lot of ways, including that they can be upgraded -individually, their log outputs are separate, and their servers and configured -scaling are separate as well. - -Once you have your app deployed and running, the default service can be accessed -at `https://project-id.appspot.com`, substituting whatever your App Engine app -is named for "project-id". Note that that is the URL for the production instance -of your app; other environments will have the environment name appended with a -hyphen in the hostname, e.g. `https://project-id-sandbox.appspot.com`. - -The URL for the backend service is `https://backend-dot-project-id.appspot.com` -and the URL for the tools service is `https://tools-dot-project-id.appspot.com`. -The reason that the dot is escaped rather than forming subdomains is because the -SSL certificate for `appspot.com` is only valid for `*.appspot.com` (no double -wild-cards). - -#### Default service - -The default service is responsible for all registrar-facing +[Google Cloud Platform](https://cloud.google.com/). + +Nomulus was originally built for App Engine, but the modern architecture now +uses Google Kubernetes Engine (GKE) for better flexibility and control over +networking, running as a series of Java-based microservices within GKE pods. + +In addition, because GKE (and standard HTTP load balancers) typically handle +HTTP(s) traffic, Nomulus uses a custom proxy to handle raw TCP traffic required +for EPP (Port 700). This proxy can run as a GKE sidecar or a standalone cluster. +For more information on the proxy, see [the proxy setup guide](proxy-setup.md). + +### Workloads + +Nomulus contains four Kubernetes +[workloads](https://kubernetes.io/docs/concepts/workloads/). Each workload is +fairly independent as one would expect, including scaling. + +The four workloads are referred to as `frontend`, `backend`, `console`, and +`pubapi`. + +Each workload's URL is created by prefixing the name of the workload to the base +domain, e.g. `https://pubapi.mydomain.example`. Requests to each workload are +all handled by the +[RegistryServlet](https://github.com/google/nomulus/blob/master/core/src/main/java/google/registry/module/RegistryServlet.java) + +#### Frontend workload + +The frontend workload is responsible for all registrar-facing [EPP](https://en.wikipedia.org/wiki/Extensible_Provisioning_Protocol) command -traffic, all user-facing WHOIS and RDAP traffic, and the admin and registrar web -consoles, and is thus the most important service. If the service has any -problems and goes down or stops servicing requests in a timely manner, it will -begin to impact users immediately. Requests to the default service are handled -by the `FrontendServlet`, which provides all of the endpoints exposed in -`FrontendRequestComponent`. - -#### Backend service - -The backend service is responsible for executing all regularly scheduled -background tasks (using cron) as well as all asynchronous tasks. Requests to the -backend service are handled by the `BackendServlet`, which provides all of the -endpoints exposed in `BackendRequestComponent`. These include tasks for -generating/exporting RDE, syncing the trademark list from TMDB, exporting -backups, writing out DNS updates, handling asynchronous contact and host -deletions, writing out commit logs, exporting metrics to BigQuery, and many -more. Issues in the backend service will not immediately be apparent to end -users, but the longer it is down, the more obvious it will become that -user-visible tasks such as DNS and deletion are not being handled in a timely -manner. - -The backend service is also where scheduled and automatically invoked MapReduces -run, which includes some of the aforementioned tasks such as RDE and -asynchronous resource deletion. Consequently, the backend service should be -sized to support not just the normal ongoing DNS load but also the load incurred -by MapReduces, both scheduled (such as RDE) and on-demand (asynchronous -contact/host deletion). - -#### BSA service - -The bsa service is responsible for business logic behind Nomulus and BSA -functionality. Requests to the backend service are handled by the `BsaServlet`, -which provides all of the endpoints exposed in `BsaRequestComponent`. These -include tasks for downloading, processing and uploading BSA data. - - -#### Tools service - -The tools service is responsible for servicing requests from the `nomulus` -command line tool, which provides administrative-level functionality for -developers and tech support employees of the registry. It is thus the least -critical of the three services. Requests to the tools service are handled by the -`ToolsServlet`, which provides all of the endpoints exposed in -`ToolsRequestComponent`. Some example functionality that this service provides -includes the server-side code to update premium lists, run EPP commands from the -tool, and manually modify contacts/hosts/domains/and other resources. Problems -with the tools service are not visible to users. - -The tools service also runs ad-hoc MapReduces, like those invoked via `nomulus` -tool subcommands like `generate_zone_files` and by manually hitting URLs under -https://tools-dot-project-id.appspot.com, like -`/_dr/task/refreshDnsForAllDomains`. - -### Task queues - -App Engine [task -queues](https://cloud.google.com/appengine/docs/java/taskqueue/) provide an +traffic. If the workload has any problems or goes down, it will begin to impact +users immediately. + +#### PubApi workload + +The PubApi (Public API) workload is responsible for all public traffic to the +registry. In practice, this primarily consists of RDAP traffic. This is split +into a separate workload so that public users (without authentication) will have +a harder time impacting intra-registry or registrar-registry actions. + +#### Backend workload + +The backend workload is responsible for executing all regularly scheduled +background tasks (using cron) as well as all asynchronous tasks. These include +tasks for generating/exporting RDE, syncing the trademark list from TMDB, +exporting backups, writing out DNS updates, syncing BSA data, +generating/exporting ICANN activity data, and many more. Issues in the backend +workload will not immediately be apparent to end users, but the longer it is +down, the more obvious it will become that user-visible tasks such as DNS and +deletion are not being handled in a timely manner. + +The backend workload is also where scheduled and automatically-invoked BEAM +pipelines run, which includes some of the aforementioned tasks such as RDE. +Consequently, the backend workload should be sized to support not just the +normal ongoing DNS load but also the load incurred by BEAM pipelines, both +scheduled (such as RDE) and on-demand (started by registry employees). + +The backend workload also supports handling of manually-performed actions using +the `nomulus` command-line tool, which provides administrative-level +functionality for developers and tech support employees of the registry. + +### Cloud Tasks queues + +GCP's [Cloud Tasks](https://docs.cloud.google.com/tasks/docs) provides an asynchronous way to enqueue tasks and then execute them on some kind of -schedule. There are two types of queues, push queues and pull queues. Tasks in -push queues are always executing up to some throttlable limit. Tasks in pull -queues remain there until the queue is polled by code that is running for some -other reason. Essentially, push queues run their own tasks while pull queues -just enqueue data that is used by something else. Many other parts of App Engine -are implemented using task queues. For example, [App Engine -cron](https://cloud.google.com/appengine/docs/java/config/cron) adds tasks to -push queues at regularly scheduled intervals, and the [MapReduce -framework](https://cloud.google.com/appengine/docs/java/dataprocessing/) adds -tasks for each phase of the MapReduce algorithm. - -Nomulus uses a particular pattern of paired push/pull queues that is worth -explaining in detail. Push queues are essential because App Engine's -architecture does not support long-running background processes, and so push -queues are thus the fundamental building block that allows asynchronous and -background execution of code that is not in response to incoming web requests. -However, they also have limitations in that they do not allow batch processing -or grouping. That's where the pull queue comes in. Regularly scheduled tasks in -the push queue will, upon execution, poll the corresponding pull queue for a -specified number of tasks and execute them in a batch. This allows the code to -execute in the background while taking advantage of batch processing. - -The task queues used by Nomulus are configured in the `cloud-tasks-queue.xml` -file. Note that many push queues have a direct one-to-one correspondence with -entries in `cloud-scheduler-tasks.xml` because they need to be fanned-out on a -per-TLD or other basis (see the Cron section below for more explanation). -The exact queue that a given cron task will use is passed as the query string -parameter "queue" in the url specification for the cron task. - -Here are the task queues in use by the system. All are push queues unless -explicitly marked as otherwise. +schedule. Task queues are essential because by nature, GKE architecture does not +support long-running background processes, and so queues are thus the +fundamental building block that allows asynchronous and background execution of +code that is not in response to incoming web requests. + +The task queues used by Nomulus are configured in the `cloud-tasks-queue.xml` +file. Note that many push queues have a direct one-to-one correspondence with +entries in `cloud-scheduler-tasks-ENVIRONMENT.xml` because they need to be +fanned-out on a per-TLD or other basis (see the Cron section below for more +explanation). The exact queue that a given cron task will use is passed as the +query string parameter "queue" in the url specification for the cron task. + +Here are the task queues in use by the system: * `brda` -- Queue for tasks to upload weekly Bulk Registration Data Access - (BRDA) files to a location where they are available to ICANN. The - `RdeStagingReducer` (part of the RDE MapReduce) creates these tasks at the - end of generating an RDE dump. -* `dns-pull` -- A pull queue to enqueue DNS modifications. Cron regularly runs - `ReadDnsQueueAction`, which drains the queue, batches modifications by TLD, - and writes the batches to `dns-publish` to be published to the configured - `DnsWriter` for the TLD. + (BRDA) files to a location where they are available to ICANN. The RDE + pipeline creates these tasks at the end of generating an RDE dump. * `dns-publish` -- Queue for batches of DNS updates to be pushed to DNS writers. -* `lordn-claims` and `lordn-sunrise` -- Pull queues for handling LORDN - exports. Tasks are enqueued synchronously during EPP commands depending on - whether the domain name in question has a claims notice ID. +* `dns-refresh` -- Queues for reading and fanning out DNS refresh requests, + using the `DnsRefreshRequest` SQL table as the source of data * `marksdb` -- Queue for tasks to verify that an upload to NORDN was successfully received and verified. These tasks are enqueued by `NordnUploadAction` following an upload and are executed by `NordnVerifyAction`. * `nordn` -- Cron queue used for NORDN exporting. Tasks are executed by - `NordnUploadAction`, which pulls LORDN data from the `lordn-claims` and - `lordn-sunrise` pull queues (above). + `NordnUploadAction` * `rde-report` -- Queue for tasks to upload RDE reports to ICANN following successful upload of full RDE files to the escrow provider. Tasks are enqueued by `RdeUploadAction` and executed by `RdeReportAction`. @@ -157,28 +101,25 @@ explicitly marked as otherwise. * `retryable-cron-tasks` -- Catch-all cron queue for various cron tasks that run infrequently, such as exporting reserved terms. * `sheet` -- Queue for tasks to sync registrar updates to a Google Sheets - spreadsheet. Tasks are enqueued by `RegistrarServlet` when changes are made - to registrar fields and are executed by `SyncRegistrarsSheetAction`. + spreadsheet, done by `SyncRegistrarsSheetAction`. -### Cron jobs +### Scheduled cron jobs -Nomulus uses App Engine [cron -jobs](https://cloud.google.com/appengine/docs/java/config/cron) to run periodic -scheduled actions. These actions run as frequently as once per minute (in the -case of syncing DNS updates) or as infrequently as once per month (in the case -of RDE exports). Cron tasks are specified in `cron.xml` files, with one per -environment. There are more tasks that run in Production than in other -environments because tasks like uploading RDE dumps are only done for the live -system. Cron tasks execute on the `backend` service. +Nomulus uses [Cloud Scheduler](https://docs.cloud.google.com/scheduler/docs) to +run periodic scheduled actions. These actions run as frequently as once per +minute (in the case of syncing DNS updates) or as infrequently as once per month +(in the case of RDE exports). Cron tasks are specified in +`cloud-scheduler-tasks-{ENVIRONMENT}.xml` files, with one per environment. There +are more tasks that run in Production than in other environments because tasks +like uploading RDE dumps are only done for the live system. Most cron tasks use the `TldFanoutAction` which is accessed via the -`/_dr/cron/fanout` URL path. This action, which is run by the BackendServlet on -the backend service, fans out a given cron task for each TLD that exists in the -registry system, using the queue that is specified in the `cron.xml` entry. -Because some tasks may be computationally intensive and could risk spiking -system latency if all start executing immediately at the same time, there is a -`jitterSeconds` parameter that spreads out tasks over the given number of -seconds. This is used with DNS updates and commit log deletion. +`/_dr/cron/fanout` URL path. This action fans out a given cron task for each TLD +that exists in the registry system, using the queue that is specified in the XML +entry. Because some tasks may be computationally intensive and could risk +spiking system latency if all start executing immediately at the same time, +there is a `jitterSeconds` parameter that spreads out tasks over the given +number of seconds. This is used with DNS updates and commit log deletion. The reason the `TldFanoutAction` exists is that a lot of tasks need to be done separately for each TLD, such as RDE exports and NORDN uploads. It's simpler to @@ -192,8 +133,7 @@ tasks retry in the face of transient errors. The full list of URL parameters to `TldFanoutAction` that can be specified in cron.xml is: -* `endpoint` -- The path of the action that should be executed (see - `web.xml`). +* `endpoint` -- The path of the action that should be executed * `queue` -- The cron queue to enqueue tasks in. * `forEachRealTld` -- Specifies that the task should be run in each TLD of type `REAL`. This can be combined with `forEachTestTld`. @@ -218,14 +158,14 @@ Each environment is thus completely independent. The different environments are specified in `RegistryEnvironment`. Most correspond to a separate App Engine app except for `UNITTEST` and `LOCAL`, which by their nature do not use real environments running in the cloud. The -recommended naming scheme for the App Engine apps that has the best possible -compatibility with the codebase and thus requires the least configuration is to -pick a name for the production app and then suffix it for the other -environments. E.g., if the production app is to be named 'registry-platform', -then the sandbox app would be named 'registry-platform-sandbox'. +recommended project naming scheme that has the best possible compatibility with +the codebase and thus requires the least configuration is to pick a name for the +production app and then suffix it for the other environments. E.g., if the +production app is to be named 'registry-platform', then the sandbox app would be +named 'registry-platform-sandbox'. The full list of environments supported out-of-the-box, in descending order from -real to not, is: +real to not-real, is: * `PRODUCTION` -- The real production environment that is actually running live TLDs. Since Nomulus is a shared registry platform, there need only ever @@ -270,28 +210,28 @@ of experience running a production registry using this codebase. ## Cloud SQL -To be filled. +Nomulus uses [GCP Cloud SQL](https://cloud.google.com/sql) (Postgres) to store +information. For more information, see the +[DB project README file.](../db/README.md) ## Cloud Storage buckets Nomulus uses [Cloud Storage](https://cloud.google.com/storage/) for bulk storage -of large flat files that aren't suitable for Cloud SQL. These files include -backups, RDE exports, and reports. Each bucket name must be unique across all of -Google Cloud Storage, so we use the common recommended pattern of prefixing all -buckets with the name of the App Engine app (which is itself globally unique). -Most of the bucket names are configurable, but the defaults are as follows, with -PROJECT standing in as a placeholder for the App Engine app name: +of large flat files that aren't suitable for SQL. These files include backups, +RDE exports, and reports. Each bucket name must be unique across all of Google +Cloud Storage, so we use the common recommended pattern of prefixing all buckets +with the name of the project (which is itself globally unique). Most of the +bucket names are configurable, but the most important / relevant defaults are: * `PROJECT-billing` -- Monthly invoice files for each registrar. -* `PROJECT-commits` -- Daily exports of commit logs that are needed for - potentially performing a restore. +* `PROJECT-bsa` -- BSA data and output * `PROJECT-domain-lists` -- Daily exports of all registered domain names per TLD. * `PROJECT-gcs-logs` -- This bucket is used at Google to store the GCS access logs and storage data. This bucket is not required by the Registry system, but can provide useful logging information. For instructions on setup, see - the [Cloud Storage - documentation](https://cloud.google.com/storage/docs/access-logs). + the + [Cloud Storage documentation](https://cloud.google.com/storage/docs/access-logs). * `PROJECT-icann-brda` -- This bucket contains the weekly ICANN BRDA files. There is no lifecycle expiration; we keep a history of all the files. This bucket must exist for the BRDA process to function. @@ -301,9 +241,3 @@ PROJECT standing in as a placeholder for the App Engine app name: regularly uploaded to the escrow provider. Lifecycle is set to 90 days. The bucket must exist. * `PROJECT-reporting` -- Contains monthly ICANN reporting files. -* `PROJECT.appspot.com` -- Temporary MapReduce files are stored here. By - default, the App Engine MapReduce library places its temporary files in a - bucket named {project}.appspot.com. This bucket must exist. To keep - temporary files from building up, a 90-day or 180-day lifecycle should be - applied to the bucket, depending on how long you want to be able to go back - and debug MapReduce problems. diff --git a/docs/code-structure.md b/docs/code-structure.md index 004f3bfff93..7fc9dd1dbc8 100644 --- a/docs/code-structure.md +++ b/docs/code-structure.md @@ -3,54 +3,46 @@ This document contains information on the overall structure of the code, and how particularly important pieces of the system are implemented. -## Bazel build system - -[Bazel](https://www.bazel.io/) is used to build and test the Nomulus codebase. - -Bazel builds are described using [BUILD -files](https://www.bazel.io/versions/master/docs/build-ref.html). A directory -containing a BUILD file defines a package consisting of all files and -directories underneath it, except those directories which themselves also -contain BUILD files. A package contains targets. Most targets in the codebase -are of the type `java_library`, which generates `JAR` files, or `java_test`, -which runs tests. - -The key to Bazel's ability to create reproducible builds is the requirement that -each build target must declare its direct dependencies. Each of those -dependencies is a target, which, in turn, must also declare its dependencies. -This recursive description of a target's dependencies forms an acyclic graph -that fully describes the targets which must be built in order to build any -target in the graph. - -A wrinkle in this system is managing external dependencies. Bazel was designed -first and foremost to manage builds where all code lives in a single source -repository and is compiled from `HEAD`. In order to mesh with other build and -packaging schemes, such as libraries distributed as compiled `JAR`s, Bazel -supports [external target -declarations](https://www.bazel.io/versions/master/docs/external.html#transitive-dependencies). -The Nomulus codebase uses external targets pulled in from Maven Central, these -are declared in `java/google/registry/repositories.bzl`. The dependencies of -these external targets are not managed by Bazel; you must manually add all of -the dependencies or use the -[generate_workspace](https://docs.bazel.build/versions/master/generate-workspace.html) -tool to do it. - -### Generating EAR/WAR archives for deployment - -There are special build target types for generating `WAR` and `EAR` files for -deploying Nomulus to GAE. These targets, `zip_file` and `registry_ear_file` respectively, are used in `java/google/registry/BUILD`. To generate archives suitable for deployment on GAE: - -```shell -$ bazel build java/google/registry:registry_ear - ... - bazel-genfiles/java/google/registry/registry.ear -INFO: Elapsed time: 0.216s, Critical Path: 0.00s -# This will also generate the per-module WAR files: -$ ls bazel-genfiles/java/google/registry/*.war -bazel-genfiles/java/google/registry/registry_backend.war -bazel-genfiles/java/google/registry/registry_default.war -bazel-genfiles/java/google/registry/registry_tools.war -``` +## Gradle build system + +[Gradle](https://gradle.org/) is used to build and test the Nomulus codebase. + +Nomulus, for the most part, uses fairly standard Gradle task naming for building +and running tests, with the various tasks defined in various `build.gradle` +files. + +Dependencies and their version restrictions are defined in the +`dependencies.gradle` file. Within each subproject's `build.gradle` file, the +actual dependencies used by that subproject are listed along with the type of +dependency (e.g. implementation, testImplementation). Versions of each +dependency are locked to avoid frequent dependency churn, with the locked +versions stored in the various `gradle.lockfile` files. To update these +versions, run any Gradle command (e.g. `./gradlew build`) with the +`--write-locks` argument. + +### Generating WAR archives for deployment + +The `jetty` project is the main entry point for building the Nomulus WAR files, +and one can use the `war` gradle task to build the base WAR file. The various +deployment/release files use Docker to deploy this, in a system that is too +Google-specialized to replicate directly here. + +## Subprojects + +Within the Nomulus repository there are a few notable subprojects: + +* `util` contains tools that don't depend on any of our other code, e.g. + libraries or raw utilities +* `db` contains database-related code, managing the schema and + deployment/testing of the database. +* `integration` contains tests to make sure that schema rollouts won't break + Nomulus, that code versions and schema versions are cross-compatible +* `console-webapp` contains the Typescript/HTML/CSS/Angular code for the + registrar console frontend +* `proxy` contains code for the EPP proxy, which relays port 700 requests to + the core EPP services +* `core` contains the bulk of the core Nomulus code, including request + handling+serving, backend, actions, etc ## Cursors @@ -72,8 +64,8 @@ The following cursor types are defined: * **`RDE_UPLOAD`** - RDE (thick) escrow deposit upload * **`RDE_UPLOAD_SFTP`** - Cursor that tracks the last time we talked to the escrow provider's SFTP server for a given TLD. -* **`RECURRING_BILLING`** - Expansion of `BillingRecurrence` (renew) billing events - into one-time `BillingEvent`s. +* **`RECURRING_BILLING`** - Expansion of `BillingRecurrence` (renew) billing + events into one-time `BillingEvent`s. * **`SYNC_REGISTRAR_SHEET`** - Tracks the last time the registrar spreadsheet was successfully synced. @@ -82,16 +74,9 @@ next timestamp at which an operation should resume processing and a `CursorType` that identifies which operation the cursor is associated with. In many cases, there are multiple cursors per operation; for instance, the cursors related to RDE reporting, staging, and upload are per-TLD cursors. To accomplish this, each -`Cursor` also has a scope, a `Key` to which the particular -cursor applies (this can be e.g. a `Registry` or any other `ImmutableObject` in -the database, depending on the operation). If the `Cursor` applies to the entire -registry environment, it is considered a global cursor and has a scope of -`EntityGroupRoot.getCrossTldKey()`. - -Cursors are singleton entities by type and scope. The id for a `Cursor` is a -deterministic string that consists of the websafe string of the Key of the scope -object concatenated with the name of the name of the cursor type, separated by -an underscore. +`Cursor` also has a scope, a string to which the particular cursor applies (this +can be anything, but in practice is either a TLD or `GLOBAL` for cross-TLD +cursors. Cursors are singleton entities by type and scope. ## Guava @@ -101,8 +86,7 @@ idiomatic, well-tested, and performant add-ons to the JDK. There are several libraries in particular that you should familiarize yourself with, as they are used extensively throughout the codebase: -* [Immutable - Collections](https://github.com/google/guava/wiki/ImmutableCollectionsExplained): +* [Immutable Collections](https://github.com/google/guava/wiki/ImmutableCollectionsExplained): Immutable collections are a useful defensive programming technique. When an Immutable collection type is used as a parameter type, it immediately indicates that the given collection will not be modified in the method. @@ -144,11 +128,10 @@ as follows: * `Domain` ([RFC 5731](https://tools.ietf.org/html/rfc5731)) * `Host` ([RFC 5732](https://tools.ietf.org/html/rfc5732)) -* `Contact` ([RFC 5733](https://tools.ietf.org/html/rfc5733)) All `EppResource` entities use a Repository Object Identifier (ROID) as its -unique id, in the format specified by [RFC -5730](https://tools.ietf.org/html/rfc5730#section-2.8) and defined in +unique id, in the format specified by +[RFC 5730](https://tools.ietf.org/html/rfc5730#section-2.8) and defined in `EppResourceUtils.createRoid()`. Each entity also tracks a number of timestamps related to its lifecycle (in @@ -164,12 +147,9 @@ the status of a resource at a given point in time. ## Foreign key indexes -Foreign key indexes provide a means of loading active instances of `EppResource` -objects by their unique IDs: - -* `Domain`: fully-qualified domain name -* `Contact`: contact id -* `Host`: fully-qualified host name +`Domain` and `Host` each are foreign-keyed, meaning we often wish to query them +by their foreign keys (fully-qualified domain name and fully-qualified host +name, respectively). Since all `EppResource` entities are indexed on ROID (which is also unique, but not as useful as the resource's name), the `ForeignKeyUtils` provides a way to @@ -203,10 +183,9 @@ events that are recorded as history entries, including: The full list is captured in the `HistoryEntry.Type` enum. -Each `HistoryEntry` has a parent `Key`, the EPP resource that was -mutated by the event. A `HistoryEntry` will also contain the complete EPP XML -command that initiated the mutation, stored as a byte array to be agnostic of -encoding. +Each `HistoryEntry` has a reference to a singular EPP resource that was mutated +by the event. A `HistoryEntry` will also contain the complete EPP XML command +that initiated the mutation, stored as a byte array to be agnostic of encoding. A `HistoryEntry` also captures other event metadata, such as the `DateTime` of the change, whether the change was created by a superuser, and the ID of the @@ -215,9 +194,9 @@ registrar that sent the command. ## Poll messages Poll messages are the mechanism by which EPP handles asynchronous communication -between the registry and registrars. Refer to [RFC 5730 Section -2.9.2.3](https://tools.ietf.org/html/rfc5730#section-2.9.2.3) for their protocol -specification. +between the registry and registrars. Refer to +[RFC 5730 Section 2.9.2.3](https://tools.ietf.org/html/rfc5730#section-2.9.2.3) +for their protocol specification. Poll messages are stored by the system as entities in the database. All poll messages have an event time at which they become active; any poll request before @@ -245,8 +224,9 @@ poll messages are ACKed (and thus deleted) in `PollAckFlow`. ## Billing events Billing events capture all events in a domain's lifecycle for which a registrar -will be charged. A `BillingEvent` will be created for the following reasons (the -full list of which is represented by `BillingEvent.Reason`): +will be charged. A one-time `BillingEvent` will (or can) be created for the +following reasons (the full list of which is represented by +`BillingBase.Reason`): * Domain creates * Domain renewals @@ -254,19 +234,19 @@ full list of which is represented by `BillingEvent.Reason`): * Server status changes * Domain transfers -A `BillingBase` can also contain one or more `BillingBase.Flag` flags that -provide additional metadata about the billing event (e.g. the application phase -during which the domain was applied for). - -All `BillingBase` entities contain a parent `VKey` to identify the -mutation that spawned the `BillingBase`. - There are 4 types of billing events, all of which extend the abstract `BillingBase` base class: * **`BillingEvent`**, a one-time billing event. -* **`BillingRecurrence`**, a recurring billing event (used for events such as domain - renewals). -* **`BillingCancellation`**, which represents the cancellation of either a `OneTime` - or `BillingRecurrence` billing event. This is implemented as a distinct event to - preserve the immutability of billing events. +* **`BillingRecurrence`**, a recurring billing event (used for events such as + domain renewals). +* **`BillingCancellation`**, which represents the cancellation of either a + `BillingEvent` or `BillingRecurrence` billing event. This is implemented as + a distinct event to preserve the immutability of billing events. + +A `BillingBase` can also contain one or more `BillingBase.Flag` flags that +provide additional metadata about the billing event (e.g. the application phase +during which the domain was applied for). + +All `BillingBase` entities contain reference to a given ROID (`EppResource` +reference) to identify the mutation that spawned the `BillingBase`. diff --git a/docs/configuration.md b/docs/configuration.md index 5bc4221c187..9c49fe00727 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -2,10 +2,11 @@ There are multiple different kinds of configuration that go into getting a working registry system up and running. Broadly speaking, configuration works in -two ways -- globally, for the entire sytem, and per-TLD. Global configuration is -managed by editing code and deploying a new version, whereas per-TLD -configuration is data that lives in the database in `Tld` entities, and is -updated by running `nomulus` commands without having to deploy a new version. +two ways -- globally, for the entire system, and per-TLD. Global configuration +is managed by editing code and deploying a new version, whereas per-TLD +configuration is data that lives in the database in `Tld` entities, and +[is updated](operational-procedures/modifying-tlds.md) without having to deploy +a new version. ## Initial configuration @@ -23,40 +24,14 @@ Before getting into the details of configuration, it's important to note that a lot of configuration is environment-dependent. It is common to see `switch` statements that operate on the current `RegistryEnvironment`, and return different values for different environments. This is especially pronounced in -the `UNITTEST` and `LOCAL` environments, which don't run on App Engine at all. -As an example, some timeouts may be long in production and short in unit tests. +the `UNITTEST` and `LOCAL` environments, which don't run on GCP at all. As an +example, some timeouts may be long in production and short in unit tests. See the [Architecture documentation](./architecture.md) for more details on environments as used by Nomulus. -## App Engine configuration - -App Engine configuration isn't covered in depth in this document as it is -thoroughly documented in the [App Engine configuration docs][app-engine-config]. -The main files of note that come pre-configured in Nomulus are: - -* `cron.xml` -- Configuration of cronjobs -* `web.xml` -- Configuration of URL paths on the webserver -* `appengine-web.xml` -- Overall App Engine settings including number and type - of instances -* `cloud-scheduler-tasks.xml` -- Configuration of Cloud Scheduler Tasks -* * `cloud-tasks-queue.xml` -- Configuration of Cloud Tasks Queue -* `application.xml` -- Configuration of the application name and its services - -Cron, web, and queue are covered in more detail in the "App Engine architecture" -doc, and the rest are covered in the general App Engine documentation. - -If you are not writing new code to implement custom features, is unlikely that -you will need to make any modifications beyond simple changes to -`application.xml` and `appengine-web.xml`. If you are writing new features, it's -likely you'll need to add cronjobs, URL paths, and task queues, and thus edit -those associated XML files. - -The existing codebase is configured for running a full-scale registry with -multiple TLDs. In order to deploy to App Engine, you will either need to -[increase your quota](https://cloud.google.com/compute/quotas#requesting_additional_quota) -to allow for at least 100 running instances or reduce `max-instances` in the -backend `appengine-web.xml` files to 25 or less. +TODO: documentation about how to set up GKE and what config points are necessary +to modify there ## Global configuration @@ -65,9 +40,9 @@ deployed in the app. The full list of config options and their default values can be found in the [`default-config.yaml`][default-config] file. If you wish to change any of these values, do not edit this file. Instead, edit the environment configuration file named -`google/registry/config/files/nomulus-config-ENVIRONMENT.yaml`, overriding only -the options you wish to change. Nomulus ships with blank placeholders for all -standard environments. +`core/src/main/java/google/registry/config/files/nomulus-config-ENVIRONMENT.yaml`, +overriding only the options you wish to change. Nomulus ships with blank +placeholders for all standard environments. You will not need to change most of the default settings. Here is the subset of settings that you will need to change for all deployed environments, including @@ -75,52 +50,65 @@ development environments. See [`default-config.yaml`][default-config] for a full description of each option: ```yaml -appEngine: - projectId: # Your App Engine project ID - toolsServiceUrl: https://tools-dot-PROJECT-ID.appspot.com # Insert your project ID - isLocal: false # Causes saved credentials to be used. +gcpProject: + projectId: # Your GCP project ID + projectIdNumber: # The corresponding ID number, found on the home page + locationId: # e.g. us-central1 + isLocal: false # Causes saved credentials to be used + baseDomain: # the base domain from which the registry will be served, e.g. registry.google gSuite: - domainName: # Your G Suite domain name - adminAccountEmailAddress: # An admin login for your G Suite account + domainName: # Your GSuit domain name, likely same as baseDomain above + adminAccountEmailAddress: # An admin login for your GSuite account + +auth: + allowedServiceAccountEmails: + - # a list of service account emails given access to Nomulus + oauthClientId: # the client ID of the Identity-Aware Proxy + +cloudSql: + jdbcUrl: # path to the Postgres server + ``` For fully-featured production environments that need the full range of features (e.g. RDE, correct contact information on the registrar console, etc.) you will -need to specify more settings. +need to specify *many* more settings. From a code perspective, all configuration settings ultimately come through the [`RegistryConfig`][registry-config] class. This includes a Dagger module called `ConfigModule` that provides injectable configuration options. While most configuration options can be changed from within the yaml config file, certain -derived options may still need to be overriden by changing the code in this +derived options may still need to be overridden by changing the code in this module. -## OAuth 2 client id configuration +## OAuth 2 client ID configuration -The open source Nomulus release uses OAuth 2 to authenticate and authorize -users. This includes the `nomulus` tool when it connects to the system to -execute commands. OAuth must be configured before you can use the `nomulus` tool -to set up the system. +Nomulus uses OAuth 2 to authenticate and authorize users. This includes the +`nomulus` [command-line tool](admin-tool.md) when it connects to the system to +execute commands as well as the +[Identity-Aware Proxy](https://pantheon.corp.google.com/security/iap) used to +authenticate standard requests. OAuth must be configured before you can use +either system. -OAuth defines the concept of a *client id*, which identifies the application +OAuth defines the concept of a *client ID*, which identifies the application which the user wants to authorize. This is so that, when a user clicks in an OAuth permission dialog and grants access to data, they are not granting access to every application on their computer (including potentially malicious ones), but only to the application which they agree needs access. Each environment of -the Nomulus system should have its own client id. Multiple installations of the -`nomulus` tool application can share the same client id for the same -environment. +the Nomulus system should have its own pair of client IDs. Multiple +installations of the `nomulus` tool application can share the same client ID for +the same environment. -There are three steps to configuration. +For the Nomulus tool OAuth configuration, do the following steps: -* **Create the client id in App Engine:** Go to your project's +* **Create the registry tool client ID in GCP:** Go to your project's ["Credentials" page](https://console.developers.google.com/apis/credentials) in the Developer's Console. Click "Create credentials" and select "OAuth client ID" from the dropdown. In the create credentials window, select an - application type of "Desktop app". After creating the client id, copy the - client id and client secret which are displayed in the popup window. You may - also obtain this information by downloading the json file for the client id. + application type of "Desktop app". After creating the client ID, copy the + client ID and client secret which are displayed in the popup window. You may + also obtain this information by downloading the JSON file for the client ID * **Copy the client secret information to the config file:** The *client secret file* contains both the client ID and the client secret. Copy the @@ -129,18 +117,21 @@ There are three steps to configuration. `registryTool` section. This will make the `nomulus` tool use this credential to authenticate itself to the system. -* **Add the new client id to the configured list of allowed client ids:** The - configuration files include an `oAuth` section, which defines a parameter - called `allowedOauthClientIds`, specifying a list of client ids which are - permitted to connect. Add the client ID to the list. You will need to - rebuild and redeploy the project so that the configuration changes take - effect. - -Once these steps are taken, the `nomulus` tool will use a client id which the -server is configured to accept, and authentication should succeed. Note that -many Nomulus commands also require that the user have App Engine admin -privileges, meaning that the user needs to be added as an owner or viewer of the -App Engine project. +For IAP configuration, do the following steps: * **Create the IAP client ID:** +Follow similar steps from above to create an additional OAuth client ID, but +using an application type of "Web application". Note the client ID and secret. * +**Enable IAP for your HTTPS load balancer:** On the +[IAP page](https://pantheon.corp.google.com/security/iap), enable IAP for all of +the backend services that all use the same HTTPS load balancer. * **Use a custom +OAuth configuration:** For the backend services, under the "Settings" section +(in the three-dot menu) enable custom OAuth and insert the client ID and secret +that we just created * **Save the client ID:** In the configuration file, save +the client ID as `oauthClientId` in the `auth` section + +Once these steps are taken, the `nomulus` tool and IAP will both use client IDs +which the server is configured to accept, and authentication should succeed. +Note that many Nomulus commands also require that the user have GCP admin +privileges on the project in question. ## Sensitive global configuration @@ -151,8 +142,8 @@ control mishap. We use a secret store to persist these values in a secure manner, which is backed by the GCP Secret Manager. The `Keyring` interface contains methods for all sensitive configuration values, -which are primarily credentials used to access various ICANN and ICANN- -affiliated services (such as RDE). These values are only needed for real +which are primarily credentials used to access various ICANN and +ICANN-affiliated services (such as RDE). These values are only needed for real production registries and PDT environments. If you are just playing around with the platform at first, it is OK to put off defining these values until necessary. This allows the codebase to start and run, but of course any actions @@ -169,16 +160,16 @@ ${KEY_NAME}`. ## Per-TLD configuration -`Tld` entities, which are persisted to the database, are used for per-TLD -configuration. They contain any kind of configuration that is specific to a TLD, -such as the create/renew price of a domain name, the pricing engine -implementation, the DNS writer implementation, whether escrow exports are -enabled, the default currency, the reserved label lists, and more. The `nomulus -update_tld` command is used to set all of these options. See the -[admin tool documentation](./admin-tool.md) for more information, as well as the -command-line help for the `update_tld` command. Unlike global configuration -above, per-TLD configuration options are stored as data in the running system, -and thus do not require code pushes to update. +`Tld` entities, which are persisted to the database and stored in YAML files, +are used for per-TLD configuration. They contain any kind of configuration that +is specific to a TLD, such as the create/renew price of a domain name, the +pricing engine implementation, the DNS writer implementation, whether escrow +exports are enabled, the default currency, the reserved label lists, and more. + +To create or update TLDs, we use +[YAML files](operational-procedures/modifying-tlds.md) and the `nomulus +configure_tld` command. Because the TLDs are stored as data in the running +system, they do not require code pushes to update. [app-engine-config]: https://cloud.google.com/appengine/docs/java/configuration-files [default-config]: https://github.com/google/nomulus/blob/master/java/google/registry/config/files/default-config.yaml @@ -242,7 +233,7 @@ connectionName: your-project:us-central1:nomulus Use the `update_keyring_secret` command to update the `SQL_PRIMARY_CONN_NAME` key with the connection name. If you have created a read-replica, update the -`SQL_REPLICA_CONN_NAME` key with the replica's connection time. +`SQL_REPLICA_CONN_NAME` key with the replica's connection name. ### Installing the Schema @@ -334,6 +325,17 @@ $ gcloud sql connect nomulus --user=nomulus From this, you should have a postgres prompt and be able to enter the "GRANT" command specified above. +### Replication and Backups + +We highly recommend creating a read-only replica of the database and using the +previously-mentioned `SQL_REPLICA_CONN_NAME` value in the keyring to the name of +that replica. By doing so, you can remove some load from the primary database. + +We also recommend enabling +[point-in-time recovery](https://docs.cloud.google.com/sql/docs/postgres/backup-recovery/pitr) +for the instance, just in case something bad happens and you need to restore +from a backup. + ### Cloud SecretManager You'll need to enable the SecretManager API in your project. diff --git a/docs/install.md b/docs/install.md index d3cfcce0c2b..dad31b6f9e1 100644 --- a/docs/install.md +++ b/docs/install.md @@ -6,48 +6,41 @@ This document covers the steps necessary to download, build, and deploy Nomulus. You will need the following programs installed on your local machine: -* A recent version of the [Java 11 JDK][java-jdk11]. -* [Google App Engine SDK for Java][app-engine-sdk], and configure aliases to the `gcloud` and `appcfg.sh` utilities ( - you'll use them a lot). -* [Git](https://git-scm.com/) version control system. -* Docker (confirm with `docker info` no permission issues, use `sudo groupadd docker` for sudoless docker). -* Python version 3.7 or newer. -* gnupg2 (e.g. in run `sudo apt install gnupg2` in Debian-like Linuxes) - -**Note:** App Engine does not yet support Java 9. Also, the instructions in this -document have only been tested on Linux. They might work with some alterations -on other operating systems. +* A recent version of the [Java 21 JDK][java-jdk21]. +* The [Google Cloud CLI](https://docs.cloud.google.com/sdk/docs/install-sdk) + (configure an alias to the `gcloud`utility, because you'll use it a lot) +* [Git](https://git-scm.com/) version control system. +* Docker (confirm with `docker info` no permission issues, use `sudo groupadd + docker` for sudoless docker). +* Python version 3.7 or newer. +* gnupg2 (e.g. in run `sudo apt install gnupg2` in Debian-like Linuxes) + +**Note:** The instructions in this document have only been tested on Linux. They +might work with some alterations on other operating systems. ## Download the codebase -Start off by using git to download the latest version from the [Nomulus GitHub -page](https://github.com/google/nomulus). You may checkout any of the daily -tagged versions (e.g. `nomulus-20200629-RC00`), but in general it is also -safe to simply checkout from HEAD: +Start off by using git to download the latest version from the +[Nomulus GitHub page](https://github.com/google/nomulus). You may check out any +of the daily tagged versions (e.g. `nomulus-20260101-RC00`), but in general it +is also safe to simply check out from HEAD: ```shell $ git clone git@github.com:google/nomulus.git Cloning into 'nomulus'... [ .. snip .. ] -$ cd nomulus -$ ls -apiserving CONTRIBUTORS java LICENSE scripts -AUTHORS docs javascript python third_party -CONTRIBUTING.md google javatests README.md WORKSPACE ``` -Most of the directory tree is organized into gradle sub-projects (see -`settings.gradle` for details). The following other top-level directories are +Most of the directory tree is organized into gradle subprojects (see +`settings.gradle` for details). The following other top-level directories are also defined: -* `buildSrc` -- Gradle extensions specific to our local build and release - methodology. * `config` -- Tools for build and code hygiene. * `docs` -- The documentation (including this install guide) -* `gradle` -- Configuration and code managed by the gradle build system. +* `gradle` -- Configuration and code managed by the Gradle build system. +* `integration` -- Testing scripts for SQL changes. * `java-format` -- The Google java formatter and wrapper scripts to use it incrementally. -* `python` -- Some Python reporting scripts * `release` -- Configuration for our continuous integration process. ## Build the codebase @@ -56,34 +49,29 @@ The first step is to build the project, and verify that this completes successfully. This will also download and install dependencies. ```shell -$ ./nom_build build +$ ./gradlew build Starting a Gradle Daemon (subsequent builds will be faster) Plugins: Using default repo... -> Configure project :buildSrc Java dependencies: Using Maven central... [ .. snip .. ] ``` -The `nom_build` script is just a wrapper around `gradlew`. Its main -additional value is that it formalizes the various properties used in the -build as command-line flags. +The "build" command builds all the code and runs all the tests. This will take a +while. -The "build" command builds all of the code and runs all of the tests. This -will take a while. +## Create and configure a GCP project -## Create an App Engine project - -First, [create an -application](https://cloud.google.com/appengine/docs/java/quickstart) on Google -Cloud Platform. Make sure to choose a good Project ID, as it will be used -repeatedly in a large number of places. If your company is named Acme, then a -good Project ID for your production environment would be "acme-registry". Keep +First, +[create an application](https://cloud.google.com/appengine/docs/java/quickstart) +on Google Cloud Platform. Make sure to choose a good Project ID, as it will be +used repeatedly in a large number of places. If your company is named Acme, then +a good Project ID for your production environment would be "acme-registry". Keep in mind that project IDs for non-production environments should be suffixed with -the name of the environment (see the [Architecture -documentation](./architecture.md) for more details). For the purposes of this -example we'll deploy to the "alpha" environment, which is used for developer -testing. The Project ID will thus be `acme-registry-alpha`. +the name of the environment (see the +[Architecture documentation](./architecture.md) for more details). For the +purposes of this example we'll deploy to the "alpha" environment, which is used +for developer testing. The Project ID will thus be `acme-registry-alpha`. Now log in using the command-line Google Cloud Platform SDK and set the default project to be this one that was newly created: @@ -96,6 +84,17 @@ You are now logged in as [user@email.tld]. $ gcloud config set project acme-registry-alpha ``` +And make sure the required APIs are enabled in the project: + +```shell +$ gcloud services enable \ + container.googleapis.com \ + artifactregistry.googleapis.com \ + sqladmin.googleapis.com \ + secretmanager.googleapis.com \ + compute.googleapis.com +``` + Now modify `projects.gradle` with the name of your new project:
@@ -106,42 +105,51 @@ rootProject.ext.projects = ['production': 'your-production-project',
                             'crash'     : 'your-crash-project']
 
-Next follow the steps in [configuration](./configuration.md) to configure the -complete system or, alternately, read on for an initial deploy in which case -you'll need to deploy again after configuration. +#### Create GKE Clusters -## Deploy the code to App Engine +We recommend Standard clusters with Workload Identity enabled to allow pods to +securely access Cloud SQL and Secret Manager. Feel free to adjust the numbers +and sizing as desired. -AppEngine deployment with gradle is straightforward: +```shell +$ gcloud container clusters create nomulus-cluster \ + --region=$REGION \ + --workload-pool=$PROJECT_ID.svc.id.goog \ + --num-nodes=3 \ + --enable-ip-alias +$ gcloud container clusters create proxy-cluster \ + --region=$REGION \ + --workload-pool=$PROJECT_ID.svc.id.goog \ + --num-nodes=3 \ + --enable-ip-alias +``` - $ ./nom_build appengineDeploy --environment=alpha +Then create an artifact repository: `shell $ gcloud artifacts repositories +create nomulus-repo \ --repository-format=docker \ --location=$REGION \ +--description="Nomulus Docker images"` -To verify successful deployment, visit -https://acme-registry-alpha.appspot.com/registrar in your browser (adjusting -appropriately for the project ID that you actually used). If the project -deployed successfully, you'll see a "You need permission" page indicating that -you need to configure the system and grant access to your Google account. It's -time to go to the next step, configuration. +See the files and documentation in the `release/` folder for more information on +the release process. You will likely need to customize the internal build +process for your own setup, including internal repository management, builds, +and where Nomulus is deployed. Configuration is handled by editing code, rebuilding the project, and deploying -again. See the [configuration guide](./configuration.md) for more details. -Once you have completed basic configuration (including most critically the -project ID, client id and secret in your copy of the `nomulus-config-*.yaml` -files), you can rebuild and start using the `nomulus` tool to create test -entities in your newly deployed system. See the [first steps tutorial](./first-steps-tutorial.md) +again. See the [configuration guide](./configuration.md) for more details. Once +you have completed basic configuration (including most critically the project +ID, client id and secret in your copy of the `nomulus-config-*.yaml` files), you +can rebuild and start using the `nomulus` tool to create test entities in your +newly deployed system. See the [first steps tutorial](./first-steps-tutorial.md) for more information. -[app-engine-sdk]: https://cloud.google.com/appengine/docs/java/download -[java-jdk11]: https://www.oracle.com/java/technologies/javase-downloads.html +[java-jdk21]: https://www.oracle.com/java/technologies/javase-downloads.html -## Deploy the BEAM Pipelines +## Deploy the Beam Pipelines -Nomulus is in the middle of migrating all pipelines to use flex-template. For -pipelines already based on flex-template, deployment in the testing environments +Deployment of the Beam pipelines to Cloud Dataflow in the testing environments (alpha and crash) can be done using the following command: ```shell -./nom_build :core:stageBeamPipelines --environment=alpha +./gradlew :core:stageBeamPipelines -Penvironment=alpha ``` Pipeline deployment in other environments are through CloudBuild. Please refer diff --git a/docs/proxy-setup.md b/docs/proxy-setup.md index 3c2ecaaebf2..97f96616b09 100644 --- a/docs/proxy-setup.md +++ b/docs/proxy-setup.md @@ -2,22 +2,22 @@ This doc covers procedures to configure, build and deploy the [Netty](https://netty.io)-based proxy onto [Kubernetes](https://kubernetes.io) -clusters. [Google Kubernetes -Engine](https://cloud.google.com/kubernetes-engine/) is used as deployment -target. Any kubernetes cluster should in theory work, but the user needs to -change some dependencies on other GCP features such as Cloud KMS for key -management and Stackdriver for monitoring. +clusters. +[Google Kubernetes Engine](https://cloud.google.com/kubernetes-engine/) is used +as deployment target. Any kubernetes cluster should in theory work, but the user +needs to change some dependencies on other GCP features such as Cloud KMS for +key management and Stackdriver for monitoring. ## Overview -Nomulus runs on Google App Engine, which only supports HTTP(S) traffic. In order -to work with [EPP](https://tools.ietf.org/html/rfc5730.html) (TCP port 700) and -[WHOIS](https://tools.ietf.org/html/rfc3912) (TCP port 43), a proxy is needed to -relay traffic between clients and Nomulus and do protocol translation. +Nomulus runs on GKE, and natively only supports HTTP(S) traffic. In order to +work with [EPP](https://tools.ietf.org/html/rfc5730.html) (TCP port 700), a +proxy is needed to relay traffic between clients and Nomulus and do protocol +translation. We provide a Netty-based proxy that runs as a standalone service (separate from -Nomulus) either on a VM or Kubernetes clusters. Deploying to kubernetes is -recommended as it provides automatic scaling and management for docker +Nomulus) either on a VM or Kubernetes clusters. Deploying to Kubernetes is +recommended as it provides automatic scaling and management for Docker containers that alleviates much of the pain of running a production service. The procedure described here can be used to set up a production environment, as @@ -26,13 +26,13 @@ However, proper release management (cutting a release, rolling updates, canary analysis, reliable rollback, etc) is not covered. The user is advised to use a service like [Spinnaker](https://www.spinnaker.io/) for release management. -## Detailed Instruction +## Detailed Instructions We use [`gcloud`](https://cloud.google.com/sdk/gcloud/) and -[`terraform`](https://terraform.io) to configure the proxy project on GCP and to create a GCS -bucket for storing the terraform state file. We use -[`kubectl`](https://kubernetes.io/docs/tasks/tools/install-kubectl/) to deploy -the proxy to the project. These instructions assume that all three tools are +[`terraform`](https://terraform.io) to configure the proxy project on GCP and to +create a GCS bucket for storing the terraform state file. We use +[`kubectl`](https://kubernetes.io/docs/tasks/tools/install-kubectl/) to deploy +the proxy to the project. These instructions assume that all three tools are installed. ### Setup GCP project @@ -41,9 +41,9 @@ There are three projects involved: - Nomulus project: the project that hosts Nomulus. - Proxy project: the project that hosts this proxy. -- GCR ([Google Container - Registry](https://cloud.google.com/container-registry/)) project: the - project from which the proxy pulls its Docker image. +- GCR + ([Google Container Registry](https://cloud.google.com/container-registry/)) + project: the project from which the proxy pulls its Docker image. We recommend using the same project for Nomulus and the proxy, so that logs for both are collected in the same place and easily accessible. If there are @@ -64,16 +64,16 @@ $ gcloud storage buckets create gs:/// --project ### Obtain a domain and SSL certificate -The proxy exposes two endpoints, whois.\ and -epp.\. The base domain \ needs to be obtained -from a registrar ([Google Domains](https://domains.google) for example). Nomulus -operators can also self-allocate a domain in the TLDs under management. +The proxy exposes one endpoint: `epp.`. The base domain +`` needs to be obtained from a registrar (RIP to +[Google Domains](https://domains.google/)). Nomulus operators can also +self-allocate a domain in the TLDs under management. [EPP protocol over TCP](https://tools.ietf.org/html/rfc5734) requires a client-authenticated SSL connection. The operator of the proxy needs to obtain -an SSL certificate for domain epp.\. [Let's -Encrypt](https://letsencrypt.org) offers SSL certificate free of charge, but any -other CA can fill the role. +an SSL certificate for domain `epp.`. +[Let's Encrypt](https://letsencrypt.org) offers SSL certificate free of charge, +but any other CA can fill the role. Concatenate the certificate and its private key into one file: @@ -82,7 +82,7 @@ $ cat > ``` The order between the certificate and the private key inside the combined file -does not matter. However, if the certificate file is chained, i. e. it contains +does not matter. However, if the certificate file is chained, i.e. it contains not only the certificate for your domain, but also certificates from intermediate CAs, these certificates must appear in order. The previous certificate's issuer must be the next certificate's subject. @@ -92,8 +92,9 @@ bucket will be created automatically by terraform. ### Setup proxy project -First setup the [Application Default -Credential](https://cloud.google.com/docs/authentication/production) locally: +First setup the +[Application Default Credential](https://cloud.google.com/docs/authentication/production) +locally: ```bash $ gcloud auth application-default login @@ -102,10 +103,9 @@ $ gcloud auth application-default login Login with the account that has "Project Owner" role of all three projects mentioned above. -Navigate to `proxy/terraform`, create a folder called -`envs`, and inside it, create a folder for the environment that proxy is -deployed to ("alpha" for example). Copy `example_config.tf` and `outputs.tf` -to the environment folder. +Navigate to `proxy/terraform`, create a folder called `envs`, and inside it, +create a folder for the environment that proxy is deployed to ("alpha" for +example). Copy `example_config.tf` and `outputs.tf` to the environment folder. ```bash $ cd proxy/terraform @@ -132,12 +132,12 @@ takes a couple of minutes. ### Setup Nomulus -After terraform completes, it outputs some information, among which is the -email address of the service account created for the proxy. This needs to be -added to the Nomulus configuration file so that Nomulus accepts traffic from the -proxy. Edit the following section in -`java/google/registry/config/files/nomulus-config-.yaml` and redeploy -Nomulus: +After terraform completes, it outputs some information, among which is the email +address of the service account created for the proxy. This needs to be added to +the Nomulus configuration file so that Nomulus accepts traffic from the proxy. +Edit the following section in +`core/src/main/java/google/registry/config/files/nomulus-config-.yaml` and +redeploy Nomulus: ```yaml auth: @@ -148,7 +148,7 @@ auth: ### Setup nameservers The terraform output (run `terraform output` in the environment folder to show -it again) also shows the nameservers of the proxy domain (\). +it again) also shows the nameservers of the proxy domain (``). Delegate this domain to these nameservers (through your registrar). If the domain is self-allocated by Nomulus, run: @@ -160,8 +160,8 @@ $ nomulus -e production update_domain \ ### Setup named ports Unfortunately, terraform currently cannot add named ports on the instance groups -of the GKE clusters it manages. [Named -ports](https://cloud.google.com/compute/docs/load-balancing/http/backend-service#named_ports) +of the GKE clusters it manages. +[Named ports](https://cloud.google.com/compute/docs/load-balancing/http/backend-service#named_ports) are needed for the load balancer it sets up to route traffic to the proxy. To set named ports, in the environment folder, do: @@ -189,8 +189,9 @@ $ gcloud storage cp gs:// ### Edit proxy config file -Proxy configuration files are at `java/google/registry/proxy/config/`. There is -a default config that provides most values needed to run the proxy, and several +Proxy configuration files are at +`proxy/src/main/java/google/registry/proxy/config/`. There is a default config +that provides most values needed to run the proxy, and several environment-specific configs for proxy instances that communicate to different Nomulus environments. The values specified in the environment-specific file override those in the default file. @@ -202,16 +203,33 @@ detailed descriptions on each field. ### Upload proxy docker image to GCR -Edit the `proxy_push` rule in `java/google/registry/proxy/BUILD` to add the GCR -project name and the image name to save to. Note that as currently set up, all -images pushed to GCR will be tagged `bazel` and the GKE deployment object loads -the image tagged as `bazel`. This is fine for testing, but for production one -should give images unique tags (also configured in the `proxy_push` rule). +The GKE deployment manifest is set up to pull the proxy docker image from +[Google Container Registry](https://cloud.google.com/container-registry/) (GCR). +Instead of using `docker` and `gcloud` to build and push images, respectively, +we provide `gradle` rules for the same tasks. To push an image, first use +[`docker-credential-gcr`](https://github.com/GoogleCloudPlatform/docker-credential-gcr) +to obtain necessary credentials. It is used by the Gradle to push the image. + +After credentials are configured, verify that Gradle will use the proper +`gcpProject` for deployment in the main `build.gradle` file. We recommend using +the same project and image for proxies intended for different Nomulus +environments, this way one can deploy the same proxy image first to sandbox for +testing, and then to production. To push to GCR, run: ```bash -$ bazel run java/google/registry/proxy:proxy_push +$ ./gradlew proxy:pushProxyImage +``` + +If the GCP project to host images (gcr project) is different from the project +that the proxy runs in (proxy project), give the service account "Storage Object +Viewer" role of the gcr project. + +```bash +$ gcloud projects add-iam-policy-binding \ +--member serviceAccount: \ +--role roles/storage.objectViewer ``` ### Deploy proxy @@ -243,9 +261,9 @@ Repeat this for all three clusters. ### Afterwork -Remember to turn on [Stackdriver -Monitoring](https://cloud.google.com/monitoring/docs/) for the proxy project as -we use it to collect metrics from the proxy. +Remember to turn on +[Stackdriver Monitoring](https://cloud.google.com/monitoring/docs/) for the +proxy project as we use it to collect metrics from the proxy. You are done! The proxy should be running now. You should store the private key safely, or delete it as you now have the encrypted file shipped with the proxy. @@ -278,14 +296,14 @@ in multiple zones to provide geographical redundancy. ### Create service account -The proxy will run with the credential of a [service -account](https://cloud.google.com/compute/docs/access/service-accounts). In -theory it can take advantage of [Application Default -Credentials](https://cloud.google.com/docs/authentication/production) and use -the service account that the GCE instance underpinning the GKE cluster uses, but -we recommend creating a separate service account. With a dedicated service -account, one can grant permissions only necessary to the proxy. To create a -service account: +The proxy will run with the credential of a +[service account](https://cloud.google.com/compute/docs/access/service-accounts). +In theory, it can take advantage of +[Application Default Credentials](https://cloud.google.com/docs/authentication/production) +and use the service account that the GCE instance underpinning the GKE cluster +uses, but we recommend creating a separate service account. With a dedicated +service account, one can grant permissions only necessary to the proxy. To +create a service account: ```bash $ gcloud iam service-accounts create proxy-service-account \ @@ -303,10 +321,10 @@ $ gcloud iam service-accounts keys create proxy-key.json --iam-account \ A `proxy-key.json` file will be created inside the current working directory. -The service account email address needs to be added to the Nomulus -configuration file so that Nomulus accepts the OAuth tokens generated for this -service account. Add its value to -`java/google/registry/config/files/nomulus-config-.yaml`: +The service account email address needs to be added to the Nomulus configuration +file so that Nomulus accepts the OAuth tokens generated for this service +account. Add its value to +`core/src/main/java/google/registry/config/files/nomulus-config-.yaml`: ```yaml auth: @@ -325,27 +343,13 @@ $ gcloud projects add-iam-policy-binding \ --role roles/logging.logWriter ``` -### Obtain a domain and SSL certificate - -A domain is needed (if you do not want to rely on IP addresses) for clients to -communicate to the proxy. Domains can be purchased from a domain registrar -([Google Domains](https://domains.google) for example). A Nomulus operator could -also consider self-allocating a domain under an owned TLD insteadl. - -An SSL certificate is needed as [EPP over -TCP](https://tools.ietf.org/html/rfc5734) requires SSL. You can apply for an SSL -certificate for the domain name you intended to serve as EPP endpoint -(epp.nic.tld for example) for free from [Let's -Encrypt](https://letsencrypt.org). For now, you will need to manually renew your -certificate before it expires. - ### Create keyring and encrypt the certificate/private key The proxy needs access to both the private key and the certificate. Do *not* -package them directly with the proxy. Instead, use [Cloud -KMS](https://cloud.google.com/kms/) to encrypt them, ship the encrypted file -with the proxy, and call Cloud KMS to decrypt them on the fly. (If you want to -use another keyring solution, you will have to modify the proxy and implement +package them directly with the proxy. Instead, use +[Cloud KMS](https://cloud.google.com/kms/) to encrypt them, ship the encrypted +file with the proxy, and call Cloud KMS to decrypt them on the fly. (If you want +to use another keyring solution, you will have to modify the proxy and implement yours) Concatenate the private key file with the certificate. It does not matter which @@ -378,7 +382,7 @@ A file named `ssl-cert-key.pem.enc` will be created. Upload it to a GCS bucket in the proxy project. To create a bucket and upload the file: ```bash -$ gcloud storage buckets create gs:// --project +$ gcloud storage buckets create gs:// --project $ gcloud storage cp ssl-cert-key.pem.enc gs:// ``` @@ -402,8 +406,9 @@ $ gcloud storage buckets add-iam-policy-binding gs:// \ ### Proxy configuration -Proxy configuration files are at `java/google/registry/proxy/config/`. There is -a default config that provides most values needed to run the proxy, and several +Proxy configuration files are at +`proxy/src/main/java/google/registry/proxy/config/`. There is a default config +that provides most values needed to run the proxy, and several environment-specific configs for proxy instances that communicate to different Nomulus environments. The values specified in the environment-specific file override those in the default file. @@ -416,12 +421,12 @@ for detailed descriptions on each field. ### Setup Stackdriver for the project The proxy streams metrics to -[Stackdriver](https://cloud.google.com/stackdriver/). Refer to [Stackdriver -Monitoring](https://cloud.google.com/monitoring/docs/) documentation on how to -enable monitoring on the GCP project. +[Stackdriver](https://cloud.google.com/stackdriver/). Refer to +[Stackdriver Monitoring](https://cloud.google.com/monitoring/docs/) +documentation on how to enable monitoring on the GCP project. -The proxy service account needs to have ["Monitoring Metric -Writer"](https://cloud.google.com/monitoring/access-control#predefined_roles) +The proxy service account needs to have +["Monitoring Metric Writer"](https://cloud.google.com/monitoring/access-control#predefined_roles) role in order to stream metrics to Stackdriver: ```bash @@ -464,44 +469,6 @@ tag for all clusters. Repeat this for all the zones you want to create clusters in. -### Upload proxy docker image to GCR - -The GKE deployment manifest is set up to pull the proxy docker image from -[Google Container Registry](https://cloud.google.com/container-registry/) (GCR). -Instead of using `docker` and `gcloud` to build and push images, respectively, -we provide `bazel` rules for the same tasks. To push an image, first use -[`docker-credential-gcr`](https://github.com/GoogleCloudPlatform/docker-credential-gcr) -to obtain necessary credentials. It is used by the [bazel container_push -rules](https://github.com/bazelbuild/rules_docker#authentication) to push the -image. - -After credentials are configured, edit the `proxy_push` rule in -`java/google/registry/proxy/BUILD` to add the GCP project name and the image -name to save to. We recommend using the same project and image for proxies -intended for different Nomulus environments, this way one can deploy the same -proxy image first to sandbox for testing, and then to production. - -Also note that as currently set up, all images pushed to GCR will be tagged -`bazel` and the GKE deployment object loads the image tagged as `bazel`. This is -fine for testing, but for production one should give images unique tags (also -configured in the `proxy_push` rule). - -To push to GCR, run: - -```bash -$ bazel run java/google/registry/proxy:proxy_push -``` - -If the GCP project to host images (gcr project) is different from the project -that the proxy runs in (proxy project), give the service account "Storage Object -Viewer" role of the gcr project. - -```bash -$ gcloud projects add-iam-policy-binding \ ---member serviceAccount: \ ---role roles/storage.objectViewer -``` - ### Upload proxy service account key to GKE cluster The kubernetes pods (containers) are configured to read the proxy service @@ -555,22 +522,22 @@ Repeat the same step for all clusters you want to deploy to. The proxies running on GKE clusters need to be exposed to the outside. Do not use Kubernetes [`LoadBalancer`](https://kubernetes.io/docs/concepts/services-networking/service/#type-loadbalancer). -It will create a GCP [Network Load -Balancer](https://cloud.google.com/compute/docs/load-balancing/network/), which -has several problems: +It will create a GCP +[Network Load Balancer](https://cloud.google.com/compute/docs/load-balancing/network/), +which has several problems: - This load balancer does not terminate TCP connections. It simply acts as an edge router that forwards IP packets to a "healthy" node in the cluster. As such, it does not support IPv6, because GCE instances themselves are currently IPv4 only. -- IP packets that arrived on the node may be routed to another node for +- IP packets that arrived at the node may be routed to another node for reasons of capacity and availability. In doing so it will [SNAT](https://en.wikipedia.org/wiki/Network_address_translation#SNAT) the packet, therefore losing the source IP information that the proxy needs. The - proxy uses WHOIS source IP address to cap QPS and passes EPP source IP to - Nomulus for validation. Note that a TCP terminating load balancer also has - this problem as the source IP becomes that of the load balancer, but it can - be addressed in other ways (explained later). See + proxy uses source IP address to cap QPS and passes EPP source IP to Nomulus + for validation. Note that a TCP terminating load balancer also has this + problem as the source IP becomes that of the load balancer, but it can be + addressed in other ways (explained later). See [here](https://kubernetes.io/docs/tutorials/services/source-ip/) for more details on how Kubernetes route traffic and translate source IPs inside the cluster. @@ -581,8 +548,8 @@ has several problems: Instead, we split the task of exposing the proxy to the Internet into two tasks, first to expose it within the cluster, then to expose the cluster to the outside -through a [TCP Proxy Load -Balancer](https://cloud.google.com/compute/docs/load-balancing/tcp-ssl/tcp-proxy). +through a +[TCP Proxy Load Balancer](https://cloud.google.com/compute/docs/load-balancing/tcp-ssl/tcp-proxy). This load balancer terminates TCP connections and allows for the use of a single anycast IP address (IPv4 and IPv6) to reach any clusters connected to its backend (it chooses a particular cluster based on geographical proximity). From @@ -611,8 +578,8 @@ $ kubectl create -f \ proxy/kubernetes/proxy-service.yaml ``` -This service object will open up port 30000 (health check), 30001 (WHOIS) and -30002 (EPP) on the nodes, routing to the same ports inside a pod. +This service object will open up port 30000 (health check) and 30002 (EPP) on +the nodes, routing to the same ports inside a pod. Repeat this for all clusters. @@ -641,7 +608,7 @@ Then set the named ports: ```bash $ gcloud compute instance-groups set-named-ports \ ---named-ports whois:30001,epp:30002 --zone +--named-ports epp:30002 --zone ``` Repeat this for each instance group (cluster). @@ -689,7 +656,7 @@ routed to the corresponding port on a proxy pod. The backend service codifies which ports on the node's clusters should receive traffic from the load balancer. -Create one backend service for EPP and one for WHOIS: +Create a backend service for EPP: ```bash # EPP backend @@ -697,28 +664,18 @@ $ gcloud compute backend-services create proxy-epp-loadbalancer \ --global --protocol TCP --health-checks proxy-health --timeout 1h \ --port-name epp -# WHOIS backend -$ gcloud compute backend-services create proxy-whois-loadbalancer \ ---global --protocol TCP --health-checks proxy-health --timeout 1h \ ---port-name whois ``` -These two backend services route packets to the epp named port and whois named -port on any instance group attached to them, respectively. +This ackend service routes packets to the EPP named port on any instance group +attached to it. -Then add (attach) instance groups that the proxies run on to each backend -service: +Then add (attach) instance groups that the proxies run on the backend service: ```bash # EPP backend $ gcloud compute backend-services add-backend proxy-epp-loadbalancer \ --global --instance-group --instance-group-zone \ --balancing-mode UTILIZATION --max-utilization 0.8 - -# WHOIS backend -$ gcloud compute backend-services add-backend proxy-whois-loadbalancer \ ---global --instance-group --instance-group-zone \ ---balancing-mode UTILIZATION --max-utilization 0.8 ``` Repeat this for each instance group. @@ -747,10 +704,10 @@ $ gcloud compute addresses describe proxy-ipv4 --global $ gcloud compute addresses describe proxy-ipv6 --global ``` -Set these IP addresses as the A/AAAA records for both epp. and -whois. where is the domain that was obtained earlier. (If you -use [Cloud DNS](https://cloud.google.com/dns/) as your DNS provider, this step -can also be performed by `gcloud`) +Set these IP addresses as the A/AAAA records for epp.where is +the domain that was obtained earlier. (If you use +[Cloud DNS](https://cloud.google.com/dns/) as your DNS provider, this step can +also be performed by `gcloud`) #### Create load balancer frontend @@ -761,21 +718,16 @@ First create a TCP proxy (yes, it is confusing, this GCP resource is called "proxy" as well) which is a TCP termination point. Outside connections terminate on a TCP proxy, which establishes its own connection to the backend services defined above. As such, the source IP address from the outside is lost. But the -TCP proxy can add the [PROXY protocol -header](https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt) at the -beginning of the connection to the backend. The proxy running on the backend can -parse the header and obtain the original source IP address of a request. - -Make one for each protocol (EPP and WHOIS). +TCP proxy can add the +[PROXY protocol header](https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt) +at the beginning of the connection to the backend. The proxy running on the +backend can parse the header and obtain the original source IP address of a +request. ```bash # EPP $ gcloud compute target-tcp-proxies create proxy-epp-proxy \ --backend-service proxy-epp-loadbalancer --proxy-header PROXY_V1 - -# WHOIS -$ gcloud compute target-tcp-proxies create proxy-whois-proxy \ ---backend-service proxy-whois-loadbalancer --proxy-header PROXY_V1 ``` Note the use of the `--proxy-header` flag, which turns on the PROXY protocol @@ -785,47 +737,36 @@ Next, create the forwarding rule that route outside traffic to a given IP to the TCP proxy just created: ```bash -$ gcloud compute forwarding-rules create proxy-whois-ipv4 \ ---global --target-tcp-proxy proxy-whois-proxy \ ---address proxy-ipv4 --ports 43 +$ gcloud compute forwarding-rules create proxy-epp-ipv4 \ +--global --target-tcp-proxy proxy-epp-proxy \ +--address proxy-ipv4 --ports 700 ``` The above command sets up a forwarding rule that routes traffic destined to the -static IPv4 address reserved earlier, on port 43 (actual port for WHOIS), to the -TCP proxy that connects to the whois backend service. +static IPv4 address reserved earlier, on port 700 (actual port for EPP), to the +TCP proxy that connects to the EPP backend service. -Repeat the above command another three times, set up IPv6 forwarding for WHOIS, -and IPv4/IPv6 forwarding for EPP. +Repeat the above command to set up IPv6 forwarding for EPP ## Additional steps -### Check if it all works - -At this point the proxy should be working and reachable from the Internet. Try -if a whois request to it is successful: - -```bash -whois -h whois. something -``` - -One can also try to contact the EPP endpoint with an EPP client. - ### Check logs and metrics -The proxy saves logs to [Stackdriver -Logging](https://cloud.google.com/logging/), which is the same place that -Nomulus saves it logs to. On GCP console, navigate to Logging - Logs - GKE -Container - - default. Do not choose "All namespace_id" as it -includes logs from the Kubernetes system itself and can be quite overwhelming. - -Metrics are stored in [Stackdriver -Monitoring](https://cloud.google.com/monitoring/docs/). To view the metrics, go -to Stackdriver [console](https://app.google.stackdriver.com) (also accessible -from GCE console under Monitoring), navigate to Resources - Metrics Explorer. -Choose resource type "GKE Container" and search for metrics with name "/proxy/" -in it. Currently available metrics include total connection counts, active -connection count, request/response count, request/response size, round-trip -latency and quota rejection count. +The proxy saves logs to +[Stackdriver Logging](https://cloud.google.com/logging/), which is the same +place that Nomulus saves it logs to. On GCP console, navigate to Logging - +Logs - GKE Container - - default. Do not choose "All +namespace_id" as it includes logs from the Kubernetes system itself and can be +quite overwhelming. + +Metrics are stored in +[Stackdriver Monitoring](https://cloud.google.com/monitoring/docs/). To view the +metrics, go to Stackdriver [console](https://app.google.stackdriver.com) (also +accessible from GCE console under Monitoring), navigate to Resources - Metrics +Explorer. Choose resource type "GKE Container" and search for metrics with name +"/proxy/" in it. Currently available metrics include total connection counts, +active connection count, request/response count, request/response size, +round-trip latency and quota rejection count. ### Cleanup sensitive files