Skip to content

Add kernels ResNetCell and TransformerCell#276

Open
andidr wants to merge 19 commits intomasterfrom
andi/benchmarks
Open

Add kernels ResNetCell and TransformerCell#276
andidr wants to merge 19 commits intomasterfrom
andi/benchmarks

Conversation

@andidr
Copy link
Collaborator

@andidr andidr commented Aug 2, 2019

This series of commits adds two new ML kernels ResNetCell and TransformerCell. Additionally, for each of these kernels, two further kernels are added that result result from splitting the original kernel in two stages (ResNetCellTopHalf, ResNetCellBottomHalf, TransformerCellTopHalf, and TransformerCellTopHalf).

@andidr andidr requested review from Elarnon, nicoTolly and ulysseB August 2, 2019 13:54
}

#[derive(Clone, Deserialize, Serialize)]
pub struct ResNetCellBottomHalfP {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more explicit name (in NN terminology) could be FusedDenseBias -- this is a fused operation performing activation(a.b) + c, which corresponds to a fused dense layer (activation(a.b)) and bias layer (which is a tensor addition).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the more meaningful name. However I'd like to keep a naming scheme that expresses the split of ResNetCell. Alternatively, we could use more meaningful names for the split kernels and just add aliases in cuda_search of telamon-cli. I'm open for suggestions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me FusedDenseBias is much more explicit than ResNetBottomHalf, and IMO would be as well for anyone with NN experience (the equivalent operation in tensorflow is called fused_bias_conv2d or something similar IIRC). The point is that this fused layer is used/useful in more than only resnet cells.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then let's completely eliminate the relationship of the smaller kernels with TransformerCell and ResNetCell and choose meaningful names for the *{Top,Bottom}Half kernels.

  • Remove ResNetCellTopHalf entirely, since FusedMM is already sufficient
  • Rename ResNetBottomHalf to FusedDenseBias
  • Rename TransformerCellBottomHalf to ScaledMM

Any suggestions for TransformerCellTopHalf?

andidr added 19 commits August 2, 2019 17:01
When composing an expression using virtual tensors, it might be
necessary to reference an intermediate result more than once. For
example, when normalizing a vector, its values are referenced first by
a reduction and a second time when each element of the vector is
divided by the result of the reduction.

When simply reusing a virtual tensor in multiple subexpressions the
exact same dimensions of the virtual tensor could be used by multiple
instructions with specific ordering constraints ensuring the correct
order of the calculation, which in turn could lead to cyclic
dependencies for the ordering of the virtual tensor dimensions.

For example, in the above-mentioned normalization, the dimensions of
the virtual tensor representing the vector are used both by the
reduction and the division. For correct results, the division
instruction must be executed only once the reduction is completed,
which requires that the reduction dimension are placed before the
division. However, the division also iterates over the dimensions of
the virtual tensor. The dimensions of the virtual tensor would thus be
required to be placed before themselves and Telamon would fail with
unsatisfiable constraints.

This patch introduces a new function `duplicate()` for virtual
tensors, creating a new virtual tensor with identical values, but with
a new set of dimensions. This function is currently implemented only
partially:

  - For virtual tensors originating from a tensor stored in global
    memory, `duplicate()` simply reloads the tensor a second time from
    global memory.

  - For virtual tensors which originate from arbitrary instructions,
    duplication would require a more complex procedure, potentially
    duplicating multiple instructions or storing intermediate results
    in memory. Neither duplication of instructions, nor temporary
    buffers are currently supported by the Telamon
    infrastructure. Therefore, `duplicate()` panics in these cases.

A complete implementation covering all cases is left for a redesign of
the code for kernel composition.
…tion

This patch adds two kernel composition functions `tensor_activate` and
`array_activate_inplace` that apply an optional activation function to
a VirtualTensor and an `ndarray::Array`, respectively.

By using an `Option<ActivationFunction>` rather than an
`ActivationFunction`, the distinction of cases function present / no
function specified can be eliminated from the kernel compositor, which
allows for more compact kernel specifications.

E.g., instead of:

  if let Some(activation_fun) = &activation_fun_opt {
      let res = activation_fun.apply::<S>(..., &tmp);
      res.store(...);
  } else {
      tmp.store(...);
  }

the compositor can simply do:

  let res = tensor_activate::<S>(..., tmp, &activation_fun_opt);
  res.store(...);
This adds a new kernel `ResNetCell` implementing one cell of ResNet
[1]. The kernel computes:

  O = activation(activation(A.B).C) + A

where `A`, `B` and `C` are matrices and `activation` an activation
function (ReLU, Sigmoid or identity).

[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun: "Deep Residual
    Learning for Image Recognition". Available online at:
    https://arxiv.org/abs/1512.03385, accessed 07/2019.
This patch implements `Display` for `ActivationFunction`, as well as
two convenience functions `ActivationFunction::opt_to_display` (the
equivalent of `Display` for `Option<ActivationFunction>`) and
`ActivationFunction::opt_from_string`, which converts a string info an
`Option<ActivationFunction>`.
This patch adds the `ResNetCell` kernel to `cuda_search` of
`telamon-cli`. To launch the new kernel with cuda_search, use a string
of the form `resnetcell_M_N_K_A` for `--kernel`, where `M`, `N`, and
`K` are positive integers defining the size of the matrices processed
by the kernel and `A` is an activation function, either `identity`,
`relu` or `sigmoid`, e.g.,

  $ cd telamon-cli
  $ cargo +nightly run --bin cuda_search --release -- \
    --kernel resnetcell_1024_1024_1024_identity
… place

This patch adds the composition function `array_softmax_inplace()`,
that updates each element of an n-dimensional array with its value
according to the softmax operation.
This patch adds the kernel composition function
`tensor_elementwise_div()`, dividing each element of a tensor by a
scalar value and returning a new virtual tensor with the result.
This adds a new kernel `TransformerCell` implementing a single scaled
dot product attention operation of [1]. The kernel computes:

  O = softmax(scale(Q.K)).V

where `Q`, `K` and `V` are matrices.

[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
    Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin: "Attention
    Is All You Need". Available online at:
    https://arxiv.org/abs/1706.03762, accessed 07/2019.
…n-cli

This patch adds the `TransformerCell` kernel to `cuda_search` of
telamon-cli`. To launch the new kernel with cuda_search, use a string
of the form `transformercell_M_N_P_R` for `--kernel`, where `M`, `N`,
`P`and `R` are positive integers defining the sizes of the matrices
processed by the kernel, e.g.,

  $ cd telamon-cli
  $ cargo +nightly run --bin cuda_search --release -- \
    --kernel transformercell_1024_1024_1024_1024
The function `helper::Tensor::read_to_host()` assumes that arrays have
at least one dimension and thus fails for scalar values represented by
0-dimensional arrays. This patch relaxes this requirement and allows
0-dimensional arrays to be read to the host.
…softmax

This adds a new kernel `TransformerCellTopHalf`, implementing the
operations of `TransformerCell` up to the calculation of the scalar
sum of the softmax operation, but excluding the element-wise division
of the softmax operation and excluding the final multiplication with
the matrix of values. That is, the kernel calculates:

  O = elementwise_exp(scale(Q.K))
  s = scalar_sum(O)

The same results as for `TransformerCell` can thus be obtained by
dividing the elements of `O` by `s` and by multiplying the result with
the value matrix.
… telamon-cli

This patch adds the `TransformerCellTopHalf` kernel to `cuda_search`
of telamon-cli`. To launch the new kernel with cuda_search, use a
string of the form `transformercelltophalf_M_N_P` for `--kernel`,
where `M`, `N`, and `P` are positive integers defining the sizes of
the matrices processed by the kernel, e.g.,

      $ cd telamon-cli
      $ cargo +nightly run --bin cuda_search --release -- \
        --kernel transformercelltophalf_1024_1024_1024
… softmax

This adds a new kernel `TransformerCellBottomHalf`, implementing the
operations of `TransformerCell` starting from the divison of the
elements of the temporary matrix by their sum when applying the
softmax operation. That is, the kernel calculates:

  O = (1 / s_exp) * QK_SCEXP . V

The same results as for `TransformerCell` can thus be obtained by
first applying `TransformerCellTopHalf` to matrices `Q`, `K` and `V`
and by passing the results to `TransformerCellBottomHalf`.
… of telamon-cli

This patch adds the `TransformerCellBottomHalf` kernel to
`cuda_search` of telamon-cli`. To launch the new kernel with
cuda_search, use a string of the form
`transformercellbottomhalf_M_N_R` for `--kernel`, where `M`, `N`, and
`R` are positive integers defining the sizes of the matrices processed
by the kernel, e.g.,

      $ cd telamon-cli
      $ cargo +nightly run --bin cuda_search --release -- \
        --kernel transformercellbottomhalf_1024_1024_1024
…plication

This adds a new kernel `ResNetCellTopHalf`, implementing the
operations of `ResNetCell` before the second matrix
multiplication. That is, the kernel calculates:

  O = activation(A.B)

The same results as for `ResNetCell` can thus be obtained by
mutliplying the result with the third input matrix of `ResNetCell`,
applying the activation function to the result and by adding the first
input matrix `A`.
…mon-cli

This patch adds the `ResNetCellTopHalf` kernel to `cuda_search` of
telamon-cli`. To launch the new kernel with cuda_search, use a string
of the form `resnetcelltophalf_M_N_K_A` for `--kernel`, where
`M`, `N`, and `K` are positive integers defining the sizes of the
matrices processed by the kernel and `A` is an activation function,
either `identity`, `relu` or `sigmoid`, e.g.,

  $ cd telamon-cli
  $ cargo +nightly run --bin cuda_search --release -- \
    --kernel resnetcelltophalf_1024_1024_1024_relu
…lication

This adds a new kernel `ResNetCellBottomHalf`, implementing the
operations of `ResNetCell` starting with the second matrix
multiplication. That is, the kernel calculates:

  O = activation(ACTAB.C)+A

The same results as for `ResNetCell` can thus be obtained by applying
`ResNetCellTopHalf` to the matrices `A` and `B` and an activation
function, followed by an invocation of `ResNetCellBottomHalf` with the
result from `ResNetCellTopHalf`, a matrix `C` and the original matrix
`A`.
…elamon-cli

This patch adds the `ResNetCellBottomHalf` kernel to `cuda_search` of
telamon-cli`. To launch the new kernel with cuda_search, use a string
of the form `resnetcellbottomhalf_M_N_K_A` for `--kernel`, where
`M`, `N`, and `K` are positive integers defining the sizes of the
matrices processed by the kernel and `A` is an activation function,
either `identity`, `relu` or `sigmoid`, e.g.,

  $ cd telamon-cli
  $ cargo +nightly run --bin cuda_search --release -- \
    --kernel resnetcellbottomhalf_1024_1024_1024_relu.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants