Add kernels ResNetCell and TransformerCell by andidr · Pull Request #276 · ulysseB/telamon

andidr · 2019-08-02T13:54:00Z

This series of commits adds two new ML kernels ResNetCell and TransformerCell. Additionally, for each of these kernels, two further kernels are added that result result from splitting the original kernel in two stages (ResNetCellTopHalf, ResNetCellBottomHalf, TransformerCellTopHalf, and TransformerCellTopHalf).

kernels/src/linalg.rs

Elarnon · 2019-08-02T14:08:01Z

kernels/src/linalg.rs

+}
+
+#[derive(Clone, Deserialize, Serialize)]
+pub struct ResNetCellBottomHalfP {


A more explicit name (in NN terminology) could be FusedDenseBias -- this is a fused operation performing activation(a.b) + c, which corresponds to a fused dense layer (activation(a.b)) and bias layer (which is a tensor addition).

I like the more meaningful name. However I'd like to keep a naming scheme that expresses the split of ResNetCell. Alternatively, we could use more meaningful names for the split kernels and just add aliases in cuda_search of telamon-cli. I'm open for suggestions.

To me FusedDenseBias is much more explicit than ResNetBottomHalf, and IMO would be as well for anyone with NN experience (the equivalent operation in tensorflow is called fused_bias_conv2d or something similar IIRC). The point is that this fused layer is used/useful in more than only resnet cells.

OK, then let's completely eliminate the relationship of the smaller kernels with TransformerCell and ResNetCell and choose meaningful names for the *{Top,Bottom}Half kernels.

Remove ResNetCellTopHalf entirely, since FusedMM is already sufficient

Rename ResNetBottomHalf to FusedDenseBias

Rename TransformerCellBottomHalf to ScaledMM

Any suggestions for TransformerCellTopHalf?

kernels/src/linalg.rs

When composing an expression using virtual tensors, it might be necessary to reference an intermediate result more than once. For example, when normalizing a vector, its values are referenced first by a reduction and a second time when each element of the vector is divided by the result of the reduction. When simply reusing a virtual tensor in multiple subexpressions the exact same dimensions of the virtual tensor could be used by multiple instructions with specific ordering constraints ensuring the correct order of the calculation, which in turn could lead to cyclic dependencies for the ordering of the virtual tensor dimensions. For example, in the above-mentioned normalization, the dimensions of the virtual tensor representing the vector are used both by the reduction and the division. For correct results, the division instruction must be executed only once the reduction is completed, which requires that the reduction dimension are placed before the division. However, the division also iterates over the dimensions of the virtual tensor. The dimensions of the virtual tensor would thus be required to be placed before themselves and Telamon would fail with unsatisfiable constraints. This patch introduces a new function `duplicate()` for virtual tensors, creating a new virtual tensor with identical values, but with a new set of dimensions. This function is currently implemented only partially: - For virtual tensors originating from a tensor stored in global memory, `duplicate()` simply reloads the tensor a second time from global memory. - For virtual tensors which originate from arbitrary instructions, duplication would require a more complex procedure, potentially duplicating multiple instructions or storing intermediate results in memory. Neither duplication of instructions, nor temporary buffers are currently supported by the Telamon infrastructure. Therefore, `duplicate()` panics in these cases. A complete implementation covering all cases is left for a redesign of the code for kernel composition.

…tion This patch adds two kernel composition functions `tensor_activate` and `array_activate_inplace` that apply an optional activation function to a VirtualTensor and an `ndarray::Array`, respectively. By using an `Option<ActivationFunction>` rather than an `ActivationFunction`, the distinction of cases function present / no function specified can be eliminated from the kernel compositor, which allows for more compact kernel specifications. E.g., instead of: if let Some(activation_fun) = &activation_fun_opt { let res = activation_fun.apply::<S>(..., &tmp); res.store(...); } else { tmp.store(...); } the compositor can simply do: let res = tensor_activate::<S>(..., tmp, &activation_fun_opt); res.store(...);

This adds a new kernel `ResNetCell` implementing one cell of ResNet [1]. The kernel computes: O = activation(activation(A.B).C) + A where `A`, `B` and `C` are matrices and `activation` an activation function (ReLU, Sigmoid or identity). [1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun: "Deep Residual Learning for Image Recognition". Available online at: https://arxiv.org/abs/1512.03385, accessed 07/2019.

This patch implements `Display` for `ActivationFunction`, as well as two convenience functions `ActivationFunction::opt_to_display` (the equivalent of `Display` for `Option<ActivationFunction>`) and `ActivationFunction::opt_from_string`, which converts a string info an `Option<ActivationFunction>`.

This patch adds the `ResNetCell` kernel to `cuda_search` of `telamon-cli`. To launch the new kernel with cuda_search, use a string of the form `resnetcell_M_N_K_A` for `--kernel`, where `M`, `N`, and `K` are positive integers defining the size of the matrices processed by the kernel and `A` is an activation function, either `identity`, `relu` or `sigmoid`, e.g., $ cd telamon-cli $ cargo +nightly run --bin cuda_search --release -- \ --kernel resnetcell_1024_1024_1024_identity

… place This patch adds the composition function `array_softmax_inplace()`, that updates each element of an n-dimensional array with its value according to the softmax operation.

…tensor

This patch adds the kernel composition function `tensor_elementwise_div()`, dividing each element of a tensor by a scalar value and returning a new virtual tensor with the result.

This adds a new kernel `TransformerCell` implementing a single scaled dot product attention operation of [1]. The kernel computes: O = softmax(scale(Q.K)).V where `Q`, `K` and `V` are matrices. [1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin: "Attention Is All You Need". Available online at: https://arxiv.org/abs/1706.03762, accessed 07/2019.

…n-cli This patch adds the `TransformerCell` kernel to `cuda_search` of telamon-cli`. To launch the new kernel with cuda_search, use a string of the form `transformercell_M_N_P_R` for `--kernel`, where `M`, `N`, `P`and `R` are positive integers defining the sizes of the matrices processed by the kernel, e.g., $ cd telamon-cli $ cargo +nightly run --bin cuda_search --release -- \ --kernel transformercell_1024_1024_1024_1024

The function `helper::Tensor::read_to_host()` assumes that arrays have at least one dimension and thus fails for scalar values represented by 0-dimensional arrays. This patch relaxes this requirement and allows 0-dimensional arrays to be read to the host.

…softmax This adds a new kernel `TransformerCellTopHalf`, implementing the operations of `TransformerCell` up to the calculation of the scalar sum of the softmax operation, but excluding the element-wise division of the softmax operation and excluding the final multiplication with the matrix of values. That is, the kernel calculates: O = elementwise_exp(scale(Q.K)) s = scalar_sum(O) The same results as for `TransformerCell` can thus be obtained by dividing the elements of `O` by `s` and by multiplying the result with the value matrix.

… telamon-cli This patch adds the `TransformerCellTopHalf` kernel to `cuda_search` of telamon-cli`. To launch the new kernel with cuda_search, use a string of the form `transformercelltophalf_M_N_P` for `--kernel`, where `M`, `N`, and `P` are positive integers defining the sizes of the matrices processed by the kernel, e.g., $ cd telamon-cli $ cargo +nightly run --bin cuda_search --release -- \ --kernel transformercelltophalf_1024_1024_1024

… softmax This adds a new kernel `TransformerCellBottomHalf`, implementing the operations of `TransformerCell` starting from the divison of the elements of the temporary matrix by their sum when applying the softmax operation. That is, the kernel calculates: O = (1 / s_exp) * QK_SCEXP . V The same results as for `TransformerCell` can thus be obtained by first applying `TransformerCellTopHalf` to matrices `Q`, `K` and `V` and by passing the results to `TransformerCellBottomHalf`.

… of telamon-cli This patch adds the `TransformerCellBottomHalf` kernel to `cuda_search` of telamon-cli`. To launch the new kernel with cuda_search, use a string of the form `transformercellbottomhalf_M_N_R` for `--kernel`, where `M`, `N`, and `R` are positive integers defining the sizes of the matrices processed by the kernel, e.g., $ cd telamon-cli $ cargo +nightly run --bin cuda_search --release -- \ --kernel transformercellbottomhalf_1024_1024_1024

…plication This adds a new kernel `ResNetCellTopHalf`, implementing the operations of `ResNetCell` before the second matrix multiplication. That is, the kernel calculates: O = activation(A.B) The same results as for `ResNetCell` can thus be obtained by mutliplying the result with the third input matrix of `ResNetCell`, applying the activation function to the result and by adding the first input matrix `A`.

…mon-cli This patch adds the `ResNetCellTopHalf` kernel to `cuda_search` of telamon-cli`. To launch the new kernel with cuda_search, use a string of the form `resnetcelltophalf_M_N_K_A` for `--kernel`, where `M`, `N`, and `K` are positive integers defining the sizes of the matrices processed by the kernel and `A` is an activation function, either `identity`, `relu` or `sigmoid`, e.g., $ cd telamon-cli $ cargo +nightly run --bin cuda_search --release -- \ --kernel resnetcelltophalf_1024_1024_1024_relu

…lication This adds a new kernel `ResNetCellBottomHalf`, implementing the operations of `ResNetCell` starting with the second matrix multiplication. That is, the kernel calculates: O = activation(ACTAB.C)+A The same results as for `ResNetCell` can thus be obtained by applying `ResNetCellTopHalf` to the matrices `A` and `B` and an activation function, followed by an invocation of `ResNetCellBottomHalf` with the result from `ResNetCellTopHalf`, a matrix `C` and the original matrix `A`.

…elamon-cli This patch adds the `ResNetCellBottomHalf` kernel to `cuda_search` of telamon-cli`. To launch the new kernel with cuda_search, use a string of the form `resnetcellbottomhalf_M_N_K_A` for `--kernel`, where `M`, `N`, and `K` are positive integers defining the sizes of the matrices processed by the kernel and `A` is an activation function, either `identity`, `relu` or `sigmoid`, e.g., $ cd telamon-cli $ cargo +nightly run --bin cuda_search --release -- \ --kernel resnetcellbottomhalf_1024_1024_1024_relu.

andidr requested review from Elarnon, nicoTolly and ulysseB August 2, 2019 13:54

Elarnon reviewed Aug 2, 2019

View reviewed changes

kernels/src/linalg.rs Show resolved Hide resolved

Elarnon reviewed Aug 2, 2019

View reviewed changes

kernels/src/linalg.rs Outdated Show resolved Hide resolved

Elarnon reviewed Aug 2, 2019

View reviewed changes

kernels/src/linalg.rs Show resolved Hide resolved

andidr added 19 commits August 2, 2019 17:01

Add function applying the softmax function to n-dimensional arrays in…

254a493

… place This patch adds the composition function `array_softmax_inplace()`, that updates each element of an n-dimensional array with its value according to the softmax operation.

Add kernel composition function calculating sum of all elements of a …

3cc4dde

…tensor

Add kernel composition function for element-wise division of a tensor

5d6c767

This patch adds the kernel composition function `tensor_elementwise_div()`, dividing each element of a tensor by a scalar value and returning a new virtual tensor with the result.

andidr force-pushed the andi/benchmarks branch from e8eaa61 to 5168c3b Compare August 2, 2019 15:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add kernels ResNetCell and TransformerCell#276

Add kernels ResNetCell and TransformerCell#276
andidr wants to merge 19 commits intomasterfrom
andi/benchmarks

andidr commented Aug 2, 2019

Uh oh!

Uh oh!

Elarnon Aug 2, 2019

Uh oh!

andidr Aug 2, 2019

Uh oh!

Elarnon Aug 2, 2019

Uh oh!

andidr Aug 2, 2019

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andidr commented Aug 2, 2019

Uh oh!

Uh oh!

Elarnon Aug 2, 2019

Choose a reason for hiding this comment

Uh oh!

andidr Aug 2, 2019

Choose a reason for hiding this comment

Uh oh!

Elarnon Aug 2, 2019

Choose a reason for hiding this comment

Uh oh!

andidr Aug 2, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants