Conversation
| } | ||
|
|
||
| #[derive(Clone, Deserialize, Serialize)] | ||
| pub struct ResNetCellBottomHalfP { |
There was a problem hiding this comment.
A more explicit name (in NN terminology) could be FusedDenseBias -- this is a fused operation performing activation(a.b) + c, which corresponds to a fused dense layer (activation(a.b)) and bias layer (which is a tensor addition).
There was a problem hiding this comment.
I like the more meaningful name. However I'd like to keep a naming scheme that expresses the split of ResNetCell. Alternatively, we could use more meaningful names for the split kernels and just add aliases in cuda_search of telamon-cli. I'm open for suggestions.
There was a problem hiding this comment.
To me FusedDenseBias is much more explicit than ResNetBottomHalf, and IMO would be as well for anyone with NN experience (the equivalent operation in tensorflow is called fused_bias_conv2d or something similar IIRC). The point is that this fused layer is used/useful in more than only resnet cells.
There was a problem hiding this comment.
OK, then let's completely eliminate the relationship of the smaller kernels with TransformerCell and ResNetCell and choose meaningful names for the *{Top,Bottom}Half kernels.
- Remove
ResNetCellTopHalfentirely, sinceFusedMMis already sufficient - Rename
ResNetBottomHalftoFusedDenseBias - Rename
TransformerCellBottomHalftoScaledMM
Any suggestions for TransformerCellTopHalf?
When composing an expression using virtual tensors, it might be
necessary to reference an intermediate result more than once. For
example, when normalizing a vector, its values are referenced first by
a reduction and a second time when each element of the vector is
divided by the result of the reduction.
When simply reusing a virtual tensor in multiple subexpressions the
exact same dimensions of the virtual tensor could be used by multiple
instructions with specific ordering constraints ensuring the correct
order of the calculation, which in turn could lead to cyclic
dependencies for the ordering of the virtual tensor dimensions.
For example, in the above-mentioned normalization, the dimensions of
the virtual tensor representing the vector are used both by the
reduction and the division. For correct results, the division
instruction must be executed only once the reduction is completed,
which requires that the reduction dimension are placed before the
division. However, the division also iterates over the dimensions of
the virtual tensor. The dimensions of the virtual tensor would thus be
required to be placed before themselves and Telamon would fail with
unsatisfiable constraints.
This patch introduces a new function `duplicate()` for virtual
tensors, creating a new virtual tensor with identical values, but with
a new set of dimensions. This function is currently implemented only
partially:
- For virtual tensors originating from a tensor stored in global
memory, `duplicate()` simply reloads the tensor a second time from
global memory.
- For virtual tensors which originate from arbitrary instructions,
duplication would require a more complex procedure, potentially
duplicating multiple instructions or storing intermediate results
in memory. Neither duplication of instructions, nor temporary
buffers are currently supported by the Telamon
infrastructure. Therefore, `duplicate()` panics in these cases.
A complete implementation covering all cases is left for a redesign of
the code for kernel composition.
…tion
This patch adds two kernel composition functions `tensor_activate` and
`array_activate_inplace` that apply an optional activation function to
a VirtualTensor and an `ndarray::Array`, respectively.
By using an `Option<ActivationFunction>` rather than an
`ActivationFunction`, the distinction of cases function present / no
function specified can be eliminated from the kernel compositor, which
allows for more compact kernel specifications.
E.g., instead of:
if let Some(activation_fun) = &activation_fun_opt {
let res = activation_fun.apply::<S>(..., &tmp);
res.store(...);
} else {
tmp.store(...);
}
the compositor can simply do:
let res = tensor_activate::<S>(..., tmp, &activation_fun_opt);
res.store(...);
This adds a new kernel `ResNetCell` implementing one cell of ResNet
[1]. The kernel computes:
O = activation(activation(A.B).C) + A
where `A`, `B` and `C` are matrices and `activation` an activation
function (ReLU, Sigmoid or identity).
[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun: "Deep Residual
Learning for Image Recognition". Available online at:
https://arxiv.org/abs/1512.03385, accessed 07/2019.
This patch implements `Display` for `ActivationFunction`, as well as two convenience functions `ActivationFunction::opt_to_display` (the equivalent of `Display` for `Option<ActivationFunction>`) and `ActivationFunction::opt_from_string`, which converts a string info an `Option<ActivationFunction>`.
This patch adds the `ResNetCell` kernel to `cuda_search` of
`telamon-cli`. To launch the new kernel with cuda_search, use a string
of the form `resnetcell_M_N_K_A` for `--kernel`, where `M`, `N`, and
`K` are positive integers defining the size of the matrices processed
by the kernel and `A` is an activation function, either `identity`,
`relu` or `sigmoid`, e.g.,
$ cd telamon-cli
$ cargo +nightly run --bin cuda_search --release -- \
--kernel resnetcell_1024_1024_1024_identity
… place This patch adds the composition function `array_softmax_inplace()`, that updates each element of an n-dimensional array with its value according to the softmax operation.
This patch adds the kernel composition function `tensor_elementwise_div()`, dividing each element of a tensor by a scalar value and returning a new virtual tensor with the result.
This adds a new kernel `TransformerCell` implementing a single scaled
dot product attention operation of [1]. The kernel computes:
O = softmax(scale(Q.K)).V
where `Q`, `K` and `V` are matrices.
[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin: "Attention
Is All You Need". Available online at:
https://arxiv.org/abs/1706.03762, accessed 07/2019.
…n-cli
This patch adds the `TransformerCell` kernel to `cuda_search` of
telamon-cli`. To launch the new kernel with cuda_search, use a string
of the form `transformercell_M_N_P_R` for `--kernel`, where `M`, `N`,
`P`and `R` are positive integers defining the sizes of the matrices
processed by the kernel, e.g.,
$ cd telamon-cli
$ cargo +nightly run --bin cuda_search --release -- \
--kernel transformercell_1024_1024_1024_1024
The function `helper::Tensor::read_to_host()` assumes that arrays have at least one dimension and thus fails for scalar values represented by 0-dimensional arrays. This patch relaxes this requirement and allows 0-dimensional arrays to be read to the host.
…softmax This adds a new kernel `TransformerCellTopHalf`, implementing the operations of `TransformerCell` up to the calculation of the scalar sum of the softmax operation, but excluding the element-wise division of the softmax operation and excluding the final multiplication with the matrix of values. That is, the kernel calculates: O = elementwise_exp(scale(Q.K)) s = scalar_sum(O) The same results as for `TransformerCell` can thus be obtained by dividing the elements of `O` by `s` and by multiplying the result with the value matrix.
… telamon-cli
This patch adds the `TransformerCellTopHalf` kernel to `cuda_search`
of telamon-cli`. To launch the new kernel with cuda_search, use a
string of the form `transformercelltophalf_M_N_P` for `--kernel`,
where `M`, `N`, and `P` are positive integers defining the sizes of
the matrices processed by the kernel, e.g.,
$ cd telamon-cli
$ cargo +nightly run --bin cuda_search --release -- \
--kernel transformercelltophalf_1024_1024_1024
… softmax This adds a new kernel `TransformerCellBottomHalf`, implementing the operations of `TransformerCell` starting from the divison of the elements of the temporary matrix by their sum when applying the softmax operation. That is, the kernel calculates: O = (1 / s_exp) * QK_SCEXP . V The same results as for `TransformerCell` can thus be obtained by first applying `TransformerCellTopHalf` to matrices `Q`, `K` and `V` and by passing the results to `TransformerCellBottomHalf`.
… of telamon-cli
This patch adds the `TransformerCellBottomHalf` kernel to
`cuda_search` of telamon-cli`. To launch the new kernel with
cuda_search, use a string of the form
`transformercellbottomhalf_M_N_R` for `--kernel`, where `M`, `N`, and
`R` are positive integers defining the sizes of the matrices processed
by the kernel, e.g.,
$ cd telamon-cli
$ cargo +nightly run --bin cuda_search --release -- \
--kernel transformercellbottomhalf_1024_1024_1024
…plication This adds a new kernel `ResNetCellTopHalf`, implementing the operations of `ResNetCell` before the second matrix multiplication. That is, the kernel calculates: O = activation(A.B) The same results as for `ResNetCell` can thus be obtained by mutliplying the result with the third input matrix of `ResNetCell`, applying the activation function to the result and by adding the first input matrix `A`.
…mon-cli
This patch adds the `ResNetCellTopHalf` kernel to `cuda_search` of
telamon-cli`. To launch the new kernel with cuda_search, use a string
of the form `resnetcelltophalf_M_N_K_A` for `--kernel`, where
`M`, `N`, and `K` are positive integers defining the sizes of the
matrices processed by the kernel and `A` is an activation function,
either `identity`, `relu` or `sigmoid`, e.g.,
$ cd telamon-cli
$ cargo +nightly run --bin cuda_search --release -- \
--kernel resnetcelltophalf_1024_1024_1024_relu
…lication This adds a new kernel `ResNetCellBottomHalf`, implementing the operations of `ResNetCell` starting with the second matrix multiplication. That is, the kernel calculates: O = activation(ACTAB.C)+A The same results as for `ResNetCell` can thus be obtained by applying `ResNetCellTopHalf` to the matrices `A` and `B` and an activation function, followed by an invocation of `ResNetCellBottomHalf` with the result from `ResNetCellTopHalf`, a matrix `C` and the original matrix `A`.
…elamon-cli
This patch adds the `ResNetCellBottomHalf` kernel to `cuda_search` of
telamon-cli`. To launch the new kernel with cuda_search, use a string
of the form `resnetcellbottomhalf_M_N_K_A` for `--kernel`, where
`M`, `N`, and `K` are positive integers defining the sizes of the
matrices processed by the kernel and `A` is an activation function,
either `identity`, `relu` or `sigmoid`, e.g.,
$ cd telamon-cli
$ cargo +nightly run --bin cuda_search --release -- \
--kernel resnetcellbottomhalf_1024_1024_1024_relu.
This series of commits adds two new ML kernels
ResNetCellandTransformerCell. Additionally, for each of these kernels, two further kernels are added that result result from splitting the original kernel in two stages (ResNetCellTopHalf,ResNetCellBottomHalf,TransformerCellTopHalf, andTransformerCellTopHalf).