Combine DMA loads#146
Conversation
bb55411 to
1d4d720
Compare
doe300
left a comment
There was a problem hiding this comment.
Just some high-level comments for now
| { | ||
| namespace optimizations | ||
| { | ||
| class ValueExpr |
There was a problem hiding this comment.
Is there a specific reason you did not use Expression here?
There was a problem hiding this comment.
I didn't know this class. I'll try to use it.
| auto kernels = module.getKernels(); | ||
| for(Method* kernelFunc : kernels) | ||
| { | ||
| optimizations::combineDMALoads(module, *kernelFunc, config); |
There was a problem hiding this comment.
Does this code have to be run before the normalization steps (e.g. before the memory accesses are rewritten)?
There was a problem hiding this comment.
Yes. It is easy to combine vloadn, but a function inlineMethods replaces vloadn. So it should be run before inlineMethods. (I don't think it's not appropriate that an optimization process is in normalization steps.)
There was a problem hiding this comment.
Can you please add this to a comment
|
To fix the build error, you will need to rebase on the latest master, I kind of screwed up there... |
d21bd2b to
c1a1378
Compare
|
@doe300 I have a question. I want to create a variable typed auto in = assign(inIt, DataType(TYPE_INT8.getPointerType()), "%in") = UNIFORM_REGISTER; |
|
Well, This is necessary, since "complex types" require additional data, which is stored not in the type object itself, but in the module (think like a symbol table for these data types) while all information for "simple types" is stored in the type object itself. |
eefdda8 to
9704c72
Compare
56521ff to
c2e2e31
Compare
583cd2a to
998e468
Compare
|
Evaluation of peformance
Speed-up of about 16%. #define I(x, y) ((y) * width / 4 + (x))
__kernel void stencil(int width, int height, __global uchar *in, __global uchar *out)
{
for (int x = 0; x < width / 4; x++) {
for (int y = 1; y < height - 1; y++) {
uchar4 left4 = (x > 0) ? vload4(I(x, y) * 4 - 1, in) : (uchar4)(0);
uchar4 right4 = (x < (width / 4 - 1)) ? vload4(I(x, y) * 4 + 4, in) : (uchar4)(0);
size_t idx = I(x, y);
size_t yy = 800 / 4;
uchar16 up = vload16(idx - yy, in);
uchar16 center = vload16(idx, in);
uchar16 down = vload16(idx + yy, in);
uchar16 left = (uchar16)(left4, center.s01234567, center.s89AB);
uchar16 right = (uchar16)(center.s456789AB, center.sCDEF, right4);
uchar16 r = (
up / (uchar16)(5) +
left / (uchar16)(5) +
center / (uchar16)(5) +
right / (uchar16)(5) +
down / (uchar16)(5));
vstore16(r, I(x, y), out);
}
}
} |
|
@doe300 The work is finished. Please review changes. |
| // A fake operation to indicate an unsigned multiplication | ||
| static constexpr OpCode FAKEOP_UMUL{"umul", 132, 132, 2, false, false, FlagBehavior::NONE}; | ||
|
|
||
| static constexpr OpCode FAKEOP_MUL{"mul", 132, 132, 2, false, false, FlagBehavior::NONE}; |
There was a problem hiding this comment.
Are these signed or unsigned? Can you please name them correctly and add a comment?
| auto kernels = module.getKernels(); | ||
| for(Method* kernelFunc : kernels) | ||
| { | ||
| optimizations::combineDMALoads(module, *kernelFunc, config); |
There was a problem hiding this comment.
Can you please add this to a comment
| } | ||
| } | ||
|
|
||
| SubExpression iiToExpr(const Value& value, const LocalUser* inst) |
There was a problem hiding this comment.
What is an ii? Please make the function name more expressive
| } | ||
| }; | ||
|
|
||
| void expandExpression(const SubExpression& subExpr, ExpandedExprs& expanded) |
There was a problem hiding this comment.
Is there a reason why you don't return the result, but instead have an output-parameter?
| } | ||
| else | ||
| { | ||
| expanded.push_back( |
There was a problem hiding this comment.
Use emplace_back instead of push_back, same below
| { | ||
| // TODO: gather these instructions in one mutex lock | ||
| it = method.vpm->insertLockMutex(it, true); | ||
| assign(it, output) = VPM_IO_REGISTER; |
There was a problem hiding this comment.
If we use the "scratch" area, then this does not work, since the "scratch" area might be overridden by another QPU. For this to work, we would need to use an own VPM area per QPU.
| } | ||
|
|
||
| VPRGenericSetup VPMArea::toReadSetup(DataType elementType, uint8_t numRows) const | ||
| VPRGenericSetup VPMArea::toReadSetup(DataType elementType/*, uint8_t numRows*/) const |
| canBePackedIntoRow() ? 1 : static_cast<uint8_t>(TYPE_INT32.getScalarBitCount() / type.getScalarBitCount()); | ||
| VPRGenericSetup setup(getVPMSize(type), stride, numRows, calculateQPUSideAddress(type, rowOffset, 0)); | ||
|
|
||
| if (numRows_ >= 16) numRows_ = 1; |
There was a problem hiding this comment.
I think this is an error which we can't simply rewrite, but need to fail on. Or am I mistaken?
| periphery::VPRDMASetup expectedDMASetup(dmaSetupMode, vectorType.getVectorWidth() % 16, numOfLoads, vpitch, 0); | ||
| periphery::VPRGenericSetup expectedVPRSetup(vprSize, vprStride, numOfLoads, 0); | ||
|
|
||
| inputMethod.dumpInstructions(); |
|
|
||
| testCombineDMALoadsSub(module, inputMethod, config, Float16); | ||
| } | ||
|
|
There was a problem hiding this comment.
Please also add some negative tests, e.g. vloadn instructions accessing different sources or with different offsets
Implemented a combiner of DNA loads (see #144)
TODO
ValueExprand related functions.vloadnfunctions thanvload16.Support variable offsets at first argument ofvloadn.MPITCHB). However there are no method the compiler to know the real offset value.Expressioninstead ofValueExpr.Example
Three loads (
vload16) are combined to one load bycombineDMALoads.(A code without
combineDMALoads)