diff --git a/md.4 b/md.4 index 2574c37e..fbf102ac 100644 --- a/md.4 +++ b/md.4 @@ -352,6 +352,165 @@ transient. The list of faulty sectors can be flushed, and the active list of failure modes can be cleared. +.SS HOW MD READS/WRITES DEPENDING ON THE LEVEL AND CHUNK SIZE + +The following explains how MD reads/writes data depending on the MD\ level; +\fIespecially how many bytes are consecutively read/written fully at once +from/to the underlying device(s)\fP. +.br +Further block layers below MD may influence and change this of course. + +Generally, the number of bytes read/written is \fIindependent of the chunk +size\fP. + +.TP +.B LINEAR +Reads/writes as many bytes as requested by the block layer (for example MD, +dm-crypt, LVM or a filesystem) above MD. + +As data is neither striped nor mirrored in chunks over the devices, no IO +distribution takes place on reads/writes. + +There is no resynchronisation nor can the MD be degraded. +.PP + +.TP +.B RAID0 +Reads/writes as many bytes as requested by the block layer (for example MD, +dm-crypt, LVM or a filesystem) above MD \fIup to the chunk size\fP (obviously, +if any of the block layers above is not aligned with MD, even less will at most +be read/written). + +As data is striped in chunks over the devices, IO distribution takes place on +reads/writes. + +There is no resynchronisation nor can the MD be degraded. +.PP + +.TP +.B RAID1 +Reads/writes as many bytes as requested by the block layer (for example MD, +dm-crypt, LVM or a filesystem) above MD. + +As data is mirrored over the devices, IO distribution takes place on reads, with +MD trying to heuristically select the optimal device (for example that with the +minimum seek time). +.br +On writes, data must be written to all the devices, though. + +On resynchronisation data will be read from the “first” usable device (that is +the device with the lowest role number that has not failed) and written to all +those needed to be synchronised (there is no IO distribution). + +When degraded, failed devices won’t be used for reads/writes. +.PP + +.TP +.B RAID10 +Reads/writes as many bytes as requested by the block layer (for example MD, +dm-crypt, LVM or a filesystem) above MD \fIup to the chunk size\fP (obviously, +if any of the block layers above is not aligned with MD, even less will at most +be read/written). + +As data is mirroed over some of the devices and also striped in chunks over some +of the devices, IO distribution takes place on reads, with MD trying to +heuristically select the optimal device (for example that with the minimum seek +time). +.br +On writes, data must be written to all of the respectively mirrored deivces, +though. + +On resynchronisation data will be read from the “first” usable device (that is +the device with the lowest role number that holds the data and that has not +failed) and written to all those needed to be synchronised (there is no IO +distribution). + +When degraded, failed devices won’t be used for reads/writes. +.PP + +.TP +.B RAID4, RAID5, and RAID6 +\fIWhen not degraded on reads\fP: +.br +Reads as many bytes as requested by the block +layer (for example MD, dm-crypt, LVM or a filesystem) above MD \fIup to the +chunk size\fP (obviously, if any of the block layers above is not aligned with +MD, even less will at most be read). +.br +\fIWhen degraded on reads\fP \fBor\fP \fIalways on writes\fP: +.br +Reads/writes \fIgenerally\fP in blocks of \fBPAGE_SIZE\fP (hoping that block +layers below MD will optimise this). + +\fIWhen not degraded\fP: +As data is striped in chunks over the devices, IO distribution takes place on +reads (using the different data chunks but not the parity chunk(s)). +.br +On writes, data and parity must be written to the respective devices (that is +1\ device with the respective data chunk and 1\ (in case of RAID4 or RAID5) or +2\ (in case of RAID6) device(s) with the respective parity chunk(s). These +writes but also any necessary reads are done in blocks of \fBPAGE_SIZE\fP. +.br +\fIWhen degraded or on resynchronisation\fP: +Failed devices won’t be used for reads/writes. +.br +In order to read from within a failed data chunk, the respective blocks of +\fBPAGE_SIZE\fP are read from all the other corresponding data and parity chunks +and the failed data is calculated from these. +.br +Resynchronising works analogously with the addition of writing the missing data +or parity, which happens again in blocks of \fBPAGE_SIZE\fP. +.PP + + +.TP +.B Chunk Size +The chunk size has no effect for the non-striped levels LINEAR and RAID1. +.br +Further, MD’s reads/writes are in general \fInot\fP in blocks of the chunk size +(see above). + +For the levels RAID0, RAID10, RAID4, RAID5 and RAID6 it controls the number of +consecutive data bytes placed on one device before the following data bytes +continue at a “next” device. +.br +Obviously it also controls the size of any parity chunks, but \fIthe actual +parity data itself is split into blocks of\fP +.BI PAGE_SIZE +(within a parity chunk). + +With striped levels, IO distribution on reads/writes takes place over the +devices where possible. +.br +The main effect of the chunk size is basically how much data is consecutively +read/written from/to a single device, (typically) before it has to seek to an +arbitrary other (on random reads/writes) or the “next” (on sequential +reads/writes) chunk (on the same device). Due to the striping, “next chunk” +doesn’t necessarily mean directly consecutive data (as this may be on the “next” +device), but rather the “next” of consecutive data found \fIon the respective +device\fP. + +The ideal chunk size depends greatly on the IO scenario, some general guidelines +include: +.RS +.IP \(bu 2 +On sequential reads/writes, having to read/write from/to less chunks is faster +(for example since less seeks may be necessary) and thus a larger chunk size may +be better. +.br +This applies analogously for “pseudo-random” reads/writes, that is not strictly +sequential ones but such that take place in a very close consecutive area. +.IP \(bu 2 +For very large sequential reads/writes, this may apply less, since larger +chunk sizes tend to result in larger IO requests to the underlying devices. +.IP \(bu 2 +For reads/writes, the stripe size (that is ) should ideally match the typical size for reads/writes in the +respective scenario. +.RE +.PP + + .SS UNCLEAN SHUTDOWN When changes are made to a RAID1, RAID4, RAID5, RAID6, or RAID10 array