This issue is to track and possibly try to add support for in-place hard delete
As far as I am aware to hard delete entries one needs to call defragment.
How can we delete at the point where the delete function is called.
The following is a convo I had with dbreee dev that might help:
Imagine data structure in a DBreeze file (number means a data block, bold-red first char of the block means - not deleted):
(1)111(2)2222(3)3333444(5)555555
Currently, if we want to delete block 2 it will look like this:
(1)11122222(3)3333444(5)555555
Do you want to have it like
1111-----333334445555555 ?
I would expect to see (1)111(3)3333444(5)555555
If you have several parallel threads that are reading table data (these threads have already read some bytes and are going to read other bytes from disk starting
from address xxxx. At this moment you run de-fragmentaion procedure ((3)3333(4)44(5)555555 must be moved to address that (2)2222 occupies).
All pointers in parallel reading threads will be corrupted and threads can go out with mistakes or read incorrect data.
To avoid that, long time ago, was developed Exclusive/Shared access (read docu "Full tables locking inside of transaction.")
It is also expensive and weird to run de-fragmentation procedure (purge) after each "delete command".
There is a sense to run it not so often, once per day, month, when DBreeze starts etc...
Purge itself is a copying data from one table t1 into another table t2 and then restoring t1 from t2
(t1 delete, t2 rename to t1 - all happens in one transaction with tran.SynchronizeTable(t1)).
It is also possible to use tran.RestoreTableFromTheOtherFile
(also will not save from "parallel-reading threads losing pointers" problem, so the same Exclusive/Shared technique may be used
or purge on DBreeze Init)
So, all techniques are already available.
So if we have a database the size of 600gb and 50gb is deletable, then we need to copy 550gb to a new location?
I think you understand that partial de-fragmentation of 600GB could take the same time as de-fragmentation of the file system data.
Copying and renaming will be definitely faster, it is more sequential operation for DBreeze, you just need to have enough space on the drive.
But, depending upon the nature of your data, probably, you can utilize reusable data-blocks - buckets.
Imagine that you have created data data-block with the size 1MB or 10MB. This is a reserved space inside of DBreeze table.
There you can create other data structures, like serialized dictionary. The most important is to stay within data-block size while
it is updated. Many keys can lead to one data-block and each key must also lead to an element inside that stored serialized Dictionary.
There you can remove unused values, insert new and update data-block completely and in transnational manner in the end.
What would I do? Don't know anything about your workflow. We work in Telematik area, there I would do either nothing
or would do de-fragmentation once per several years, while putting all data on the new disk type or, probably, nothing.
If you defragment on delete you don't need to do it on 600gb because you do it at the point of removal, so the data is constantly defragmented.
Imagine, you have inserted 100GB, after that you have inserted 1 more MB, after that passed one month and then you want to delete that inserted 1MB.
Within the month you have inserted 200GB
(1)111-----(3)3333(4)44(5)555555
1 - 100GB
--- your 1 MB
3,4,5 - 200 GB
you have to move from 3-5 at the index of ----
it could take... hmmm... 5 minutes or 10 minutes or you need to move 500GB
My advice is make compaction once per year on startup.
This issue is to track and possibly try to add support for in-place hard delete
As far as I am aware to hard delete entries one needs to call defragment.
How can we delete at the point where the delete function is called.
The following is a convo I had with dbreee dev that might help: