WIP: Constant space sort#61
Open
bgamari wants to merge 7 commits intohaskell:masterfrom
Open
Conversation
Previously we used the Binary instance for Text to serialise the event name. This is wrong. We now first encode to UTF-8 and use this in the eventlog encoding.
73ef6f3 to
3cc5c14
Compare
3cc5c14 to
3f7eca6
Compare
|
Is it necessary to sort all the events as if they were completely unordered? I was assuming, at the time I wrote my patch, that the events of each capability would come in order within each capability, so all that was necessary was merging the capability streams back again. Perhaps there is some inconvenience in splitting the capability streams that I'm missing? |
maoe
reviewed
May 18, 2020
Member
maoe
left a comment
There was a problem hiding this comment.
I don't have much time for today so I'll take a look again tomorrow.
Member
|
I think facundominguez's point is valid. Also it seems that #14 gets in the way if we're going to replace the existing sortEvents with this constant space sorting. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This introduces an interface providing constant-space sorting of eventlogs via on-disk merge-sort.
I have confirmed that the implementation indeed maintains constant-space behavior, requiring about 79s seconds to sort a 200MB eventlog while not exceeding 40 MBytes residency (using a chunk size of 1e5 events).
This is in contrast to the old in-memory codepath which requires 35 seconds and nearly 5GBytes of residency to sort this same eventlog.
If I increase the chunk size by an order of magnitude (to 1e6 events) then the constant-space sort has essentially the same runtime as the in-memory codepath but runs in merely 300MBytes.
Note: In testing this I realized that several of the encoding paths treated strings incorrectly. Consequently, this depends upon (and contains commits from) #62.
To-do
sortEvents?