Skip to content

Conversation

@GeorgCantor
Copy link

@GeorgCantor GeorgCantor commented Dec 23, 2025

Changes Made:

Single-pass processing - Replaced multiple collection operations with a single iteration over input files
Set-based deduplication - Used linkedSetOf() instead of distinct() for O(1) uniqueness checking

Before:

Created intermediate collection gavs
3 separate map operations + 3 distinct() calls
Total complexity: ~O(6n) operations

After:

Processes each file once during iteration
Immediate deduplication via Set insertion
Total complexity: O(n) with better constant factors
Lower memory usage (no intermediate collection)

Result: More efficient code with better memory and time performance characteristics, especially for large input sets.

@martinbonnin
Copy link
Member

I'm sorry but given the size of the datasets and this running in a much more complex process (Gradle), the gain is probably negligible. From a cursory look, the code looks correct but I just don't think the change is worth it.

I can change my mind if you can share real life numbers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants