-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Problem Description
Our system currently generates files with identical base names (e.g., normcounts, exacttest, coldata). Without a clear versioning mechanism, multiple versions of the same file may exist concurrently. This ambiguity makes it difficult for automated processes (such as database queries or Spark jobs) to reliably select the most recent version.
Proposed Solution
- File Naming Convention Update:
-
New Format: [prefix]_[timestamp]delta[random_id]
- prefix: Indicates the file type or source (e.g., exacttest_genome, normcounts_genome, coldata).
- timestamp: A UTC timestamp formatted as YYYYMMDDTHHMMSSZ (e.g., 20250324T044412Z).
- delta: A unique identifier (e.g., a random number) to guarantee uniqueness.
-
Embedding the timestamp allows us to sort filenames by time and automatically select the latest file group.
- Data Frame Construction & Timestamp Extraction:
-
Create a function (e.g., get_latest_file_group_df()) that:
-
Parses filenames to extract the timestamp string.
-
Converts the timestamp string to a POSIXct object.
-
Constructs a data frame containing:
- The original filename
- The extracted timestamp string (or NA if missing)
- The converted time value.
-
-
All files, including those without a valid timestamp, should be retained in the data frame for debugging purposes.
- Latest Version Selection:
-
Identify the latest timestamp (ignoring NA values) and mark the corresponding files as the latest version.
-
Use the sorted data frame to reliably extract the latest set of files (one each for normcounts, exacttest, and coldata) for further processing.
Expected Benefits
Automated Latest Version Identification:
By sorting filenames based on the embedded timestamp, the system can quickly and automatically determine the most recent file group.
Clear Versioning:
Unique filenames with an embedded timestamp and delta identifier prevent version conflicts and reduce the risk of querying outdated data.
Enhanced Debugging:
Retaining files without a valid timestamp in the output allows developers to easily identify and address any issues with file naming.
Improved System Stability:
A consistent naming convention reduces ambiguity, ensuring that downstream processes always select the correct, latest version of each file.