Conversation
There was a problem hiding this comment.
Pull request overview
This PR implements asynchronous backup functionality for the index management system, changing from a synchronous blocking backup process to a background job-based approach. The changes also remove compression from backup archives (tar.gz → tar) for improved performance and simplicity.
Changes:
- Converted synchronous
createBackupto asynchronouscreateBackupAsyncwith job tracking and persistence - Removed gzip compression from backup archives, changing from
.tar.gzto.tarformat - Added new
/api/v1/backups/jobsendpoint to query backup job status - Implemented write operation blocking during active backups to ensure consistency
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 12 comments.
| File | Description |
|---|---|
| src/utils/archive_utils.hpp | Renamed createTarGz/extractTarGz to createTar/extractTar, removed gzip compression |
| src/main.cpp | Updated backup API to return 202 with job_id, changed file extensions to .tar, added jobs endpoint, improved error handling |
| src/core/ndd.hpp | Added BackupJob struct and status tracking, implemented async backup execution, added checkBackupInProgress to all write operations, job persistence with recovery on restart |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return json_error(500, "Failed to read backup file"); | ||
| } | ||
|
|
||
|
|
There was a problem hiding this comment.
Remove unnecessary blank line with trailing whitespace. This appears to be an unintentional formatting artifact that should be cleaned up.
| auto completed_time_t = std::chrono::system_clock::to_time_t(completed_at); | ||
| j["completed_at"] = std::to_string(completed_time_t); |
There was a problem hiding this comment.
The completed_at field is always serialized to JSON, even for IN_PROGRESS jobs where it hasn't been set yet. This could result in serializing an uninitialized time_point (typically epoch time 0). Consider only including completed_at in the JSON when the job status is COMPLETED or FAILED, similar to how it's handled in the API endpoint at main.cpp:625-628.
| auto completed_time_t = std::chrono::system_clock::to_time_t(completed_at); | |
| j["completed_at"] = std::to_string(completed_time_t); | |
| if (status == BackupJobStatus::COMPLETED || | |
| status == BackupJobStatus::FAILED) { | |
| auto completed_time_t = std::chrono::system_clock::to_time_t(completed_at); | |
| j["completed_at"] = std::to_string(completed_time_t); | |
| } |
| { | ||
| std::unique_lock<std::shared_mutex> lock(jobs_mutex_); | ||
| backup_jobs_[job_id] = job; | ||
| // Track active backup for this index | ||
| active_backup_jobs_[index_id] = job_id; | ||
| persistJobs(); | ||
| } |
There was a problem hiding this comment.
Missing check to prevent multiple concurrent backup jobs for the same index. While active_backup_jobs_ tracks ongoing backups per index, createBackupAsync doesn't check if a backup is already in progress for the given index_id. This could allow multiple backup jobs to be initiated for the same index, potentially causing resource contention and inconsistent results. Consider checking active_backup_jobs_.count(index_id) before creating the new job.
| std::filesystem::rename(backup_tar_temp, backup_tar_final); | ||
|
|
There was a problem hiding this comment.
The filesystem::rename operation at line 570 can throw an exception if the rename fails (e.g., cross-filesystem move on some systems, insufficient permissions). If this happens after the backup tar has been successfully created and the operation_mutex released, the backup job will be marked as FAILED and the temp tar will be removed, but writes have already been re-enabled. This is unlikely to cause data corruption but could be confusing. Consider wrapping the rename in a try-catch to provide a more specific error message, or using filesystem::copy followed by filesystem::remove as a fallback.
| std::filesystem::rename(backup_tar_temp, backup_tar_final); | |
| try { | |
| std::filesystem::rename(backup_tar_temp, backup_tar_final); | |
| } catch (const std::filesystem::filesystem_error& fe) { | |
| LOG_WARN("Failed to rename backup tar from temp to final location: " << fe.what() | |
| << " (from: " << backup_tar_temp << " to: " << backup_tar_final | |
| << "). Attempting copy/remove fallback."); | |
| try { | |
| std::filesystem::copy_file( | |
| backup_tar_temp, | |
| backup_tar_final, | |
| std::filesystem::copy_options::overwrite_existing); | |
| std::filesystem::remove(backup_tar_temp); | |
| LOG_INFO("Backup tar moved to final location via copy/remove fallback: " << backup_tar_final); | |
| } catch (const std::filesystem::filesystem_error& fe_fallback) { | |
| LOG_ERROR("Fallback copy/remove for backup tar failed: " << fe_fallback.what() | |
| << " (from: " << backup_tar_temp << " to: " << backup_tar_final << ")"); | |
| throw std::runtime_error( | |
| std::string("Failed to move backup tar to final location. rename error: ") + | |
| fe.what() + ", fallback copy/remove error: " + fe_fallback.what()); | |
| } | |
| } |
| std::ifstream ifs(jobs_file); | ||
| nlohmann::json j_array; | ||
| ifs >> j_array; |
There was a problem hiding this comment.
The JSON deserialization at line 668 could throw an exception if the file contains invalid JSON, but the outer try-catch will handle it and log the error. However, if the file is partially written or corrupted, this could silently skip loading valid jobs. Consider checking ifs.good() after reading and providing a more specific error message to distinguish between read errors and parse errors.
| // 2. Check if backup file already exists | ||
| std::string backup_tar = data_dir_ + "/backups/" + backup_name + ".tar"; | ||
| if(std::filesystem::exists(backup_tar)) { | ||
| return {false, "Backup already exists: " + backup_name}; | ||
| } |
There was a problem hiding this comment.
There's no protection against concurrent backup jobs creating backups with the same backup_name. Two concurrent requests with the same backup_name could both pass the existence check at line 1065 and create duplicate jobs. While executeBackupJob rechecks at line 470, both jobs would initially be created and tracked. Consider checking if a job with the same backup_name is already in progress by iterating through backup_jobs_ while holding the jobs_mutex_.
| throw std::runtime_error("Insufficient disk space: need " + | ||
| std::to_string(index_size * 2 / (1024 * 1024)) + " MB"); |
There was a problem hiding this comment.
The error message for insufficient disk space is potentially misleading. It states "need X MB" based on index_size * 2, but it doesn't show how much is actually available. Consider including both required and available space in the error message for better debugging, e.g., "Insufficient disk space: need X MB, available Y MB".
| throw std::runtime_error("Insufficient disk space: need " + | |
| std::to_string(index_size * 2 / (1024 * 1024)) + " MB"); | |
| size_t required_bytes = index_size * 2; | |
| size_t required_mb = required_bytes / (1024 * 1024); | |
| size_t available_mb = space_info.available / (1024 * 1024); | |
| throw std::runtime_error("Insufficient disk space: need " + | |
| std::to_string(required_mb) + " MB, available " + | |
| std::to_string(available_mb) + " MB"); |
| std::pair<bool, std::string> createBackupAsync(const std::string& index_id, | ||
| const std::string& backup_name) { | ||
| // 1. Validate backup name | ||
| std::pair<bool, std::string> result = validateBackupName(backup_name); | ||
| if(!result.first) { | ||
| return result; | ||
| } | ||
|
|
||
| // 2. Check if backup file already exists | ||
| std::string backup_tar = data_dir_ + "/backups/" + backup_name + ".tar"; | ||
| if(std::filesystem::exists(backup_tar)) { | ||
| return {false, "Backup already exists: " + backup_name}; | ||
| } | ||
|
|
||
| // 3. Generate unique job_id (timestamp + random 6 chars) | ||
| auto now = std::chrono::system_clock::now(); | ||
| auto timestamp = std::chrono::duration_cast<std::chrono::milliseconds>( | ||
| now.time_since_epoch()).count(); | ||
| std::string job_id = std::to_string(timestamp) + "_" + | ||
| random_generator::rand_alphanum(6); | ||
|
|
||
| // 4. Create BackupJob record with IN_PROGRESS status | ||
| BackupJob job; | ||
| job.job_id = job_id; | ||
| job.index_id = index_id; | ||
| job.backup_name = backup_name; | ||
| job.status = BackupJobStatus::IN_PROGRESS; | ||
| job.started_at = now; | ||
| job.completed_at = std::chrono::system_clock::time_point(); | ||
|
|
||
| { | ||
| std::unique_lock<std::shared_mutex> lock(jobs_mutex_); | ||
| backup_jobs_[job_id] = job; | ||
| // Track active backup for this index | ||
| active_backup_jobs_[index_id] = job_id; | ||
| persistJobs(); | ||
| } |
There was a problem hiding this comment.
The createBackupAsync method doesn't validate that the index_id exists before creating and persisting the backup job. The validation only happens later in executeBackupJob when getIndexEntry is called. This means invalid backup jobs could be created, persisted, and only fail during execution. Consider checking if the index exists (e.g., by calling metadata_manager_->getMetadata(index_id)) before creating the job, so the API can return a synchronous error for non-existent indexes.
| enum class BackupJobStatus : uint8_t { | ||
| IN_PROGRESS = 1, | ||
| COMPLETED = 2, | ||
| FAILED = 3 | ||
| }; |
There was a problem hiding this comment.
The BackupJobStatus enum starts at 1 instead of the more conventional 0. While this works correctly, if the status value defaults to 0 in the fromJson method (line 179), it would create an invalid enum value that doesn't match any of IN_PROGRESS, COMPLETED, or FAILED. Consider either starting the enum at 0 or using a more defensive default value in fromJson that ensures the status is always valid.
| } else if(job.status == BackupJobStatus::COMPLETED) { | ||
| job_obj["status"] = "completed"; | ||
| } else if(job.status == BackupJobStatus::FAILED) { | ||
| job_obj["status"] = "failed"; |
There was a problem hiding this comment.
The status enum to string conversion lacks a default case or error handling. If job.status contains an invalid value (e.g., due to data corruption or programming error), the status field will simply be omitted from the response without any indication of the problem. Consider adding an else clause to handle unexpected status values, either by logging a warning and setting a default value like "unknown", or by throwing an exception.
| job_obj["status"] = "failed"; | |
| job_obj["status"] = "failed"; | |
| } else { | |
| CROW_LOG_WARNING << "Unexpected BackupJobStatus value for job_id=" | |
| << job.job_id; | |
| job_obj["status"] = "unknown"; |
No description provided.