Skip to content

Stress testing resumable transfers #83

@tonyhutter

Description

@tonyhutter

I've been heavily stress testing axl_cp resumable transfers and (as @adammoody rightly predicted) there are bugs...

Test setup
Create 100 random (0-90MB) files per CPU. For each CPU, spawn off a axl_cp -X pthreads <100 files> . killall -9 axl_cp. Resume the transfers and wait for them to finish.

Here's some of the failures I've seen:

  1. Corrupt state_file?
KVTree 1.0.0 ABORT: butte21: Failed to persist hash wrote 9121 bytes != expected 9127 @ /g/g0/hutter2/KVTree/src/kvtree.c:1132

(I've only see this error once)

  1. Can't read state_file
AXL_Create() failed (error -1)
AXL 0.3.0 ERROR: butte21: Couldn't read state file correctly @ axl_alloc_id /g/g0/hutter2/AXL/src/axl.c:89

The state_file defiantly existed since I check it with access() first. Later on I added a mutex in axl_write_state_file() thinking that would help, but still got the error. This error is semi-rare (happens roughly every 4-5 times I run my test).

  1. Files are missing

I often notice that files are missing on the destination side after the resume completes. When I look at the state_file after the killall -9 axl_cp, but before before the resume, I see that it may not have all 100 of the files it should be transferring (it may have like 60 or something). This makes me think it's getting killed in the middle of AXL_Add'ing all its files. The solution here would be to always AXL_Add() all your files before doing a resume. That way it would transfer the missing ones along with resuming the existing ones.

It seems to me that for resumes to truly work, we need either:

  • ACID compliment KVTree implementation

or

  • Don't use state_file at all for resumes. Derive the state information we need by looking at the existing ._AXL destination file. The presence of an ._AXL extension means the file didn't finish transferring or is transferring and can be resumed. And the size of the ._AXL destination file tells you how much of the file has been transferred (and thus the resume offset).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions