Parallel s3 uploads by miller-time · Pull Request #393 · rust-lang/docs.rs

miller-time · 2019-08-02T23:39:29Z

No description provided.

miller-time · 2019-08-02T23:41:18Z

oops a commit snuck in from my other branch #392

Mark-Simulacrum

Thinking about this a bit more, we'll probably want to do block_on during the main loop as we get to, say, 1000 futures in queue. Based on previous testing there's not really any serious gain in performance from having more than that in queue anyway. As is this'll probably cause pretty serious memory growth since we'll be storing ~200,000 fairly large objects in memory for some of the larger crates.

miller-time · 2019-08-02T23:50:42Z

during the main loop

can you point me to that?

Mark-Simulacrum · 2019-08-03T00:11:20Z

Ah, the directory reading loop -- in that same function. Right after pushing onto the vector of futures we can check if we've exceeded some constant amount of futures and dispatch based on that.

Mark-Simulacrum

With this last change I think this is pretty much good to go. It'd be good to drop the Vagrant commit (since it's not related to this PR) and I'll hand this off to r? @QuietMisdreavus for another round of review before we can merge.

outdated

QuietMisdreavus

Looks fine, just some small nits.

Mark-Simulacrum · 2019-08-28T19:27:21Z

@QuietMisdreavus This has been updated and looks ready to go from my side; I can take care of deploy and monitoring after deploy if you sign off on the code.

QuietMisdreavus

This is mostly okay, just a couple more comments.

QuietMisdreavus · 2019-10-11T15:08:44Z

If we're adding error information to the panic message, we should add it here as well.

QuietMisdreavus · 2019-10-22T15:22:48Z

This looks great! One thing i noticed is that this doesn't include the "refresh the client" logic. Do we think that this will get around that issue? Should we create a new client when we execute a batch of uploads?

Mark-Simulacrum · 2019-10-22T15:27:01Z

Oh, yeah, we probably want a new client for each batch -- that should be low enough overhead wise. I imagine that's not too difficult to add?

miller-time · 2019-10-22T15:35:28Z

The way it was working before, I believe the client was only refreshed if an error occurred. It seems like now we call s3_client() for each file. So I think, if anything, we might be refreshing the client too much?

I honestly don't remember why I changed that.

QuietMisdreavus · 2019-10-22T15:40:11Z

The way it was working before, I believe the client was only refreshed if an error occurred. It seems like now we call s3_client() for each file. So I think, if anything, we might be refreshing the client too much?

Oh, you're right! In this case, i think this will be fine. I don't think that creating a new client each time is that heavyweight, but i haven't looked that strongly into it. @Mark-Simulacrum, will this work, or do you want to bring back the old logic of "save the client outside the loop and overwrite it after a batch is sent"?

Mark-Simulacrum · 2019-10-22T15:40:33Z

Oh, yeah -- I think a new client for each future is probably not what we want (it seems likely to cause problems of some kind, or be generally slower) -- I'm not sure what a good way to structure this would be though. Maybe we can do a RwLock that we share between all the futures, and then replace it on error? If that proves pretty complicated, then we can likely get away with just creating a new client for every batch and re-running the batch if it was unsuccessful (so, kinda similar to before, but less fine-grained, I guess).

miller-time · 2019-10-23T01:40:01Z

I feel like I'm getting pretty close...

diff --git src/db/file.rs src/db/file.rs
index faf66d2..2465141 100644
--- src/db/file.rs
+++ src/db/file.rs
@@ -10,6 +10,7 @@ use postgres::Connection;
 use rustc_serialize::json::{Json, ToJson};
 use std::fs;
 use std::io::Read;
+use std::sync::RwLock;
 use error::Result;
 use failure::err_msg;
 use rusoto_s3::{S3, PutObjectRequest, GetObjectRequest, S3Client};
@@ -148,6 +149,7 @@ pub fn add_path_into_database<P: AsRef<Path>>(conn: &Connection,
     try!(cookie.load::<&str>(&[]));
 
     let trans = try!(conn.transaction());
+    let client = s3_client().map(|c| RwLock::new(c));
     let mut file_list_with_mimes: Vec<(String, PathBuf)> = Vec::new();
 
     let mut rt = ::tokio::runtime::Runtime::new().unwrap();
@@ -188,11 +190,13 @@ pub fn add_path_into_database<P: AsRef<Path>>(conn: &Connection,
             }
         };
 
-        if let Some(client) = s3_client() {
+        if let Some(client) = client {
             let bucket_path = bucket_path.clone();
             let content = content.clone();
             let mime = mime.clone();
 
+            let mut client = client.write().unwrap();
+
             futures.push(client.put_object(PutObjectRequest {
                 bucket: "rust-docs-rs".into(),
                 key: bucket_path.clone(),
@@ -201,6 +205,11 @@ pub fn add_path_into_database<P: AsRef<Path>>(conn: &Connection,
                 ..Default::default()
             }).map_err(move |e| {
                 log::error!("failed to upload to {}: {:?}", bucket_path, e);
+                // Get a new client, in case the old one's connection is stale.
+                // AWS will kill our connection if it's alive for too long; this avoids
+                // that preventing us from building the crate entirely.
+                *client = s3_client().unwrap();
+
                 client.put_object(PutObjectRequest {
                     bucket: "rust-docs-rs".into(),
                     key: bucket_path,

...but

error[E0277]: `std::sync::RwLockWriteGuard<'_, rusoto_s3::S3Client>` cannot be sent between threads safely
   --> src/db/file.rs:225:23
    |
225 |                 if rt.block_on(::futures::future::join_all(futures)).is_err() {
    |                       ^^^^^^^^ `std::sync::RwLockWriteGuard<'_, rusoto_s3::S3Client>` cannot be sent between threads safely
    |

Also, acquiring a write lock for each file upload seems like a bad idea... but because of the closure with the "retry" in it, we're moving client during each iteration of the loop. Here's an attempt to only acquire the write lock for the "client refresh":

diff --git src/db/file.rs src/db/file.rs
index faf66d2..5f8c252 100644
--- src/db/file.rs
+++ src/db/file.rs
@@ -10,6 +10,7 @@ use postgres::Connection;
 use rustc_serialize::json::{Json, ToJson};
 use std::fs;
 use std::io::Read;
+use std::sync::RwLock;
 use error::Result;
 use failure::err_msg;
 use rusoto_s3::{S3, PutObjectRequest, GetObjectRequest, S3Client};
@@ -148,6 +149,7 @@ pub fn add_path_into_database<P: AsRef<Path>>(conn: &Connection,
     try!(cookie.load::<&str>(&[]));
 
     let trans = try!(conn.transaction());
+    let mut client = s3_client().map(|c| RwLock::new(c));
     let mut file_list_with_mimes: Vec<(String, PathBuf)> = Vec::new();
 
     let mut rt = ::tokio::runtime::Runtime::new().unwrap();
@@ -188,12 +190,12 @@ pub fn add_path_into_database<P: AsRef<Path>>(conn: &Connection,
             }
         };
 
-        if let Some(client) = s3_client() {
+        if let Some(client) = client {
             let bucket_path = bucket_path.clone();
             let content = content.clone();
             let mime = mime.clone();
 
-            futures.push(client.put_object(PutObjectRequest {
+            futures.push(client.read().unwrap().put_object(PutObjectRequest {
                 bucket: "rust-docs-rs".into(),
                 key: bucket_path.clone(),
                 body: Some(content.clone().into()),
@@ -201,7 +203,15 @@ pub fn add_path_into_database<P: AsRef<Path>>(conn: &Connection,
                 ..Default::default()
             }).map_err(move |e| {
                 log::error!("failed to upload to {}: {:?}", bucket_path, e);
-                client.put_object(PutObjectRequest {
+                // Get a new client, in case the old one's connection is stale.
+                // AWS will kill our connection if it's alive for too long; this avoids
+                // that preventing us from building the crate entirely.
+                {
+                    let mut c = client.write().unwrap();
+                    *c = s3_client().unwrap();
+                }
+
+                client.read().unwrap().put_object(PutObjectRequest {
                     bucket: "rust-docs-rs".into(),
                     key: bucket_path,
                     body: Some(content.into()),

As expected, the move prevents this from working:

error[E0382]: use of moved value
   --> src/db/file.rs:193:21
    |
193 |         if let Some(client) = client {
    |                     ^^^^^^ value moved here, in previous iteration of loop
    |

error[E0505]: cannot move out of `client` because it is borrowed
   --> src/db/file.rs:204:24
    |
198 |             futures.push(client.read().unwrap().put_object(PutObjectRequest {
    |                          ----------------------
    |                          |
    |                          borrow of `client` occurs here
    |                          a temporary with access to the borrow is created here ...
...
204 |             }).map_err(move |e| {
    |                        ^^^^^^^^ move out of `client` occurs here
...
210 |                     let mut c = client.write().unwrap();
    |                                 ------ move occurs due to use in closure
...
223 |             }));
    |                - ... and the borrow might be used here, when that temporary is dropped and runs the `Drop` code for type `std::sync::RwLockReadGuard`

I'll keep poking at this, but any help or insight greatly appreciated! 😀

QuietMisdreavus · 2019-10-23T15:20:36Z

I think this may result in lots of new clients being written to the RwLock in rapid succession in case several fail at once (which is probably what's going to happen since we're going to be running up to 1000 at a time). I feel like we may need some kind of connection pool or manager or something that can make sure that we only generate one new client when things start to fail, and the individual uploads only request "a new client" instead of making it themselves. I can try to sketch something out, but if anyone else knows something better, that would be ideal.

We should only acquire the lock for writing outside the "inner loop" of 1000(?) futures; if any of those fail we can then re-create the client.

If that happens, how do we know which files to re-upload? Aren't the futures consumed in the call to join_all?

We can reupload all the files; we might duplicate work but it shouldn't really matter (i.e., the files are the same).

There are two places where we might want to re-create the client (two different rt.block_on(::futures::future::join_all(futures)) calls):

if futures.len() > MAX_CONCURRENT_UPLOADS - this is actually inside the for file_path in ... loop. after blocking on the initial futures, it clears the futures vec and continues with the next file (are you calling this the "completion of the inner loop"?)

immediately following the for file_path in ... loop

were you talking about re-creating the client in both cases?

when MAX_CONCURRENT_UPLOADS is reached, there's a partially-completed iterator that represents the files that we tried to upload (of which one or more failed). it's not immediately obvious to me how we would retry those uploads...

Also, we currently have a retry for each file. It's unclear how we'd perform a reupload of all the files. It seems like we basically need to start over iterating through the file paths. Am I missing something?

This looks like it doesn't match the code precisely -- but this is what I was envisioning in an "abstract" sense:

we have a list of all the file paths we need to upload (to_upload)

we remove MAX_CONCURRENT_UPLOADS elements off of this list (currently_uploading)

create a fresh client here (every time)

loop through these, creating a future for each upload

block on the completion of all these futures

if any fail, re-create the futures from the list currently_uploading (this may need an extra copy or so but that's fine) -- from the "create a fresh client" step basically

repeat the last several steps until we're done (i.e., to_upload is empty)

That makes sense, I gave it a shot. It was a bit of a rewrite so hopefully it's correct 😄

Mark-Simulacrum · 2019-10-24T14:46:59Z

Looking at the diff is pretty horrible (indentation changed and things moved around) but based on a high-level read through of https://github.com/rust-lang/docs.rs/blob/d9beb9e6a5e85348811cee5f5884dc977b02b4c5/src/db/file.rs this looks good.

QuietMisdreavus · 2019-10-29T15:27:49Z

This looks like it will retry forever if files fail to upload for reasons other than a stale connection, i.e. now there's no panic!() branch. Do we want to have a failure counter and panic on 2-3 failed upload batches?

miller-time · 2019-10-29T20:50:54Z

Oops, good catch! Yeah I think

            if rt.block_on(::futures::future::join_all(futures)).is_ok() {
                // this batch was successful, start another batch if there are still more files
                currently_uploading = to_upload.drain(..MAX_CONCURRENT_UPLOADS).collect();

could set a counter to 0, and

            } else {
                // if any futures error, leave `currently_uploading` in tact so that we can retry the batch

could set a counter to += 1

Mark-Simulacrum · 2019-10-29T23:55:37Z

Seems quite reasonable. We might want to also log the error somewhere (ideally in a way metrics can pick up). But that can be left for future work I think.

miller-time · 2019-10-30T01:42:14Z

I think if this happens during a retry, it will duplicate items in the file_list_with_mimes vec.

This might need to be done only when the put_object is successful?

Something like

client.put_object(PutObjectRequest { // ... }).and_then(|| { file_list_with_mimes.push((mime.clone(), file_path.clone())); });

except I think now file_list_with_mimes needs to be in an Arc

Might be simpler to deduplicate after uploading by file path.

Or prevent duplicates during insert?

QuietMisdreavus · 2019-10-30T16:20:18Z

This call to drain will panic on the last batch of a crate upload, since it's unlikely that a crate will have an even multiple of MAX_CONCURRENT_UPLOADS files, and drain panics if the range is outside the Vec's bounds. The end of the range should be max(to_upload.len(), MAX_CONCURRENT_UPLOADS).

nice catch, I should have read the drain docs more closely 😞

QuietMisdreavus

I think this is in a good position now. If @Mark-Simulacrum is good with this, i think we're ready to merge.

Mark-Simulacrum · 2019-10-30T19:38:13Z

Yes, I'm happy with this overall (a few nits but not something we need to fix here, I might try to fix them up when I work on compression this weekend, presuming this lands).

Let's merge and deploy (cc @pietroalbini). I'd like to deploy this quickly after merging and monitor it for a few uploads so not doing so myself.

(Ideally, we'd collect metrics on upload time, and we'd know if this is actually an improvement).

miller-time · 2019-10-30T20:01:09Z

squash first?

Mark-Simulacrum · 2019-10-30T20:28:07Z

Oh, yes, it'd be good to squash this into one commit.

emilyalbini · 2019-10-31T14:51:54Z

Let's merge and deploy (cc @pietroalbini). I'd like to deploy this quickly after merging and monitor it for a few uploads so not doing so myself.

(Ideally, we'd collect metrics on upload time, and we'd know if this is actually an improvement).

Should we add metrics for a day or two and then merge this PR then? Adding a new one for uploaded files shouldn't be hard.

Mark-Simulacrum · 2019-10-31T15:26:41Z

I don't know, we can I suppose -- I don't know how hard that would be. I'm unlikely to get a chance to do so soon, unfortunately.

miller-time · 2019-10-31T20:04:01Z

Is that something I can help out with?

miller-time · 2019-11-01T05:00:06Z

@pietroalbini does this look right to you? (I just rebased after your merge of pull request #457)

crate::web::metrics::UPLOADED_FILES_TOTAL.inc_by(batch_size as i64);

emilyalbini · 2019-11-05T14:33:35Z

@miller-time yeah, it's fine. It'd be great to increment by 1 each time a single future completes, but that's not critical and if you want it can be implemented in a later PR. I think merging the speedups first is more important than slightly more accurate metrics.

Huge thanks for the work you put into this! 🎉

Mark-Simulacrum previously requested changes Aug 2, 2019

View reviewed changes

Comment thread src/db/file.rs Outdated

miller-time force-pushed the parallel-s3-uploads branch from c34a1f0 to 783e2f3 Compare August 2, 2019 23:52

miller-time commented Aug 3, 2019

View reviewed changes

Comment thread src/db/file.rs Outdated

Mark-Simulacrum reviewed Aug 3, 2019

View reviewed changes

Comment thread src/db/file.rs Outdated

miller-time force-pushed the parallel-s3-uploads branch from d5524fe to 4ca3f3e Compare August 3, 2019 00:40

miller-time force-pushed the parallel-s3-uploads branch from 4ca3f3e to d1dadfd Compare August 27, 2019 01:48

QuietMisdreavus reviewed Aug 27, 2019

View reviewed changes

Comment thread src/db/file.rs Outdated

Comment thread src/db/file.rs Outdated

Comment thread src/db/file.rs Outdated

miller-time force-pushed the parallel-s3-uploads branch from d1dadfd to ba9a3c5 Compare August 28, 2019 01:15

miller-time force-pushed the parallel-s3-uploads branch from ba9a3c5 to afd4a0e Compare September 18, 2019 14:42

QuietMisdreavus reviewed Oct 11, 2019

View reviewed changes

miller-time force-pushed the parallel-s3-uploads branch 2 times, most recently from c4f67a1 to f297719 Compare October 15, 2019 03:08

miller-time force-pushed the parallel-s3-uploads branch from f297719 to f23373c Compare October 19, 2019 01:28

QuietMisdreavus reviewed Oct 23, 2019

View reviewed changes

miller-time commented Oct 30, 2019

View reviewed changes

QuietMisdreavus reviewed Oct 30, 2019

View reviewed changes

QuietMisdreavus approved these changes Oct 30, 2019

View reviewed changes

miller-time force-pushed the parallel-s3-uploads branch from 76d81bb to 888a391 Compare October 30, 2019 20:32

emilyalbini mentioned this pull request Oct 31, 2019

Add a metric on the number of uploaded files #457

Merged

perform s3 uploads in parallel

c95b8bf

miller-time force-pushed the parallel-s3-uploads branch from 888a391 to c95b8bf Compare November 1, 2019 04:59

emilyalbini merged commit 5d22713 into rust-lang:master Nov 5, 2019

miller-time mentioned this pull request Nov 5, 2019

increment upload metric for each successful file #469

Merged

Conversation

miller-time commented Aug 2, 2019

Uh oh!

miller-time commented Aug 2, 2019

Uh oh!

Mark-Simulacrum left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

miller-time commented Aug 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mark-Simulacrum commented Aug 3, 2019

Uh oh!

Uh oh!

Mark-Simulacrum left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

QuietMisdreavus left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Mark-Simulacrum commented Aug 28, 2019

Uh oh!

QuietMisdreavus left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

QuietMisdreavus commented Oct 22, 2019

Uh oh!

Mark-Simulacrum commented Oct 22, 2019

Uh oh!

miller-time commented Oct 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

QuietMisdreavus commented Oct 22, 2019

Uh oh!

Mark-Simulacrum commented Oct 22, 2019

Uh oh!

miller-time commented Oct 23, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mark-Simulacrum commented Oct 24, 2019

Uh oh!

QuietMisdreavus commented Oct 29, 2019

Uh oh!

miller-time commented Oct 29, 2019

Uh oh!

Mark-Simulacrum commented Oct 29, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

miller-time commented Aug 2, 2019 •

edited

Loading

miller-time commented Oct 22, 2019 •

edited

Loading

miller-time commented Nov 1, 2019 •

edited

Loading