Skip to content

Larges pushes fail, because they take too long #11

@myieye

Description

@myieye

We recently had a push of almost 2GB (surprisingly it only took about 3 min for the client to upload it).
Here's an overview of when things broken:
image

It failed on the last chunk, because that's when the server actually starts doing the heavy lifting: trying to apply the commit.

And here's what I think/know happened:

  • Chorus sends the last chunk
    • The resumable server detects that it's the last chunk
    • It calls unbundle
    • Which calls hg incoming, which takes ~3.5m (it gets logged to /var/cache/hgresume/<transaction-ID>.bundle.incoming.async_run)
  • Because the request takes so long
  • So, presumably the PHP script gets torn down while it's waiting for the hg incoming command to finish (which does finish, because it's in its own process)
  • Because the PHP script gets torn down, no work actually happens: it never gets to running the command hg unbundle and creating a lockfile for that command (A lockfile is created for hg incoming, but that's a seperate file)
  • Because the lockfile doesn't get created, when Chorus retries the push-bundle, the server throws an Exception

Do we want to allow big pushes like this? I think so! So how:

  • It's fine if Chorus times out as long as the job actually happens and we returns a more meaningful response. The code tries to return a 200, but fails, because the lockfile it's expecting doesn't exist. The exception is good, because a missing lockfile means nothing is happening.

So we either need to:

  1. Move more stuff into an external command that doesn't get torn down 🙁
  2. Prevent the PHP script from getting torn down (e.g. move large pushes to a Lexbox Job and make sure we turn off everything that might kill a long PHP script)
  3. Make the retries smarter and have them pick up where the last one died

I think 3 sounds like the best bet. Something like:

  • Replace the exception-throwing isComplete check, with something that anticipates this senario:
    • If there's no lock file retry the unbundle
    • In the unbundle, detect if hg incoming already ran and if so:
      • Do the necessary validation
      • Then start hg unbundle

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions