Saturday, April 16, 2011

Google App Engine, scheduled tasks, and persisting changes into the datastore: the risk of a race condition

This post is about a race condition I've accidentally discovered and hopefully fixed. It occurred in App Engine and was generated by tasks I created for immediate execution...

Context

When I started developing in Java for Google App Engine, I decided to give a try with JDO, mainly because it is datastore agnostic (*). Operations managing my entities are organized in DAOs with set of methods like the followings.

public Demand update(Demand demand) {
    PersistenceManager pm = getPersistenceManager();
    try {
      return update(pm, demand);
    }
    finally {
        pm.close();
    }
}

public Demand update(PersistenceManager pm, Demand demand) {
    // Check if this instance comes from memcache
    ObjectState state = JDOHelper.getObjectState(consumer);
    if (ObjectState.TRANSIENT.equals(state)) {
        // Get a fresh copy from the data store
        ...
        // Merge old copy attributes into the fresh one
        ...
    }
    // Persists the changes
    return pm.makePersistent(demand);
}

I knew that changes are persisted only when the PersistenceManager is closed; closing it after an update is safe attitude. I decided anyway to separate the PersistenceManager instance management from the business logic updating the entity for clarity.

This decision offers the additional benefit of being able to share PersistenceManager instance with many operations. The following code snippet illustrates my point: a unique PersistenceManager instance is used for two entity loads and one save.

public void processDemandUpdateCommand(Long demandKey, JsonObject command, Long ownerKey) throws ... {
    PersistenceManager pm = getPersistenceManager();
    try {
        // Get the identified demand (can come from the memcache)
        Demand demand = getDemandOperations().getDemand(pm, demandKey, ownerKey);

        // Check if the demand's location is changed
        if (command.contains(Location.POSTAL_CODE) || command.contains(Location.COUNTRY_CODE) {
            Location location = getLocationOperations().getLocation(pm, command);
            if (!location.getKey().equals(demand.getLocationKey())) {
                command.put(Demand.LOCATION_KEY, location.getKey());
            }
        }

        // Merge the changes
        demand.fromJson(command);

        // Validate the demand attributes
        ...

        // Persist them
        demand = getDemandOperations().updateDemand(pm, demand);

        // Report the demand state to the owner
        ...
    }
    finally {
        pm.close();
    }
}

For my service AnotherSocialEconomy which connects Consumers to Retailers, the life cycle for a Demand is made of many steps:
  • State open: raw data just submitted by a Consumer;
  • State invalid: one verification step failed, requires an update from the Consumer;
  • State published: verification is OK, and Demand broadcasted to Retailers;
  • State confirmed: Consumer confirmed one Proposal; Retailer reserves the product for pick-up, or delivers it;
  • State closed: Consumer notified the system that the transaction is closed successfully;
  • State cancelled: ...
  • State expired: ...

In my system, some operations take time:
  • Because of some congestion in the environment, which occurs sometimes when sending e-mails.
  • Because some operations require a large data set to be processed–like when a Demand has to be broadcasted to selected Retailers.

Because this time constraint and the 30 second limit, I decided to use tasks extensively (tasks can run for 10 minutes). In some ways, my code is very modular now, easier to maintain and test.

So I updated my code to trigger a validation task once the Demand has been updated with the raw data submitted by the Consumer. The code snippet shows the task scheduling in the context of the processDemandUpdate() method illustrated above.

public void processDemandUpdateCommand(Long demandKey, JsonObject command, Long ownerKey) throws ... {
    PersistenceManager pm = getPersistenceManager();
    try {
        ...

        // Update the state so the entity is ready for the validation process
        demand.setState(State.OPEN);

        // Persist them
        demand = getDemandOperations().updateDemand(pm, demand);

        // Create a task for that demand validation
        getQueue().add(
            withUrl("/_tasks/validateOpenDemand").
                param(Demand.KEY, demandKey.toString()).
                method(Method.GET)
        );
    }
    finally {
        pm.close();
    }
}

Issue

Until I activated the Always On feature, no issue has been reported for that piece of code: my unit tests worked as expected, my smoke tests were fine, the live site behaved correctly, etc.

Then the issue started to appear randomly: sometimes, updated Demand instances were not processed by the validation task anymore! A manual trigger of this task from a browser or curl had however the expected result...

For the task to be idempotent, the state of the Demand instance to be validated is checked: if set with open, the Demand attributes are checked with the result of the state being set with invalid or published. Otherwise nothing happens. With that approach, Demands already validated are not processed a second time...

What occurred?
  • Without the Always On feature activated, because of the low traffic in my application, the infrastructure was delaying the process of the validation task a bit and it was executed once the request process finished.
  • Thanks to that soft process serialization, the datastore update commanded by the instruction pm.close() had all chances to be completed before the start of the validation task!
  • With the Always On feature activated, the infrastructure had much more chance to get one of the two other application instances to process the validation task... which could happen before the datastore update...
  • As it started before the datastore update, the validation task found a task in the state set by the previous run of the task for this Demand instance: invalid or published. Then it exited without reporting any error.

Solutions

The ugly one:
Add a delay before executing the task with the countdownMillis() method.

// Create a task for that demand validation
        getQueue().add(
            withUrl("/_tasks/validateOpenDemand").
                param(Demand.KEY, demandKey.toString()).
                method(Method.GET).
                countdownMillis(2000)
        );
    }
    finally {
        pm.close();
    }
}

A tricky one:
Use memcache to store a copy of the Demand, which the validation will use instead of the reading it from the datastore. Because there's no warranty that your entity won't be evicted before the the run of the validation task, this is not a solution I can recommend.

The simplest one:
Move the code scheduling the code outside the try...finally... block. The task will be scheduled only if the updates of the Demand instance have been persisted.

public void processDemandUpdateCommand(Long demandKey, JsonObject command, Long ownerKey) throws ... {
    PersistenceManager pm = getPersistenceManager();
    try {
        ...

        // Update the state so the entity is ready for the validation process
        demand.setState(State.OPEN);

        // Persist them
        demand = getDemandOperations().updateDemand(pm, demand);
    }
    finally {
        pm.close();
    }

    // Create a task for that demand validation
    getQueue().add(
        withUrl("/_tasks/validateOpenDemand").
            param(Demand.KEY, demandKey.toString()).
            method(Method.GET)
    );
}

The most robust one:
Wrap everything withing a transaction. When a task is scheduled within a transaction, it's really enqueued when the transaction is committed.

Be aware that adopting this solution may require a major refactoring.

Conclusion

Now I understand the issue, I'm a bit ashamed of it. For my defense, I should say the defect has been introduced as part of an iteration which came with a series of unit tests. Before the activation of the Always On feature, it stayed undetected, and later it occurred only rarely.

Anyway, verifying the impact of all calls to external tasks before persisting any changes is one point in my review check list.

I hope this helps,
A+, Dom

--
Notes:
* These days, I would start my application with Objectify. This blog post summarizes many arguments I agree on too in favor to Objectify.

No comments:

Post a Comment