Episode 77: How to Build a Batch Processing System with Drew Jaynes

March 14, 2017

Apply Filters

00:00 / 48:55

Today’s episode is sponsored by Pagely, who are the original gangsters of managed WordPress hosting. They come from humble beginnings, but now they host huge brands like Disney, Visa, eBay and more. They know their customers well and they understand and provide solutions for the complex challenges of scaling. Check them out at Pagely.com for more information.

Our guest today is Drew Jaynes. He works for Sandhill Development with Pippin, and today we’re going to talk about batch processing. Drew will talk about his recent projects, and he’ll share tips on batch processing, data sets, and more.

Some of the highlights of the show include:

What batch processing is and why it’s important, including examples of why you might need to use it with your WP site.
Challenges Drew ran into when building a new batch processor with API.
The differences between batch processing and step processing.
Detailed information on how Drew’s new batch processing works and how it’s an improvement over the last one. Also, he goes into Ajax callbacks.
Examples of things Drew has done that depend on PHP 5.3 or later.
Possible applications that the batch processor might be used for in the future.
How a batch-processing failure is handled.
Advice that Drew would give to someone who is building their first batch processor with their plugin.

Links and Resources:

Pagely

Episode 49

AffiliateWP batch processing API:

If you’re enjoying the show we sure would appreciate a Review in iTunes. Thanks!

Transcript

INTRO: Welcome to Apply Filters, the podcast all about WordPress development. Now here’s your hosts, Pippin Williamson and Brad Touesnard.

PIPPIN: Welcome back to Episode 77 of Apply Filters. Today we’re going to talk again to Mr. Drew Jaynes. But before we do that–

BRAD: This episode is sponsored by Pagely, undoubtedly the original gangsters of managed WordPress hosting. They’ve been around for a long time, but man have they evolved over the last few years from humble beginnings to now hosting huge brands like Disney, Visa, Comcast, eBay. I could go on.

They’ve also figured out who their customers are. They realized that people come to them with complex scaling, deployment, and security challenges, challenges deemed unsolvable by other providers. And so they made that their niche. That’s what they do. They provide solutions to these challenges.

Over the years their technology has seen big improvements as well. One of the biggest changes they ever made was moving all of their hosting onto Amazon Web Services and it has paid off big time for them. In fact, there was a S3 outage just recently and they were able to stay operational. There was even a tweet I remember from Scott Bollinger. I looked it up and it reads, “Impressed that while AWS is down, my @Pagely site with AWS infrastructure is still up. Redundancy ftw” (for the win).

If you have a big scaling problem, a big hosting scaling problem, you should definitely get in touch with the great folks over at Pagely. They’d be more than happy to chat with you and find a solution to your scaling problem. Also, check out Pagely.com for more details about their services and contact details.

PIPPIN: I’ll vouch for them personally. They host all of our websites and have been nothing but awesome for the last several years as we’ve hosted Affiliate WP, Easy Digital Downloads, Restrict Content Pro, and actually a whole bunch of other sites that a lot of people don’t know about all on Pagely and have worked absolutely swell for us.

All right, so today we’re going to talk with Drew Jaynes. Drew Jaynes actually joined us previously back on Episode 49. Back then Drew worked as a platform engineer at 10up and now, full disclosure, Drew actually works for Sandhills Development, which is my company, on our Affiliate WP product.

We wanted to bring Drew in today to talk about batch processing. Brad and I have covered batch processing in various ways over the last–I don’t know–quite a few episodes. We keep touching on it here and there with some of the work that Brad and his team have been doing and some of the work that we’ve been doing on our side. Anyway, we thought it would be good to bring Drew in and get his take on batch processing and ask him some hopefully harder questions about the process of building a batch processing API.

One of Drew’s projects for the last six months or so was to completely rebuild the batch processing API in Affiliate WP so that we could do things with big data sets. Anyway, he went to town on it and he made it happen. Say hi, Drew.

DREW: Hello.

PIPPIN: Nice to have you on.

BRAD: Okay, Drew. Let’s get into this thing. What’s batch processing anyway? Can you just give a basic overview of what that means?

DREW: I don’t know. I guess the elevator pitch of a batch processor would be cycling through a big data set in batches, essentially. Performing some action on batches of data so that you can cycle through a big data set and tiny little pieces.

PIPPIN: Are there any quick examples maybe why we use batch processors that listeners would be familiar with, within the WordPress ecosystem?

DREW: Actually, I can give an example from just a couple, like the last major release of Affiliate WP. We added a database upgrade routine that uses a batch processor. I think it cycled through every affiliate in the database for the current site in and recalculated unpaid earnings or something. But the point is that we have no way of knowing if a customer site has 10 affiliates or 10,000 affiliates, right? You don’t want to try to run one process against 10,000 affiliates. We’d much rather want to run 10,000 processes, obviously pretty quickly, but one at a time instead of all at once.

PIPPIN: I guess one of the main things that batch processing really achieves is the ability to handle big data sets. Are there any examples from, say, the wider WordPress ecosystem, more WordPress Core where a batch processor is used to do something?

DREW: Yeah, I think that probably the best example of where you could already interact with some stripe of batch processor would be the network upgrade tool, like when you upgrade WordPress and you need to upgrade something like across your multisite network. There’s like a poor man’s batch processor built into Core that cycles through every site in your network and upgrades them all one at a time.

BRAD: There’s also one for metadata as well or the taxonomy.

DREW: Right. Yeah.

BRAD: Didn’t they get rid of one of the tables or something? I can’t remember. It’s been a while since I looked at that.

DREW: Yeah. They consolidated stuff for the taxonomy roadmap. But, yeah. But see the difference is that Core uses — I called it a poor man’s batch processor and that’s because it uses page refreshes, right? So like every single step that the page has to refresh.

BRAD: Oh, man. Yeah, that’s ugly. Like the whole page, it just feels super clunky, right? Now that you say that, I remember now what it looks like when you upgrade a network install, and it’s just like clunk–

DREW: Exactly.

BRAD: –clunk–

DREW: Yeah.

BRAD: –clunk. For like every single site it does that. Oh, man.

DREW: I mean it gets the job done, but it’s definitely not very modern in the way that it’s architected.

PIPPIN: I know we’ve used some old batch processors like that in various plugins. EDD, for example, used a page refresh for a long time. It was a constant challenge trying to get big sites upgraded. The idea of that batch process was to handle big data sets. Yet, it would still fail because we were still relying on the page refresh. Even if you architected in a way that is designed to avoid the maximum number of redirects, it can still happen.

Going into this, Drew, one of your goals was to write a new batch processing API, and now Affiliate WP actually already had one. We had built a batch processing API that was originally used for a database migration tool to import data from previous affiliate programs, other plugins and then also to, say, import a user’s database and create affiliate accounts for them.

What was wrong with that batch processor? What were some of the limitations of it, and what were some of your kind of goals of building the new API?

DREW: If I remember correctly, because I didn’t build it, I think it followed the page refresh format, first of all, right? I don’t remember exactly, but I think that’s right. We also used it for like as a user migration tool. I think there was a batch process for that too.

Again, it was sort of clunky because it relied on page refreshes, which meant that there wasn’t a really great way to track progress. You could sort of see which step you were on, but all the context was lost, right? You would know which step you were on, but you didn’t really know what was going on, and so you would have to just wait until it finished to see if it did what it said it did. I mean it was a good effort. I think it was just in some ways inflexible.

A big one, and this is kind of a big piece, was that I think we had one batch processing base class or something like that that lived outside of the APIs that they were extending. So you had this problem where, if you wanted to create a batch processor, rather than hooking into the export, like the export API had to write a new export batch process base class or something like that. It seemed like it was just a little clunky in terms of implementing new ones. But it got the job done for what we had, I think.

BRAD: Can we back up just a sec about batch processing? We mentioned that one of the reasons for batch processing is that we have this giant data set that we need to operate on, but I think another dimension of this is the action. The action that needs to be performed on each item of that data set, that action can be a very quick thing. It could be a very small process to execute, or it could be a very large one, right?

There are two things at play here. You could end up with a giant data set where every single thing that you need to operate on is actually quite a bit of processing, either power or time as well. That’s just something else to consider, I guess, when you’re designing a batch processing system, right?

DREW: Well, and I think it might be worth, Pippin, differentiating between the batch processor we had before and the one we have now. I think before maybe a better characterization would be a step processor, right? Whereas now we have a batch processor, which we have a lot more flexibility to process one or many items. Whereas before it was sort of like one data set and then you were stuck with stepping through them, but there was a lot less control over one versus many. But yeah, I think maybe calling the older one a step processor would be better, a better differentiation between the two.

PIPPIN: I think that’s a really accurate description. Like in the old one and what we had previously built, not just in Affiliate WP, but also in EDD and other places. It really did work as a step processor where we said, okay, based on the amount of data we have, we have 500 steps and we’re going to do 30 items per step. Most of the time that literally translated to the post per page parameter of WP Query.

DREW: Well, and another way, another thing, too, I think a big difference with the other one was that — so, Brad, just to give you some perspective here, like, when we’re talking about a batch processor, we’re talking about polling for a set of data and then performing batch processes against it, but generally working toward a goal of whatever that is, like a set goal, like a finite point that we already know about. Whereas with like a step processor, we would like do the query on every step, but we would just set the offset, for instance, for whatever the per step value was.

Let’s say we’re doing something where we have to query affiliates. We would query 30 affiliates and then, in the next step, we’d query the next 30. But the problem that we would have is that, over time, it was hard to track your progress because unless you were proactively storing the total number of items finished as you go, at the end we had no way of determining the total number of items that had been affected.

BRAD: You weren’t actually — when you’d finish operating on a set of data or an item, you wouldn’t store a piece of metadata that said this one has been processed? That kind of thing?

DREW: Right. We didn’t really have a standardized temporary storage, data store going on.

PIPPIN: It was honestly an assumption of, hey, we’re on step 55, and so that means that we have supposedly processed 30 times 55 records, and that was it.

DREW: Right.

BRAD: Okay.

DREW: Yeah.

BRAD: Right.

PIPPIN: Yeah.

BRAD: You could end up, in that scenario, processing the same thing twice? Is that right?

PIPPIN: Definitely.

DREW: Yeah, and you could also end up in a scenario where you would give them bad information about what you’d actually processed on the other end because you don’t actually know what you’ve processed.

BRAD: Right. Right. Okay. All right, so that sets the stage pretty well, I think, for the improvement. What does the new batch processor do? Let’s start with that. How does it keep track of what it’s processed?

DREW: Well, actually, we can give an example for something that just got merged in for the next release. We had this tool for recounting affiliate stats, like we can recount paid earnings based off of paid referrals or unpaid earnings or the total number of referrals associated with an affiliate or the total number of visits, let’s say, right?

In the old way we would have basically just pulled down all affiliates. In the tool, you choose one. You choose one of those four things to recount for an affiliate or for all affiliates. In the past, we would have pulled down affiliate objects for all affiliates and then looped through them one at a time, but maybe only processed like 10% of those, maybe. But we would still have to pull down all of them.

With this new system, and we had a discussion about this and there was some confusion. That was essentially that the new batch processor only pulls the unpaid referrals and then retrieves the affiliate IDs from those records. Those were the only affiliates that we actually account for. We just discount everyone else automatically. It’s just more efficient.

PIPPIN: At the end then you can give an actual — you can give information, feedback to the site owner that says we processed this for five accounts.

DREW: Right. Then of course the problem with that was that the confusion that creates by telling them how many affiliates have been processed. Then you want to know, well, why didn’t you process all of my affiliates? We ended up having to change the messaging in the end to be vague to say all matching affiliates were processed because, in the end, each of the four different options produces different numbers and that’s confusing.

PIPPIN: Can you, Drew, give listeners an overview of how the new batch processor works from having the different kinds of batch processors that we have to the registry system that was built, just really a good overview of how the whole system works now?

DREW: We actually forked some code that was in Easy Digital Downloads to start our batch processor API, our new one. But as I recall, the way it was built for EDD was that there was a very specific process that it was built for, like import or export or something. In our case we needed something that could be generalized, something that could be used for almost anything, you know, as defined in a register and batch process, let’s say.

The first thing that I did was that we built essentially a registry class, which is, in Core specifically, registers batch processes and then defines the class name and then the file. The reason that we did it this way is that it keeps the batch processor incredibly lightweight. Meaning that we’re not loading all of the batch processing files on every page load, like WordPress works for a lot of things. We’re simply loading the registry. Then we can call them on demand and load the files on demand when we need them. This gives us incredible flexibility while also keeping the footprint low.

Essentially, the way our batch processor works right now is all batch processes are run through the same AJAX callback. We just assign some values in the form, like data attributes and classes with some JavaScript hooked to the selector for whatever class that is, and then literally every batch process is run through the same callback. The callback handles pulling up the registry, finding the batch by the ID that you’ve given in the form, and then it loads to class and runs the batch process.

It’s really nice the way it’s architected. I’m so modest. It’s really nice that it’s architected in this way because that means we only have to write one AJAX callback and then we can handle all of the other logic within the server side. I think that it really ended up being kind of lean because all you have to do is register your processes and then the AJAX callback will handle the rest.

PIPPIN: How does it then know? When you send that AJAX callback and say this is the batch process that I’m running, how does it know where it’s at in the batch? You know you have this data set to process, but how does it know how to go through each of the steps or each of the batches within that data set? How does that part work?

DREW: It’s actually an interesting interplay between front end and server side –client side and server side, I should say. It essentially uses AJAX. Each step is an AJAX call, right?

There was some debate about this when we were originally building it, like, should we instantiate the class once and then just iterate on a static method on every call? What we ultimately decided to do was to actually instantiate the batch process class on every call, on every step call. If there’s 10,000, it’s being re-instantiated 10,000 times, but the point is that we’re passing the step data around like through the constructor and the batch process. Then it processes a step based on whatever the current step is.

In some ways it’s not that far off of the step processor, but they point is that we don’t lose context, so we can do things like have pre-fetch. Here’s an example. In the AJAX callback it tests to see if this is the first step. If it is, then sometimes we run a pre-fetch, which is like an expensive query that we may need that data for every step. We’ll run pre-fetch and use a new, temporary data storage API to store that data for the duration of the batch process. Then we have an initialization method that pulls data from the form and makes that data available to every step instance of the batch process. We have pre-fetch and initialization, which allows us to sort of introduce context to each step. Then it just blindly steps through.

The process step method or whatever just passes back. If it detects that there is more data to process, it passes back the next step number. If it’s done, then it passes back done. Then the API just knows what to do with that and it finishes the process and chose the message, whatever that is.

BRAD: Let’s say you load the user interface and you hit a button to kick off the batch processing. I guess what you’re going to see is some kind of progress bar, I guess. Is that right?

DREW: Yeah, and actually it’s funny you say progress bar because, like I said, we borrowed or we forked code from Easy Digital Downloads. It actually doesn’t use a progress bar like you would imagine as a developer a progress bar. You would think to use jQuery UI progress bar or whatever. It doesn’t use a real progress bar. It actually uses a div. It splits up a div into all the number of pieces that it needs, and then it just adds background from left to right as you go through the div. It in some ways is kind of ingenious because we don’t have to load the progress bar script.

PIPPIN: It was a total hack that just happened to work really, really well.

DREW: Yeah, it’s amazing. But yes, I mean sorry to get off point. Yes, it’s a “progress bar.”

BRAD: You had these steps. You’re saying “steps.” AJAX is just passing one step, like step one, step two, step three to the backend. Then the backend kind of just knows what to do next as the next step? Is that right?

DREW: Sort of. Essentially what happens is the form passes the batch ID, which you know we need to pull it from the registry so we can load the class. We essentially send the serialized form data, which includes the batch ID and a couple of other things. Those get passed to the AJAX callback.

BRAD: Does the batch ID — the batch ID refers to the data that you’ve pre-fetched or not?

DREW: The batch ID refers to the batch process that you’re registered, pre-registered.

BRAD: Okay.

PIPPIN: Then we use that to pull the data from the registry, which is in memory. But, yeah, to answer your question, essentially what happens is we talked about the data pre-fetch and the temporary data store. The way the API works is let’s say it pulls something. It pulls some massive query in pre-fetch and stores it in a temporary data store. Then as we go, like as a step is processed, we actually store the current count for where we are out of the total count.

So like in the pre-fetch we have, let’s say, 1,000 items to process. Then when a step goes through, it determines how many items were processed for that step. Then it updates the current count against the total count. That’s how we display the progress bar. That’s how we know how far we are because, in pre-fetch, we’ve determined how many items we have to process. Then as we process, we update the count. That determines the percentage for the progress bar.

BRAD: Right. Does that data come back as the AJAX request finishes? Is that right?

DREW: Yeah. I mean essentially via JavaScript we update the progress bar with the current percentage.

BRAD: Right. When you say temporary data store, is that something you guys cooked up on the backend, or are we talking about a front–?

DREW: Yeah, it is very literally like storing an option.

BRAD: Okay.

DREW: It’s just that we bypass a lot of the core — like in the end we’re using the options table, the core options table, but we’re not setting transients or doing a get option or update option or delete option because, in some ways, we want it to be so lightweight that it falls completely outside of the core APIs.

BRAD: Right, because you don’t want it to be caching that stuff.

DREW: Well, and plus we don’t want it to be firing off all of the core actions and stuff that happen whenever you use the APIs. We just want it to store data for us temporarily. Then once the batch process is finished, we fire off a finish method, which trash collects all of this data, the temporary stuff we’ve been storing.

BRAD: Is that part of the API then, the temporary data and all that stuff?

DREW: It’s sort of an adjacent API, but it was built at the same time.

PIPPIN: You ended up building a data storage class that’s completely separate. It just happens to be able to be utilized by the batch processor.

DREW: Right. We built it adjacent to the batch processor because we needed it, but it’s not really part of the batch processing API. It just uses it.

PIPPIN: There was a bunch of different things that you did in the batch processing API such as using interfaces and implementations. What are some of these? Give me some examples of things that you did that depend on PHP 5.3 or later. Now, I ask this for a couple of reasons.

DREW: Almost everything.

PIPPIN: First of all, in our last episode, which was primarily focused on listener questions, there was some question about PHP 5.3 and the requirements and getting people up to 5.3, still using 5.2, et cetera. One of the questions that I like to ask people is what is it. What can you do on 5.3? What can you do better there that you couldn’t do on 5.2? Because you built an API that does depend on 5.3, and Affiliate WP requires 5.3 or later, I think it’s technically like 5.3.4, did utilizing those 5.3+ APIs provide a significant advantage or did you just use them because they were available?

DREW: Aren’t interfaces available on 5.2? I can’t remember.

PIPPIN: I honestly don’t know the answer to that.

DREW: I don’t remember exactly when it was–

BRAD: I think they are. I think interfaces are. Name spaces are not, for sure.

DREW: I guess it’s worth mentioning that we used, in some ways, some very specific 5.3+ things in core, generally, that some of these new APIs rely on. I don’t know. I can’t remember when interfaces were originally introduced, but a big thing here, Pippin was mentioning interfaces. I think before 5.3.9 or something, you couldn’t implement more than one interface. I think it was just you can only implement one at a time.

Okay, so one pattern that you’ll see commonly, even in Affiliate WP today, is somebody will write a base class or an abstract class that is then extended by something else. The base class maybe handles the base logic, whatever that is, and then our integrations API does that, right? We have a base class and then each integration extends the base class to do things specific to each integration.

But a limitation of that means that if we were to define a base batch process, for instance, we would be severely limiting our flexibility. Let me give you an example of that. We have prior to the release where we did the batch processor, we already had an export and import API that used step processes or was just like a one time shot or whatever. But the point is that we already had export APIs. The problem is that there’s not like an export base class. If you create a new batch process base class, you would essentially have to recreate the export API because if each batch process is forced to extend the base batch process, it sort of lives in its own bubble outside of all other APIs.

Really, what you have to think about is, what exactly does a base class do? That is that it essentially create — or what does an abstract base class do? It creates essentially a contract, right? When you create an abstract base class and you extend it, let’s say there’s abstract methods in there. That determines that any extending classes have to implement those methods because they’re abstract, right?

We’ve essentially managed to enforce the contract using interfaces instead of like an abstract class so that we can create batch processes that extend existing APIs, but implement standardized interfaces, right? We have a batch process interface, or we have a batch process with pre-fetch interface that extends the base interface. The whole point of this whole discussion is the contract, essentially, right? Each batch process implements the contract of these interfaces and can live separately in each API.

Like we were talking about the export API a minute ago. All we did was we created a sort of batch process middle class, middleman class that goes in between the base exporter class and batch processes that extend the exporting API because we had certain things that we needed to implement that were batch process specific. But taking this approach of being able to implement multiple interfaces instead of having to extend a new base class means that we can be incredibly flexible and we can build batch processes for almost anything.

Obviously that got pretty technical, but the main point here is that we’re still enforcing a contract of saying you need these methods to implement a batch process. We’re just doing it with interfaces instead of an abstract or a base class.

BRAD: It’s interesting to me, this discussion about interfaces, because I don’t think I’ve ever used an interface personally. That’s almost like a shameful thing for a developer to say. My gut is telling me that probably a lot of developers out there that probably are in the same boat, right? They just really haven’t. Maybe they just haven’t had a case where it made sense to use them, or maybe they just don’t really know enough about them to use them effectively.

PIPPIN: I had used them one time before this and I didn’t really understand what I was using it for.

BRAD: Right. Yeah.

PIPPIN: For anybody that’s wants, I’m going to include some links to files inside of Affiliate WP that do show what Drew was just talking about so you can go and take a look at some of those classes that then extend our base classes and implement if it’s a little easier for you to follow along. Take a look at the show notes and we’ll have links there.

DREW: I think another example I can give is we already had exporters for things, right, and the exporters followed a certain pattern because they were extended the export base class that was already there. By essentially creating new interfaces that can lay over the top of that, in some ways you end up writing kind of like a little bit more initial code. But in the end, you write the same amount you would be writing if you were just extending a base class.

The preparation is a little bit more because you have to think about, like, okay, I need to enforce the contract in this interface over here, and then maybe I need to do some basic functionality in this middleman class over here. But when you’re actually writing your batch process and extending that API, you’re writing the same amount of code in the end. It’s just that by enforcing the contract with interfaces instead of a base class or abstract class, again you get the full flexibility of being able to extend almost any API and create a batch process that extends almost to any API.

PIPPIN: If you can extend it to use just about any API, we had mentioned earlier that this batch process was built to be flexible and to be able to use for anything. What are some examples of possible applications that it might be used for in the future?

DREW: Do you want me to give you an example of what we built them for?

PIPPIN: It could be either what we built or just things that you could imagine it being used for, whether we are using them or not.

DREW: Well, obviously I’m more familiar with our code base. Let me just give you an example really quick about things that we’ve built, just to give you kind of a broad idea of how flexible this can be.

We built a batch exporter, which covers anything from exporting affiliates, referrals, visits, payouts, which actually generates a payout, and separately we have a payout generation tool, which is a batch processor, which falls outside of the export API. Then we have importers. We can import and export settings, which use batch processors. We can migrate users between plugins like somebody is migrating from one plugin to Affiliate WP. We can migrate users from other plugins into our plugin.

As I mentioned, we did a database upgrade that basically just extended our notices API. There’s tons of migration tools and things, but we also have the recount tool that I was telling you about. We can use that to recount affiliates. That doesn’t extend any existing API. That’s just one its own. It’s standalone. It’s basically like anything you can think of where you would have to process a lot of data and it would be intensive to do it, you could use a batch processor for that.

I think we’ve had discussions about other ideas. I can’t think of any off the top of my head, but other ideas in Affiliate WP for things that we could use a batch processor for.

PIPPIN: Here’s one, and this is something that we haven’t actually done yet, but it’s something that we are actually looking at doing in EDD as well as Affiliate WP and possibly other places is using a batch processing API to do on the fly report generation. Sometimes it’s the downside of e-commerce and related data is that you have tons of data to work with. I mean for example if you’re running a store that has even a few dozen transactions a day, you’re looking at thousands and thousands of transactions by year’s end.

At some point you want to be able to show store owners what that data looks like all together. What were your total earnings? Maybe what was your average refund rate or your average transaction value, your customer values, a lot of different things, some of which you can do through static values that are set whenever a transaction is processed. Some of them you have to do with on the fly generation or calculations. Eventually, as your data sets get bigger and bigger and bigger, on the fly generation is just not even feasible.

DREW: So like preloading data, basically.

PIPPIN: Yeah, preloading data and basically being able to run, say, a calculation on a million transactions that then give you various numbers for your store reports. So whether it is your average transaction value or your refund rates or your average daily transactions or whatever your numbers are, you can start working with huge data sets if you can do it in batches and you don’t have to try to do it all in a single request. We’ve looked at extending our batch process and APIs in Affiliate WP and EDD and other places to be able to provide our own customers with much, much better reporting values.

DREW: Right. Brad, I think one other area that you could probably visualize would be like I could see a next logical step being that we build a jobs queue, right? You could initiate a batch process in a jobs queue and then walk away or navigate away and it would just be doing its thing while you’re doing other things.

BRAD: Right, so basically background processing.

DREW: Right.

BRAD: Doing batch processing in the background. Yeah. Yeah, absolutely. I was going to ask you about that. Is that something you guys are already doing or something you’re planning on doing in Affiliate WP?

DREW: I can see value in it, but it’s not on our roadmap as far as I know.

BRAD: When a batch is processing and it fails, how is that handled?

DREW: If it errors out in the middle or something? It would show. It would show a notice that said something went wrong. Try again. Or, in some cases, be very specific about what went wrong. But typically it’s pretty smooth. I don’t think I’ve really seen cases where something would error out unless they were working with corrupted data or something where something expected wasn’t there. I think typically we code pretty defensively to the point where even if something was wrong, it wouldn’t necessarily error out right away.

BRAD: Right. What about timeouts, though? That’s a pretty common thing? Maybe each individual task is taking longer than you guys kind of anticipated and then maybe a batch ends up timing out, like NGINX 502s or something. What happens then? Does it come back or what happens?

DREW: Currently what happens is it spins and spins and spins. I mean it’s a terrible thing to say, but it literally just breaks. I mean when it times out and there’s nothing coming back from the client side, then it kind of just sits there and spins….

PIPPIN: But, you know, that’s really part of the defensive coding. Yes, maybe we don’t have something that handles that kind of failure smoothly, but if you anticipate those kind of failures and actively prevent them. And so in our case, for example, we do everything in small enough data sets to try and ensure that we don’t ever have timeouts like that because we’re never doing large data sets.

If the data sets that we’re processing fail due to a timeout, it’s probably because there’s something bigger wrong on the site because the data sets are so small per batch. Honestly, with a good batch processing API, you honestly don’t care how small your data set is because it’s not going to add very much time to the total time it takes to run a batch or to run the entire system. You care more about it being reliable, and so you make your batches smaller and smaller to help prevent you from ever having that problem to begin with.

DREW: Right. I feel like a long those points there’s a couple things that I could see coming up like, well, what if I wanted to change the number of items processed per batch or whatever? I think right now we have a pretty good idea of what a good median number is for whatever data set we’re working on. Sometimes it’s one affiliate at a time, and sometimes it’s 100 at a time or whatever. It depends on how intensive the process is, whatever that is. But I think–

BRAD: Do you allow customization of those things through filters, like tweaking the number?

DREW: I don’t think so. I think currently it doesn’t have anything. It’s pretty much hard coded in the batch processes.

BRAD: That’s something we’ve done, we’ve ended up doing is allowing people to filter the size of things so that if they do or if they are running on some server from 2003 or something that they can dial those things down and just make it work.

DREW: Actually, I think that’s not true. We have a filter for every batch process where you can modify the number of items to run per step, and that’s just like a smart filter where it checks to see if there’s a callback against the filter. If there is, then it checks. But otherwise it doesn’t.

PIPPIN: You can also — I mean in a lot of the batch processors, at least within our own system, you could filter some of the standard queries like we use our own internal APIs for pulling data from the database, typically. And so any of those filters present, you could use those. Now in some cases it might cause issues with our calculations on steps, but theoretically it could still work.

DREW: Well, and one thing I wanted to mention too was that we were talking about how we’re using this adjacent, temporary data storage API. Similarly, we also have our now global logging API that we use. We have like a debug log functionality in Affiliate WP and actually logging batch processes, which I think, Brad, you asked about before.

I don’t know that we have any batch processes that log anything to the debug logs, but it would be trivial to implement if we wanted to, to log number of things per step or whatever, or errors even if any errors popped up. There’s a few places where we’ll bail from a batch process if there’s an error. That would be a good place to log an error in the debug log and then silently bail, which then would just complete the process. It doesn’t necessarily show an error, but I don’t know that there’s any places that we’re currently doing it. But it would be super trivial to implement if we wanted to.

BRAD: Cool. Yeah. I think for the 5.02 thing you guys probably mitigate against that, right? On the client side, if the AJAX request, you know, status sends up as a 5.02, you could just catch for that and then show an error that says, “Oh, it timed out.”

DREW: Right.

BRAD: “Something is going on.”

DREW: Right, because we’re looking for certain queues within what comes back from the AJAX call. If they aren’t there, then obviously we can detect other things. Part of that too, and this may be a little bit old school for you, but I mean when we first started working on the batch processor one of the first things I did was I wrote WP CLI commands for generating core objects. We have basically five core objects. We have affiliates, creatives, payouts, referrals, and visits. I created WP CLI commands for generating those. Where I’m going with this is like the old-schoolness of this is I’m building these batch processes, and I have like 250,000 affiliates. I’m deliberately inflating my numbers by a lot. Then I’m just throwing these massive data sets at these batch processes to make sure that it’s not going to break.

BRAD: Yeah. No, man. You have to.

DREW: You almost have to.

BRAD: You have to do that because, I mean, yeah, you have to because like what’s the point of building this batch process system to account for those kinds of data sets if you’re not even testing against them.

DREW: Right, and the number of customers actually running sites with those huge numbers of objects may not be that nutty, but I mean we have to make sure that it can handle that. That’s the whole point.

BRAD: Yeah. Well, those customers tend to be the big ones, and do you really want it to not work?

PIPPIN: You want it to work.

BRAD: Probably. Probably not. Yeah.

PIPPIN: Okay, Drew.

DREW: Yeah.

PIPPIN: I think we’ve got just one more question for you on batch processing before we wrap up here. Having now built a pretty kick ass batch processor for Affiliate WP, seeing what we had built before that, maybe what we did wrong, where we didn’t go far enough, seeing what we’ve built in our other projects like EDD, seeing what WordPress Core has done, et cetera, and being all around pretty familiar with batch processors, what’s maybe a little bit of advice that you might give to somebody who is building the first batch processor into their own plugin?

DREW: I think there’s value in having tried and failed with another system first, right? A batch processor seems like a great idea, but it can feel like hammering in a nail with an anvil if you’re not careful. I would say, let’s say you have something that you think would be a good fit for a batch processor. I would try it first as like a static processor or a step processor just so that you could sort of figure out where the kinks are going to be because once you get into building a batch processor, you want to try to create something that is generic enough that you can use it for more than one thing. That’s not to say that you couldn’t do like what EDD did and build a batch processor specifically for importing or specifically for exporting.

Probably the best advice I could give is to try and fail first with something other than a batch processor. That sounds like terrible advice, but it’s important to have perspective, I think, on how you expect it to work. If you try to dive right into building a batch processor without having a demonstrated need then it may be a little more than you need.

BRAD: That’s a great bit of advice. When you’re starting out building something new like that to just focus on a very specific problem instead of trying to build this abstract, generalized thing because I find the complexity is just so much greater when you try to account for every possible scenario versus just trying to cater to this one, to fixing this one problem, and then trying. Then after you get that fixed and nailed down, then make it general and make it apply to other–

DREW: Right, and there’s absolutely nothing wrong with building a specialized API and then making it abstracted later because, if you build it special now, then you can write tests for the special case now. If you find that you want to reuse that code later then great. But if you start abstract, you could skip a lot of steps, a lot of learning steps about how you need it to work and it ends up, in some cases, I think, becoming kind of a big, jumbled mess.

PIPPIN: Drew, I think a lot of the reason why you were able to build the abstract API actually is because you had a lot of that backbone to already work on where there were those unique or very specific problems that had been solved and then readapted that showed all of the different use cases. I couldn’t agree more.

DREW: Brad, a good correlation to that is have you ever seen a brand new product website that has an FAQ on it and you just think to yourself, “Who is asking these questions?” It’s like that. I mean that’s kind of how I equate that is like if it’s a brand new product, how can it have frequently asked questions?

BRAD: Yeah.

DREW: It’s the similar problem with build the specific thing you need now. And if you need more of them later, then maybe abstract it out, and your learning process will dictate what you need.

BRAD: I couldn’t agree more.

PIPPIN: That’s great.

BRAD: Cool.

PIPPIN: Thanks for joining us, Drew. It’s been a pleasure.

DREW: Absolutely. Thanks for having me.

PIPPIN: And thanks again to Pagely for sponsoring this episode. You can go check them out at Pagely.com. If you are interested in the business of hosting, in the business of WordPress or anything like that, or just an interested individual in general, definitely go check out their 2016 year in review post. It’s an excellent insight into some of the decisions that are made at Pagely and it is highly educational.

BRAD: Thanks, everybody.

PIPPIN: Cheers. Catch you next time.

Episode 77: How to Build a Batch Processing System with Drew Jaynes

Links and Resources:

Leave a Reply Cancel Reply

Search Episodes Archive

Recent Episodes