#139 – Andy Fragen on Automatic Update Rollbacks in WordPress

Transcription

[00:00:00] Nathan Wrigley: Welcome to the Jukebox podcast from WP Tavern. My name is Nathan Wrigley.

Jukebox is a podcast which is dedicated to all things WordPress. The people, the events, the plugins, the blocks, the themes, and in this case, automatic update rollbacks in WordPress, what they are, and why this was a difficult feature to build.

If you’d like to subscribe to the podcast, you can do that by searching for WP Tavern in your podcast player of choice. Or by going to wptavern.com/feed/podcast. And you can copy that URL into most podcast players.

If you have a topic that you’d like us to feature on the podcast, I’m keen to hear from you and hopefully get you, or your idea, featured on the show. Head to wptavern.com/contact/jukebox, and use the form there.

So on the podcast today, we have Andy Fragen.

Andy is a dedicated member of the WordPress community, as well as a trauma surgeon. Somehow he manages to balance the demands of his profession with his passion for the community. And, as you will hear, to important work inside of WordPress Core. Even while in the operating room waiting for patients to be prepped, Andy has been known to find moments to answer forum questions and provide support to others. It’s truly remarkable.

Andy talks about the important topic of automatic rollbacks in WordPress, a feature aimed at reverting to a previous version if an automatic plugin or theme update fails, ensuring the website remains functional for users. I’ve managed to encapsulate the idea into the previous sentence, but as you will hear, the execution of that idea was anything other than straightforward.

Andy discusses the origins of the rollback feature. The team working on this problem identified complexities and potential fatal errors during plugin updates, and came up with a simple, yet effective solution, which worked. But as with so much in code, some edge cases meant that the road to a fully workable solution for all WordPress users was not quite in sight. Many times the drawing board had to be dusted off, and the problem looked at once again.

While developing this feature, numerous challenges were encountered. From finding consistent test conditions to managing technical limitations. Andy shares insights into the critical role of testing and collaboration with hosting companies, meticulous attention to detail and problem solving skills, developed to combat issues like file write delays, and loop back test redirects.

Andy explains how the team managed to avoid fatal errors in active plugins with extensive testing and incremental improvements. They introduce functionalities like WP error checks, simulation features for testing error handling, and a new move directory function to enhance reliability.

Andy also discusses the broader impact of their rollback efforts. Many users might not notice this new feature, but in a sense that’s how it should be. The more unnoticeable to end-users update failures are, the better. It means that sites that would previously have been broken and are now working, and that’s a win for everyone.

If you’re interested in the behind the scenes development of a WordPress feature, that quietly keeps your website running smoothly, and in hearing how a dedicated contributor balances his passion for WordPress with a demanding medical career, this episode is for you.

If you’d like to find out more, you can find all of the links in the show notes by heading to wptavern.com/podcast, where you’ll find all the other episodes as well.

And so without further delay, I bring you Andy Fragen.

I am joined on the podcast by Andy Fragen. Hello, Andy.

[00:04:21] Andy Fragen: Hello Nathan. How are you?

[00:04:22] Nathan Wrigley: Yeah, good. Nice to speak to you. We are in WordCamp US, we’re in Oregon. We’re in the convention center. I’ve actually no conception of what the convention center’s called.

[00:04:30] Andy Fragen: I think that’s it. I think it’s called the Oregon Convention Center.

[00:04:32] Nathan Wrigley: Normally when I come to these events, I’m talking to people who are doing a presentation of some kind, but we had a conversation, introduced, I think, via Courtney Robertson. She suggested that I might like to talk to you about this particular subject, and I bit her hand off frankly, because this is really interesting.

We’re going to talk about automatic rollbacks in WordPress Core. And if that doesn’t mean anything to you, that’s fine, Andy will introduce the subject a bit later. First of all though, Andy, it’s a bit of a boring, dull question, but can you give us your little bio, and then I think, unusually, I’m going to dwell on your bio for quite a long time, if that’s all right? Tell us what you do in all of your life.

[00:05:08] Andy Fragen: I am a full-time trauma acute care surgeon, and I work in Southern California. And I dabble in WordPress as a hobby.

[00:05:18] Nathan Wrigley: You are also a Core committer.

[00:05:19] Andy Fragen: Yes, sir. I got my first props for a Core commit about nine years ago.

[00:05:23] Nathan Wrigley: Okay. So, to me, Core commit, if you have privileges to commit to Core.

[00:05:29] Andy Fragen: Well, I’m not a committer, I’m a Core contributor.

[00:05:31] Nathan Wrigley: A core contributor.

[00:05:33] Andy Fragen: If they ever offered me commit access, I would decline.

[00:05:36] Nathan Wrigley: Yes, okay. Nevertheless, you are working at a very high level within WordPress, and this is the kind of thing that I think, this is what people do as a career. You know, they become incredibly skilled at coding, and in this case with WordPress, and this is the level maybe that they reach.

But let’s just rewind a little bit, because at the beginning of your bio you said you are also a trauma surgeon, which to me seems much more than a regular job. It seems like that would be an incredibly difficult thing to achieve in life, but also a thing which would consume many, many hours. And so I’m kind of wondering, how do you keep up in life with being a Core contributor as well as a trauma surgeon?

[00:06:13] Andy Fragen: As I said previously, time is the only fungible commodity we have. You spend it how you want to spend it. Obviously, depending on how you work and what the structure is, you have certain time availability, and then certain time not availability.

Some of what I have as a trauma surgeon is downtime. There are certain parts of my day that are busier than others, and there are certain parts of my day that aren’t as busy. I have been known to answer forum questions, and do support while in the operating room, waiting for a patient to be prepped.

[00:06:46] Nathan Wrigley: So, let me just parse that a little bit. Whilst in the hospital, when there are not urgent things to be done, you are whipping out the laptop and contributing to WordPress.

[00:06:56] Andy Fragen: No, I usually do that on the phone, believe it or not.

[00:06:58] Nathan Wrigley: I think that’s remarkable. Honestly, I am really, really amazed that you can manage that. I think you must have an incredible grasp of your own time management.

[00:07:07] Andy Fragen: Honestly, a lot of the support stuff is just having been in my own plugins that I’ve written for a long time. I have a better understanding of what the issue is almost without even getting too much you know, stack traces and things like that. Not that I don’t need them sometimes, but if you’ve ever done support and gotten issues from people, sometimes it takes a while to tease out the actual information you need to give a response.

[00:07:31] Nathan Wrigley: Do you regard WordPress as a hobby then, or is it more like a second career almost?

[00:07:36] Andy Fragen: Career would imply that you earn something from it. And, yes, I do sell a plugin, and part of the reason my wife allowed me to fly up to Portland for a WordCamp is I could say that the plugin sales paid for an airline ticket and a hotel. The usual deal I had with my wife living in Southern California was I could go to any WordCamp I could drive to. Since Covid they’ve really just stopped. Everybody’s sort of burned out, and that’s a whole other issue unfortunately.

[00:08:01] Nathan Wrigley: Yeah. How much time do you think, let’s say a month, would you be putting into WordPress?

[00:08:08] Andy Fragen: Some of that depends on where you consider your time. If I’m kind of paying attention to the Core dev Slack when they’re doing meetings, do you count that towards time? I mean, some people do, some people don’t. So I have said that I have usually put anywhere between 10 to 15 hours a week into it, depending on where it was. I mean, for a while I was running the Core upgrade install meeting. Sometimes we get, there’s not much for us to talk about, we’ve foregone the meeting at times.

Yes, somehow I’ve found myself as a component maintainer for the upgrade install component, along with several other people.

Over the years I’ve done, or found my way into, I should say, several different things. And one of the things I did when we met Courtney last WordCamp US in San Diego, she was having trouble getting set up in a dev environment on her computer. And I’d already sort of figured it out, and sat down and helped her with it and, yes, by the end of the time period, we had Courtney up and running in the docker dev environment for WordPress Core.

[00:09:08] Nathan Wrigley: So 10 to 15 hours a week, 40 to 60 hours a month, that’s a lot.

[00:09:14] Andy Fragen: Yes, and you have no idea what my normal schedule is like. Even if I didn’t do that, as a trauma acute care surgeon, we work shift work, and our shifts are 12 hours long. And an FTE, full-time equivalent, for being considered full-time is 16 shifts a month. Whether that’s two shifts in a row in 24 hours, or not, or individually, it just depends. I would’ve considered at some point in time, in the last several years, only doing 16 shifts quite nice. I’ve done as many as 30 in a month, and probably in the last several years, averaged somewhere around 20, give or take.

[00:09:54] Nathan Wrigley: I think what I’m taking from this, Andy, is that you work quite hard in all things that you do.

[00:09:59] Andy Fragen: I have two adult children, 29 and 23, and I did not encourage them in the least to go into medicine. I think they saw how hard I worked, or how much I worked, and I’m very pleased to say that neither one of them went into healthcare. I think it’s just a changing field.

[00:10:16] Nathan Wrigley: Yeah. Where did your interest in WordPress, or coding come from? Is this something that you’ve had as a child, or did you pick this up later in life?

[00:10:23] Andy Fragen: I took a coding class in college, not my major. But I graduated from college in 1985 as a senior, the first time they had an intro to microcomputers, and there was this new computer out called a Macintosh. And it looked fascinating. And the class had a lab, you could either choose to be in the Macintosh lab or the PC lab. And honestly it was a freshman level class, I took the class just so I could do the lab and learned how to use the computer. And on graduation I bought one, and I’ve been using Macintosh computers since 1985.

[00:10:57] Nathan Wrigley: So it is not really from childhood this, I guess maybe, I don’t know exactly how old you are, but certainly at my age, computers were a brand new thing. You couldn’t really do a great deal with computers back in the day. But I picked up the interest when I was probably a similar age to you, and it just kind of blossomed in me, and I just got really into it. And although it’s become what I do for a living, I can well imagine that if I hadn’t have done it for a living, it would’ve been a big part of my life anyway, a little bit like you by the sounds of it.

[00:11:23] Andy Fragen: I look at it as an interesting hobby. It’s a lot about problem solving, and it’s a lot about making a repetitive process simple, defined and consistent.

[00:11:34] Nathan Wrigley: Yeah. We’re going to talk now about something which is new in WordPress, and it’s automatic rollbacks. I don’t know if there’s a more grandiose title than that.

[00:11:44] Andy Fragen: We call it, I mean, internally we call it rollback auto updates, that’s just part of it though. Initially we just called it rollback.

It’s taken several years, and many cycles to kind of get into Core. And part of the issue was, when we first had it, a lot of Core committers don’t want to touch something, or don’t want to commit something that touches such a large piece of WordPress, such as plugin updates, or theme updates. And as you can imagine, everybody had a lot of trepidation. And it was a big enough project that somebody who wasn’t directly involved in it, if they left and weren’t paying attention for several months, they’d come back and whole things were new again.

[00:12:24] Nathan Wrigley: Right. Let’s describe what it is. So this is my understanding of it. A couple of years ago it feels like now, we got this option in the plugin screen to automatically update plugins. And it may be that almost everything in your WP admin, plugin wise, you can automatically update. So you don’t check a box, but in effect you check a box.

And from that moment on, that plugin, or all the plugins, if you decide to go that route, will automatically update in the background, so you don’t have to think. I guess it’s a part of a broader initiative to make WordPress as automatic as possible, so that you don’t have to log in if you’re an inexperienced user, or you’re just not really bothered about WordPress, you just want a website. And it’ll just all tick over in the background and keep itself updated.

Now that’s great, except when there’s a problem with the plugin update, and the updating automatically breaks something. You know, you come back and either you’ve got an email to say that things have gone wrong, or you just go to your website and discover, why does my website now not function anymore?

The initiative here was to make it so that some automatic detection mechanism would update the plugin, throw an error, say, there’s a problem, wait, roll it back, so go back to the previous safe version, and go from there. And so everything in theory should be good. Now that all sounds so straightforward to say, but just before we hit record, you indicated to me, and let’s get into this really granularly, you indicate to me that it was anything other than straightforward. Now, why was that?

[00:13:54] Andy Fragen: So Core has had the ability for a failed update to revert, try again, I’m not quite sure of the process, since about 3.5 or 3.7. And the ability to roll back a failed update for a plugin, or a theme really just wasn’t around. Colin Stewart and I, and Paul Baron had kind of gotten together on trac tickets about rolling back, and I think it was Paul’s initial trac ticket about it and, no, I don’t remember the number offhand, that really got us started. And it was intriguing to all of us on how to do it, and where to go.

And so we get together in private DMs, and discuss it, and kind of try and figure it out. We would overcome problems and basically create the solution, and I’d put it in a plugin so that we could test. And what we found was, well, once we got to parts where we had the whole solution, and we had a whole solution for quite a while, including the auto updates before we even had any of it committed.

But again, as I say, it was so complex and, you know, when you have millions of sites, and 45% of the web, and all of them are doing updates to plugins and themes at any given time, it’s like touching the third rail. I mean, sometimes you can kind of get away with it, but more often than not, you could potentially kill yourself.

So we never had a Core committer deeply involved in it. We had some that came and would give us advice on how to do things, or not do things, and we would seek to solve those issues and those problems. But without having someone with commit access, it made it very difficult to actually get things in. Colin now has commit access, and he got it just before the rollback was complete. But again, he’s like, I don’t feel comfortable, as my very first commit, committing this big part of Core.

After we’d had it done for a while, there was a discussion on the trac ticket about what it would take to test it. And one of the Core committers suggested that we test it on thousands of sites, of various different server setups, to ensure that it didn’t cause issues or problems.

Now, conceptually I do understand that, because WordPress can run on a vast variety of installations, and hardware, and virtual environments, and all sorts of things. You don’t want to screw it up. You don’t want to brick somebody’s site consistently.

The problem is, I don’t think anybody’s ever been asked to do something so extensive before. And so we basically found that it was almost impossible. I was at the first WordCamp in San Diego, the first one after Covid, and I literally went around to all the hosts and said, I need you to test this, I need you to test this, I need you to test this. And everyone was very agreeable, but they would test it on their environment, right? Make sure that it worked in their system, but probably only on one site that was a test site, and not in production on anywhere else. And I’m not quite sure that that’s what they were looking for.

We only kind of overcame that process when we broke it down in smaller pieces. So rollback started out with getting, let me go back. Currently, the way Core works, Core updates work, and the way plugin and theme updates work is, Core would download the package, the update, it would store it on your system locally. It would run some checks, it would re-expand the zip into the upgrade folder, and then it would do a recursive file copy into the location of the plugin directory, or the theme directory. Although it would also delete the file first, before it that recursive copy.

Consequently, if something failed, your plugin was gone. You had to go find it again and reinstall it, if somewhere along that process failed. The first thing we did is try and, well, one of the first things I did was made it a zip file. So I back up the plugin or theme is a zip file, store that, do the update. If it failed, I’d reinstall the zip file. But we heard from several hosting companies, that may be taxing on resources creating, and zipping, and unzipping files at that scale.

And so I’m like, okay, let’s come up with something else. And what we ended up with is changing that recursive file copy to use PHPs rename function, with the recursive file copy as a backup, as a fallback for it. Now, you would think that, since this runs on PHP, and everybody has PHP installed, that it would work out just fine, except VirtualBox and VVV. And so these are custom local dev environments.

We did put out a call to the hosting companies, does anyone offer a virtualised solution using VVV or VirtualBox? And the answer was, no. We kept getting failures using the PHP rename and VirtualBox. Partial file transfers, partial rights, and it took a lot of investigation to figure it out. Peter Wilson, one of the Core committers from Australia, uses VVV as his dev environment. And when you get a Core committer, and some basic part of this isn’t working, it’s not going in.

So we got the fallback. We did a lot of research, and it appears, from some Composer bug reports, that VirtualBox has sort of a delayed file right. Go figure. And they found that by adding a 200 millisecond delay to things, it solved the problem. We added a 200 millisecond delay to things, it sort of solved the problem. We later had some help from, I want to say it was Andrew Ozz, who’s clearly been around WordPress for a long time. And he suggested we try flushing the memory for, after we did the rights and stuff.

And we eventually got it so that it did work by doing that. We had a PR for essentially a new function, for a move directory, move dir, underscore dir, that replaced, copied, the recursive file copy, copy dir, in the update process. Copy dir’s still there as a fallback, but we now had the PHPs rename function working in every environment that we had, and consistently, even with VirtualBox and its limitations, or its peculiarities.

And that was the first big step we had towards getting committed. And I want to say that was committed in 6.2. So if you think back, all the stuff we had, aside from that little piece with the VirtualBox, was all done before 6.2. And the final piece just got committed in 6.6. So when I say this was years in the making, it really was years in the making.

The second part we did was just on manual updates. So we had, when you click a manual update, several things can happen during the course of the update process. Somewhere along the way the file copy, or the download can fail, the file copy can fail, certain pieces can be missing. So we created enough checks during that process and returned errors, such that, if any of those pieces failed, how we started with is storing the file to be updated in a temporary update folder.

And so if any of those processes failed, we would go and restore that folder using our moving function. The process was actually very quick in testing, on the order of, you know, less than a second because PHP rename just sticks the whole chunk and moves it. So you have to remember the first part of the process on the update is, you delete the plugin in the folder that it’s in, and so as you’re downloading and installing the other one, you’ve got nothing left.

The normal update process, you’ve deactivated the plugin, you reinstall it, and then it reactivates. That’s in a manual process. Remember when we get to the automatic updates how that’s different, and where it can be problematic. So we had restore functions, we had our delete functions, because after we had a couple of cron tasks that we created to clean up after ourselves on a weekly basis, so that those copies wouldn’t be around anymore. Even on a shutdown function, we would delete the folders, and then on a weekly basis we’d make sure they were all gone.

So once we got that working and committed, we had basically a safe method for manual updates. So if someone would go to the dashboard, they’d click update, if somewhere along the way their connection lagged, or their connection to wordpress.org was delayed, and things took too long, and the server timed out, it would reinstall the previous version. And you would still show that the plugin or theme required an update, because you’ve now reinstalled the previous version that was installed.

We didn’t have to go out to .org and re-download anything. Part of that is really what limited the resources that we were acquiring for servers. Because all we were requiring them to do was copy a directory back and forth to different locations.

[00:22:59] Nathan Wrigley: What’s the flag for success or failure in this case? What’s the thing which determines that the update hasn’t succeeded?

[00:23:06] Andy Fragen: It depends on different parts of the process. So if there’s a failure in the download package, and the downloading the package, there are several parts that we’re already checking, where it would succeed or fail. We either created new WP errors for those, and returned those values, and so when we looked for those returns, if they happened, we would send it to the functions we wrote to restore from the backup or not.

So we’d start the process by creating the backup. And then if the process continued on without error, the backup just got deleted at the end. If there was an error in the process, the backup was restored. You’d basically be right back to the place you started from, requiring an update to either the plugin or a theme.

[00:23:44] Nathan Wrigley: Yeah. Does the user in this scenario of the website, do they get some sort of notification that, actually, this didn’t work out? We’ve rolled back, your site is now working, but the thing that you intended to do didn’t happen. And if they do, does it list out what the problem was? Like it was, I don’t know, a download failure, or the plugin seems to be malfunctioning.

[00:24:03] Andy Fragen: You would see an error message, either in, if you were using the shiny updates on the plugins page, or the themes page, you’d receive the error message, and kind of a brief description of where the error was. Either the directory wasn’t writeable, or there was a download problem, or whatever the error happened to be. And then you’d be back at the place, if you refreshed your screen, it would show that you still needed the update.

If you did the update from the Core update page, you know how the update shows, here’s the zip I’m downloading, it’s doing this, you would see an update error there. And part of what we put in the test plugin, or the feature plugin, was a way to simulate an error. And so we simulated a download error. And so you’d see that, and it would say it’s simulated error.

[00:24:47] Nathan Wrigley: If I was to receive this error, let’s say that I’d put a plugin onto automatic updates, and I had received a failure, does your setup then say, okay, permanently stop automatically updating this, or is it more a process of wait for a day, or is there a setting that I can engage to say, try again arbitrarily later, just keep trying the update, or is it a case of, okay, we’ve identified there’s a problem here, now it’s up to you to go and check and fix?

[00:25:15] Andy Fragen: Well, at this point it’s just a manual update, right? So if the manual update fails, you refresh the page, it’ll say, update now again. And you just try again. Our assumption at that point is, if you continue to get an error because your upgrade path isn’t writeable, well, it’ll tell you that, and you probably have to check your permissions on the folder, and make sure that they are writeable. If there’s some other issue involved in downloading the file, the usual assumption is that will go away. And so you try updating again, and maybe the next time you update it works. You’re done.

[00:25:47] Nathan Wrigley: I guess one of the goals of automatically updating plugins was that you could have a site that you essentially could have an autopilot. And it could be, let’s say a brochure site, where the intention really is just to have a site and never look at the backend ever again.

[00:26:01] Andy Fragen: We haven’t gotten to that part. That was part one and two. The third part was the auto updates. Now, this does not apply to themes, and I say that doesn’t apply to themes because themes are kind of a different beast. And I guess we probably could figure that out, and maybe that’s the next, on the part four, which we haven’t really defined yet. But there’s certainly a lot more plugin updates than theme updates.

The difference between an auto update and a manual update is the auto update runs in a cron task. The auto update does not deactivate the plugin, update the plugin, and then reactivate the plugin. The plugin is active the entire time, which means if the plugin has an error in it, and the update completes successfully, you’ll have a white screen. And this has happened.

What we needed to do was figure out how to catch the error, essentially. And I want to say, the first iteration, there are basically three types of error handlers in PHP. I was checking every single one of them to make sure that we would catch the error. Now the problem with doing that is the shutdown error handler catches everything. So sometimes you catch the error twice, and that’s okay, but it was trying to figure out how to make sense out of some of those errors that was difficult.

And so we had a list of certain types of errors sometimes, and we would cause a rollback if we spotted any of those things. And it worked. For all the errors that we figured out, it worked. That’s not what we went with though, because what they don’t tell you is, when you write these things and you have feature plugins, is that, the way they integrate into Core, when you’re trying to commit them, may be totally different. So integrating it into Core is like, okay, now I need to update this file in Core, and that file in Core. Some of those things we could not duplicate in a plugin.

One of the things that we ended up with was a change to the load PHP file. There’s no way to mock it, there wasn’t. I couldn’t duplicate it, it loads too early, I couldn’t replace it. So we were just, okay, this could be the problem. So we tried to work around it a little bit, which is fine too. Colin did a little bit more digging and found out that, if you edit a plugin in the plugin editor on screen, it does a loop back check as it saves, to determine whether or not there’s a fatal in that plugin, and it won’t save it for you. And so we reutilise that loop back, as a test for whether the reactivation, or the installation of the plugin causes an issue or not.

We found it worked well. It greatly simplified what we were doing. I don’t even want to get into the fact that I was using reflection, and reflected objects, and reflected methods, and all sorts of things to fix it in the plugin in the first place. It worked and, yes, there are places in Core that uses the reflection classes, and reflection objects, and things like that, but not many. At least we knew we weren’t breaking new ground with that. But in adding that to Core, it made it a lot easier to do the loop back request.

Part of the issue in testing was we didn’t really have a file that would accurately fail every single time it was updated. I make an updater, it was easy to create a file that would fail on an update. So that’s all we had for a moment, until Aaron Jorbin piped up in one of the Slack meetings and said, here’s a plugin that’s been in the repo for a while, the title is, DO NOT USE THIS PLUGIN! And it is exactly for that. It’s set to fail on an update, it’s set to fatal on an update. And so now we had a plugin that other people could use more successfully, they didn’t have to go and install my updater or to find it.

Having both of them installed though did help things out because there’s all the scenarios you have to test for. What happens when you have a plugin that runs, and then a fatal, and then another plugin update that needs to happen, and maybe another fatal after that, or maybe two fatals in a row? All these things you don’t really think about, but you should test for, or you have to test for.

And it’s all manual testing. I had no idea how we could ever write, and then testing for it otherwise, without just doing it manually. I want to say, the first time we tested it for the update rollback, we picked 13 of the largest plugins. And when I say largest, some of the more complex plugins, as far as folders, and files, and size to run, and to see whether we get timeouts, whether they would complete successfully, and things like that.

We found what worked well, and that’s one of the places we found that VirtualBox error, because it would just time out. It wouldn’t complete it. We were making it fall back to the file copy originally, and nobody is going to have this list of files to recursively update at any one time, except us and testing, but it just wouldn’t work consistently. And so that was before we got the move directory function in and working for it, and then it worked.

[00:30:59] Nathan Wrigley: What I’m getting from this is you must have incredible patience. Years and years and years of trying different things, problems emerging that you couldn’t foresee, solutions that you tried to implement, and then discover, okay, that didn’t work, let’s try something new. I don’t know how good you are at not throwing things at the wall, but it feels like there was an opportunity here to throw things at the wall.

[00:31:19] Andy Fragen: You’re making an assumption Nathan, who says I don’t throw things at the wall.

[00:31:23] Nathan Wrigley: Was it a fairly, how to describe this? Did it surface things about the open source project, in terms of the way it’s done, that you wish were different?

[00:31:33] Andy Fragen: Certainly. You know, one of the things that certainly helped move us along was having a lead developer, in this case Andrew Ozz, take an interest, and help answer questions and move along the way. He helped us tremendously with the plugin dependencies feature as well.

Since none of us were committers, and it’s a big feature, it would be nice if the leadership assigned experienced Core committers to feature projects, assuming that most of the people involved in the development of those projects aren’t Core committers. Another reason I should never have commit privileges.

[00:32:11] Nathan Wrigley: Yeah, it sounds like from everything that you’ve said, that the process could have been expedited in a variety of different ways.

[00:32:17] Andy Fragen: It certainly could have. And some of that made for a lot of frustration because we would have experienced Core committers who would look at the project every couple of months maybe. It was a huge undertaking in total. And it was really only by splitting up into smaller pieces that we were able to accomplish it at all.

[00:32:36] Nathan Wrigley: Yeah, and also the dramatically impactful nature of what you were doing. And I’m sure the irony’s not lost on anybody, that you could have achieved fatal errors in the attempt to remove fatal errors. Just the idea that such an important thing, from your perspective, maybe didn’t get the, how to describe it, didn’t get the.

[00:32:55] Andy Fragen: Attention.

[00:32:55] Nathan Wrigley: Yeah, the attention, the sort of level of importance that it might have done given the impactful nature of it.

That now is in the past though, this is now a feature of WordPress. How much of a hand on heart moment, and I don’t know if that’s the right phrase, how difficult was it for you on the day that the version of WordPress that the shipped with came out? Were you fairly confident at that point that all the things were going to be fine?

[00:33:18] Andy Fragen: I’ve been running the plugin, which has had a version of the rollback code in it on sites for years, with plugins that would fatal on update. Now, the auto update will only check for the fatal update if the plugin is active, right? Because if it’s not active and it updates, all you’ll happen is, is when you go to activate it, it won’t let you activate it because it’ll say it fataled. And so you’ll have to go and reinstall another version or something, but it won’t take your site down. We specifically don’t test for plugins that aren’t active. Honestly, we’ve been running the code for so long on our own stuff, I wasn’t worried. I mean, have you heard of anyone having a problem?

[00:33:57] Nathan Wrigley: No, and that’s pretty remarkable. Have you?

[00:34:01] Andy Fragen: No. Well, yes, yesterday.

[00:34:04] Nathan Wrigley: Deliberately?

[00:34:04] Andy Fragen: Well, it was something we found that we hadn’t honestly considered.

[00:34:08] Nathan Wrigley: But just one individual, so far, that you know of.

[00:34:10] Andy Fragen: So far. And I think I actually no way to mitigate it. Apparently, if you have a redirect to your homepage, the loopback doesn’t work. So what happens is you don’t see a failure, even if there is a failure. And so because you don’t see the failure, because you’re now no longer looking at your actual homepage, you’re looking at a redirect, it might work just fine. And so you don’t revert, and your site might, you know, when you go somewhere else in the site, it might fatal then.

[00:34:40] Nathan Wrigley: Given that WordPress occupies 43% of the web, and that this endeavor of yours, and colleagues working on it, is probably now inside of millions of websites, the fact that you’ve found one character who has been able to show that it didn’t work in a, it sounds like you can mitigate more or less immediately. That’s pretty remarkable. And you’ve just, over the last 40 minutes or so, you’ve done this, it sounds like a detective story almost. Here’s a problem, we tried to fix it, this went wrong, we tried to fix that, this went wrong, we didn’t have the boots on the ground or whatever. You’ve managed to achieve it. And it also feels as if this is the kind of feature update that nobody will ever thank you for, because it’s in the background, if you know what I mean?

[00:35:24] Andy Fragen: Oh, no.

[00:35:25] Nathan Wrigley: Oh, good.

[00:35:25] Andy Fragen: The people that are going to thank us for this are all hosting companies that aren’t going to see these issues anymore.

[00:35:30] Nathan Wrigley: That’s what I meant. It’s more of an invisible.

[00:35:33] Andy Fragen: It is exactly invisible. The user has nothing to do. All they have to do is have auto updates enabled, and their plugins will auto update. If there’s a problem or an issue, it’ll revert back, and all it’ll do is show another update again, in 12 hours it’ll try again.

[00:35:50] Nathan Wrigley: As we said at the top of this interview, it sounds so simple, just roll back when there’s a problem, but now we know.

[00:35:56] Andy Fragen: It’s not that the problem isn’t simple to define, it’s finding all the little pieces in creating the solution that isn’t always simple.

[00:36:04] Nathan Wrigley: Yeah, I remember Kennedy saying, we choose to go to the moon.

[00:36:07] Andy Fragen: Not because we have to, because we want to, or something like that.

[00:36:09] Nathan Wrigley: Yeah, and the problem was hard, but they got over it. And we have echoes of that here. All I can say is thank you so much for making this an important thing. And hopefully, for me at least, and probably everybody else, the more invisible it is in the future of WordPress, in a sense, the better that is.

The less time that I have to worry about plugins updating, the more time that I can concentrate on building the website, and not thinking about those things. And for the millions, millions of people who have no interest in WordPress, but just want a website, this stuff will be remarkable, but probably they’ll never know, which is nice.

[00:36:48] Andy Fragen: You know, honestly our goal is if nobody ever knows about it, perfect. That means it works. No one sees a problem. It hopefully pushes people into clicking that little button link that says, enable auto updates, so that they keep their sites up to date. Because one of the biggest security issues in WordPress is out of date websites, or websites with out of date plugins.

[00:37:14] Nathan Wrigley: Well, dear listener, if you are listening to this podcast, I think there’s a high chance that you obsess about WordPress. So hopefully what Andy has told us today gives you some understanding of the complexities of what’s been going on in the background, but also will make you aware that it happened. If you haven’t been reading the change log, and you’ve just updated to the latest version of WordPress, this fairly consequential, but fairly hidden feature is now available to you free of charge, on the back of Andy’s, and many other people’s labor. So just very quick thank you from me. Thank you for taking the time to do that, and thank you for chatting to me today.

[00:37:46] Andy Fragen: It is certainly not me alone. Colin Stewart has been invaluable, and is a brilliant developer.

[00:37:52] Nathan Wrigley: Well, a profound change to WordPress. Yeah, thank you so much for chatting with me today, I really appreciate it.

[00:37:58] Andy Fragen: You’re very welcome, sir.

On the podcast today we have Andy Fragen.

Andy is a dedicated member of the WordPress community as well as a trauma surgeon. Somehow he manages to balance the demands of his profession with his passion for the community and, as you will hear, to important work inside of WordPress Core. Even while in the operating room, waiting for patients to be prepped, Andy has been known to find moments to answer forum questions and provide support to others. It’s truly remarkable.

Andy discusses the origins of the rollback feature. The team working on this problem identified complexities and potential fatal errors during plugin updates and came up with a simple yet effective solution which worked, but as with so much in code, some edge cases meant that the road to a fully workable solution for all WordPress users was not quite in sight. Many times the drawing board had to be dusted off and the problem looked at once again.

While developing this feature, numerous challenges were encountered, from finding consistent test conditions to managing technical limitations. Andy shares insights into the critical role of testing and collaboration with hosting companies, meticulous attention to detail, and problem-solving skills developed to combat issues like file write delays and loopback test redirects.

Andy explains how the team managed to avoid fatal errors in active plugins with extensive testing and incremental improvements. They introduced functionalities like WP error checks, simulation features for testing error handling, and a new move directory function to enhance reliability.

Andy also discusses the broader impact of their rollback efforts. Many users might not notice this new feature, but in a sense that’s how it should be. The more unnoticeable to end-users update failures are, the better. It means that sites that would previously have been broken, are now working, and that’s a win for everyone.

If you’re interested in the behind-the-scenes development of a WordPress feature that quietly keeps your website running smoothly, and in hearing how a dedicated contributor balances his passion for WordPress with a demanding medical career, this episode is for you.

Useful links

Andy’s WordPress.org profile

core-rollback on GitHub

#139 – Andy Fragen on Automatic Update Rollbacks in WordPress

Useful links

Aaron D Campbell

Previous PostWordPress Enforces Plugin Check and 2FA for New Plugin Submissions

Next PostInstaWP Announces the World’s First WordPress Online Hackathon ‘AnyoneCanWP’

Recent Posts