Post Action Report: Bad Firewall Rule Released to WPEngine Customers Wednesday
On Wednesday afternoon a small percentage of WPEngine websites using a paid version of Wordfence experienced a 500 Internal Server Error or white screen on their sites due to an erroneous firewall rule that we released. If you have experienced this issue, please check your email which contains instructions to fix the issue. You can also find guidance on our Twitter account along with our forums where we have posted a solution. We have also been hard at work in our ticketing system answering support requests from our affected customers. You can open a ticket by signing into this site and visiting your Licenses page, and clicking Get Help on the applicable license. You can also find instructions for the fix in this longer more detailed post.
Please keep in mind only a small percentage of WPEngine users using Wordfence were affected, and these were limited to paid users only due to the way we release firewall rules.
The rest of this post contains an after-action report of the root cause and what we’re doing about the issue.
On Wednesday at 2pm we released a new firewall rule to our Premium, Care and Response customers that was low priority and would propagate throughout all our paid customer sites during the following 24 hours.
Most free and paid Wordfence sites (approximately 95% or more) use a file to store the firewall rules. On some hosts we were not able to implement that so we added a backup method for compatibility a few years ago which stores the firewall rules in MySQL. WPEngine is one of the rare hosts where this is used. We have only heard of an isolated report on Pantheon where this storage method is also used, but nowhere else.
On sites that use this storage method for firewall rules, as they received the new rule on Wednesday, they started whitescreening or producing a 500 Internal Server Error. That is a catastrophic failure and leaves a site in a non-functioning state. It’s the worst case scenario for us and the customer.
By 5:15pm EST on Wednesday evening there were enough reports that one of our CS team members was able to correlate that there is a major issue underway. They posted a list of the issues we’ve received in Slack and it immediately received attention from a wide range of senior team members including our head of security, operations staff, head of products, executive team, dev team and QA team.
From 5:15pm until 5:45pm the team worked together to:
- Confirm there is a common issue among these sites.
- Investigate if it is a WPEngine operations issue, which it wasn’t.
- Analyze the error logs we had received to isolate the issue to a new firewall rule.
- Confirm it is related to a new firewall rule and which specific rule.
- Propose pulling the rule and analyze the risk/benefit of doing that and post action steps for our customers.
- At 5:45pm Chloe our head of product for Wordfence Intelligence pulled the offending rule from production.
Then we split into separate teams which each handled customer communication, root cause analysis, developing an automated fix, and developing an immediate fix for affected customers. Some team members were cross-functional.
We quickly determined we could not develop an automated fix after trying and testing various approaches. We confirmed that the fix for affected customers was to delete the firewall rules in their MySQL database which would remove the offending rule, bring the site back up and cause Wordfence to fetch fresh rules. The SQL for this is:
DELETE FROM wp_wfconfig WHERE name=’wafRules’
We recommend a database backup before you run this. You may have a different table prefix to ‘wp_’ and you may also have an upper-case C in the word ‘config’ above if you’ve been using the plugin for a long time.
If you need help running a query on your WPEngine site, you can find instructions on this page on WPEngine: https://wpengine.com/support/run-query-phpmyadmin/
Our communications team shared the fix on Twitter, our forums, and via tickets, and immediately started getting confirmation that this fix worked.
Today is Friday April 14th. This morning we held an after-action meeting to discuss the issue, root causes, what fixes we will be implementing longer term and what controls we will put in place to prevent a reoccurrence.
What Caused It?
As with most failures, this was a chain of events. Our firewall rules go through a rigorous QA process and that process included testing firewall rules on the ubiquitous file based storage system and had a process in place to test firewall rules on the MySQL based storage system. A while back that test for the MySQL stored rules was decommissioned and we inadvertently did not replace it. That was root cause one.
The second item in the chain was when we execute firewall rules in the plugin, our exception handling is not robust enough to handle the exception it encountered with this rule.
A third item in the chain is that about a year and a half ago we added new functionality to the firewall syntax on Wordfence to make it more powerful. But we haven’t used that functionality yet because our internal threat intelligence platform had not yet been updated to support it. Recently we added that support. So for the first time we rolled out a rule using this new functionality, which resulted in a higher likelihood of an exception being generated.
How Are We Preventing Similar Issues?
To prevent a future reoccurrence we are taking several steps.
Firstly we’re immediately putting a process in place to verify that all rules run on MySQL based rule storage systems.
Next, we’re implementing a long term solution by revamping our testing process to add an additional testing layer on external systems using our production infrastructure. Now before a rule is deployed it will go into ‘alpha’ mode and only be deployed to servers we own across a wide range of configurations and hosts. This will allow us to test all rules we deploy across real infrastructure running on real hosts, via our production infrastructure (as opposed to staging) as a final step before we deploy to our free or paid customers.
Either of the above two controls would have caught the issue we experienced on Wednesday.
In addition, we’re adding more robust exception handling to the plugin as it executes firewall rules. If a rule throws an exception, it will be caught, and gracefully handled. The rule will then be disabled and we will be notified. This will avoid errors on customer sites, and reduce our time to respond from 3 hours to minutes.
I’d like to sincerely apologize to the customers that were affected. Our records show that it was less than 200 in total based on reports received through our support channels, email, forums and social media. Our team works hard to avoid issues like these, and while deploying software to over 4 million sites every few weeks, along with firewall rules in real-time, presents a unique operational challenge, we are generally very good at keeping sites secure and producing rock solid software. This time we failed you and I’m sorry.
I created this business because a hacker took my own WordPress site offline in 2011. Keeping you online is at the very foundation of what we do and what we are built on. We have worked this week to do better and we will continue to do so.
Mark Maunder – Wordfence & Defiant Founder and CEO.
Good job, challenging conditions! Crisis Management 101 Award!
Long time Wordfence customer here. None of my sites were impacted. (Don't know why)
I know you are busy every moment beating yourself up over the events, but here's what I saw:
- you owned the problem right up front and didn't try to deflect
- you communicated clearly and honestly
- you set expectations
- you showed you were working fast
- you apologized and promised a post-mortem
In my view, you nailed Crisis Management 101.
So despite the rotten days, know that some of us out here saw exemplary stuff. Feel free to share this with your team. Check my site. I spend time with big global organizations that tend to fail at mastering what you've just done.
Futurist JIm Carroll
FYI, sites on Azure app services were also affected.
Thanks Bjorn I’ll alert the team.
Our main site is hosted at Wpengine and was affected by the new firewall rule. The guy who assisted me at wpengine was amazing, 10 minutes after our site went down we were to pinpoint wordfence as the cause of the internal error so I was able to go via CLI and disable the plug-in that gave me immediate access to the wordpress admin panel so i totally remove the plug-in.
Total down time was 30 min enough to generate 60 or 70 support calls.
Thanks for the explanation and all the hard work. Also, thanks to the guys at WPENGINE For their amazing tech support
Sounds like their support team rocks! Glad you got it sorted.
I run an AWS-based infrastructure (not WP Engine) that was taken out by this. I lost time with my family, and money to my clients to this mistake. I'm glad that you've put processes in place to ensure this doesn't happen again and appreciate this post. I just hope there are no other blind spots.
Sorry to hear that. QA is something we put a lot of care and thought into and we have an excellent team. This has been a catalyst for a closer overall examination of risk, exposure, processes and so on. We’re using it as an opportunity. I don’t anticipate anything else will emerge at this point but this will make us a more robust company and team going forward who can better serve our customers.
First, I'll echo what Jim Carroll said!!!
We had one site that was affected as well. That site was in our private cloud. We use the MySQL storage system when we have multiple servers behind a load balancer.
When I received the email Wed evening I checked all our sites that use the MySQL storage engine and all seemed fine. However, on Thurs morning I rechecked them all again and one of them was affected. The fix you provided was simple and the site was back up immediately.
So, thanx again for all you do to keep the WP community (and the web at large) safer!
Glad to hear you got it fixed without a hitch Paul. Thanks for the feedback.
So admire the transparency. You rock. Great job by all.
We had 3 sites down for several hours as it happened after midnight (EU). It was an easy fix though. Communication from WF was fast and detailed enough. Creating a post like this is a good way to summarise the issues and the learnings from it. Hopefully, this was the first and last issue causing sites to go down. The WF ride has been a smooth one so far.... (long-time customer).
Our site was impacted. I only found our 1hr later as we were contacted by a client. I understand that problems happen, but it would have been good to be proactively notified sooner by Wordfence so I could implement the fix. Official communication took far too long and arrived at almost 9PM.
Just as an FYI, 200 people complaining (took the time to complain) doesn't mean only 200 people were affected, others, like myself, were busy doing root cause analysis and had the team at WPEngine help get my site back up and running while never having to contact you directly. I think the impact you stated were much more than you think, so don't think because people didn't complain, that they didn't have the issue.
I did not report this issue, but one of the websites I manage on WPEngine experienced this issue. Thanks to your input on how to quickly resolve the issue, I was able to get my client back online in minutes. Appreciate the communication on what you're doing to prevent this from reoccuring.