A Funny Thing Happened on The Way to Debugging Slow Page Loads

Our engineering team gets alerts when network conditions are poor. For example, if pages seem to be loading too slowly, we’ll get emails like this:

1-new_relic_email_alert — The excellent New Relic product for APM (Application Performance Monitoring) generates email notifications.

This could be for a number of reasons:

Maybe our Success team is importing a large batch of content.
Maybe a blog post has gone viral.
Maybe a customer is performing an SEO audit on many blogs.
Most commonly, maybe a malicious or overzealous third-party crawler is ignoring our robots.txt rules and is effectively DDoS-ing our network.

This happened just the other day and I was on call to respond to it.

Share the The Situation sectionThe Situation

The first step was to confirm an actionable issue with resource usage. In this case, there certainly was one:

2-cpu_usage — The blue lines at or near 100% indicated that php was maxed out and customer-facing downtime was possible.

The next step was to review server logs to determine what sort of traffic was causing the spike:

3-apache_logs — An Apache log demonstrated an absurd amount of traffic from one particular Firefox UA.

In this case, I could see clearly that one particular version of Firefox was outpacing all other network activity by about 50x. This was an obvious attack signature.

It was puzzling though, because our enterprise-level firewall, courtesy of Cloudflare, should catch stuff like this. So why wasn’t it?

Well, there’s always the possibility that the traffic was semi-legit, but just had perhaps gone out of control. A good example of this is the scenario I mentioned above where a customer has some sort of SEO or security audit underway. A good way to diagnose that is to see if the IP associated with the suspicious User-Agent is on any blacklists. I prefer the AbuseIPDB service because I find it to be resistant to false positives.

4-non-abusive — AbuseIPDB gave the IP a thumbs-up, which suggested that simply blocking the IP would have been disruptive to benevolent traffic.

Harrumph. If this was a malicious IP, we were the first ones to know about it, which was very unlikely.

The reason I prefer to err on the side of avoiding false positives is that the caching layers on our network generally minimize the impact of incidents like this, by serving content through Varnish. And the chance of accidentally disrupting customer activity (maybe they paid a lot of money for a security audit) is something I try to avoid …

… and yet.

The number of requests from this IP was just too vast to tolerate, so I dug a bit deeper and reviewed a different service, MXToolbox. MXToolbox is great, it’s just that I find it errs moreso on the side of false positives, so it’s not my first choice. But in this case, I was willing to hear it out, and it spoke good sense:

5-maybe_abusive — Well, well, well… Maybe not so innocent after all.

MXToolbox did in fact show that this IP was on several blacklists. That, combined with the absurd volume of requests, was enough for me to take action. I immediately added it to our curated roster of blocked IP addresses via Cloudflare:

6-cf_ip_rules — We use Cloudflare to block malicious IP addresses before they even hit our servers.

Share the The Red Herring sectionThe Red Herring

At this point, I was feeling pretty good about the situation. Things were starting to make sense:

We had a network slow-down.
We traced that to a traffic issue.
We traced that to a Firefox UA.
We traced that to an IP address.
We reached a verdict against this IP address. Perhaps not beyond a reasonable doubt, but certainly with a preponderance of evidence. Call it a civil suit.
We blocked it via Cloudflare so it would stop hitting our web servers at all.

At that point, there wasn’t much else to do but take a short break, make some coffee, and monitor the server to watch with great satisfaction as the malicious IP disappeared, resource usage plummeted, and network conditions improved.

Only none of that stuff happened.

What the heck? I just blocked this IP in Cloudflare, yet somehow it was still making it through to our network! This is really inconceivable — honestly, human error on my part is the most likely explanation in cases like that. But I retraced my steps and found no obvious human error. So what next?

I had to step back and return to first principles. Forget for a moment about who this bad actor is, where they are, what their IP address is. How about this: What are they actually doing?

This is a good question and we do usually interrogate this at some point, but not until the firefight of poor network conditions is over. Tracing network activity is more time-consuming than noting the UA and IP. But in this case, it was the only logical next step.

Our servers tend to host more than one WordPress environment, so the first step was to trace the issue down to the relevant WordPress environment. I was able to note this by, again, observing New Relic. Each WordPress environment has it’s own New Relic entry, and it was obvious that all of this traffic was targeted toward one particular “Premier” environment.

At LexBlog, “Premier” means that all of the blogs on a given WordPress environment belong to the same customer, as opposed to a shared environment. We do have a couple of Premier customers who do not use our Cloudflare layer, because their blogs have domain names like this:

example-blog.example-firm.com

Why is that an issue? Notice that in this example, the blog is at a subdomain of the law firm domain name. In order to use our standard Cloudflare configuration, customers need to point their nameservers to our Cloudflare account, and having a Premier-level customer point their entire domain to our Cloudflare account is something I regard as being beyond our scope. They probably have lots of subdomains for things that have nothing to do with their LexBlog account (maybe even a client portal, say), and frankly, most customers want full control of that. For their blogs, they are content to just point a CNAME record to our web host and rely on whatever network security they have on their firm domain name.

Unfortunately, the environment under attack was not among the small handful of environments that tend to skip our Cloudflare layer. Just when things were on the verge of making sense, they continued to defy logic.

Indeed, I reviewed a handful of their blogs on Cloudflare and saw nothing remotely resembling an attack signature:

7-cf_traffic — This is not what a spike looks like. This is normal traffic fluctuation.

So what was going on exactly? An attack that does not register at all in Cloudflare, and therefore is not defensible by Cloudflare, but on an environment where all of the blogs, to my knowledge, use Cloudflare. How was this possible?

Share the The Actual Herring sectionThe Actual Herring

And then I saw it.

Lo and behold, among the many blogs this customer has with us, they do have one that uses a subdomain of their firm domain name! Nailed it. A brand new blog on their firm domain name getting hammered by some spammer. Well, that’s an easy fix:

9-web_rules — “The perfect is the enemy of the good.” – Voltaire

I blocked the IP at the web server level through our hosting platform. This is not as ideal as blocking it at the Cloudflare level, because in this case, nginx does have to handle the request, but it’s a minuscule amount of resources compared to the full Apache request. There’s no php or mysql involved, for example.

Share the The Victory Parade sectionThe Victory Parade

This was a very satisfying conclusion to a very mysterious morning, but upon reflection, I realized that the victory was even sweeter than I first supposed. This was a brand-new blog. What are the odds that it alone — in isolation from all the other blogs in this environment — had been picked up by spammers? Basically zero. So how did they spam it? Because surely they were spamming the entire customer domain name.

Think about that for a second:

Our network slows down.
I get an alert.
After some quagmire time, I find that the issue is limited to one subdomain of their firm website domain.
I fix the issue as it pertains to our platform.

What does this mean? It means that every property under this law firm domain name was likely under attack, and LexBlog was the first one to realize it and take action.

Many of our customers are the largest law firms in the world. They have revenue that outpaces entire nations. They broker things like the sale of NFL teams.

And good old LexBlog has better network security than them.

There’s always a question, around incidents like this, of how and when to reach out to the customer. For garden-variety traffic issues, we tend not to reach out, because there’s nothing noteworthy enough to justify the risk of confusion or unwarranted blame. For cases more so in the gray area, where we are blocking an IP or UA that has some possibility of disrupting the customer, yes, reaching out to them is the only responsible thing to do. But in a case like this — a case I’d never considered before — where LexBlog was the canary in the coal mine for an attack on the firm website domain, I nearly pulled a hamstring rushing to our Success team to alert the customer.

And to brag a little.