If you're getting 403 Forbidden errors browsing the forum
Posted: Mon Dec 01, 2025 11:20 am
In the ongoing fight against AI crawler bots, I made a couple of changes to the forum which might cause some people to encounter a bunch of 403 Forbidden errors temporarily. If you're running into that, make sure there isn't a "sid=" in the URL, and you'll want to reload any older forum page so that you get new links that don't have that present.
Technical explanation
All modern websites use cookies (little pieces of data sent alongside the webpage) to keep track of who's logged in. Back when phpBB was written, a lot of people were hesitant to allow cookies, so phpBB would also put the cookie data into the URL, and this is vestigial functionality that has been completely unnecessary for over two decades.
Part of how phpBB implemented the session system was to also assign session IDs to folks who aren't logged in, for some reason, and had a list of known bots that it would not assign those for, so that search engines wouldn't see random data in the URL and end up crawling every single page on a site a billion times. Most unknown crawlers were also smart enough to remove the sid value from the URL anyway.
Unfortunately, with the advent of aggressive AI crawlers, there's been an explosion of bots that are poorly-written and which also go out of their way to disguise themselves as being regular web browsers. As a result, each of these crawlers ends up seeing an infinite number of unique URLs, and in their quest to extract Every Last Byte of Information, they are super aggressive about trying to see every single one. For the past several months, our little forum has been besieged by millions of requests from hundreds of thousands of IP addresses, and it's a wonder things have stayed up as well as they have.
Whenever there's been a major outage I've gone into the logs and found groups of aggressive crawlers to block by IP address, but the AI companies have seen their bots getting blocked and instead of doing the smart thing to fix their fucking crawlers to not be so disastrous to the Internet, they've instead decided to spread the load out by launching massive botnets that come from every IP address they can get their hands on. Huge data centers around the world (especially in developing nations) are part of this, and I suspect that this is also the purpose of apps like HoneyGain which reward people for running a "network speed testing" app on as many devices as possible, basically turning the entire Internet into a giant AI botnet.
Anyway, phpBB is finally removing SIDs from URLs in the upcoming 4.0 release, but for various reasons they have opted not to make this change to 3.x (which is what everyone is currently running and which many sites will continue to run for quite some time), and even when URL SIDs are removed, there will still be a giant backlog of bots still exploring the old, known URLs, which will continue to be valid indefinitely. So the only real way to stop the torrent of suck is by making those URLs invalid.
Unfortunately there is no way to distinguish between a real user looking at an old URL vs. a shitty AI bot, so for now there will be some choppiness for some people. Never mind, I modified the access rule to tell whether you've got a login cookie, in which case you should never be forbidden.
Technical explanation
All modern websites use cookies (little pieces of data sent alongside the webpage) to keep track of who's logged in. Back when phpBB was written, a lot of people were hesitant to allow cookies, so phpBB would also put the cookie data into the URL, and this is vestigial functionality that has been completely unnecessary for over two decades.
Part of how phpBB implemented the session system was to also assign session IDs to folks who aren't logged in, for some reason, and had a list of known bots that it would not assign those for, so that search engines wouldn't see random data in the URL and end up crawling every single page on a site a billion times. Most unknown crawlers were also smart enough to remove the sid value from the URL anyway.
Unfortunately, with the advent of aggressive AI crawlers, there's been an explosion of bots that are poorly-written and which also go out of their way to disguise themselves as being regular web browsers. As a result, each of these crawlers ends up seeing an infinite number of unique URLs, and in their quest to extract Every Last Byte of Information, they are super aggressive about trying to see every single one. For the past several months, our little forum has been besieged by millions of requests from hundreds of thousands of IP addresses, and it's a wonder things have stayed up as well as they have.
Whenever there's been a major outage I've gone into the logs and found groups of aggressive crawlers to block by IP address, but the AI companies have seen their bots getting blocked and instead of doing the smart thing to fix their fucking crawlers to not be so disastrous to the Internet, they've instead decided to spread the load out by launching massive botnets that come from every IP address they can get their hands on. Huge data centers around the world (especially in developing nations) are part of this, and I suspect that this is also the purpose of apps like HoneyGain which reward people for running a "network speed testing" app on as many devices as possible, basically turning the entire Internet into a giant AI botnet.
Anyway, phpBB is finally removing SIDs from URLs in the upcoming 4.0 release, but for various reasons they have opted not to make this change to 3.x (which is what everyone is currently running and which many sites will continue to run for quite some time), and even when URL SIDs are removed, there will still be a giant backlog of bots still exploring the old, known URLs, which will continue to be valid indefinitely. So the only real way to stop the torrent of suck is by making those URLs invalid.