[ WARNING: Extremely geeky technical content ahead. Commbadges should be firmly affixed and tricorders should be at the ready. ]
One of the Gibbity features I’m most excited about is the ability to explore the games listing by tags. I got the idea from browsing Emusic’s very nice directories last week and decided to implement something similar in Gibbity.
In a nutshell: users choose one of the top 20 tags with which to begin their search. So assuming you chose "pc," you’d then be presented with a page containing all of the games tagged "pc." On the sidebar you’d also find a list of related tags, which you can use to further refine your search, e.g., "strategy," "rpg," "fps," etc. Clicking "fps" gets you a page with games matching "pc" and "fps," with its own sidebar of related tags. Theoretically you could refine your search an infinite number of times (although in practice you’d probably run out of matching games after the third or fourth tag choice).
I was pretty proud of this infinitely-nesting search when I put it together a few days back, but there were some major doubts in my mind whether I could write it in such a way as to be resource-friendly (i.e., it wouldn’t bring the server to its knees every time a few dozen people try to use it).
True enough, the search was a friggin’ hog. I ran a very simple script timer to see how long it took the PHP engine to process the instructions of my sample pc+fps search, and got an average of about 0.3 seconds on each run.
Now, 0.3 seconds may not seem like much, but I can tell you that that is a lifetime for a multi-user web-based application. Most scripts shouldn’t take longer than a hundredth of a second to run, because you are assuming an environment where the server is getting hammered by hundreds of requests per second. (By comparison, Gibbity’s homepage only takes 0.02 seconds to render from top to bottom.)
After trying various things to optimize the matching code, I simply could not get the execution time any lower than a quarter of a second. Finally, I decided that the only way I was going to shave a few milliseconds of this operation was to cheat, and cache the code.
Code-caching is basically what it sounds like: instead of actually searching for games tagged "pc" and "fps," the engine cheats and serves up stored results from previous searches. The idea is that you become very efficient when serving popular searches like the aforementioned pc+fps, although you take a small hit when running obscure searches like say, pc+fps+balloons.
There are a bunch of different ways to implement caching. The (very basic) method I chose was to save the actual PHP array of search results in to a text file, i.e., all games matching "pc" are in a file called tags_pc.cache, and all games matching "fps" are in a file called tags_fps.cache. So when anybody asks for pc+fps, all the engine has to do is look for the two *.cache files, and intersect their values. The beauty of this method is that even the obscure searches (pc+fps+balloons) benefit from the caching, because only "balloons" will actually have to be run.
Now that I had gotten the caching part working, my next problem was figuring out how to keep the cache fresh. Since the cache is totally disconnected from the present state of the site, new games could be added at any time that wouldn’t be reflected in the cache.
My first idea was to destroy and rebuild specific caches every time a new game and/or tag was introduced by a user. So for example, if someone added that new game "Hell Gate: London" and tagged it as "pc," "fps," and "mmorpg," I would destroy and rewrite those 3 tags_*.cache files. That would mean 100% accurate caches at all times, which is great. The problem though, is that popular tags like "pc" have pretty big lists of games ("pc" has 668 games already, and I haven’t even debuted the site yet), so it’d be an equally large resource hog to write and rewrite those caches every time a user makes an entry.
My second idea, and the one I eventually went with, was to have a simple expiration on each cache. So if we run our search and we find that one of the caches are older than an hour, we destroy it, rebuild it and display its contents. That means that one person every hour or so, will get a slow search (in the region of 0.3-0.4 seconds) but everyone else who runs a similar search within that time period gets a free ride. By "free ride," I mean 0.015 second execution times, which are the results I’ve been getting from my testing (Savings of over 90% = good).
The performance increase was so dramatic that I almost forgot that there were still many, many parts of the site that needed optimization, but at least I’m off to a good start :)