404 and SEO
When trying to optimize a site for a better referencement on search engines, one quickly dives into the underground world of SEO (Search Engine Optimization).
There is in that world a lot of common sense and a few tricks to know, along with some concepts flirting with the sacred, like the well known Google’s PageRank.
Globally, you have to understand how the robots used by search engines see your site, i.e. exclusively in text. You can have a glance at what it’d look like in a text browser like lynx.
Anything that can help those robots work is an advantage for you. Using w3c validated html and valid stylesheets won’t change your pagerank, but will give a better visibility of the structure of your site to the robots than formatting it with tables and divs everywhere.
The html code semantics allows robots to distinguish the content types, and use them in a pertinent way.
For example, the <h1> tag should containt first level titles, <h2> tag second level titles, <p> text paragraphs, and so on.
The most important is obviously the <title> tag of your pages, which has to contain the most important keywords as much as possible.
The specificity of robots versus users doesn’t stop here.
Along with the pages content, error codes are emitted. Those codes rarely are visible to the user, except the famous 404 (page not found), but they are important.
As a reference, here is a list of error codes a web server might return.
Let’s imagine you have a dynamic content… Some pages can disappear from your sites, although some external links still point toward them.
Typically, we want to redirect the user to a page close to his request, rather than confront him with an error page that may stall him. But we also want to warn the search engines that this link is obsolete and has to be removed. Keep in mind that when a robot crosses path with a 404 too often, it doesn’t like it and that can have consequences on your referencement !
How to solve that, will you ask ? Simply, by emitting an appropriate error code, while displaying content for the user !
The 404 code is the most known, but this error only states that the page could not be reached, for whatever reason. For a page purposefully deactivated, we will prefer the 410 error that states that the page is not available anymore purposefully and not because of a technical problem.
The choice is yours on which error code seems to fit the page status the more accurately…
In order to generate an error code, you will need to change the ‘header’ of the page. Using php, this means using the following command :
header('HTTP/1.0 410 Gone');
Note that the header is the very first thing being emitted when the page is sent. Therefore, it has to be located at the very beginning of the document, before even the doctype declaration(DTD).
Not very handy when the content handling system is structured in such a way that it separates the static elements of the page (like the logo or menu) from the actual content. Indeed, if the script sends the top of the page before handling the content related query and noticing it doesn’t exist, it’s too late to change the header !
A simple way around this problem is to raise the size of the php buffer.
In php.ini, change the output_buffering line this way:
output_buffering = 65535
This value is a limit, but usually enough (if your page don’t weight an undecent amount!)
You can now return any kind of error when you want, without disturbing your user’s visit.

Tuesday July 7th, 2009 at 02:35 AM
Hello, can you please post some more information on this topic? I would like to read more.
Monday February 8th, 2010 at 03:17 PM
Title…
Merci pour cette article intéressant…