DOMDocument and UTF-8, a php charset problem
Today we’ll see how to manipulate DOM elements with php.
We’ll take for example a very useful function which is oddly kind of hard to find : how to add a specific attribute to some HTML elements.
This can be useful for example to add a rel=”nofollow” attribute to some links to let the search engines know they don’t need to follow them, while leaving those links available for your users.
In an SEO point of view, it can be quite useful as it prevents your PageRank from leaking to all the links on your pages.
To achieve that goal, we’ll meet some vicious problems. Let’s start with a ready function, to preserve the time of the smartests among you :
function addAttribute($context, $tag, $attribute, $value)
{
$initialEncoding = mb_detect_encoding($context);
if( $initialEncoding != 'UTF-8' ){
$context = utf8_encode($context);
}
$doc = new DOMDocument("4.01", "utf-8");
$contentPrefix = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"><html><head><title>required meta for utf-8 handling!</title><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body>';
$contentSuffix = '</body></html>';
$doc->loadHTML($contentPrefix . $context . $contentSuffix);
$elements = $doc->getElementsByTagName($tag);
if(!is_array($value)){
$value = array($value);
}
foreach($elements as $element)
{
foreach($value as $currentValue)
{
$alreadySet = false;
if($element->hasAttribute($attribute))
{
$attributeCurrentValue = $element->getAttribute($attribute);
$attributeCurrentValues = explode(' ', $attributeCurrentValue);
foreach( $attributeCurrentValues as $attributeCurrentValue )
{
if($attributeCurrentValue == $currentValue){
$alreadySet = true;
}
}
if(!$alreadySet){
$element->setAttribute($attribute, implode(' ', $attributeCurrentValues) . ' ' . $currentValue);
}
} else {
$element->setAttribute($attribute, $currentValue);
}
}
}
$output = mb_substr($doc->saveHTML(), 236, -16);
if( $initialEncoding != 'UTF-8' ){
mb_convert_encoding($output, $initialEncoding, 'UTF-8');
}
return $output;
}
Explanations
$initialEncoding = mb_detect_encoding($context);
if( $initialEncoding != 'UTF-8' ){
$context = utf8_encode($context);
}
We start by detecting the encoding format currently used in the given context.
We store it in order to give the feedback in the same format, and we convert it to UTF-8.
Why UTF-8 ? This format as the (huge) advantage to handle all characters, including the accented or special ones from various languages.
$doc = new DOMDocument("4.01", "utf-8");
Then, we create a new DOM object, to whose constructor we pass two parameters : the version of the document we’re going to use (typically “1.0″ for XML and “4.01″ for HTML), and the charset of this document.
$contentPrefix = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"><html><head><title>required meta for utf-8 handling!</title><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body>'; $contentSuffix = '</body></html>'; $doc->loadHTML($contentPrefix . $context . $contentSuffix);
Next, we’re going to add a header to the given context. Indeed, even if we have defined the expected values for the given document, those will be overrideen by the document if these headers aren’t declared. And I can assure you that when we don’t know this, it’s a hair pulling scenario !! It is undoubtfully the most tricky part of the manipulation.
We can then load the content in that DOM object.
if(!is_array($value)){
$value = array($value);
}
This function allows us to add several values to a given attribute. Therefore if the argument passed to the function is a string, we convert it to an array.
I won’t spend much time ont the actual function, which is quite explicit.
Just note that we preserve the previous values by storing them in an array, and that we check if the attribute value already exists before adding the new value.
$output = mb_substr($doc->saveHTML(), 236, -16);
We’ll then save the result, and remove the prefix and suffix we’ve added with a function handling multibyte characters. Indeed, a classic substr wouldn’t wotk well with some characters, as we are here using UTF-8 which uses several bytes to store some of them.
if( $initialEncoding != 'UTF-8' ){
mb_convert_encoding($output, $initialEncoding, 'UTF-8');
}
return $output;
It’s time to set the result back to its original charset and to return the result.
And… Voila!
