Readability – article extraction and HTML parsing with PHP

Readability, a nifty tool to make web articles more readable, was launched in 2009 by Arc90. It contained code to detect article content on practically any web page, isolate it and present it in a reader-friendly view.

Readability setup page from 2009

Today, Apple and Mozilla use Readability in Safari and Firefox to power their reader views. We use it here at FiveFilters.org as part of our article extraction process.

The original Javascript code is archived here. Arc90 is no longer around, but Readability lives on at Mozilla, where it’s regularly updated, and comes with an extensive test suite.

We ported Readability to PHP back in June 2009 – as far as we know, it was the first PHP port. Initially we updated it periodically to reflect the Arc90 code, but today the version we run is somewhat outdated compared to the Mozilla version.

That said, the original Arc90 code still works remarkably well, and used in combination with our own extraction rules, our tools still produce very good results.

The FiveFilters.org approach to article extraction

We use a variety of methods to extract article content from web pages. This includes checking our repository of community-maintained, site-specific extraction rules, looking for markup in the source HTML that identifies article content, e.g. microformats, and if those don’t produce results, calling Readability, which has heuristics for detecting article content. Check out our Full-Text RSS application if you’re interested in this approach.

We’d planned to eventually port the Mozilla version over to PHP to replace our previous port. Luckily, Andres Rey has already done the hard work, including moving over Mozilla’s test suite. You’ll find his PHP port here: Readability.php. Naturally, we’d like to be able to use this newer version in our applications, and in the next release of Full-Text RSS, you will be able to choose between the two.

Improving Readability.php

We are currently in the process of testing and updating the Readability code Andres ported over to PHP so that we can use it in our own applications. We’re applying small fixes and improvements as we go, at least for our purposes. And we intend to backport more of the Readability.js changes made to Mozilla’s version. The rest of this post will document some of the changes we’ve made so far.

Changes in libxml whitespace handling

The last commit on Readability.php at the time of writing was in April 2020, more than a year ago, and the last release was in July 2019, titled: “The one where I realized that libxml didn’t die on version v2.9.4“. In the release notes, Andres writes:

Thanks to issue #86 I realized that there are modern versions of libxml2. I always wondered why the bundled version of libxml was so old (2.9.4 was released in 2016). Turns out I was checking the wrong website. What seems to be the official website has a really old version as the latest version, meanwhile in gitlab the last version was released months ago!

So I realized there are newer versions and from 2.9.5 the normal behavior changed, breaking up all our tests. Luckily the change is “cosmetic” (whitespace differences with 2.9.4) so the tests are still “valid” but PHPUnit will complain anyway. If you know a way to compare HTMLs ignoring whitespace, let me know.

The Readability.php repository contains a Docker file allowing the code to be tested with different versions of libxml. Sure enough, the behaviour in libxml from 2.9.5 did indeed change and the difference in whitespace handling does break a lot of the tests. The difference appears to be related to this entry in the libxml changelog for version 2.9.5, under “Improvements”:

Initialize keepBlanks in HTML parser (Nick Wellnhofer)

Libxml now preserves blank nodes (those containing what it deems insignificant whitespace). Previously it removed them by default. To reproduce the old behaviour with versions of libxml from 2.9.5 on, LIBXML_NOBLANKS should be passed to DOMDocument’s loadHTML method:

$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_NOBLANKS);

While most of Readability.php’s test failures using libxml 2.9.5 and up were due to cosmetic changes related to the inclusion of whitespace, we did find one part of the code which failed to clean HTML as expected with newer versions of libxml because of the change in whitespace handling. Let’s look at that now.

Removing <br> node next to a <p> node

The intention with this piece of code is to transform HTML of the form:

<article>
    <br> <p>A paragraph of text...</p>
</article>

Into:

<article>
    <p>A paragraph of text...</p>
</article>

Here’s how the original Arc90 Readability handled the above using regular expressions:

articleContent.innerHTML = articleContent.innerHTML.replace(/<br[^>]*>\s*<p/gi, '<p'); 

And our old PHP port:

$articleContent->innerHTML = preg_replace('/<br[^>]*>\s*<p/i', '<p', $articleContent->innerHTML);

Readability.php uses DOM traversal to achieve this:

foreach (iterator_to_array($article->getElementsByTagName('br')) as $br) {
    $next = $br->nextSibling;
    if ($next && $next->nodeName === 'p') {
        $this->logger->debug('[PrepArticle] Removing br node next to a p node.');
        $br->parentNode->removeChild($br);
    }
}

The line in bold picks the <br> element’s next sibling and the line after checks to see if it’s a <p> element. It doesn’t account for whitespace however. The next sibling after <br> in our HTML snippet above is actually a text node containing whitespace, not a <p> element:

In earlier versions of libxml (and with newer versions when using LIBXML_NOBLANKS) text nodes containing insignificant whitespace got removed. As a result, the next sibling given HTML like the above would always be an DOM element and not a text node containing whitespace:

To account for whitespace in our code, a more robust solution is to skip text nodes with whitespace when looking for the next sibling. Readability.php already has code to do that, so we simply apply it here:

foreach (iterator_to_array($article->getElementsByTagName('br')) as $br) {
    $next = NodeUtility::nextElement($br->nextSibling);
    if ($next && $next->nodeName === 'p') {
        $this->logger->debug('[PrepArticle] Removing br node next to a p node.');
        $br->parentNode->removeChild($br);
    }
}

Using an HTML5 parser

Another issue for us is the reliance on libxml to parse HTML. Libxml is not a HTML5 parser. It’s very forgiving and handles most things well, but you will find HTML that it will fail to parse properly. Especially if that HTML contains lots of inline Javascript.

Andres provides a good example:

<div> <!-- Offending div without closing tag -->
<script type="text/javascript">
       var test = '</div>';
       // I should not appear on the result
</script>

Parsing the above with libxml will not produce what you expect. The closing tag which appears as a value of the Javascript variable test trips up the parser and parts of the script element spill out into the rest of your document.

To get around this, Readability.php has a workaround involving regular expressions: summonCthulhu

This will remove all script tags via regex, which is not ideal because you may end up summoning the lord of darkness.

But you can avoid that altogether by using a HTML5 parser designed to handle modern HTML. There’s a popular one for PHP called HTML5-PHP. We use it in Full-Text RSS and most of our other tools where we need to parse HTML. It does mean introducing a new dependency to the codebase but we think it’s well worth it. We’ve gone ahead and added it to Readability.php and made it the default for all HTML parsing.

Using HTML5-PHP also deals with another issue Andres lists with libxml parsing:

&nbsp entities are converted to spaces automatically by libxml and there’s no way to disable it.

That’s not a problem using HTML5-PHP. But for anyone who prefers the previous libxml parsing (it is faster), we’ve added a ‘parser’ configuration option which you can set to ‘libxml’ to get the old behaviour:

$readability = new Readability(new Configuration(['parser'=>'libxml']));

Parsing HTML, serialising as XML

Another issue we encountered is that the code currently outputs the final extracted article using libxml’s XML serialisation. By outputting HTML as XML, you’ll see peculiarities such as <br> elements serialised as <br></br>, and <img> elements with separate closing </img> tags, which you rarely see in HTML.

As we’re now including HTML5-PHP as a dependency and using it as the default parser, we’ve gone ahead and used its serialiser for all HTML output too.

In addition to the above, we’ve also made a few other changes:

  • Updated the Docker file to support versions of PHP from 7.3 to 8.0 (previously it was 7.0 to 7.3)
  • Updated the Docker file to allow you to run PHP with libxml 2.9.4, 2.9.5, 2.9.10, and 2.9.12
  • Replaced the expected HTML files in the tests folder to reflect HTML5-PHP’s serialisation
  • Backported some changes made to Readability.js since the last Readability.php update (but not all changes yet).

View all our recent Readability.php changes on Github.

Testing

If you’d like to run tests against the the changes we’ve made, you can try the following commands (requires git, docker and composer):

git clone https://github.com/fivefilters/readability.php.git

cd readability.php

composer update

docker-compose up -d php-8-libxml-2.9.12

docker-compose exec php-8-libxml-2.9.12 php /app/vendor/phpunit/phpunit/phpunit --configuration /app/phpunit.xml

You will probably have to wait a while for the docker-compose up step as it will rebuild PHP with the version of libxml specified. But afterwards, you can make any changes to the code and execute the last line again to have the tests run again against your new changes.

If you have any feedback, please let us know in our forum.