Full-Text RSS 3.5

The new version of our Full-Text RSS application is now available. Full-Text RSS is our article extraction and partial-to-full-text feed conversion tool. You can try it out now, or read on to find out what’s new. This is mostly a maintenance release, with a few new features.

Returning Open Graph elements

Many sites now implement Facebook’s Open Graph protocol. This is what helps Facebook and Twitter decide what to display as the main image, title, and description of a shared article. On the source page, it’s essentially a set of meta tags that helps parsers pick out essential information about the page.

We are now returning 5 of these elements in our RSS and JSON output: og:title, og:type, og:url, og:image, and og:description. (Provided they appear in the page being processed.) Here’s what this looks like in our RSS output:

<item>
...
<og:title>Iraq Body Count: undercounting death with pro-war cash</og:title>
<og:url>https://medium.com/insurge-intelligence/iraq-body-count-undercounting-death-with-pro-war-cash-b8ec232551a8</og:url>
<og:image>https://d262ilb51hltx0.cloudfront.net/max/800/1*5VzSkUN9COksk33IzbKkNw.jpeg</og:image>
<og:description>The Pentagon used IBC data for pro-occupation propaganda</og:description>
<og:type>article</og:type>
...
</item>

And in our simple JSON output:

{
  "title": ...,
  "excerpt": ...,
  "date": ...,
  "author": ...,
  "language": ...,
  "url": ...,
  "effective_url": ...,
  "og_url": "https://medium.com/insurge-intelligence/iraq-body-count-undercounting-death-with-pro-war-cash-b8ec232551a8",
  "og_title": "Iraq Body Count: undercounting death with pro-war cash",
  "og_description": "The Pentagon used IBC data for pro-occupation propaganda",
  "og_image": "https://d262ilb51hltx0.cloudfront.net/max/800/1*5VzSkUN9COksk33IzbKkNw.jpeg",
  "og_type": "article",
  "content": ...
}

HTML5 parsing

We’ve updated the HTML5 parsing library, HTML5-PHP, to the most recent release. We’ve also fixed an issue which prevented many XPath expressions in our site config files from being applied correctly when HTML5 parsing was requested.

Compatibility with newer versions of PHP and HHVM

Humble HTTP Agent, our HTTP component, has been updated to make use of version 2 of PHP’s HTTP extension. This is the version that’s used in more recent versions of PHP. This shouldn’t cause a problem for those running older versions of PHP, or those without the HTTP extension installed. We check to see if version 2 of the HTTP extension is available, and if it’s not we use more widespread alternatives (cURL, file_get_contents()).

In our tests, we also found Full-Text RSS 3.5 to work with HHVM version 3.7.1 (the latest version at the time of writing). With the caveat that Tidy support is not yet available (HHVM does not offer the Tidy extension at the moment), and automatic site config updates do not yet work. Nevertheless, this might be very useful for users who want faster HTML5 parsing—one area where Tidy support is not really needed, and where HHVM should show some real speed advantage over regular PHP.

We’ve also changed the minimum PHP requirement from 5.2 to 5.3. If you’re still using PHP 5.2, please continue using Full-Text RSS 3.4.

HTTP headers in site config files

It’s now possible to specify user-agent, referer and cookie HTTP headers in site config files. Full-Text RSS already uses what we think are good defaults, but in cases where specific values are needed, the defaults can be overridden in site config files. The format for specifying HTTP headers in a site config file is as follows:

http_header(user-agent): PHP/5.6
http_header(cookie): Blablabla
http_header(referer): http://google.com

Previously we’d specified custom user-agent strings in the config file. We’ve now moved these to appropriate site config files.

VPS hosting

We’ve updated our hosting page and our Puppet script. The Puppet script can now be applied to a new instance of Ubuntu 15.04 and will install all the server software and PHP extensions needed to run Full-Text RSS. This is the quickest way to get up and running with Full-Text RSS and to get results that match ours.

Pricing and support

The cost of Full-Text RSS remains the same for personal/student use (20 Euro), but support will now only be via our public forum and we are no longer offering a free site config file request with each purchase.

The cost of Full-Text RSS for business use (or really for anyone who can afford to pay more and might want more support from us) has increased from 40 Euro to 50 Euro. With this option you can email us for support and request one site config file for a site where the software struggles to extract content.

The reason for these changes is to give us more time to work on the software. Development has slowed down somewhat due to the increased time we are now spending on support. Offering support on the public forum means others who may be experiencing similar issues can benefit too and possibly save us from dealing with the same support requests.

Many problem reports we receive are related to hosting environments. For example, if Tidy is not available as a PHP extension on your server, Full-Text RSS will still work, but you may not get the same results as you do on our website. So one area we’ll be focusing on is making it easier for users to deploy the software in an environment where all necessary components are available. Currently the best way to do that is using a new VPS with our Puppet script applied (see above).

Survey response

Shortly after the release of Full-Text RSS 3.4, we sent out a survey to our customers to get feedback on the software. Many thanks to everyone who completed the survey—we had about 60 respondents. We meant to publish a summary of the responses earlier, but never got round to it. Here it is:

How do you use Full-Text RSS? (multiple choice)

  • 39% With a news reading application (e.g. Feedly)
  • 39% As a content extractor in a custom software application
  • 35% Linked to a blog/CMS
  • 11% Other

Did you refer to our hosting help page to choose your host?

  • 93% No
  • 5% Yes, and I used the suggested steps for VPS setup
  • 2% Yes, but only to choose the host

Does it work as expected?

  • 87% Yes
  • 13% Other
  • 0% No

Note: Most of the ‘other’ respondents said that they encountered some sites with content which Full-Text RSS couldn’t extract. If you’re considering purchasing Full-Text RSS, please try your favourite sites using our free hosted service first. If it works on our site but not yours, we’ll be happy to help.

What was the hardest part in getting it up and running? (Self-hosted version)

  • 75% It all worked as expected
  • 14% Getting the right extraction results
  • 9% Other
  • 2% Installation

Did you enable automatic site config updates in the admin area? (Self-hosted version)

  • 67% No
  • 33% Yes

Note: While updates to the Full-Text RSS software may not be so frequent, we do maintain a database of site-specific extraction rules (site config files) which we, along with help from our users, update more frequently. Full-Text RSS can be configured to automatically check for these updates and apply them without any user intervention.

Testimonials

The survey form also allowed for our users to leave a testimonial. Here are a few:

It’s great, and I’m extremely impressed with the support, which is consistently helpful and fast.


I used the hosted version for a couple of years, then switched to the installed version. In both cases, worked very well. The full-text extraction is not quite 100%, but very close. When I’ve had questions or problems, they were usually dealt with promptly. Overall, a very good product!


Extremely useful and easy to use. Many thanks!


It just works =)


Full-Text RSS is a great tool that enabled us improve our service a lot. If only there were other such easy and well functioning tools out there.
socialmind.gr


Full Text RSS is a really great piece of software for my development needs and I developed SST Announcer on the iOS app store for my school which utilizes Full Text RSS when loading an individual article.


Love the support on Full-Text RSS by FiveFilters.org. We all know sites change often, and requests to tweak the extraction of content from some news sites were done quickly! An invaluable tool!


Full-Text RSS allows me to read websites objectively, without being influenced by flashy design and aesthetics.


Full-Text RSS is a well developed and easy to use solution for everybody who likes to read their curated information stream in one central place. Don’t settle for summaries and intro texts, read the whole article in your reader!
janpeter.wiersma.me


Excellent service and does exactly as it says. Well worth the purchase. We use it on www.gaaresults.ie to pull snippets from other sites. If you view our site all the feeds apart from the twitter feeds are done using Full Text RSS.


Full Test RSS is an amazing little utility for working with content. It only took me a few minutes of playing with the app to purchase it and stat putting it to use in my applications. Also, the support is very responsive. I recommend this product, hands down!
shoutcloudstudios.com


Full Text RSS est un excellent logiciel qui permet de récupérer l’intégralité des flux RSS.


We have tried some similar systems before but Full-Text RSS is the most complete, functional and robust one by far. It simply works flawlessly!
SEO Natural


Full-Text provides me with the news I care about in the format I prefer. Including the added bonus of being private (self-hosted version).


Full-Text RSS was great – it helped me parse a great number of websites and extract content from them. Regular updates, great support. Thanks!


Easy to set up and yet very powerful. For an accessible price you will get a professional tool.
The Gamer’s Reader creator


Full-Text RSS is a great fit for training our natural language processing and detection applications.
– Crystal Construct Limited


Great product, 100% worth the license costs


I use Full-Text RSS in my personal project www.newpsel.com. I like how it works.


Faster reading, less clicking: Full-Text RSS makes my work more efficient every day.
– Journalist, Germany


This is the one and only one best scripts you can get out there for low price. The software is amazing, and if anything didn’t work as expected, staffs respond in less than a day to help you out with your problem for FREE 😀


Where other similar software fails, Full-Text RSS never ceases to amaze me. If you need a web scraper, you can’t go wrong with Full-Text RSS.

Thank you all! Great to see so many happy users. 🙂

Full changelog

  • Open Graph properties og:title, og:type, og:url, og:image, and og:description now returned if found in the page being processed
  • Bug fix: certain XPath expressions weren’t being evaluated correctly when HTML5 parsing was enabled
  • Cookie handling now only on redirects – fixes issue with certain sites (thanks to Dave Vasilevsky)
  • Compatibility test will no longer show HHVM as incompatible – Full-Text RSS worked with HHVM 3.7.1 in our tests (but without Tidy support and no automatic site config updates)
  • Humble HTTP Agent updated to support version 2 of PHP’s HTTP extension
  • HTML5-PHP library updated
  • Site config files can now include HTTP headers (user-agent, cookie, referer), e.g. http_header(user-agent): PHP/5.6
  • Config option removed: $options->user_agents – use site config files
  • Site config files which use single_page_link can now follow it with if_page_contains: XPath to make it conditional.
  • Minimum supported PHP version is now 5.3. If you must use PHP 5.2, please download Full-Text RSS 3.4
  • Site config files updated for better extraction
  • Other minor fixes/improvements

Available to buy

Full-Text RSS 3.5 is now available to buy. If you’re an existing customer, please wait for an email from us with an upgrade link.