7 votes

HTML to XML (Screen scraper)

An xslt extension for use in Umbraco that wraps the functionality found in Light HTML to XML converter by Alain COUTHURES: http://sourceforge.net/projects/light-html2xml/

The extension can help to reformat bad html into xml for getting external content i.e. screen scraping.

There are two exposed methods
1) Htm2XmlNodeset
public static XPathNodeIterator Htm2XmlNodeset(string url)
Given a url will fetch the page and attempt to create a valid XPathNodeIterator from it

2) Htm2XmlQueryNodeset
public static XPathNodeIterator Htm2XmlQueryNodeset(string url,string query)
Given a url and a xpath quety will fetch the page and attempt to match the template and create a valid XPathNodeIterator from it. Usful for getting a div, table etc.

Macros

There are two macro's html2xml.xslt & html2xml_withxpquery.xslt

Examples

<umbraco:Macro URL="http://www.bbc.co.uk" Alias="Html2xml" runat="server"></umbraco:Macro>
<umbraco:Macro URL="http://weather.yahoo.com/forecast/FRXX0016.html" xpathQuery="//div[@class='yw-ulmwrap']//h1" Alias="Html2xml_withxpquery" runat="server"></umbraco:Macro>
<umbraco:Macro URL="http://weather.yahoo.com/forecast/FRXX0016.html" xpathQuery="//dd[preceding-sibling::*[1][name()='dt' and contains(.,'Feels Like:')]]" Alias="Html2xml_withxpquery" runat="server"></umbraco:Macro>
<umbraco:Macro URL="http://weather.yahoo.com/forecast/FRXX0016.html" xpathQuery="//div[@id='yui-main']/div[@class='yui-b']/div[@id='yw-forecast']//em" Alias="Html2xml_withxpquery" runat="server"></umbraco:Macro>
<umbraco:Macro URL="http://weather.noaa.gov/cgi-bin/mgetmetar.pl?cccc=EGPD&Submit=SUBMIT" Alias="Html2xml" runat="server"></umbraco:Macro>


Its my first package and it didn’t hurt too much. I would like to create more, I am sure lots of us have a bunch of useful things we never share for one reason or another such as code being too specific to a client project and not having the time to clean it up and publish it. Well in my quest for karma and also to just make a start I have done it.
Yes documentation is lacking but at this stage it is only for those who know what the title means. When it is stable and I have some time I can create some worked example of how to use doc types etc.
Finally Screen Scraping is not polite and could be illegal so you will be careful wont you ?

Archived files

Documentation

Source code

Package owner

Alec Griffiths

Alec Griffiths

Alec has 151 karma points

Package Compatibility

This package is compatible with the following versions as reported by community members who have downloaded this package:
Untested or doesn't work on Umbraco Cloud
Version 8.18.x (untested)

You must login before you can report on package compatibility.

Previously reported to work on versions:

Package Information

External resources