7 votes

HTML to XML (Screen scraper)

An xslt extension for use in Umbraco that wraps the functionality found in Light HTML to XML converter by Alain COUTHURES: http://sourceforge.net/projects/light-html2xml/

The extension can help to reformat bad html into xml for getting external content i.e. screen scraping.

There are two exposed methods
1) Htm2XmlNodeset
public static XPathNodeIterator Htm2XmlNodeset(string url)
Given a url will fetch the page and attempt to create a valid XPathNodeIterator from it

2) Htm2XmlQueryNodeset
public static XPathNodeIterator Htm2XmlQueryNodeset(string url,string query)
Given a url and a xpath quety will fetch the page and attempt to match the template and create a valid XPathNodeIterator from it. Usful for getting a div, table etc.

Macros

There are two macro's html2xml.xslt & html2xml_withxpquery.xslt

Examples

<umbraco:Macro URL="http://www.bbc.co.uk" Alias="Html2xml" runat="server"></umbraco:Macro>
<umbraco:Macro URL="http://weather.yahoo.com/forecast/FRXX0016.html" xpathQuery="//div[@class='yw-ulmwrap']//h1" Alias="Html2xml_withxpquery" runat="server"></umbraco:Macro>
<umbraco:Macro URL="http://weather.yahoo.com/forecast/FRXX0016.html" xpathQuery="//dd[preceding-sibling::*[1][name()='dt' and contains(.,'Feels Like:')]]" Alias="Html2xml_withxpquery" runat="server"></umbraco:Macro>
<umbraco:Macro URL="http://weather.yahoo.com/forecast/FRXX0016.html" xpathQuery="//div[@id='yui-main']/div[@class='yui-b']/div[@id='yw-forecast']//em" Alias="Html2xml_withxpquery" runat="server"></umbraco:Macro>
<umbraco:Macro URL="http://weather.noaa.gov/cgi-bin/mgetmetar.pl?cccc=EGPD&Submit=SUBMIT" Alias="Html2xml" runat="server"></umbraco:Macro>

Its my first package and it didn’t hurt too much. I would like to create more, I am sure lots of us have a bunch of useful things we never share for one reason or another such as code being too specific to a client project and not having the time to clean it up and publish it. Well in my quest for karma and also to just make a start I have done it.
Yes documentation is lacking but at this stage it is only for those who know what the title means. When it is stable and I have some time I can create some worked example of how to use doc types etc.
Finally Screen Scraping is not polite and could be illegal so you will be careful wont you ?

Package Files
Documentation
Archived Files

Package files

html2html2_0.1.zip

uploaded 04/10/2009 by Alec Griffiths
For Umbraco: & .NET Version:

Archived files

Documentation

Source code

Forums

Feedback

Feedback here

Download package
version 0.1

Package owner

Alec Griffiths

Alec has 151 karma points

Package Compatibility

This package is compatible with the following versions as reported by community members who have downloaded this package:

Untested or doesn't work on Umbraco Cloud

Version 8.18.x (untested)

You must login before you can report on package compatibility.

Previously reported to work on versions:

Package Information

Package owner: Alec Griffiths
Created: 04/10/2009
Current version 0.1
License GNU General Public License version 2 (GPLv2)
Downloads on Our: 1K