An xslt extension for use in Umbraco that wraps the functionality found in Light HTML to XML converter by Alain COUTHURES: http://sourceforge.net/projects/light-html2xml/
The extension can help to reformat bad html into xml for getting external content i.e. screen scraping.
There are two exposed methods
1) Htm2XmlNodeset
public static XPathNodeIterator Htm2XmlNodeset(string url)
Given a url will fetch the page and attempt to create a valid XPathNodeIterator from it
2) Htm2XmlQueryNodeset
public static XPathNodeIterator Htm2XmlQueryNodeset(string url,string query)
Given a url and a xpath quety will fetch the page and attempt to match the template and create a valid XPathNodeIterator from it. Usful for getting a div, table etc.
Macros
There are two macro's html2xml.xslt & html2xml_withxpquery.xslt
Examples
<umbraco:Macro URL="http://www.bbc.co.uk" Alias="Html2xml" runat="server"></umbraco:Macro>
<umbraco:Macro URL="http://weather.yahoo.com/forecast/FRXX0016.html" xpathQuery="//div[@class='yw-ulmwrap']//h1" Alias="Html2xml_withxpquery" runat="server"></umbraco:Macro>
<umbraco:Macro URL="http://weather.yahoo.com/forecast/FRXX0016.html" xpathQuery="//dd[preceding-sibling::*[1][name()='dt' and contains(.,'Feels Like:')]]" Alias="Html2xml_withxpquery" runat="server"></umbraco:Macro>
<umbraco:Macro URL="http://weather.yahoo.com/forecast/FRXX0016.html" xpathQuery="//div[@id='yui-main']/div[@class='yui-b']/div[@id='yw-forecast']//em" Alias="Html2xml_withxpquery" runat="server"></umbraco:Macro>
<umbraco:Macro URL="http://weather.noaa.gov/cgi-bin/mgetmetar.pl?cccc=EGPD&Submit=SUBMIT" Alias="Html2xml" runat="server"></umbraco:Macro>
Its my first package and it didn’t hurt too much. I would like to create more, I am sure lots of us have a bunch of useful things we never share for one reason or another such as code being too specific to a client project and not having the time to clean it up and publish it. Well in my quest for karma and also to just make a start I have done it.
Yes documentation is lacking but at this stage it is only for those who know what the title means. When it is stable and I have some time I can create some worked example of how to use doc types etc.
Finally Screen Scraping is not polite and could be illegal so you will be careful wont you ?
Feedback here