<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Update: Screenscraping HTML with TagSoup and XPath</title>
	<atom:link href="http://www.hackdiary.com/2003/12/28/update-screenscraping-html-with-tagsoup-and-xpath/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.hackdiary.com/2003/12/28/update-screenscraping-html-with-tagsoup-and-xpath/</link>
	<description></description>
	<lastBuildDate>Fri, 19 Feb 2010 13:50:20 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
	<item>
		<title>By: Tunc</title>
		<link>http://www.hackdiary.com/2003/12/28/update-screenscraping-html-with-tagsoup-and-xpath/comment-page-1/#comment-152</link>
		<dc:creator>Tunc</dc:creator>
		<pubDate>Wed, 19 May 2004 23:59:09 +0000</pubDate>
		<guid isPermaLink="false">http://www.hackdiary.com/?p=44#comment-152</guid>
		<description>I have a question.

When I do run your the example program I get
the following exception.

&quot;Prefix must resolve to a namespace: html&quot;

Any ideas what might be wrong.

Thanks a lot,
-tunc


</description>
		<content:encoded><![CDATA[<p>I have a question.</p>
<p>When I do run your the example program I get<br />
the following exception.</p>
<p>&#8220;Prefix must resolve to a namespace: html&#8221;</p>
<p>Any ideas what might be wrong.</p>
<p>Thanks a lot,<br />
-tunc</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: hannes wallnoefer</title>
		<link>http://www.hackdiary.com/2003/12/28/update-screenscraping-html-with-tagsoup-and-xpath/comment-page-1/#comment-151</link>
		<dc:creator>hannes wallnoefer</dc:creator>
		<pubDate>Mon, 19 Jan 2004 15:32:53 +0000</pubDate>
		<guid isPermaLink="false">http://www.hackdiary.com/?p=44#comment-151</guid>
		<description>TagSoup 0.8 does indeed have a problem with HTML entities. The fix is a one-liner: just add

theSize = 0;

at line 336 in file HTMLScanner.java after calling h.pcdata() and before falling through to A_SAVE_PUSH. I also sent an email to John Cowan about this, so hopefully, there&#039;ll be a fixed version out soon.
</description>
		<content:encoded><![CDATA[<p>TagSoup 0.8 does indeed have a problem with HTML entities. The fix is a one-liner: just add</p>
<p>theSize = 0;</p>
<p>at line 336 in file HTMLScanner.java after calling h.pcdata() and before falling through to A_SAVE_PUSH. I also sent an email to John Cowan about this, so hopefully, there&#8217;ll be a fixed version out soon.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Danny</title>
		<link>http://www.hackdiary.com/2003/12/28/update-screenscraping-html-with-tagsoup-and-xpath/comment-page-1/#comment-150</link>
		<dc:creator>Danny</dc:creator>
		<pubDate>Thu, 01 Jan 2004 16:56:13 +0000</pubDate>
		<guid isPermaLink="false">http://www.hackdiary.com/?p=44#comment-150</guid>
		<description>Nice one Matt! I saw the earlier article but held off trying that approach because of the need for JDOM. I&#039;m already using Xalan in my app, and I want to eat tag soup ;-)
</description>
		<content:encoded><![CDATA[<p>Nice one Matt! I saw the earlier article but held off trying that approach because of the need for JDOM. I&#8217;m already using Xalan in my app, and I want to eat tag soup ;-)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Matt Biddulph</title>
		<link>http://www.hackdiary.com/2003/12/28/update-screenscraping-html-with-tagsoup-and-xpath/comment-page-1/#comment-149</link>
		<dc:creator>Matt Biddulph</dc:creator>
		<pubDate>Mon, 29 Dec 2003 13:12:46 +0000</pubDate>
		<guid isPermaLink="false">http://www.hackdiary.com/?p=44#comment-149</guid>
		<description>That&#039;s because I was lazy in the code example and just used the toString() method on the XObject returned from xpath evaluation. If you change the last line to:

org.w3c.dom.NodeList nodes = title.nodelist();
for(int i = 0; i&lt;nodes.getLength(); i++) {
Node node = nodes.item(i);
System.out.println(&quot;Result &quot;+i+&quot;: &#039;&quot;+node.toString()+&quot;&#039;&quot;);
}

then it properly iterates through each separate node returned from the xpath, printing it out. You get lines such as:

Result 5: &#039;&lt;div class=&quot;teaserText&quot;&gt;    &lt;span class=&quot;dateStamp&quot;&gt;28.12.03 22:30&lt;/span&gt; &#124;     Australske telemyndigheder sl &#124;     Australske telemyndigheder sl&amp;aringr fast, at de nye 3G-sendemaster har lavere str &#124;     Australske telemyndigheder sl&amp;aringr fast, at de nye 3G-sendemaster har lavere str&amp;aringling end almindelige mobilmaster og f.eks. taxiradioer.&lt;br clear=&quot;all&quot; /&gt;&lt;/div&gt;&#039;

which contain all the tags and entities found. You could also use the methods on the Node object to traverse the tree of divs, spans and text nodes to examine the individual parts instead of using toString().
</description>
		<content:encoded><![CDATA[<p>That&#8217;s because I was lazy in the code example and just used the toString() method on the XObject returned from xpath evaluation. If you change the last line to:</p>
<p>org.w3c.dom.NodeList nodes = title.nodelist();<br />
for(int i = 0; i&lt;nodes.getLength(); i++) {<br />
Node node = nodes.item(i);<br />
System.out.println(&#8220;Result &#8220;+i+&#8221;: &#8216;&#8221;+node.toString()+&#8221;&#8216;&#8221;);<br />
}</p>
<p>then it properly iterates through each separate node returned from the xpath, printing it out. You get lines such as:</p>
<p>Result 5: &#8216;&lt;div class=&#8221;teaserText&#8221;&gt;    &lt;span class=&#8221;dateStamp&#8221;&gt;28.12.03 22:30&lt;/span&gt; |     Australske telemyndigheder sl |     Australske telemyndigheder sl&amp;amp;aringr fast, at de nye 3G-sendemaster har lavere str |     Australske telemyndigheder sl&amp;amp;aringr fast, at de nye 3G-sendemaster har lavere str&amp;amp;aringling end almindelige mobilmaster og f.eks. taxiradioer.&lt;br clear=&#8221;all&#8221; /&gt;&lt;/div&gt;&#8217;</p>
<p>which contain all the tags and entities found. You could also use the methods on the Node object to traverse the tree of divs, spans and text nodes to examine the individual parts instead of using toString().</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Guan Yang</title>
		<link>http://www.hackdiary.com/2003/12/28/update-screenscraping-html-with-tagsoup-and-xpath/comment-page-1/#comment-148</link>
		<dc:creator>Guan Yang</dc:creator>
		<pubDate>Mon, 29 Dec 2003 12:48:34 +0000</pubDate>
		<guid isPermaLink="false">http://www.hackdiary.com/?p=44#comment-148</guid>
		<description>Do entities work correctly with Tagsoup? I have fetched the content of &lt;a href=&quot;http://www.berlingske.dk/&quot; rel=&quot;nofollow&quot;&gt;http://www.berlingske.dk/&lt;/a&gt; (a Danish newspaper) and run it through the code example above. When I select the path //html:div[@html:class=&#039;headline&#039;]/following-sibling::html:div[@html:class=&#039;teaserText&#039;] I get the following:

29.12.03 10:02 &#124;     Over 2000 mennesker er blevet reddet ud i live fra ruiner i den jordsk &#124;     Over 2000 mennesker er blevet reddet ud i live fra ruiner i den jordsk&amp;aeliglvsramte iranske by Bam. Det meddelte den iranske R &#124;     Over 2000 mennesker er blevet reddet ud i live fra ruiner i den jordsk&amp;aeliglvsramte iranske by Bam. Det meddelte den iranske R&amp;oslashde Halvm &#124;     Over 2000 mennesker er blevet reddet ud i live fra ruiner i den jordsk&amp;aeliglvsramte iranske by Bam. Det meddelte den iranske R&amp;oslashde Halvm&amp;aringne mandag.

It&#039;s supposed to be:

&#124; Over 2000 mennesker er blevet reddet ud i live fra ruiner i den jordsk</description>
		<content:encoded><![CDATA[<p>Do entities work correctly with Tagsoup? I have fetched the content of <a href="http://www.berlingske.dk/" rel="nofollow">http://www.berlingske.dk/</a> (a Danish newspaper) and run it through the code example above. When I select the path //html:div[@html:class='headline']/following-sibling::html:div[@html:class='teaserText'] I get the following:</p>
<p>29.12.03 10:02 |     Over 2000 mennesker er blevet reddet ud i live fra ruiner i den jordsk |     Over 2000 mennesker er blevet reddet ud i live fra ruiner i den jordsk&#038;aeliglvsramte iranske by Bam. Det meddelte den iranske R |     Over 2000 mennesker er blevet reddet ud i live fra ruiner i den jordsk&#038;aeliglvsramte iranske by Bam. Det meddelte den iranske R&#038;oslashde Halvm |     Over 2000 mennesker er blevet reddet ud i live fra ruiner i den jordsk&#038;aeliglvsramte iranske by Bam. Det meddelte den iranske R&#038;oslashde Halvm&#038;aringne mandag.</p>
<p>It&#8217;s supposed to be:</p>
<p>| Over 2000 mennesker er blevet reddet ud i live fra ruiner i den jordsk</p>
]]></content:encoded>
	</item>
</channel>
</rss>
