Update: Screenscraping HTML with TagSoup and XPath

December 28th, 2003 | Published in java | 5 Comments

UPDATE: Oliver Roup has published updated code that uses the builtin XPath processor in JDK 1.5

Some emails and comments on Screenscraping HTML with TagSoup and XPath alerted me to the fact that the example I gave on that page has gone out of sync with the current release of JDOM and no longer works. I’ve reworked the example using Xalan 2.5.

The problem seems to be that JDOM is asking the TagSoup parser for full namespace support, which it’s not able to give. This new example uses Xalan’s SAX2DOM class to make a DOM tree out of the TagSoup SAX stream, then uses the simple XPathAPI wrapper to make the XPath call.

import java.net.URL;
import org.apache.xalan.xsltc.trax.SAX2DOM;
import org.apache.xpath.XPathAPI;
import org.apache.xpath.objects.XObject;
import org.ccil.cowan.tagsoup.Parser;
import org.w3c.dom.Node;
import org.xml.sax.InputSource;
 
public class example {
 public final static void main(String[] args) throws Exception {
  URL url = new URL("https://example.com");
  Parser p = new Parser();
  p.setFeature("https://xml.org/sax/features/namespace-prefixes",true);
  // to define the html: prefix (off by default)
  SAX2DOM sax2dom = new SAX2DOM();
  p.setContentHandler(sax2dom);
  p.parse(new InputSource(url.openStream()));
  Node doc = sax2dom.getDOM();
  String titlePath = "/html:html/html:head/html:title";
  XObject title = XPathAPI.eval(doc,titlePath);
  System.out.println("Title is '"+title+"'");
 }
}

This code example can be compiled and run with just the TagSoup classes and the Xalan 2.5 main jar on the classpath.

Responses

Feed

Guan Yang says:

December 29th, 2003 at 12:48 pm (#)

Do entities work correctly with Tagsoup? I have fetched the content of http://www.berlingske.dk/ (a Danish newspaper) and run it through the code example above. When I select the path //html:div[@html:class=’headline’]/following-sibling::html:div[@html:class=’teaserText’] I get the following:

29.12.03 10:02 | Over 2000 mennesker er blevet reddet ud i live fra ruiner i den jordsk | Over 2000 mennesker er blevet reddet ud i live fra ruiner i den jordsk&aeliglvsramte iranske by Bam. Det meddelte den iranske R | Over 2000 mennesker er blevet reddet ud i live fra ruiner i den jordsk&aeliglvsramte iranske by Bam. Det meddelte den iranske R&oslashde Halvm | Over 2000 mennesker er blevet reddet ud i live fra ruiner i den jordsk&aeliglvsramte iranske by Bam. Det meddelte den iranske R&oslashde Halvm&aringne mandag.

It’s supposed to be:

| Over 2000 mennesker er blevet reddet ud i live fra ruiner i den jordsk
Matt Biddulph says:

December 29th, 2003 at 1:12 pm (#)

That’s because I was lazy in the code example and just used the toString() method on the XObject returned from xpath evaluation. If you change the last line to:

org.w3c.dom.NodeList nodes = title.nodelist();
for(int i = 0; i<nodes.getLength(); i++) {
Node node = nodes.item(i);
System.out.println(“Result “+i+”: ‘”+node.toString()+”‘”);
}

then it properly iterates through each separate node returned from the xpath, printing it out. You get lines such as:

Result 5: ‘<div class=”teaserText”> <span class=”dateStamp”>28.12.03 22:30</span> | Australske telemyndigheder sl | Australske telemyndigheder sl&aringr fast, at de nye 3G-sendemaster har lavere str | Australske telemyndigheder sl&aringr fast, at de nye 3G-sendemaster har lavere str&aringling end almindelige mobilmaster og f.eks. taxiradioer.<br clear=”all” /></div>’

which contain all the tags and entities found. You could also use the methods on the Node object to traverse the tree of divs, spans and text nodes to examine the individual parts instead of using toString().
Danny says:

January 1st, 2004 at 4:56 pm (#)

Nice one Matt! I saw the earlier article but held off trying that approach because of the need for JDOM. I’m already using Xalan in my app, and I want to eat tag soup ;-)
hannes wallnoefer says:

January 19th, 2004 at 3:32 pm (#)

TagSoup 0.8 does indeed have a problem with HTML entities. The fix is a one-liner: just add

theSize = 0;

at line 336 in file HTMLScanner.java after calling h.pcdata() and before falling through to A_SAVE_PUSH. I also sent an email to John Cowan about this, so hopefully, there’ll be a fixed version out soon.
Tunc says:

May 19th, 2004 at 11:59 pm (#)

I have a question.

When I do run your the example program I get
the following exception.

“Prefix must resolve to a namespace: html”

Any ideas what might be wrong.

Thanks a lot,
-tunc

Hackdiary

Update: Screenscraping HTML with TagSoup and XPath

Responses

Archives