This post covers an algorithm I recently implemented, it was originally developed by a group of researchers at Stanford. The original paper can be located here: http://ilpubs.stanford.edu:8090/832/1/2008-15.pdf
Extracting content from news articles is not necessarily a straightforward tasks. First, the HTML itself often gets in the way. Scripts, CSS, and strange layouts can be cumbersome to navigate in an automated fashion. Second, most news articles these days have discussion threads attached at the bottom, which presents a challenge if you are looking for the largest blocks of text. An obvious approach is to use site specific wrappers and use machine learning techniques trained on representative samples PER SITE. This is obviously going to be impossible to maintain, unless you have the power of a Google or Microsoft. Even then, you probably don’t want to.
The approach taken essentially walks an HTML’s Document Object Model (DOM) and calculates a simple ratio of text to links. Ignoring specific tags, such as script, head, etc., most of the “noise” can be filtered out. The basic algorithm works as follows.
For every node, count the number of tokens (words) and links. These are recursive definitions – if the node contained children, the count of tokens would include the number of tokens of the children as well. Then, the node with the highest ratio of text to link is counted as the body. A further improvement can be noted that a given node has to be above some threshold to be included in the parent structure. This can happen when news organizations break up their stories, such as to insert an advertisement. This way, you are tracking a “set” of nodes. That’s basically it! Let’s talk implementation.
I used TagSoup (www.ccil.org/~cowan/XML/tagsoup/) a Java library for parsing HTML, and DOM4j to represent the model. Here is a sample function:
public Document getDocument(InputStream stream) {
Document document = null;
Parser p = new Parser();
p.setContentHandler(new DefaultHandler());
try {
InputSource source = new InputSource(stream);
SAXContentHandler handler = new SAXContentHandler();
p.setContentHandler(handler);
p.parse(source);
document = handler.getDocument();
} catch (IOException e) {
logger.info("IOException for InputStream in getDocument.", e);
} catch (SAXException e) {
logger.info("Exception parsing XML document", e);
}
return document;
}
At this point, the HTML page is represented as an XML Document. From here, we can walk the document and track the score of each element. This can be done recursively fairly easily. For instance, if we start off with an Element, we can iterate its children, score them, and determine whether they pass the threshold or not. If they do, add them to the Element’s score. Lastly, score the content of the element itself.
The full recursive definition is below.
* Scores a specific element.
*
* @param element element
* @return a meta data object
*/
protected ElementMetaData score(Element element) {
ElementMetaData data = new ElementMetaData(element);
if (NewsArticleBodyUtils.isIgnoredNode(element)) {
return data;
}
if (NewsArticleBodyUtils.isTerminalNode(element)) {
if (NewsArticleBodyUtils.isLinkNode(element)) {
data.setTextCnt(1);
data.setLinkCnt(1);
} else if (NewsArticleBodyUtils.isTextNode(element)) {
String text = element.getTextTrim();
String[] tokens = text.split(" ");
double l = (tokens.length > 0) ? tokens.length : 1;
data.setTextCnt(l);
data.setLinkCnt(0);
}
} else {
for (int i = 0; i < element.elements().size(); i++) {
Element e = (Element) element.elements().get(i);
ElementMetaData childData = score(e);
double childRatio = (childData.getTextCnt() - childData.getLinkCnt()) / childData.getTextCnt();
// add any children that have a significant amount of text.
if (childRatio > threshold) {
data.add(childData);
}
}
}
double scoreNode = scoreNode(data);
// TODO We overwrite matching scores, which could be a problem...
scoreNodes.put(scoreNode, element);
metaMap.put(scoreNode, data);
return data;
}
There are helper functions indicating which nodes should be ignored, flagged as link nodes, flagged as terminal nodes, and so forth. A meta data object is available to store the link and text counts for different elements.
