Skip navigation

This post covers an algorithm I recently implemented, it was originally developed by a group of researchers at Stanford. The original paper can be located here: http://ilpubs.stanford.edu:8090/832/1/2008-15.pdf

Extracting content from news articles is not necessarily a straightforward tasks. First, the HTML itself often gets in the way. Scripts, CSS, and strange layouts can be cumbersome to navigate in an automated fashion. Second, most news articles these days have discussion threads attached at the bottom, which presents a challenge if you are looking for the largest blocks of text. An obvious approach is to use site specific wrappers and use machine learning techniques trained on representative samples PER SITE. This is obviously going to be impossible to maintain, unless you have the power of a Google or Microsoft. Even then, you probably don’t want to.

The approach taken essentially walks an HTML’s Document Object Model (DOM) and calculates a simple ratio of text to links. Ignoring specific tags, such as script, head, etc., most of the “noise” can be filtered out. The basic algorithm works as follows.

For every node, count the number of tokens (words) and links. These are recursive definitions – if the node contained children, the count of tokens would include the number of tokens of the children as well. Then, the node with the highest ratio of text to link is counted as the body. A further improvement can be noted that a given node has to be above some threshold to be included in the parent structure. This can happen when news organizations break up their stories, such as to insert an advertisement. This way, you are tracking a “set” of nodes. That’s basically it! Let’s talk implementation.

I used TagSoup (www.ccil.org/~cowan/XML/tagsoup/) a Java library for parsing HTML, and DOM4j to represent the model. Here is a sample function:

public Document getDocument(InputStream stream) {
        Document document = null;
        Parser p = new Parser();
        p.setContentHandler(new DefaultHandler());
        try {
            InputSource source = new InputSource(stream);
            SAXContentHandler handler = new SAXContentHandler();
            p.setContentHandler(handler);
            p.parse(source);

            document = handler.getDocument();
        } catch (IOException e) {
            logger.info("IOException for InputStream in getDocument.", e);
        } catch (SAXException e) {
            logger.info("Exception parsing XML document", e);
        }
        return document;
    }

At this point, the HTML page is represented as an XML Document. From here, we can walk the document and track the score of each element. This can be done recursively fairly easily. For instance, if we start off with an Element, we can iterate its children, score them, and determine whether they pass the threshold or not. If they do, add them to the Element’s score. Lastly, score the content of the element itself.

The full recursive definition is below.

     * Scores a specific element.
     *
     * @param element element
     * @return a meta data object
     */
    protected ElementMetaData score(Element element) {
        ElementMetaData data = new ElementMetaData(element);
        if (NewsArticleBodyUtils.isIgnoredNode(element)) {
            return data;
        }

        if (NewsArticleBodyUtils.isTerminalNode(element)) {
            if (NewsArticleBodyUtils.isLinkNode(element)) {
                data.setTextCnt(1);
                data.setLinkCnt(1);
            } else if (NewsArticleBodyUtils.isTextNode(element)) {
                String text = element.getTextTrim();
                String[] tokens = text.split(" ");
                double l = (tokens.length > 0) ? tokens.length : 1;
                data.setTextCnt(l);
                data.setLinkCnt(0);
            }
        } else {
            for (int i = 0; i < element.elements().size(); i++) {
                Element e = (Element) element.elements().get(i);
                ElementMetaData childData = score(e);
                double childRatio = (childData.getTextCnt() - childData.getLinkCnt()) / childData.getTextCnt();
                // add any children that have a significant amount of text.
                if (childRatio > threshold) {
                    data.add(childData);
                }
            }
        }
        double scoreNode = scoreNode(data);

        // TODO We overwrite matching scores, which could be a problem...
        scoreNodes.put(scoreNode, element);
        metaMap.put(scoreNode, data);
        return data;
    }

There are helper functions indicating which nodes should be ignored, flagged as link nodes, flagged as terminal nodes, and so forth. A meta data object is available to store the link and text counts for different elements.

On a recent project I had to send a collection of objects from JavaScript to an embedded Flash application. Each object was an associative array containing properties options for different elements in the Flash app. It turned out fairly cumbersome to actually do so! First, let’s review the ways that an associative array can be created in JavaScript:

1) An object literal

var arr = {name: "edge-color", group: "edges"};

2) An object

var arr = new Object();
arr.name = "edge-color"
arr.group = "edges";

3) An array literal

var arr = ["name": "edge-color", "group" : "edges"];

4) An array object

var arr = new Array();
arr.name = "edge-color"
arr.group = "edges";

All of these are acceptable means of creating an associative array. Next, we need to send this information to the Flash object using the ActionScript ExternalInterface class. This is where our trouble starts. If you use either of the options 3 and 4, this will NOT WORK! Array is a special class that has strange underpinnings. Only Object will be translated, and the most reliable way to do that is by converting to JSON. First, if you are using a modern browser the function JSON.stringify(Object) should be available to you without any special include on JavaScript.

On the ActionScript side, decoding JSON is equally straightforward:

var object:Object = JSON.decode(s);

I’ll leave it up to the reader to figure everything else out…

I started the actual process of writing my thesis two weeks ago. Here is the running log of what I have accomplished and the page counts. I need to have a rough draft out by the 25th.

June 10th – 10 pages. Introduction, Problem Description.

June 16th – 20 pages. Background Research, Some Implementation details.

June 17th – 28 pages. Screenshots and some results filling in.

June 19th – 39 pages. Added discussion, visualization sections. More results section, FoxNews integration.

June 20th – 52 pages. Rough draft complete.

[This is an entry about my thesis, it might be nonsensical.]

The clustering database is used to keep a ‘running log’ of the clusters in the system. After each crawl, single-link agglomerative clustering is performed on the resulting index and compared to the clustering database. Since each crawl is an independent snapshot of the articles, a method for comparing the results of different clusters was required. The clustered results are first partitioned, resulting in sub-clusters that can be compared to the database entries using an inter cluster comparison method.

Cluster comparison is done using a simple method where the individual url’s of each cluster are compared. The current cluster is compared to the existing database, and if the current cluster contains all of the items as an existing cluster, it is assigned that cluster’s id. Next, the cluster’s are written to the database. Any updated clusters are appended with the new url’s, and new clusters are added to the database.

A recent project I’m working on has a need for storing a Multimap, where the key is a String and the values is a Set. I’m trying to keep this as abstract as possible, but basically, these sets of values can grow over time and are compared to each other in a large scale fashion (in line of millions of records).

While Google’s Multimap collection is a great API to use, it turns out that their implementations are slow for this particular use case. In general, they are very concerned with maintaining data integrity when they expose their underlying objects, so if you change the exposed data it changes the internal multimap data. This adds a lot of overhead on iterating, which I have to do a lot of. In fact, profiling the application indicates that more time is spent iterating than on the expensive comparison operation we do.

These are the three implementations tested: HashMultimap, LinkedHashMultimap, and an internal StringMultiMap that isn’t concerned with exposing the underlying objects (and doesn’t use their WrappedIterator code). The StringMultiMap is extremely fast compared to the others. Obviously LinkedHashMultiMap is slower then HashMultimap. The chart loses some value since I can’t really describe the entire task, but basically, iterating and putting values is a major component of the task, and the task is held fixed while only the data structure changed.

Clustering is used in my system as a “first-pass” approach, reducing the potential similarity candidates to a manageable size. In large information spaces, clustering is very advantageous. Unfortunately, in large information spaces, hierarchical agglomerative clustering is still very expensive and memory intensive. The basic HAC algorithm works as follows:

  1. Assign every item a cluster
  2. For each cluster, calculate the difference between all others
  3. Merge the closest clusters, forming a new cluster
  4. Remove merged clusters, update distances
  5. Continue until one cluster is left

For text processing, the difference metric is usually a vector similarity calculation. It is at this point that I was faced with a decision: optimize for speed or optimize for memory requirements. The problem is that for speed, all the vectors are stored in memory as double[]. On our test data sets, which are CNN crawls at depth 5 (resulting in around 5000 documents per crawl), there are 140k unique terms. We reduce these term based on word frequency and uniqueness to around 60k terms. That still leaves us with 5000 double[60k] which is somewhere on the order of 2.4GB.

Therefore, we shifted our strategy slightly. Distance calculations are now done on a sliding window with only 1000 vector stored in memory at any one time. This results in some deduplication of the creation of the vector (which is decently expensive on the dataset of this size), but greatly stabilizes the memory performance.

With this phase complete, we are moving on to comparisons of items within clusters at a specified cutoff point.

I gave this presentation to the KEWI Group at the University of Nebraska at Omaha. It’s a tutorial on OpenNLP, and discusses using the library for document or text classification.

I’ve used OpenNLP in both work and academic settings and it works phenomenally well.

Check it out!

This is more for reference than anything. But the interaction with simple HBase use cases isn’t always well documented.

So, let’s say we are designing a table to track news articles. We can assume we have the parsed text, the url, the date retrieved, at a minimum. Logically, HBase breaks things down by column families identified by a primary key. In this case, the key is our url. We have column families ‘text’ and ‘meta’ information.

Using the HBase shell script, we can create, add, and query data. Creating the table is simple:

create 'articles', {NAME => 'url'}, {NAME => 'text'}, {NAME => 'meta'}

This tells HBase to create an articles table with url, text, and meta column families. It’s not necessary to specify the internal designations, such as text:parsed, or text:tokens.

To put data, you can specify individual columns to add data to.

put 'articles', 'http://www.cnn.com', 'text:parsed', 'parsed text'

This says, in the articles table, at the row identified by ‘http://www.cnn.com’, in the column ‘text’ with key ‘parsed’, put ‘parsed text’. I am probably garbling terminology here.

To query, we can do:

get 'articles', 'http://www.cnn.com'

With the response:

COLUMN                       CELL
 text:parsed                 timestamp=1251063919650, value=parsed text

That’s about it.

In an earlier post, I had attempted to add the full content to the Nutch index. I had to settle for just entering the body, as the original stream was in another plugin, the parse-html plugin. Today I sat down and modified everything I will need for the experiment.

The source of this is that I need to store 30 days worth of a news website, but many of the links end up being temporary or change (don’t ask me why, some news websites do this for some reason). So in order to do this in a manner that can be recreated without having to recrawl or manually mark the data somehow, I need to store the original content. I don’t care about the images or the style, just the content itself at this point.

To do this I had to modify two plugins in Nutch: index-basic and parse-html. I added two fields to index-basic, one for the extracted body text, and one for the original content including HTML markup. To do this second piece, I had to modify the parse-html plugin to preserve the original input stream. This turned out to be pretty simple as the parse-html uses an interface that is a map-backed object.

Next step will be running Nutch in an automated fashion for a week, and seeing if the indexing works properly and permits me to load and scan each day. If that works, then we can move out to a month.

This happened using Nutch 1.0. When I went to import it into Eclipse, it gave me an error because it has an incomplete JRE set somewhere. This is a recurring theme, but I’m not entirely sure what is causing it, nor do I care enough to investigate it. So, logging it and moving on: I just had to set the java compiler to the default. Done.

Next – compiling. The base 1.0 didn’t want to compile for me. Was complaining about something on line 62. This sequence in particular:

      
      <touch datetime="01/25/1971 2:00 pm">
      <fileset dir="${conf.dir}" includes="**/*.template"/>
    </touch>
    

I removed that, it built. Success.

I need to find a way to add the full HTML content to the index, unfortunately since Nutch rolls their own parser and doesn’t use Tika this is going to be difficult. I’ll have to modify every parser to add the full content to the document. For now, the solution I have *may* work. I will try running it and seeing what I get for content. If it’s parsable/usable by everything I’ll leave it be.

Follow

Get every new post delivered to your Inbox.