MindTouch Developer Center > Dream > Tutorials > Building a Web Scraper with Dream

In this tutorial, we create a screen scraping application that does not utilize an API. Its purpose is to go to a website, pull the search results from their page, and output a list of all the returned articles as a webpage. The creation and usage is very straightforward. The concepts from this tutorial can be used for applications or websites that want to create mashups of various web services.


This tutorial shows you how to:
  • Create and build URIs using the XUri class
  • Send HTTP requests using the Plug class
  • Build an XHTML document using the XDoc Class

Getting Started

As with all Dream applications, we must add mindtouch.dream.dll to our project references and add a using statement to our source file in order to use the XUri, Plug, and XDoc classes.

using MindTouch.Dream;

After that is done, we start off with creating a form that prompts the user with a search field. It should look something like this, but feel free to customize it to look any way you want it to:

searchform (1).jpg

Make the Search... button’s events (found in the button’s properties):Click is activated so that you can use the function created by these events to invoke the function that will handle the user’s input.

 


private void SearchClicked(object sender, EventArgs e){
    HandleSearch(searchBox.Text);
}

With the preliminaries set up, we create the function HandleSearch(string input) that handles the user’s request. In our example, we use the New York Time’s article search engine, but feel free to use anyone you would like.

(Note: Because the site you are screen-scraping is not aware of the fact that you are screen scraping them, this means the site is prone to change that you may not be warned of ahead of time. In addition, we recommend you use a site that creates structurally correct DOM trees. If not, the XDoc class may have trouble traversing it and return the wanted results.)

Creating XUris and Plugs

Our first step is to create a request to ask the site for their information in HandleSearch(input). In order to do this, we need to build a URI to give to the Plug class so that it can make the necessary request.

XUri nyt_xuri = new XUri("http://query.nytimes.com/");

We do this by appending the path we want to use, as well as the URI query parameters. Note that the order is not relevant.

XUri nyt_full_uri = nyt_xuri.At("search", "query").With("query", user_input).With("srchst", "nyt").With("n", "100");


The above call is equivalent to entering the following URI:

 http://query.nytimes.com/search/query?query=user_input&srchst=nyt&n=100

 

Next, we create a Plug with the XUri we just built and request a response from the webpage using a Get() call.

(Note: The Get() method will cause a warning by the compiler because it is a thread blocking method. Just ignore it for now.)

Plug plug = Plug.New(nyt_full_uri);
DreamMessage message = plug.Get();

 

Creating and Extracting with XDoc

Once we have the DreamMessage, we request the response to be returned as an XDoc.

XDoc doc = message.AsDocument();

 

At the same time, we create another XDoc with the intent to build a webpage. We create the necessary <html>, <head>, <body> element, etc… using Start(element_name), which opens a new elment (e.g. <element name>). The Elem(element_name, text) call creates a complete element with an internal text node (e.g. <element_name>text</element_name>). And finally, the End() call closes the current element (e.g. </element_name>).

XDoc output = new XDoc("html"); //html wrapper
output.Start("head").Elem("title", user_input).End();
output.Start("body");
output.Elem("h3", "Search Result for: " + user_input);
output.Start("table").Attr("border", "1");// table for results

// go to the location wanted in the code by traversing the DOM tree
foreach(XDoc entry in doc["body/div/div/div/ol/li"]) {

// formatting for the returned articles
            output.Start("tr").Start("td");
            output.Add(entry["h3"]);    // retrieve article title and link
            output.Add(entry["p"]);     // retrieve article snipplet
            output.Add(entry["div"]);   // retrieve article info(author, date, number of words)
            output.End().End();         // close off tr and td
}
output.End().End();             // close off table and body

The html equivalent of what we've just done above is this:

<html>
     <head>
          <title>[user_input]</title>
     </head>
     <body>
           <h3>Search Results for [user_input]:</h3>
                <table border="1">
                     <tr><td>
                         <h3><a>article title and link</a></h3>
                         <p><a>article snipplet</a></p>
                         <div><a>article info</a></div>
                     </td></tr>
                     <tr><td>
                          ...
                     </td></tr>
                </table>
     </body>
</html>

 

Running the Application

After we’re finished creating the XDoc, we generate a random temp file, which will have the XHTML document written to it.

string filename = Path.GetTempFileName()+".html";
File.WriteAllText(filename, output.ToXHtml());

 

Finally, we request the operating system to open a browser and execute the file with the screen scraped article list we just created.

Async.ExecuteProcess("explorer.exe", filename, Stream.Null, new Result<Tuple<int, Stream, Stream>>());

 

You now have a simple web-based search application. Here is a sample of an output. We entered "weather" in the search form and these are the results returned from the New York Times website:

result.jpg

 

Though the method used in this sample is a screen scrape, the logistics of it are very similar to using the REST API for any web-service.  The reason is that REST web-services work on the same principles as web pages.

Conclusion

Now that our application is complete, we have seen that the XDoc, Plug, and XUri Classes have provided an easy and fast way to pull a page from any website, traverse its HTML document, and write a new HTML document with the contents pulled from that site. In review, these are the methods in the Dream Classes that we used to fulfill our objecive:

 

  • XUri:
    • At() builds the Path of the URI
    • With() builds the query string of the URI
  • Plug:
    • New() creates a new Plug based on the passed in XUri or Uri
    • Get() makes a HTTP asyncronous GET call to the provided URI
  • XDoc:
    • Start() creates the start tag of any element (e.g. <foo>)
    • Elem() creates a set of tags with a text node in the middle (e.g.  <foo>BAR</foo>)
    • End() creates the close tag of any element it matches (e.g. </foo>)
    • Attr() add any attribute last tag created before this method (e.g. <foo style="color:blue;"> )
    • Add() add any XDoc node to the current XML/HTML/XHTML document you are building
    • DreamMessage: (not the focus of the tutorial, but it allows you to specify response format)
    • AsDocument(): return response as a XDoc
    • AsBytes(): return response as a byte array
    • AsStream(): return response as a Stream
    • AsText(): return response as Text
    • AsTextReader(): return response as a TextReader

Tag page
Viewing 2 of 2 comments: view all
I've been experimenting with this sample. I can't seem to make the XPath indexer to work properly. I've tried numerous xpath expressions and reviewed your sourcecode without seeing an obvious error. I guess the most obvious one is that the webpage i'm consuming is not XHtml compliant. Is there any way of using ToXhtml or something like that to parse crappy html pages with this approach?
Posted 11:55, 12 Nov 2008
Yes, you can use xpath on crappy HTML pages too, thanks the SgmlReader component. One thing to check is if your xpath starts with the root element name (a common mistake). For example, to obtain the first <p> element inside the <body> element of a HTML page, do this: doc["body/p"]. Do NOT do this: doc["html/body/p"]. Hope that helps!
Posted 15:02, 12 Nov 2008
Viewing 2 of 2 comments: view all
You must login to post a comment.
Powered by MindTouch Deki v.8.08.1a