Extract Genuine File Information Legal Issues in Data Extraction
Apr 18

Printer Friendly Version

Download Source Code: YahooMoviesScrapper.zip - 18.17KB

---
 

Implementation of a generic base class for Web Scrappers, based on sequential parsing of the text content (NOT using regular expression). Works on static HTML content, so no client-side JavaScript or dynamic HTML supported or executed. Online demo, with a simple real-time scrapper for some Yahoo!Movies pages:

 
---

Overview

With the huge amount of public data available on the Internet, web scraping has become a largely used method to extract full or partial content from HTML pages, and apply simple transformations to get a different presentation format. Most popular use case for web scraping appeared with search engines, and companies like AltaVista, Yahoo! and lately Google, present some of the most amazing success stories in the modern Internet era. Search engines based their business model on indexing web pages from all the World Wide Web, and making it easier for web users to locate them, based on specific keywords and context.

Simple extraction and translation of some web table data from Yahoo!Movies
Simple extraction and translation of some web table data from Yahoo!Movies

Other frequently used web scraping is collecting structured or semi-structured content from web pages, and exposing data in a new presentation format. Data extraction may come from different sources with similar content, and finally aggregation techniques are applied, to merge extracted information together.

We'll not get here into the legal issues and debates around stealing and using copyrighted data. We already have another article that presents so many cases where data is either public, owned by the company itself or simply available for web scrappers to parse it and add value with more advertising.

There are different terms used for automated programs, that transparently goes to and collect data from web pages, without user interaction. It is the term web spider, which makes you think at the huge number of web pages parsed in a couple of minutes. There is also the equivalent term of web crawler, which suggests how the search engines gather their data, visiting a large number of sites, by following many or all hyperlinks. When main goal is partial data extraction, here comes maybe the most appropriate term, that we will use from now, of web scrappers.

There are so many articles on the web about web scrappers and search engines. Amazingly enough, not many open-source projects, in .NET or for other platforms, that show you how to implement web scrappers. Certain number of companies offer web scraping services for huge amounts of money. So are they, the scraping techniques, so difficult to implement? Not really, and most scrappers use regular expressions to do it quickly. Problem is the content of web pages is so particular for each site and changes so often, that it is very difficult to assume your scrapper will still recognize those pages in a couple of weeks only. Maintenance costs are high. And this is one reason why you'll find people offering custom and proprietary solutions.

There are also two extreme opposite tendencies that can be seen at web scraping creators. Some try to create standards, formalize web scraping structures and offer general products, supposed to work on most sites. That's good, we would like to have such software products and quickly build our own scrappers with them. Problem is the tendancy of such products is to become too complex. For instance, some try to describe the structure of a page in XML files, with some custom format you have to learn. The goal is, when the page changes, you will adapt only your XML data file, without touching the code. But truth is, in most cases you need more flexibility, to change and adapt your code as well.

The other tendency is to build any new scrapper from scratch. It's hard to find a publicly open source available library with some basic functionality for scrappers. And when one is available, in most cases it uses regular expressions and is all custom-made. If you are already familiar with the cryptic expression language used by regular expressions, it's fine. But if you're not, you may spend hours to understand what each expression is looking for, to adapt it to your particular crawler.

WebScrapperBase Class

We'll try to present here something between, a basic class with primitive helper functions for web scraping, NOT using regular expressions. The code is free to use and adapt to your own needs. We'll keep it very flexible and easy to adapt. We'll cover most of the areas where simple methods will greatly help you in building a scrapper in just a few minutes.

The functionality will be limited, at least for now, at scraping data only from static HTTP content. This means that it can be no JavaScript executed on the client side, after the page is returned, and no dynamic HTML object processing. Whatever content is returned by the web server, it will be treated and loaded as static text. Yes, this can be HTML, XML or other text format, but no dynamic objects or behavior will follow, as it is the case in your browser. The advantage of loading web page results as static content is you will never automatically transfer and load associated images or other pages part of the content, unless you explicitly do it. Your pages are loaded much faster. And, in most cases, this will do it for you.

There are several areas where we will offer helper methods, to use for your own scrapper. For the beginning, we will cover only four main areas:

  1. Web Navigation functions cover, for the beginning, the three main HTTP methods to issue a web request and get back the response: GET, HEAD and POST.
  2. Static Web Page Content Crawling begins once you have the content loaded into an internal string buffer. We offer some simple but very useful methods to immediately locate text within, extract substrings and automatically advance your pointers to the text left.
  3. Data Capture provides basic functionality to store extracted data, in memory structures and external files or databases.
  4. Any good crawler must have some Tracing functionality, to send back messages about processing results and let you know about eventual parsing errors. You also need performance counters and measurement of the time spent on navigation, data extraction or other forms of processing.

Continue reading »

Subscribe and Share: Subscribe using any feed reader Bookmark and Share

Leave a Reply