The COntext INterchange (COIN) Project:

Data Extraction and Interpretation from Semi-Structured Web Sources

Stuart E. Madnick [smadnick@mit.edu] and Michael Siegel [msiegel@mit.edu]

MIT Sloan School of Management; Cambridge, Massachusetts 02139 USA

The popularity and growth of the Web have dramatically increased the number of information sources available for use and the opportunity for important new information-intensive applications (e.g., massive data warehouses, integrated supply chain management, global risk management, in-transit visibility). Unfortunately, there are significant challenges to be overcome regarding data extraction and data interpretation in order for this opportunity to be realized.

Data Extraction: One problem is the difficulty in easily and automatically extracting very specific data elements from Web sites for use by operational systems. As an example solution, we have developed a technology, Automatic Web Wrapping, which allows semi-structured Web sites (including dynamic Web sites) to be queried using SQL as if they were relational databases. This allows ODBC-compliant package software, such as Excel, Visual Basic, etc., as well as traditional user application software to directly query the Web. As a separate application, using a post-processor, this technology has also been used to generate XML-tagged pages from "legacy" HTML Web sites.

Data Interpretation: Another serious problem is the existence of heterogeneous contexts, whereby each SOURCE of information and potential RECEIVER of that information may operate with a different context, leading to large-scale semantic heterogeneity. A context is the collection of implicit assumptions about the context definition (i.e., meaning) and context characteristics (i.e., quality) of the information. As a simple example, whereas most US universities grade on a 4.0 scale, MIT uses a 5.0 scale. Another typical example might be the extraction of price information from the Web: but is the price in Dollars or Yen (If dollars, is it US dollars or Hong Kong dollars), does it include taxes, does it include shipping, etc. - and does that match the receiver's assumptions? We have investigated the existence of and reasons behind various forms of context challenges and have developed a theory and technology for representing context knowledge and a Context Mediation engine for mitigating the problem.

Recent sponsors of this research include: the DoD Advanced Research Projects Agency (ARPA), Merrill Lynch, PricewaterhouseCoopers, and PRIMARK.

Recent articles about this work can be found as follows:

Go to http://www.computerworld.com/newsrch.nsf/$$search?OpenForm

Then, enter search for Madnick, and look at articles:

11/13/98