Semantic Data Extractor

Part of Data

Author(s) and publish date

By:

Dominique Hazaël-Massieux

Published:

12 February 2009

Every so often, someone writes to me or to the public-qa-dev mailing list to report bugs, or simply to give thanks on the semantic data extractor.

I'm always pleasantly surprised when I hear that, what started as a 10 minutes demonstrator of the semantics attached to HTML, is actually used as a tool by a number of developers.

With a name such "semantic data extractor", it was a bit of a shame that the tool didn't highlight the usage of GRDDL or RDFa on pages that use either of these technologies; I have just added detection of both of these to the extractor.

As a bonus, I have also added detection of non-semantic markup: at this time, it will detect purely-wrapping <div>, empty <span>, and tables with a single row or a single column (which have good chances to be layout tables); if you have suggestions for detecting other non-semantic markup, let me know!

Related RSS feed

Subscribe to our blog feed

Comments (7)

Carlo - 13 March 2009 at 11:11:28 UTC

The tool doesn't work!
This is the error message:
Using org.apache.xerces.parsers.SAXParser
Exception net.sf.saxon.trans.DynamicError: org.xml.sax.SAXParseException: Content is not allowed in prolog.
org.xml.sax.SAXParseException: Content is not allowed in prolog.
Dom - 13 March 2009 at 13:59:20 UTC

Hi Carlo,
Please report bugs and errors on public-qa-dev@w3.org, with details on the URI you tried the tool on.
Thanks,
Dom
OP - 9 April 2009 at 16:23:51 UTC

Hi there.
I am a big fan of this tool. But something that is puzzling me is this message we are getting for our site(s) when running them through.
" with no additional content to their unique child"
Gotta love it. Naturally I checked out a few things and tried what I thought might fix this from appearing, and to no avail.
The divs I have are empty in a sense... they have this. Since the include file is in
So I tried to add a few things into the div container... nbsp's, transparent gifs, other content, and still nothing was reducing the amount of divs with no additional content.
The included file doesnt consist of having empty divs inside it.
Other divs have content in them with headers and such, so I think that the div i mentioned may be throwing it off.
Any suggestions what to do to remedy this?
Joe - 8 July 2009 at 06:47:37 UTC
I second OP's comment.
I have no clue what this error means, having tried the same as OP:
Non-semantic markup
The following suspiciously non-semantic markup has been detected:

* 6 <div> with no additional content to their unique child

Any advice?
Someone?
Thanks!
Dom - 8 July 2009 at 07:29:24 UTC

I have removed the test for empty div, as it was both confusing and misleading.
James Sanders - 1 January 2011 at 00:24:26 UTC

Hello All,
Well, after 3 hours of searching through W3C to try to figure out what is wrong and why the semantics validator keeps throwing errors, I finally decided to post here in hopes that some answers might be forthcoming. Hopefully, with Q@A being closed, this still gets some attention. So without further delay, the error I get is as follows:
Using org.apache.xerces.parsers.SAXParser
Exception net.sf.saxon.trans.XPathException: org.xml.sax.SAXParseException: The markup declarations contained or pointed to by the document type declaration must be well-formed.
org.xml.sax.SAXParseException: The markup declarations contained or pointed to by the document type declaration must be well-formed.
URI link to the file in question is as follows:
http://www.sanders-consultation-group-plus.com/redesign/grail-template.html
Any help in this matter would be greatly appreciated because I do so much love the idea of running pages through the validator. It really does give a designer an idea of what spiders might think a page is about based on the semantics, and lets me know, as a designer, if I have done my job to make sure they know.
Thanks in advance
Dom - 3 January 2011 at 07:58:18 UTC

The problem reported by James seems to be coming from an invalid/ill-formed XHTML document, see http://lists.w3.org/Archives/Public/public-qa-dev/2011Jan/0001.html

Comments for this post are closed.

Standards

Groups

Get involved

Resources

News & events

About

Semantic Data Extractor

Author(s) and publish date

Related RSS feed

Comments (7)