« Social Networking Workshop Report | Main | Palm webOS approach to HTML extensibility: x-mojo-* »

Semantic Data Extractor

Every so often, someone writes to me or to the public-qa-dev mailing list to report bugs, or simply to give thanks on the semantic data extractor.

I'm always pleasantly surprised when I hear that, what started as a 10 minutes demonstrator of the semantics attached to HTML, is actually used as a tool by a number of developers.

With a name such "semantic data extractor", it was a bit of a shame that the tool didn't highlight the usage of GRDDL or RDFa on pages that use either of these technologies; I have just added detection of both of these to the extractor.

As a bonus, I have also added detection of non-semantic markup: at this time, it will detect purely-wrapping <div>, empty <span>, and tables with a single row or a single column (which have good chances to be layout tables); if you have suggestions for detecting other non-semantic markup, let me know!

Filed by Dominique Hazaël-Massieux on February 12, 2009 10:27 AM in Semantic Web, Tools
| | Comments (5) | TrackBacks (0)

Comments

Carlo # 2009-03-13

The tool doesn't work! This is the error message:

Using org.apache.xerces.parsers.SAXParser Exception net.sf.saxon.trans.DynamicError: org.xml.sax.SAXParseException: Content is not allowed in prolog. org.xml.sax.SAXParseException: Content is not allowed in prolog.

Dom # 2009-03-13

Hi Carlo,

Please report bugs and errors on public-qa-dev@w3.org, with details on the URI you tried the tool on.

Thanks,

Dom

OP # 2009-04-09

Hi there.

I am a big fan of this tool. But something that is puzzling me is this message we are getting for our site(s) when running them through.

" with no additional content to their unique child"

Gotta love it. Naturally I checked out a few things and tried what I thought might fix this from appearing, and to no avail.

The divs I have are empty in a sense... they have this. Since the include file is in

So I tried to add a few things into the div container... nbsp's, transparent gifs, other content, and still nothing was reducing the amount of divs with no additional content.

The included file doesnt consist of having empty divs inside it.

Other divs have content in them with headers and such, so I think that the div i mentioned may be throwing it off.

Any suggestions what to do to remedy this?

Joe # 2009-07-08

I second OP's comment. I have no clue what this error means, having tried the same as OP:

Non-semantic markup

The following suspiciously non-semantic markup has been detected:

* 6 <div> with no additional content to their unique child

Any advice? Someone? Thanks!

Dom # 2009-07-08

I have removed the test for empty div, as it was both confusing and misleading.

Leave a comment

Note: this blog is intended to foster polite on-topic discussions. Comments failing these requirements and spam will not get published. Please, enter your real name and email address. Every individual comment is reviewed by the W3C staff. This may take some time, thank you for your patience.

You can use the following HTML markup (a href, b, i, br/, p, strong, em, ul, ol, li, blockquote, pre) and/or Markdown syntax.

Your comment


About you

This blog is written by W3C staff and working group participants,
 and maintained by Coralie Mercier.
Authorized parties may log in to create a new entry.
Powered by Movable Type, magpierss and a lot of Web Technology