[issue-34] meeting about languagetool from Felix Sasaki on 2012-08-10 (public-multilingualweb-lt@w3.org from August 2012)

From: Felix Sasaki <fsasaki@w3.org>
Date: Fri, 10 Aug 2012 15:46:14 +0200
To: public-multilingualweb-lt@w3.org
Cc: Daniel Naber <naber@danielnaber.de>
Message-ID: <CAL58czq7TgjTtAqx869s+hdxfHirHhyFPJdr+evjgce-cagTUA@mail.gmail.com>

Hi all,

today Arle and I met with the lead developer of languagetool
http://www.languagetool.org/ , Daniel Naber (see CC). Background: we wanted
to discuss how the output of languagetool could be re-used as pat of
quality data category information.

In languagetool, XML files are one means (in addition to Java rules and
tool configuration) to specify rules for checking, see e.g.

http://languagetool.svn.sourceforge.net/viewvc/languagetool/trunk/JLanguageTool/src/rules/en/grammar.xml?content-type=text%2Fplain

The XML files don't have information for categorizing errors: names of
categories, rules and rule groups are not standardized and not mapped to
our top level quality types.

First, Daniel agreed to add an attribute to grammar.xml for specifying in
languagetool locQualityType. The value list for that attribute is not yet
fixed, but we will keep Daniel in the loop so that later, when we have
finalized the list, he can start implementing the attribute.

This does not mean that all rules in languagetool are mapped to
locQualityType, but it provides a hook to do the mapping. We when can work
on implementing the mapping in community or other, more specific efforts.

The attribute will be available in various parts of grammar.xml: at the
"category" level for rule groups and for rules. Settings on a lower level
override settings on the upper level(s). In that way, the writers of
grammar rules can have their own mappings for single rules, even if the
general mappings are different.

A second topic we discussed was the output from languagetool.

The current output of language tool looks like this, in an XML
representation:

<error fromy="0" fromx="0" toy="0" tox="5"

ruleId="UPPERCASE_SENTENCE_START"

msg="This sentence does not start with an uppercase letter"

replacements="This" context="this is a test."

contextoffset="0"

errorlength="4"/>

To get the mapping to locQualityType, one could use the ruleID, go back to
grammar.xml and get the mapping. But the disadvantage is that you need to
do access on the file level, and you will miss rules that come from the
system, general java rules etc.

Daniel proposed a better solution: he would add two more attributes to the
output: one will convey the language tool version (this relates to
issue-42). The other attribute will output (if available) the
locqualitytype, from grammar.xml or whatever source was used to create the
message.

We discussed also other topics like the bi-text module in languagetool, see
http://languagetool.wikidot.com/checking-translations-bilingual-texts
Or the structure of error messages. Here we didn't conclude anything that
is relevant for our topic.

All, please let me know what you think, and Daniel, Arle, correct me if I
missed something or got something wrong.

We also agreed to put Daniel into the acknowledgement section, for
contributing an implementation of the quality data category - without any
support. Thanks a lot for that, Daniel!

Best,

Felix

--
Felix Sasaki
DFKI / W3C Fellow

Received on Friday, 10 August 2012 13:46:44 UTC