29752 2016-07-25 10:26:49 +0000 [XSLT30]two accumulator examples using count(tokenize(., '\s+')) respectively count(tokenize(., '\W+')) to count words give odd results 2016-10-06 16:30:39 +0000 1 1 1 Unclassified XPath / XQuery / XSLT XSLT 3.0 Candidate Recommendation PC Windows NT CLOSED FIXED P2 normal --- 1 martin.honnen mike cmsmcq public-qt-comments oldest_to_newest 127049 0 martin.honnen 2016-07-25 10:26:49 +0000 The section about accumulators gives two examples said to count words in sections respectively in the document, one is in https://www.w3.org/XML/Group/qtspecs/specifications/xslt-30/html/#func-accumulator-after and defines <xsl:accumulator name="w" initial-value="0" streamable="true" as="xs:integer"> <xsl:accumulator-rule match="text()" select="$value + count(tokenize(., '\s+'))"/> </xsl:accumulator> and <xsl:template match="section"> <xsl:apply-templates/> (words: <xsl:value-of select="accumulator-after('w') - accumulator-before('w')"/>) </xsl:template> the other is in the section https://www.w3.org/XML/Group/qtspecs/specifications/xslt-30/html/#accumulator-examples and defines <xsl:accumulator name="word-count" as="xs:integer" initial-value="0"> <xsl:accumulator-rule match="text()" select="$value + count(tokenize(string(.), '\W+'))"/> </xsl:accumulator> and <xsl:template match="/"> <xsl:apply-templates/> <p>Word count: <xsl:value-of select="accumulator-after('word-count')"/></p> </xsl:template> I realize the examples are supposed to be short and illustrate the use of accumulators rather than providing a good word count implementation but when I test them on documents containing any white space text nodes both above approaches give rather odd and too high results for the word count compared to what a human reader would count. For instance for the input <?xml version="1.0" encoding="UTF-8"?> <doc> <section id="sec1">This is a quick test.</section> <section id="sec2"> <p>The quick <b>brown</b> fox jumped over the lazy dog.</p> </section> </doc> the complete stylesheet using the tokenize(., '\W+') is like this <?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:math="http://www.w3.org/2005/xpath-functions/math" exclude-result-prefixes="xs math" version="3.0"> <xsl:mode on-no-match="shallow-copy" streamable="yes"/> <xsl:global-context-item use-accumulators="w" streamable="yes"/> <xsl:accumulator name="w" initial-value="0" streamable="true" as="xs:integer"> <xsl:accumulator-rule match="text()" select="$value + count(tokenize(., '\W+'))"/> </xsl:accumulator> <xsl:template match="/*"> <xsl:copy> <xsl:apply-templates/> <p>Total count of words in document : <xsl:value-of select="accumulator-after('w')"/></p> </xsl:copy> </xsl:template> <xsl:template match="section"> <xsl:copy> <xsl:apply-templates select="@*"/> <xsl:apply-templates/> <p>(words: <xsl:value-of select="accumulator-after('w') - accumulator-before('w')"/>)</p> </xsl:copy> </xsl:template> </xsl:stylesheet> and when run with Saxon 9.7 EE outputs <?xml version="1.0" encoding="UTF-8"?><doc> <section id="sec1">This is a quick test.<p>(words: 6)</p></section> <section id="sec2"> <p>The quick <b>brown</b> fox jumped over the lazy dog.</p> <p>(words: 16)</p></section> <p>Total count of words in document : 28</p></doc> As both examples in the spec in terms of the accumulator actually want to do the same, namely count the words in text nodes, I wonder whether it is not possible to include a slightly longer but more precise accumulator definition in the form of <xsl:accumulator name="w" initial-value="0" streamable="true" as="xs:integer"> <xsl:accumulator-rule match="text()"> <xsl:variable name="words" as="xs:string*"> <xsl:analyze-string select="." regex="\w+"> <xsl:matching-substring> <xsl:sequence select="regex-group(0)"/> </xsl:matching-substring> </xsl:analyze-string> </xsl:variable> <xsl:sequence select="$value + count($words)"/> </xsl:accumulator-rule> </xsl:accumulator> once in the spec and have both spec examples reference/use that accumulator definition. The word count results with a full stylesheet <?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:math="http://www.w3.org/2005/xpath-functions/math" exclude-result-prefixes="xs math" version="3.0"> <xsl:mode on-no-match="shallow-copy" streamable="yes"/> <xsl:global-context-item use-accumulators="w" streamable="yes"/> <xsl:accumulator name="w" initial-value="0" streamable="true" as="xs:integer"> <xsl:accumulator-rule match="text()"> <xsl:variable name="words" as="xs:string*"> <xsl:analyze-string select="." regex="\w+"> <xsl:matching-substring> <xsl:sequence select="regex-group(0)"/> </xsl:matching-substring> </xsl:analyze-string> </xsl:variable> <xsl:sequence select="$value + count($words)"/> </xsl:accumulator-rule> </xsl:accumulator> <xsl:template match="/*"> <xsl:copy> <xsl:apply-templates/> <p>Total count of words in document : <xsl:value-of select="accumulator-after('w')"/></p> </xsl:copy> </xsl:template> <xsl:template match="section"> <xsl:copy> <xsl:apply-templates select="@*"/> <xsl:apply-templates/> <p>(words: <xsl:value-of select="accumulator-after('w') - accumulator-before('w')"/>)</p> </xsl:copy> </xsl:template> </xsl:stylesheet> are then <?xml version="1.0" encoding="UTF-8"?><doc> <section id="sec1">This is a quick test.<p>(words: 5)</p></section> <section id="sec2"> <p>The quick <b>brown</b> fox jumped over the lazy dog.</p> <p>(words: 9)</p></section> <p>Total count of words in document : 14</p></doc> which seems more natural and correct as a word count. 127612 1 mike 2016-09-30 10:00:38 +0000 Thanks for pointing this out and sorry for the delay in responding. There seem to be two things wrong with count(tokenize(., '\W+')) (a) it counts 1 for a whitespace text node (b) for other text nodes, it gives a count that is 1 too high. One solution would simply to be to subtract 1 from the count. I'm inclined though to use the new XPath 3.1 tokenize#1 <xsl:accumulator-rule match="text()" select="$value + count(tokenize(.))"/> which gives the correct answer 14 (whether or not we strip whitespace text nodes) without making the example a lot more complicated. Added as test accumulator-058. 127618 2 martin.honnen 2016-09-30 11:47:31 +0000 (In reply to Michael Kay from comment #1) > I'm inclined though to use the new XPath 3.1 tokenize#1 > > <xsl:accumulator-rule match="text()" > select="$value + count(tokenize(.))"/> > > which gives the correct answer 14 (whether or not we strip whitespace text > nodes) without making the example a lot more complicated. That looks much better than the previous approaches using tokenize(., '\W+'). It is still easily possible to construct input samples like <p>He asked:"Does it work?"</p> where the analyze-string count on \w+ sequences would work better but I agree the examples on accumulators need to be short to demonstrate the use of accumulators and not to present more complicated attempts on word counting. 127667 3 mike 2016-10-06 08:26:48 +0000 I have now updated both examples to use count(tokenize(.)), and to mention that the word-count produced is crude. Thanks for the feedback. Because the change only affects code examples, it won't be change-highlighted or included in the change log.