29752 – [XSLT30]two accumulator examples using count(tokenize(., '\s+')) respectively count(tokenize(., '\W+')) to count words give odd results

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 29752 - [XSLT30]two accumulator examples using count(tokenize(., '\s+')) respectively count(tokenize(., '\W+')) to count words give odd results

Summary: [XSLT30]two accumulator examples using count(tokenize(., '\s+')) respectively...

Status:	CLOSED FIXED

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	XSLT 3.0 (show other bugs)
Version:	Candidate Recommendation
Hardware:	PC Windows NT

Importance:	P2 normal
Target Milestone:	---
Assignee:	Michael Kay
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2016-07-25 10:26 UTC by Martin Honnen
Modified:	2016-10-06 16:30 UTC (History)
CC List:	1 user (show)

See Also:

Attachments

Description Martin Honnen 2016-07-25 10:26:49 UTC

The section about accumulators gives two examples said to count words in sections respectively in the document, one is in https://www.w3.org/XML/Group/qtspecs/specifications/xslt-30/html/#func-accumulator-after and defines 

<xsl:accumulator name="w" initial-value="0" streamable="true" as="xs:integer">
   <xsl:accumulator-rule match="text()" 
                         select="$value + count(tokenize(., '\s+'))"/>
</xsl:accumulator>

and 

<xsl:template match="section">
   <xsl:apply-templates/>
   (words: <xsl:value-of select="accumulator-after('w') - accumulator-before('w')"/>)
</xsl:template>

the other is in the section https://www.w3.org/XML/Group/qtspecs/specifications/xslt-30/html/#accumulator-examples and defines 


  <xsl:accumulator name="word-count" 
                   as="xs:integer" 
                   initial-value="0">
    <xsl:accumulator-rule match="text()" 
         select="$value + count(tokenize(string(.), '\W+'))"/>
  </xsl:accumulator>
  
and

   <xsl:template match="/">
     <xsl:apply-templates/>
     <p>Word count: <xsl:value-of select="accumulator-after('word-count')"/></p>
   </xsl:template>


I realize the examples are supposed to be short and illustrate the use of accumulators rather than providing a good word count implementation but when I test them on documents containing any white space text nodes both above approaches give rather odd and too high results for the word count compared to what a human reader would count. 

For instance for the input

<?xml version="1.0" encoding="UTF-8"?>
<doc>
	<section id="sec1">This is a quick test.</section>
	<section id="sec2">
		<p>The quick <b>brown</b> fox jumped over the lazy dog.</p>
	</section>
</doc>

the complete stylesheet using the tokenize(., '\W+') is like this

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
	xmlns:xs="http://www.w3.org/2001/XMLSchema"
	xmlns:math="http://www.w3.org/2005/xpath-functions/math" exclude-result-prefixes="xs math"
	version="3.0">
	
	<xsl:mode on-no-match="shallow-copy" streamable="yes"/>
	<xsl:global-context-item use-accumulators="w" streamable="yes"/>
	
	<xsl:accumulator name="w" initial-value="0" streamable="true" as="xs:integer">
		<xsl:accumulator-rule match="text()" select="$value + count(tokenize(., '\W+'))"/>
	</xsl:accumulator>

	<xsl:template match="/*">
		<xsl:copy>
			<xsl:apply-templates/>
			<p>Total count of words in document : <xsl:value-of select="accumulator-after('w')"/></p>
		</xsl:copy>
	</xsl:template>
	
	<xsl:template match="section">
		<xsl:copy>
			<xsl:apply-templates select="@*"/>
			<xsl:apply-templates/>
			<p>(words: <xsl:value-of select="accumulator-after('w') - accumulator-before('w')"/>)</p>
		</xsl:copy>
	</xsl:template>
	
</xsl:stylesheet>


and when run with Saxon 9.7 EE outputs

<?xml version="1.0" encoding="UTF-8"?><doc>
        <section id="sec1">This is a quick test.<p>(words: 6)</p></section>
        <section id="sec2">
                <p>The quick <b>brown</b> fox jumped over the lazy dog.</p>
        <p>(words: 16)</p></section>
<p>Total count of words in document : 28</p></doc>


As both examples in the spec in terms of the accumulator actually want to do the same, namely count the words in text nodes, I wonder whether it is not possible to include a slightly longer but more precise accumulator definition in the form of

	<xsl:accumulator name="w" initial-value="0" streamable="true" as="xs:integer">
		<xsl:accumulator-rule match="text()">
			<xsl:variable name="words" as="xs:string*">
				<xsl:analyze-string select="." regex="\w+">
					<xsl:matching-substring>
						<xsl:sequence select="regex-group(0)"/>
					</xsl:matching-substring>
				</xsl:analyze-string>
			</xsl:variable>
			<xsl:sequence select="$value + count($words)"/>
		</xsl:accumulator-rule>
	</xsl:accumulator>

once in the spec and have both spec examples reference/use that accumulator definition.

The word count results with a full stylesheet

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
	xmlns:xs="http://www.w3.org/2001/XMLSchema"
	xmlns:math="http://www.w3.org/2005/xpath-functions/math" exclude-result-prefixes="xs math"
	version="3.0">
	
	<xsl:mode on-no-match="shallow-copy" streamable="yes"/>
	<xsl:global-context-item use-accumulators="w" streamable="yes"/>
	
	<xsl:accumulator name="w" initial-value="0" streamable="true" as="xs:integer">
		<xsl:accumulator-rule match="text()">
			<xsl:variable name="words" as="xs:string*">
				<xsl:analyze-string select="." regex="\w+">
					<xsl:matching-substring>
						<xsl:sequence select="regex-group(0)"/>
					</xsl:matching-substring>
				</xsl:analyze-string>
			</xsl:variable>
			<xsl:sequence select="$value + count($words)"/>
		</xsl:accumulator-rule>
	</xsl:accumulator>
	
	<xsl:template match="/*">
		<xsl:copy>
			<xsl:apply-templates/>
			<p>Total count of words in document : <xsl:value-of select="accumulator-after('w')"/></p>
		</xsl:copy>
	</xsl:template>
	
	<xsl:template match="section">
		<xsl:copy>
			<xsl:apply-templates select="@*"/>
			<xsl:apply-templates/>
			<p>(words: <xsl:value-of select="accumulator-after('w') - accumulator-before('w')"/>)</p>
		</xsl:copy>
	</xsl:template>
	
</xsl:stylesheet>


are then

<?xml version="1.0" encoding="UTF-8"?><doc>
        <section id="sec1">This is a quick test.<p>(words: 5)</p></section>
        <section id="sec2">
                <p>The quick <b>brown</b> fox jumped over the lazy dog.</p>
        <p>(words: 9)</p></section>
<p>Total count of words in document : 14</p></doc>

which seems more natural and correct as a word count.

Comment 1 Michael Kay 2016-09-30 10:00:38 UTC

Thanks for pointing this out and sorry for the delay in responding.

There seem to be two things wrong with count(tokenize(., '\W+'))

(a) it counts 1 for a whitespace text node

(b) for other text nodes, it gives a count that is 1 too high.

One solution would simply to be to subtract 1 from the count.

I'm inclined though to use the new XPath 3.1 tokenize#1

<xsl:accumulator-rule match="text()" 
         select="$value + count(tokenize(.))"/>

which gives the correct answer 14 (whether or not we strip whitespace text nodes) without making the example a lot more complicated.

Added as test accumulator-058.

Comment 2 Martin Honnen 2016-09-30 11:47:31 UTC

(In reply to Michael Kay from comment #1)

> I'm inclined though to use the new XPath 3.1 tokenize#1
> 
> <xsl:accumulator-rule match="text()" 
>          select="$value + count(tokenize(.))"/>
> 
> which gives the correct answer 14 (whether or not we strip whitespace text
> nodes) without making the example a lot more complicated.


That looks much better than the previous approaches using tokenize(., '\W+'). 

It is still easily possible to construct input samples like

<p>He asked:"Does it work?"</p>

where the analyze-string count on \w+ sequences would work better but I agree the examples on accumulators need to be short to demonstrate the use of accumulators and not to present more complicated attempts on word counting.

Comment 3 Michael Kay 2016-10-06 08:26:48 UTC

I have now updated both examples to use count(tokenize(.)), and to mention that the word-count produced is crude.

Thanks for the feedback.

Because the change only affects code examples, it won't be change-highlighted or included in the change log.