<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>29752</bug_id>
          
          <creation_ts>2016-07-25 10:26:49 +0000</creation_ts>
          <short_desc>[XSLT30]two accumulator examples using count(tokenize(., &apos;\s+&apos;)) respectively count(tokenize(., &apos;\W+&apos;)) to count words give odd results</short_desc>
          <delta_ts>2016-10-06 16:30:39 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>XPath / XQuery / XSLT</product>
          <component>XSLT 3.0</component>
          <version>Candidate Recommendation</version>
          <rep_platform>PC</rep_platform>
          <op_sys>Windows NT</op_sys>
          <bug_status>CLOSED</bug_status>
          <resolution>FIXED</resolution>
          
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Martin Honnen">martin.honnen</reporter>
          <assigned_to name="Michael Kay">mike</assigned_to>
          <cc>cmsmcq</cc>
          
          <qa_contact name="Mailing list for public feedback on specs from XSL and XML Query WGs">public-qt-comments</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>127049</commentid>
    <comment_count>0</comment_count>
    <who name="Martin Honnen">martin.honnen</who>
    <bug_when>2016-07-25 10:26:49 +0000</bug_when>
    <thetext>The section about accumulators gives two examples said to count words in sections respectively in the document, one is in https://www.w3.org/XML/Group/qtspecs/specifications/xslt-30/html/#func-accumulator-after and defines 

&lt;xsl:accumulator name=&quot;w&quot; initial-value=&quot;0&quot; streamable=&quot;true&quot; as=&quot;xs:integer&quot;&gt;
   &lt;xsl:accumulator-rule match=&quot;text()&quot; 
                         select=&quot;$value + count(tokenize(., &apos;\s+&apos;))&quot;/&gt;
&lt;/xsl:accumulator&gt;

and 

&lt;xsl:template match=&quot;section&quot;&gt;
   &lt;xsl:apply-templates/&gt;
   (words: &lt;xsl:value-of select=&quot;accumulator-after(&apos;w&apos;) - accumulator-before(&apos;w&apos;)&quot;/&gt;)
&lt;/xsl:template&gt;

the other is in the section https://www.w3.org/XML/Group/qtspecs/specifications/xslt-30/html/#accumulator-examples and defines 


  &lt;xsl:accumulator name=&quot;word-count&quot; 
                   as=&quot;xs:integer&quot; 
                   initial-value=&quot;0&quot;&gt;
    &lt;xsl:accumulator-rule match=&quot;text()&quot; 
         select=&quot;$value + count(tokenize(string(.), &apos;\W+&apos;))&quot;/&gt;
  &lt;/xsl:accumulator&gt;
  
and

   &lt;xsl:template match=&quot;/&quot;&gt;
     &lt;xsl:apply-templates/&gt;
     &lt;p&gt;Word count: &lt;xsl:value-of select=&quot;accumulator-after(&apos;word-count&apos;)&quot;/&gt;&lt;/p&gt;
   &lt;/xsl:template&gt;


I realize the examples are supposed to be short and illustrate the use of accumulators rather than providing a good word count implementation but when I test them on documents containing any white space text nodes both above approaches give rather odd and too high results for the word count compared to what a human reader would count. 

For instance for the input

&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
&lt;doc&gt;
	&lt;section id=&quot;sec1&quot;&gt;This is a quick test.&lt;/section&gt;
	&lt;section id=&quot;sec2&quot;&gt;
		&lt;p&gt;The quick &lt;b&gt;brown&lt;/b&gt; fox jumped over the lazy dog.&lt;/p&gt;
	&lt;/section&gt;
&lt;/doc&gt;

the complete stylesheet using the tokenize(., &apos;\W+&apos;) is like this

&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
&lt;xsl:stylesheet xmlns:xsl=&quot;http://www.w3.org/1999/XSL/Transform&quot;
	xmlns:xs=&quot;http://www.w3.org/2001/XMLSchema&quot;
	xmlns:math=&quot;http://www.w3.org/2005/xpath-functions/math&quot; exclude-result-prefixes=&quot;xs math&quot;
	version=&quot;3.0&quot;&gt;
	
	&lt;xsl:mode on-no-match=&quot;shallow-copy&quot; streamable=&quot;yes&quot;/&gt;
	&lt;xsl:global-context-item use-accumulators=&quot;w&quot; streamable=&quot;yes&quot;/&gt;
	
	&lt;xsl:accumulator name=&quot;w&quot; initial-value=&quot;0&quot; streamable=&quot;true&quot; as=&quot;xs:integer&quot;&gt;
		&lt;xsl:accumulator-rule match=&quot;text()&quot; select=&quot;$value + count(tokenize(., &apos;\W+&apos;))&quot;/&gt;
	&lt;/xsl:accumulator&gt;

	&lt;xsl:template match=&quot;/*&quot;&gt;
		&lt;xsl:copy&gt;
			&lt;xsl:apply-templates/&gt;
			&lt;p&gt;Total count of words in document : &lt;xsl:value-of select=&quot;accumulator-after(&apos;w&apos;)&quot;/&gt;&lt;/p&gt;
		&lt;/xsl:copy&gt;
	&lt;/xsl:template&gt;
	
	&lt;xsl:template match=&quot;section&quot;&gt;
		&lt;xsl:copy&gt;
			&lt;xsl:apply-templates select=&quot;@*&quot;/&gt;
			&lt;xsl:apply-templates/&gt;
			&lt;p&gt;(words: &lt;xsl:value-of select=&quot;accumulator-after(&apos;w&apos;) - accumulator-before(&apos;w&apos;)&quot;/&gt;)&lt;/p&gt;
		&lt;/xsl:copy&gt;
	&lt;/xsl:template&gt;
	
&lt;/xsl:stylesheet&gt;


and when run with Saxon 9.7 EE outputs

&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;&lt;doc&gt;
        &lt;section id=&quot;sec1&quot;&gt;This is a quick test.&lt;p&gt;(words: 6)&lt;/p&gt;&lt;/section&gt;
        &lt;section id=&quot;sec2&quot;&gt;
                &lt;p&gt;The quick &lt;b&gt;brown&lt;/b&gt; fox jumped over the lazy dog.&lt;/p&gt;
        &lt;p&gt;(words: 16)&lt;/p&gt;&lt;/section&gt;
&lt;p&gt;Total count of words in document : 28&lt;/p&gt;&lt;/doc&gt;


As both examples in the spec in terms of the accumulator actually want to do the same, namely count the words in text nodes, I wonder whether it is not possible to include a slightly longer but more precise accumulator definition in the form of

	&lt;xsl:accumulator name=&quot;w&quot; initial-value=&quot;0&quot; streamable=&quot;true&quot; as=&quot;xs:integer&quot;&gt;
		&lt;xsl:accumulator-rule match=&quot;text()&quot;&gt;
			&lt;xsl:variable name=&quot;words&quot; as=&quot;xs:string*&quot;&gt;
				&lt;xsl:analyze-string select=&quot;.&quot; regex=&quot;\w+&quot;&gt;
					&lt;xsl:matching-substring&gt;
						&lt;xsl:sequence select=&quot;regex-group(0)&quot;/&gt;
					&lt;/xsl:matching-substring&gt;
				&lt;/xsl:analyze-string&gt;
			&lt;/xsl:variable&gt;
			&lt;xsl:sequence select=&quot;$value + count($words)&quot;/&gt;
		&lt;/xsl:accumulator-rule&gt;
	&lt;/xsl:accumulator&gt;

once in the spec and have both spec examples reference/use that accumulator definition.

The word count results with a full stylesheet

&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
&lt;xsl:stylesheet xmlns:xsl=&quot;http://www.w3.org/1999/XSL/Transform&quot;
	xmlns:xs=&quot;http://www.w3.org/2001/XMLSchema&quot;
	xmlns:math=&quot;http://www.w3.org/2005/xpath-functions/math&quot; exclude-result-prefixes=&quot;xs math&quot;
	version=&quot;3.0&quot;&gt;
	
	&lt;xsl:mode on-no-match=&quot;shallow-copy&quot; streamable=&quot;yes&quot;/&gt;
	&lt;xsl:global-context-item use-accumulators=&quot;w&quot; streamable=&quot;yes&quot;/&gt;
	
	&lt;xsl:accumulator name=&quot;w&quot; initial-value=&quot;0&quot; streamable=&quot;true&quot; as=&quot;xs:integer&quot;&gt;
		&lt;xsl:accumulator-rule match=&quot;text()&quot;&gt;
			&lt;xsl:variable name=&quot;words&quot; as=&quot;xs:string*&quot;&gt;
				&lt;xsl:analyze-string select=&quot;.&quot; regex=&quot;\w+&quot;&gt;
					&lt;xsl:matching-substring&gt;
						&lt;xsl:sequence select=&quot;regex-group(0)&quot;/&gt;
					&lt;/xsl:matching-substring&gt;
				&lt;/xsl:analyze-string&gt;
			&lt;/xsl:variable&gt;
			&lt;xsl:sequence select=&quot;$value + count($words)&quot;/&gt;
		&lt;/xsl:accumulator-rule&gt;
	&lt;/xsl:accumulator&gt;
	
	&lt;xsl:template match=&quot;/*&quot;&gt;
		&lt;xsl:copy&gt;
			&lt;xsl:apply-templates/&gt;
			&lt;p&gt;Total count of words in document : &lt;xsl:value-of select=&quot;accumulator-after(&apos;w&apos;)&quot;/&gt;&lt;/p&gt;
		&lt;/xsl:copy&gt;
	&lt;/xsl:template&gt;
	
	&lt;xsl:template match=&quot;section&quot;&gt;
		&lt;xsl:copy&gt;
			&lt;xsl:apply-templates select=&quot;@*&quot;/&gt;
			&lt;xsl:apply-templates/&gt;
			&lt;p&gt;(words: &lt;xsl:value-of select=&quot;accumulator-after(&apos;w&apos;) - accumulator-before(&apos;w&apos;)&quot;/&gt;)&lt;/p&gt;
		&lt;/xsl:copy&gt;
	&lt;/xsl:template&gt;
	
&lt;/xsl:stylesheet&gt;


are then

&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;&lt;doc&gt;
        &lt;section id=&quot;sec1&quot;&gt;This is a quick test.&lt;p&gt;(words: 5)&lt;/p&gt;&lt;/section&gt;
        &lt;section id=&quot;sec2&quot;&gt;
                &lt;p&gt;The quick &lt;b&gt;brown&lt;/b&gt; fox jumped over the lazy dog.&lt;/p&gt;
        &lt;p&gt;(words: 9)&lt;/p&gt;&lt;/section&gt;
&lt;p&gt;Total count of words in document : 14&lt;/p&gt;&lt;/doc&gt;

which seems more natural and correct as a word count.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>127612</commentid>
    <comment_count>1</comment_count>
    <who name="Michael Kay">mike</who>
    <bug_when>2016-09-30 10:00:38 +0000</bug_when>
    <thetext>Thanks for pointing this out and sorry for the delay in responding.

There seem to be two things wrong with count(tokenize(., &apos;\W+&apos;))

(a) it counts 1 for a whitespace text node

(b) for other text nodes, it gives a count that is 1 too high.

One solution would simply to be to subtract 1 from the count.

I&apos;m inclined though to use the new XPath 3.1 tokenize#1

&lt;xsl:accumulator-rule match=&quot;text()&quot; 
         select=&quot;$value + count(tokenize(.))&quot;/&gt;

which gives the correct answer 14 (whether or not we strip whitespace text nodes) without making the example a lot more complicated.

Added as test accumulator-058.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>127618</commentid>
    <comment_count>2</comment_count>
    <who name="Martin Honnen">martin.honnen</who>
    <bug_when>2016-09-30 11:47:31 +0000</bug_when>
    <thetext>(In reply to Michael Kay from comment #1)

&gt; I&apos;m inclined though to use the new XPath 3.1 tokenize#1
&gt; 
&gt; &lt;xsl:accumulator-rule match=&quot;text()&quot; 
&gt;          select=&quot;$value + count(tokenize(.))&quot;/&gt;
&gt; 
&gt; which gives the correct answer 14 (whether or not we strip whitespace text
&gt; nodes) without making the example a lot more complicated.


That looks much better than the previous approaches using tokenize(., &apos;\W+&apos;). 

It is still easily possible to construct input samples like

&lt;p&gt;He asked:&quot;Does it work?&quot;&lt;/p&gt;

where the analyze-string count on \w+ sequences would work better but I agree the examples on accumulators need to be short to demonstrate the use of accumulators and not to present more complicated attempts on word counting.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>127667</commentid>
    <comment_count>3</comment_count>
    <who name="Michael Kay">mike</who>
    <bug_when>2016-10-06 08:26:48 +0000</bug_when>
    <thetext>I have now updated both examples to use count(tokenize(.)), and to mention that the word-count produced is crude.

Thanks for the feedback.

Because the change only affects code examples, it won&apos;t be change-highlighted or included in the change log.</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>