This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 3949 - [FO11] New regex function to match and return a list of captured strings
Summary: [FO11] New regex function to match and return a list of captured strings
Status: CLOSED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Functions and Operators 3.0 (show other bugs)
Version: Working drafts
Hardware: PC Windows XP
: P2 enhancement
Target Milestone: ---
Assignee: Michael Kay
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-11-03 23:12 UTC by Radu Serban
Modified: 2009-10-12 22:30 UTC (History)
1 user (show)

See Also:


Attachments

Description Radu Serban 2006-11-03 23:12:15 UTC
I am working on a project using XSLT 2.0, XQuery 1.0 and XPath 2.0. The project involves some heavy use of regex and I really miss a feature. I would like to be able to do match betweem a regex and a string and get back a list of captured strings. 
For example I should be able to do a regex-match('2006 Location', '(\d{4})\s+(.*)') and I would get back a list of matches ('2006', 'Location'). The closest existent matching function is replace, but it gives some problems when it doesn't match the string. Instead of returning an empty string I get the original string which forces me to do an extra check.
Comment 1 Frans Englich 2006-11-05 16:19:19 UTC
In what way doesn't fn:tokenize() work for you?

http://www.w3.org/TR/xpath-functions/#func-tokenize
Comment 2 Radu Serban 2006-11-05 16:50:17 UTC
(In reply to comment #1)
> In what way doesn't fn:tokenize() work for you?
> 
> http://www.w3.org/TR/xpath-functions/#func-tokenize
> 

The function tokenize() will return the strings that are not matched by the regex. I need the groups that _are_ matched. Maybe I should have given an example in the first place: I am processing some data in xslt and I need to extract birth dates and places for some people. The problem is that the data has different formats. Some examples are:
(1900)
(1900 place 1)
(1900 - 2000)
(1900 place 1 - 2000)
(1900 place 1 - 2000 place 2)
(1900 - 2000 place 2)
So the informations are the year of birth and death and the places of birth and death. I have to use right now four fn:replace() to obtain the needed info, which means more code verbosity (either repeat the expression for each replace or make a variable and used it for all the four replace's). And also if the expression is not matched nothing is replaced and you get the original string. Sometimes I needed an empty string instead so I had to use an additional matches() to do the replace only if the regex will surely match the string. 

It would be much easier if a single regex could be used in a function called for example match-groups(., '^\((\d{4})(.*?)-?(\d{4})?(.*?)\)$') and this function will return all the matched groups, empty or not. For the last example it should return ('1900', ' ', '2000', ' place 2'). Thus you have all the information ready, and in a simple way.

I hope I was more clear this time.
Best regards,
Radu
Comment 3 Michael Kay 2006-11-06 16:01:52 UTC
In XSLT 2.0, the xsl:analyze-string instruction meets your needs.

We limited the functionality available in XPath and XQuery to what could be achieved with a simple function call; the XQuery WG felt that this level of capability would be sufficient for release 1. The XSL WG however decided to add some custom syntax (and an extension to the dynamic context) to handle more complex use cases: these require either custom syntax, or new facilities such as higher-order functions or nested sequences (both of which are under consideration for a future version of the language).

I hope the XSLT 2.0 facility will meet your needs for the time being. We're not accepting any new requirements for the current round of specifications, but we've already started formulating a requirements list for the next round.

Michael Kay
(personal response)
Comment 4 Radu Serban 2006-11-06 18:39:25 UTC
(In reply to comment #3)
> In XSLT 2.0, the xsl:analyze-string instruction meets your needs.

Indeed it successfully does what I need. I overlooked it since I was searching only in the XQuery specifications.
Thanks a lot for your reply,
Radu Serban
Comment 5 Jim Melton 2008-03-14 20:40:14 UTC
Recategorized as 1.1 feature request
Comment 6 Michael Kay 2009-02-16 17:29:44 UTC
A proposal for an analyze-string() function to meet this requirement in XQuery is at

http://lists.w3.org/Archives/Member/w3c-xsl-wg/2009Jan/0053.html (member-only)

See also subsequent email discussion.

There was also some discussion at a WG telcon but IIRC it was informal and unminuted.
Comment 7 Jim Melton 2009-07-03 20:40:37 UTC
In meeting #402 on 2009-06-09, the WGs adopted two proposals contained in member-only email messages http://lists.w3.org/Archives/Member/w3c-xsl-query/2009May/0048.html and http://lists.w3.org/Archives/Member/w3c-xsl-query/2009May/0050.html. Those proposals added non-capturing groups and a new F&O function analyze-string() to F&O 1.1. 

As a result of this action, I am marking this bug RESOLVED FIXED. 

If you agree that this action resolves your comment, please mark the bug CLOSED. 
Comment 8 Michael Kay 2009-10-12 22:30:52 UTC
Sicne functionality to meet this requirement has been defined and is specified in the current editor's draft, I am marking this as closed.