27001 – Terminology: identity

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 27001 - Terminology: identity

Summary: Terminology: identity

Status:	RESOLVED FIXED

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	XQuery 3.1 (show other bugs)
Version:	Working drafts
Hardware:	PC Linux

Importance:	P2 normal
Target Milestone:	---
Assignee:	Jonathan Robie
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2014-10-08 19:30 UTC by Jonathan Robie
Modified:	2014-10-28 19:22 UTC (History)
CC List:	5 users (show)

See Also:	26958 27040

Attachments

Description Jonathan Robie 2014-10-08 19:30:22 UTC

Michael Sperberg-McQueen is asking us to consider the terminology we use surrounding identity.  I am splitting this off from Bug 26958 so that we can have the terminological discussion separately.

(In reply to C. M. Sperberg-McQueen from comment #8)
> I am generally sympathetic to Michael Kay's request that we conduct our
> discussion in terms of data independence and not by splitting hairs over the
> meaning of terms.  I apologize, therefore, for commenting solely on the
> question of terminology and not on the questions of design.  My excuse is
> that I can't contribute to any design discussion if I cannot understand what
> people are saying, and the use of the unqualified term "identity" to mean
> solely "persistent identity across mutation or update" (instead of what I
> understand "identity" to mean) makes it very hard for me to follow some of
> the discussion here.  Also, since I seem to be responsible for making JR
> self-conscious about his usage of the term, I would like to try to show that
> a less misleading usage is possible.
> 
> JR asks, in the initial description of the issue:
> 
>     I do not believe that the value of $z should be changed, so 
>     I think that we should use copy semantics here.  Is there a 
>     good way to say this without referring to identity?
> 
> Yes, there are plenty of ways to say it without any use of the term
> "identity".  There are also plenty of ways to say it that use the term
> "identity" in its conventional English sense, without any notion that
> "identity" applies only to complex mutable objects and does not apply to
> (say) the integers.  
> 
> By "identity" I believe normal English usage means either (a) similarity
> among distinct objects (as in "identical twins") or (b) the property of
> being itself and being distinct from other things.  We really do not want
> sense (a) here or elsewhere.  In sense (b), every thing which we can
> identify necessarily has "identity"; saying that maps, arrays, and elements
> have identity, therefore, is true but not particularly helpful, since it
> doesn't help distinguish them from other constructs in our data model or our
> languages.  What is at issue here, I think, is that we envisage having
> operators whose results depend only on the identity of the maps, arrays, or
> elements to which the operators are applied, or (roughly the same thing in
> different words) we envisage having operators which expose the identity of
> maps and arrays in much the same way that 'is' and '<<' and '>>' expose the
> identity of nodes.
> 
> To test my claim that we can express what we need to express without using
> the term "identity" in the ways I continue to object to, let me suggest
> wordings for some sentences which, I believe, accurately convey the intended
> meaning.
> 
>   - For "Suppose we ultimately decide that maps and arrays have identity,"
> read "Suppose we ultimately decide to expose the identity of maps and
> arrays".
> 
>   - For "(in pseudocode, assuming maps and arrays do have identity)" read
> "(in pseudocode)".  
> 
>   - For "Elements do have identity" read "Elements have node identity".
> 
>   - For "creating a GUID to represent the identity of a map" read "creating
> a GUID to represent the identity of a map" (i.e. no change is needed).
> 
>   - For "to change the semantics of our languages in ways that lose
> identity", I do not know what to write, because I'm not sure what's being
> said.
> 
>   - For "This implies identity" read, perhaps, "This implies some sort of
> identity across updates".
> 
> None of the references to object identity, preserving identity, exposing
> identity, maintaining identity, or changing the identity of nodes needs
> revision, because all of them make perfect sense when "identity" is
> understand as the property of things which makes them identical to
> themselves and different from other things.

Comment 1 Jonathan Robie 2014-10-08 21:17:50 UTC

(In reply to Jonathan Robie from comment #0)
> JR asks, in the initial description of the issue:
> 
>     I do not believe that the value of $z should be changed, so 
>     I think that we should use copy semantics here.  Is there a 
>     good way to say this without referring to identity?

FWIW, I was not asking whether we could create new terminology, I was asking whether we needed to have something analogous to "node identity" for maps and arrays in order to distinguish copy semantics from reference semantics in a constructor.  We rely on node identity for this in the semantics of element constructors, attribute constructors, etc.

> None of the references to object identity, preserving identity, exposing
> identity, maintaining identity, or changing the identity of nodes needs
> revision, because all of them make perfect sense when "identity" is
> understand as the property of things which makes them identical to
> themselves and different from other things.

I don't think this is clear about what it means by "identical to themselves" - does that mean having the same values? Is an object still "identical to itself" if its value changes?  What about "different from other things" - is a node N "different from" an node O if they have exactly the same values?

In the status quo terminology, each node N has a unique node identity that allows it to be distinguished from another node O, even if (1) N has exactly the same values as O, or (2) the values of N change.

Applied to nodes, value comparisons and general comparisons are based on values, not node identity. Node comparisons are based on node identity or document order.  PULs identify changes to nodes by identity.

> Yes, there are plenty of ways to say it without any use of the term
> "identity".  There are also plenty of ways to say it that use the term
> "identity" in its conventional English sense, without any notion that
> "identity" applies only to complex mutable objects and does not apply to
> (say) the integers.  

Extending the term "identity" to integers leads to two different meanings of "identity", and I find that confusing. 

I also find your definition of identity as "the property of things which makes them identical to themselves and different from other things" confusing. I prefer your definition (b) here:

> By "identity" I believe normal English usage means either (a) similarity
> among distinct objects (as in "identical twins") or (b) the property of
> being itself and being distinct from other things.

Our specification uses "identity" much as Grady Booch does when he says:

<quote source="Object-Oriented Analysis and Design with Applications">
An object is an entity that has state, behavior, and identity.
</quote>

The XDM does not have behavior, but it does have state and identity.

<quote source="Object-Oriented Analysis and Design with Applications">
The state of an object encompasses all of the (usually static) properties of the object plus the current (usually dynamic) values of each of these properties.
</quote>

<quote source="Object-Oriented Analysis and Design with Applications">
Identity is that property of an object which distinguishes it from all other objects.
</quote>

Let's consider this with a few examples:

Example 1: Elements

$a := <i>1</i>
$b := <i>1</1>

Are $a and $b "identical"?  We don't license either question. We can ask if they are the same node, which is equivalent to asking if they have the same identity. They are two different nodes.  We can also ask if they have the same value, and they do.  These two questions must be distinguished.

Example 2: Integers

$a := 1
$b := 1

Again, are $a and $b "identical"? They have the same value.  We can't really ask if they have the same identity - that would be equivalent to asking if they are "the same 1", as opposed to "different 1s", which is a rather odd question to ask.

>   - For "Suppose we ultimately decide that maps and arrays have identity,"
> read "Suppose we ultimately decide to expose the identity of maps and
> arrays".

This does not seem clearer to me.  We are asking whether two instances of a map or array that have the same values are distinguishable. 

Example 3: Maps

$a := { "one" : 1 }
$b := { "one" : 1 }

Do $a and $b refer to "the same map" or "different maps"?

We could decide that our data model and our language do not license the question, as for integers - maps are just values, we can ask if they have the same value or not.  Or we could decide that they do license the question - maps can be distinguished from each other. If they are not distinguishable, there is no identity to expose.

>   - For "(in pseudocode, assuming maps and arrays do have identity)" read
> "(in pseudocode)".  

That does not seem clearer to me. It loses the assumption behind the example.  Under a different assumption, the behavior would be different.

>   - For "Elements do have identity" read "Elements have node identity".

Both statements are true.

>   - For "creating a GUID to represent the identity of a map" read "creating
> a GUID to represent the identity of a map" (i.e. no change is needed).

Because a GUID is a value, this is not identity in the sense of node identity. If the value of the GUID were to change, or two different maps were assigned the same GUID, we would lose the ability to distinguish maps independently of their values.

That may be perfectly acceptable, but we should not blur these distinct uses of the term "identity".

>   - For "to change the semantics of our languages in ways that lose
> identity", I do not know what to write, because I'm not sure what's being
> said.

Read "lose the ability to distinguish maps or arrays independently of their values".

>   - For "This implies identity" read, perhaps, "This implies some sort of
> identity across updates".

Again, "identity" is a shorthand for the ability to distinguish maps or arrays independently of their values.

Comment 2 C. M. Sperberg-McQueen 2014-10-09 02:49:04 UTC

Thank you for the quotations from Grady Booch.  A couple questions occur to me in that connection.

  - You and others have argued in the past that we must avoid the term "object identity" because we have no objects and are not defining an OO language.  Given that premise, is it not an inconsistency on your part to suggest that we should follow what you suggest is an explicitly OO usage of the term "identity"?

  - The quotations from Booch seem to me perfectly normal instantiations of the definitions of "identity" found in standard dictionaries of English as "the condition of being ... itself, and not another" (this formulation from American College Dictionary, ed. Clarence Barnhart [New York:  Random House, 1947]).  None of them seem to me to license your conclusion that integers and other immutable things lack identity.  On the contrary, they also can be distinguished from all other things, and thus they seem to fit his characterization of identity.  Does he elsewhere say that integers have no identical, or that 1 is not identical to 1?  Or do you believe that the quotations you give license those conclusions?

Comment 3 Jonathan Robie 2014-10-09 10:47:46 UTC

(In reply to C. M. Sperberg-McQueen from comment #2)
> Thank you for the quotations from Grady Booch.  A couple questions occur to
> me in that connection.
> 
>   - You and others have argued in the past that we must avoid the term
> "object identity" because we have no objects and are not defining an OO
> language.  Given that premise, is it not an inconsistency on your part to
> suggest that we should follow what you suggest is an explicitly OO usage of
> the term "identity"?

Our nodes have state (properties and values) and identity.  They do not have behavior. 

Our specifications talk about properties, values, and identity with the same meaning as that of the OO model, and have done so for a very long time.  The phrase "Each node has a unique node identity" goes back to 2002, and has been used in two recommendations.

>   - The quotations from Booch seem to me perfectly normal instantiations of
> the definitions of "identity" found in standard dictionaries of English as
> "the condition of being ... itself, and not another" (this formulation from
> American College Dictionary, ed. Clarence Barnhart [New York:  Random House,
> 1947]).  None of them seem to me to license your conclusion that integers
> and other immutable things lack identity.  On the contrary, they also can be
> distinguished from all other things, and thus they seem to fit his
> characterization of identity.  Does he elsewhere say that integers have no
> identical, or that 1 is not identical to 1?  Or do you believe that the
> quotations you give license those conclusions?

To Booch, an object has identity, state and behavior. State has properties and values.  An integer is a value.  

His model does not use the term 'identity' with respect to integers.  Neither does ours. To do so, we would have to define what we mean by it. Our specifications have no need to define identity for integers, because the identity of an integer is indistinguishable from its value.  The integer 1 is no longer the integer 1 if you change its value to 2.  Introducing a concept like this would be confusing terminology for people who use 'identity' in the sense of 'object identity' or 'node identity', and I don't think it adds value.

The word "identical" is not a technical term that is defined in our model. I can't answer your question unless we define it.  If it means "has the same value for all properties", I would answer it one way.  If it means "has the same value for all properties and the same unique identity", I would answer it the other way.

Comment 4 C. M. Sperberg-McQueen 2014-10-09 22:09:21 UTC

A note on a point of detail, for the record.  In comment 3, Jonathan Robie writes

   The word "identical" is not a technical term that is defined in our model.

I'm assuming that "in our model" here means "in our specifications". (If not, this remark will be irrelevant.)

In context, I believe JR is referring to the term "identical" as applied to values.  But "identical" is defined for values in our specifications, in section 1.6.4 Properties of functions [1] of Functions and Operators.  And in section 2.3 Node Identity [2], the XDM spec answers explicitly the question JR declines to answer.

[1] https://www.w3.org/XML/Group/qtspecs/specifications/xpath-functions-31/html/Overview.html#properties-of-functions

[2] https://www.w3.org/XML/Group/qtspecs/specifications/xpath-datamodel-31/html/Overview.html#node-identity

Comment 5 Liam R E Quin 2014-10-09 23:30:23 UTC

A lot of the rhetoric around identity has of course come from implementation of programming languages - in C and C++ style languages identity is usually implemented as "machine address" - but atomic values such as an integer or character or floating-point number don't have a machine address, you can't write 6 to the location where 42 is defined since there is no such location: indirection is not generally used to access atomic values in such languages.

The pragmatic use for identity becomes, "identity is the property that lets you distinguish two or more things efficiently, and gives you a handle to a thing that may change over time". That does not define identity, of course, except through its properties.

This is not the same usage as mathematical identity, but is very common in programming language design and specification. Not all languages use the term "identity" for this concept, however. Perhaps we should rather say haecceity, "this-one-ness".

Comment 6 Jonathan Robie 2014-10-10 13:50:27 UTC

(In reply to Liam R E Quin from comment #5)

> This is not the same usage as mathematical identity, but is very common in
> programming language design and specification. Not all languages use the
> term "identity" for this concept, however. Perhaps we should rather say
> haecceity, "this-one-ness".

Not all programming languages use the term 'haecceity' for this either ;-> 

But I agree that this discussion is parallel to the philosophical discussion of haecceity and quiddity - see http://plato.stanford.edu/entries/medieval-haecceity/ for a good introduction.

Comment 7 Jonathan Robie 2014-10-10 13:56:43 UTC

(In reply to C. M. Sperberg-McQueen from comment #4)

> In context, I believe JR is referring to the term "identical" as applied to
> values.  But "identical" is defined for values in our specifications, in
> section 1.6.4 Properties of functions [1] of Functions and Operators.  And
> in section 2.3 Node Identity [2], the XDM spec answers explicitly the
> question JR declines to answer.

The term "identical" in Functions and Operators is defined in the context of saying whether two function calls return the same result, but if we want it as a defined term, perhaps it should be defined with a wider scope.

The XDM is indeed quite explicit on this:

Each node has a unique identity. Every node in an instance of the data model is unique: identical to itself, and not identical to any other node. (Atomic values do not have identity; every instance of the value “5” as an integer is identical to every other instance of the value “5” as an integer.)

Indeed, that paragraph seems quite clear to me.

Comment 8 Ghislain Fourny 2014-10-13 09:55:10 UTC

The way I am looking at identity, in this case, is purely from a query/result perspective, that is, about what is visible or exposed to the user. As opposed to any kind of physical identity/address in memory. That is, I define identity not in absolute terms, but "axiomatically", like you can define what a line and point are in geometry, only by expressing axioms using these words (two points unique describe a line etc). I remember from my childhood a book in which Euclidian axioms were reformulated with camels and asparagus rather than lines and points, to illustrate the separation between words and meaning :-)

In other words:

1. For XML nodes, I view identity in terms of the "is" operator's returning true or false (and document order operators).

2. If you include updates (be it copy-modify-return or scripting), identity is also exposed, in that two nodes are identical if, applying updates to the one, you see them on the other node as well.

3. If you include a persistent layer, you might notice identity exposure via a change of behavior w.r.t. the above definitions ("is" was returning true and now returns false, or applying updates had and no longer has an effect at several places in the structure, etc).

In the case of maps/arrays, there is no "is" or document order comparison operators, so only updates and persistence reveal the "identity". I think that in that case the definitions above apply as well.

I'm not 100% sure whether staying at the language semantics level is sufficient, but at least it was to me so far. I am unsure where invoking physical-level or OO-programming-level machinery helps or makes the matter more complicated here.

To put it simply, my feeling is that the specifications should still be implementable if we used the word "asparagus" instead of "identity" everywhere :-)

Just my naive perspective. I hope it helps.

Comment 9 Jonathan Robie 2014-10-13 10:36:55 UTC

I'm still not sure if there's a problem that needs to be solved in our specifications.

Is the current terminology problematic? If so, why? If not, I'm in favor of sticking with status quo terminology that has been used since 2002 (13 years now).

Do we need a better description of these terms in our specifications?

Comment 10 Ghislain Fourny 2014-10-13 12:20:14 UTC

I think I am fine with the existing terminology as well.

Although I do feel like the definition here:


"Each node has a unique identity. Every node in an instance of the data model is unique: identical to itself, and not identical to any other node."


is cyclic and informal ("itself", "identical" and "other" kind of recursively rely on the very term "identity" being defined here). But it feels like it's the "is" operator that is implicitly hidden in this sentence so I am fine with it. I wouldn't mind making it explicit though.

Comment 11 Michael Kay 2014-10-13 13:56:43 UTC

How about: "A node has a hidden property referred to as its identity. Many operations that construct new nodes are defined to return a node whose identity is distinct from that of any other node. The identity of nodes is exposed in a number of ways:

* When two expressions return nodes, it is possible to compare the identity of the nodes they return using the "is" operator. If f() is an operation defined to construct new nodes, then the result of the expression "f() is f()" is false.

* Some operators, for example the "union" operator and the path operator "/", are defined to eliminate duplicate nodes, meaning that there will never be two items at different positions in the result sequence that are nodes with the same identity. 

* The function fn:generate-id takes a node as argument: it is guaranteed that when two nodes have the same identity, fn:generate-id applied to those nodes will return equal strings (compared using the Unicode Codepoint Collation), and that when they have different identity, fn:generate-id will return unequal strings.

* In XQuery Update, it is possible to modify properties of a node (for example, adding or removing children, or changing the string value) without changing the node's identity. The semantics of update rely heavily on the concept of node identity: for example, adding two attributes to the same node is fundamentally different from adding them to different nodes.

Comment 12 Ghislain Fourny 2014-10-13 14:12:44 UTC

I like Mike's suggestion. Thank you for also adding other expressions exposing identity such as duplicate elimination or ID generation.

Comment 13 Jonathan Robie 2014-10-13 21:22:57 UTC

I like Mike's suggestion too.

Which specification should this text live in?

Comment 14 Jonathan Robie 2014-10-28 19:22:39 UTC

The Working Group agrees to close this in favor of the resolution of https://www.w3.org/Bugs/Public/show_bug.cgi?id=27040.