Architectural Proposals

Introduction

This page describes some of the primary architectural considerations driving the next generation of notation standards, and includes a set of proposals intended to address these considerations.

The proposals here do not yet attempt to reflect a consensus view of the way forward. The present purpose is to stimulate discussion and meaningful interchange.

Changing Times in Digital Notation

This section presents some brief thoughts on the changes in the software landscape that have occurred since the birth of MusicXML, as well as an increased awareness of how MusicXML can serve modern musical needs. These thoughts are also informed closely by the User Stories.

MusicXML began as an archival and exchange format in an era of "big applications" such as notation editors integrating proprietary data models, visual rendering and sound synthesis. These still exist, but the future of digital notation will include many "light applications" that work with the same data in specialized ways, but do not wish to take on the enormous burden of rendering and synthesis: they would like libraries or other components to do that work for them.

The music publishing process is increasingly geared to interactive consumption of notated music, with a recognition that these assets may be tailored in various ways by the consumer.

MusicXML began in an era of native operating systems and diverse programming languages. The Web has now become, in effect, a large and totally portable operating system on where network-aware applications run identically everywhere in JavaScript. This doesn't mean that our programming culture is dumping native platforms or discarding compiled programming languages, but we cannot afford to gear an architecture to these cases alone.

MusicXML was born when XML was itself still relatively new. Since then, the world has learned a great deal about how to make the most from XML documents, and how to design XML schemas to maximize this value. Many of the most thoughtful and useful solutions to these problems have come from the domain of the Web, to the point where such standards (HTML, CSS, SVG to name a few) are the solutions of choice for native offline applications as well.

MusicXML has, laudably, left a great deal of room for adopters to decide which portions of it they will use. This has brought great flexibility to the standard and made it useful in many different contexts. At the same time, it has brought a level of confusion about what approaches and subsets are correct when encoding documents for the many purposes reflected in the User Stories.

We must also acknowledge a significant and younger standard, the Music Encoding Initiative (MEI). MEI already exhibits a number of the suggested approaches for MusicXML that follow in this document, and it offers many useful models and ideas. However, today, neither MEI nor MusicXML reflect some of the key suggestions made here.

Story-driven Requirements

These requirements are drawn from the technical requirements section of the Requirements Matrix page, and are explained further here in an architectural context.

Stable and long-lived standard

Implies that the standard employs components and external standards considered to be the best current solutions in their class.

Separation of semantic, visual and performance data

These three facets of data describing music notation are easily and cleanly separable. Processing that concerns only one layer can easily distinguish the data belonging to that facet.

Multiple rendering styles

Consider a layer of non-semantic data pertaining to some usage context to be a "style": it could be visual, or aural, or both. This requirement states that the semantic data for a document may be associated with multiple styles, both within a given document and over a span of time.

Identification of contexts for visual, performance data

Styles within a document may be only pertinent to a specific medium of consumption e.g. a mobile phone, or printed output on A4 paper. These contexts need to be identifiable within the document, and visual or performance data may be restricted to a specific context.

Implementable using pure web technology

The standard does not employ concepts or components that are currently or potentially unavailable in a browser environment (for example, VST plugins or native C++ libraries).

Can be created and interpreted programmatically with minimum effort

Encodings do not require undue programming effort to construct or interpret. Simple musical data, with semantic facets only, should map onto simple document contents. Adding or removing visual or performance facets should not require radical increases of effort. As music complexity increases smoothly, the document contents should increase in complexity smoothly.

Note that minimum effort does not only refer to implementation. It also refers to the amount of learning required to understand an encoding, prior to implementation.

Can be incrementally modified programmatically

It is easy for applications to make incremental changes to a data structure representing encoded music. In conjunction with an appropriate rendering library, such these will be immediately reflected in live changes to a rendition of the music.

Can be styled and highlighted programmatically

It is easy for applications to apply style changes to a data structure representing encoded music. In conjunction with an appropriate rendering library, such changes will be reflected as live highlighting or modification of the music.

Dispatches element-specific notifications of user interaction with score

It is easy for applications to receive event notifications reflecting live user interaction with a rendition of encoded music, e.g. clicking on a note or a measure. In conjunction with an appropriate library, applications can allow users to interact directly with visual and aural elements of rendered music.

Key Observations and Connections

Stable and long-lived standard

Implementable using pure web technology

Can be created and interpreted programmatically with minimum effort

The above suggest the desirability of taking advantage of existing, familiar best-in-class specifications to address clear problem subdomains. Such specifications stand the best chance of being stable, lasting a long time, implementable on the web, with low effort to both learn and deploy.

Separation of semantic, visual and performance data

Multiple rendering styles

Identification of contexts for visual, performance data

On the web, documents are structured using abstract classes (e.g. 'horizontalLine') and instantiated by imposing a Cascading Style Sheet (CSS) style that determines the precise parameters (width, thickness, length, colour etc.). Different instances of the same abstract document can be created by combining it with different CSS stylesheet documents containing precise parameters for the various abstract classes in use. Currently, CSS is used for determining visual parameters, but a similar approach is also possible in the temporal domain (performance facet). CSS can be used both in HTML, and to override the properties of objects in SVG documents. CSS media queries has addressed the problem of identifying multiple styles and the selection of the proper style for a given display context.

Can be incrementally modified programmatically

Can be styled and highlighted programmatically

Dispatches element-specific notifications of user interaction with score

The above requirements are roughly identical to the requirements that have driven the construction of the Document Object Model API, a platform- and language-neutral interface for building, changing and interacting with structured documents.

Primary Architectural Recommendations

Based on the above, this section suggests architectural recommendations for the next iteration of notation encoding. While some may be controversial, and almost all are not backwards compatible, they form one possible chain of argument beginning from user stories, deriving requirements, and connecting these to current software development practice in 2016.

Adopt Flexible Profiles for Compliance

As detailed in the Music Notation Use Cases page, the notion of profiles can be very helpful in setting out rules for compliance within subsets of the expressive range of an encoding. Such profiles make it possible to test that documents are well-formed or not for some specific purpose. Without profiles, we will be left to argue fruitlessly about how well-formed every document should be, and how one can tell that it is "well-formed enough".

Use CSS for the Visual/Layout Facet of Notation

CSS builds on the strengths of XML, and MusicXML is a natural fit to it. A CSS stylesheet allows a separate set of definitions of “how a document looks” to be cleanly separated from the XML substance of “what it is” (its semantics). CSS style can attach a set of visual properties to any of the following (and more):

The document as a whole
Elements of a specific type, at any level of the document
Elements bearing a specific style name ("class" in CSS parlance)
Elements having a specific combination of attributes
Descendants or children of elements meeting some of the above conditions
Any arbitrary element in the document

CSS is just a structure and an architecture; it does not legislate the set of properties to be used. There are of course definitions today for CSS properties associated with HTML, SVG and other markup languages. Some of these can be borrowed. Examples include:

color
height, width
size
vertical-align
visibility

However, most CSS properties for music will need to be invented. They will be based on the many properties that, today, are MusicXML attributes or elements solely concerned with visual formatting. Some examples may include:

default-x/default-y
note-size
staff-spacing
page size
stem length
slur geometry

With a thoughtful set of choices for CSS music layout properties, the non-stylesheet portion of a notation document will begin to approach pure semantic musical data.

Use CSS for the Performance Facet of Notation

This may be a more surprising recommendation, but it follows the same logic as the preceding one for visual layout information. CSS properties for performance might include:

muting
tempo
dynamics
instrument mappings
relative note onset and duration
musical form for performance

Instead of interpersing performance data into a semantic document, MusicXML can now attach performance interpretation to elements using the spectrum of CSS techniques, liberating semantic markup from this second burden in a musical document. This allows a performance stylesheet to do such things like:

specify a volume increment for all accented notes
indicate whether notes marked as "cue" are to be heard
specify that a specific note's performance is to be delayed by some amount
supply explicit MIDI data to supply the actual performance of some passage

Interesting questions attach to how performance styling might be separated from visual styling -- or not.

Standardize Key Aspects of Layout and Performance

The preceding recommendations will have far more value if MusicXML moves towards well-defined layout and performance models. These will make clear and testable statements about the exact visual positioning and aural performance of notational elements, and will allow stylesheets to govern the engraving style applied to visual output in a more useful way than at present.

This does not mean that there is no room for invention of improvement of MusicXML rendering or performance. Such statements will, as with other specifications, be distinguished by qualifiers like MUST, SHOULD and MAY (which have well-defined meanings in these contexts). There is lots of room for the implementor's imagination.

Gear Encoding towards the Document Object Model (DOM) Core and Events APIs

The DOM API is the key to liberating music application developers from the complexities of specific music rendering and interaction libraries. By using the DOM API, many music developers in the future (other than those supplying such libraries or writing specialized notation editors) will simply manipulate MusicXML structures directly and leave the rendering and interaction details to a library appropriate to their platform.

This is an indirect recommendation, because the DOM API is compatible with literally any XML document. However, once one embraces this approach of working with XML, there are huge benefits in optimizing the markup language to to make the DOM style of programming easier.

When manipulating a DOM that is largely semantic information, it is easiest to simply put in the things you want, and remove the things you don't want. "Bookkeeping" operations become prohibitively complex when edits are taking place directly to a document structure, and the entire structure itself is not always generated from scratch. It is worth noting that parsing and generation of MusicXML today is often implemented as a multi-pass process due to the need to discover or create ancillary state information that drives this "bookkeeping".

Some of the bookkeeping/state problems with MusicXML include:

The "cursor" construct requires forward/backward jumps within a measure. Thus, even localized changes in note content may require corresponding changes to distant cursor jumps.
The <chord> element is a tag on a note, rather than a container of notes. This complicates simple operations such as adding and removing note elements. The placement of the chord tag must constantly be moved around to belong to the first note in any given chord.
The use of separate "start" and "stop" elements correlated only by localized numberinglike "1", "2" etc. to represent relationships e.g. slur/note bindings, means that splicing information from one place in a document to elsewhere may cause clashes, and requires considerable to care to maintain the correct endpoints as edits take place.

These can be addressed by changing and simplifying the topology of MusicXML. Some suggestions include:

Legislate the partwise organization of scores.
Place the sequential rests/notes/chords in each layer (or voice) inside a single container element and eliminate forward and backward in favor of allowing atomic jumps forward in metrical time within a layer.
Include notes in a single container element to form chords, or generalize this concept to note sets.
Include tuplets in a single container element to represent time modification groupings.
use XML id and reference constructs to represent elements with distinct endpoints, rather than start/stop pairs.

These changes not only simplify incremental change of documents, but assist in event dispatch and user interaction. For example, all user interaction with one or more notes in a chord will be dispatched via a single chord element; all user interaction with a slur will be dispatched via a single slur element.

Gear Encoding towards CSS Selectors

It is not entirely clear what changes this entails, but the next generation of MusicXML can profit by making CSS selectors (the bindings that determine what layout/performance facets apply to some semantic elements) easy to use.

The topological changes above will themselves help, as the styling of a chord or a voice may be determined by its applicability to a single container element, or to a single connecting element rather than its start or stop endpoint.

For example, there may be strong advantages to incorporating meaningful data into attributes rather than child elements since the CSS selector syntax for this is very straightforward. This then allows CSS properties to be attached to all elements with matching attribute values. For example, the type and dottedness of a note might be promoted into the <note> element.

TBD Content

Additional requirements need to be elaborated on the following topics, at the least:

Arbitrary visual/textual extensions
Arbitrary semantic extensions
Arbitrary hierarchy of musical and textual sections
Full bibliographic metadata and citations
Bundling of multiple languages in same document
Inclusion of parts and full score in same document