From XML Performance
This is the main page of the XML Performance Community Group
Reflexion on performance
Convert the document to US ASCII ("US-ASCII") or Unicode ("UTF-8" or "UTF-16") before parsing. Documents written using ASCII are the fastest to parse because each character is guaranteed to be a single byte and map directly to their equivalent Unicode value. For documents that contain Unicode characters beyond the ASCII range, multiple byte sequences must be read and converted for each character. There is a performance penalty for this conversion. The UTF-16 encoding alleviates some of this penalty because each character is specified using two bytes, assuming no surrogate characters. However, using UTF-16 can roughly double the size of the original document which takes longer to parse.
The XML canonicalization process results in some changes from the original document, among others:
Although canonicalization is not primarily meant for this purpose, we can use it to improve performance.
- Encoding of the document in UTF-8.
- Normalization of line breaks to #xA.
- Normalization of the attribute values.
- Substitution of character and parsed entities.
- Removing of the XML declaration.
- Removing of the document type declaration.
- Addition of the default attributes.
Parsing is roughly around GB/min (1 to 5 depending on disk, etc.)
There is of course different definition of parsing :
- Parsing the XML only but without any object persistence (StAX Stream)
- Parsing the XML with DOM generation (DOM)
- Parsing the XML with Event generation (StAX Event)
- Parsing the XML with Object generation (JAXB)
There is interesting approach like asynchronous parsing for example in Aalto : http://www.cowtowncoder.com/blog/archives/2011/03/entry_451.html
This is the extra time taken on top of parsing to add validation against a model or a set of Rules. The model could be :
- XML Schema
- Relax NG
- Relax NG + Schematron
The complexity of the model or the rules to check can have an impact on the validation time.
The technology used has definitely an impact on the ability to validate on the fly (Streaming)
Serialization is two fold. Text based and Binary Based