This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 19241 - non-utf8 characters in SOAP1.2 output
Summary: non-utf8 characters in SOAP1.2 output
Status: RESOLVED WONTFIX
Alias: None
Product: Validator
Classification: Unclassified
Component: check (show other bugs)
Version: HEAD
Hardware: PC Linux
: P2 normal
Target Milestone: ---
Assignee: This bug has no owner yet - up for the taking
QA Contact: qa-dev tracking
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-10-03 08:23 UTC by Pavel Janda
Modified: 2015-08-23 07:37 UTC (History)
4 users (show)

See Also:


Attachments
Don't include the forbidden code point (743 bytes, patch)
2014-03-19 22:14 UTC, Michael Fairchild
Details
Replace the invalid character instead of removing the entire line (687 bytes, patch)
2014-03-20 21:02 UTC, Michael Fairchild
Details

Description Pavel Janda 2012-10-03 08:23:12 UTC
Hi all,

first thank you for you well done job. But I found bug in SOAP output of check-script:

When bad non-valid page contains non-utf8 character, this is provided in your SOAP output aswell witch causes non-valid XML and I am not able to work with XML like that in PHP.

Sample page with bad character: http://www.itrebon.cz/ubytovani-v-treboni-a-okoli_78.html

I've solved that by removing non-utf8 chars from your output before creating SimpleXMLElement so I am good now, but I want to let you know about this because I mean that however this mistake is not in your code, you should output only valid XML in SOAP.

Thank you very much!
Pavel Janda
Comment 1 Michael Fairchild 2014-03-19 16:46:59 UTC
Bump.

I'm having this issue as well.
see my example script to reproduce: https://gist.github.com/mfairchild365/9645880
Comment 2 Brett Bieber 2014-03-19 19:37:35 UTC
This problem occurs in the web output as well as the soap12 output.

Perhaps when the error type is "Forbidden code point", the source sample should be altered to remove the invalid code point, or not shown at all.
Comment 3 Michael Fairchild 2014-03-19 22:14:51 UTC
Created attachment 1453 [details]
Don't include the forbidden code point
Comment 4 Michael Fairchild 2014-03-20 21:02:18 UTC
Created attachment 1454 [details]
Replace the invalid character instead of removing the entire line

This patch replaces the forbidden character with a question mark (?) before it is displayed.  This is an improvement over the last patch, which simply prevented the entire line of context from being displayed.  By showing the line with the question mark, it will hopefully be easier for people to find the location of the character and fix the problem.
Comment 5 Michael[tm] Smith 2015-08-23 07:37:37 UTC
The output=soap12 option is obsolete and no longer maintained and should no longer be used or relied on.

We recommend instead using the current HTML checker https://validator.w3.org/nu/ with the out=json option.