19241 – non-utf8 characters in SOAP1.2 output

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 19241 - non-utf8 characters in SOAP1.2 output

Summary: non-utf8 characters in SOAP1.2 output

Status:	RESOLVED WONTFIX

Alias:	None

Product:	Validator
Classification:	Unclassified
Component:	check (show other bugs)
Version:	HEAD
Hardware:	PC Linux

Importance:	P2 normal
Target Milestone:	---
Assignee:	This bug has no owner yet - up for the taking
QA Contact:	qa-dev tracking

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-10-03 08:23 UTC by Pavel Janda
Modified:	2015-08-23 07:37 UTC (History)
CC List:	4 users (show)

See Also:

Attachments
Don't include the forbidden code point (743 bytes, patch) 2014-03-19 22:14 UTC, Michael Fairchild	Details
Replace the invalid character instead of removing the entire line (687 bytes, patch) 2014-03-20 21:02 UTC, Michael Fairchild	Details

Description Pavel Janda 2012-10-03 08:23:12 UTC

Hi all,

first thank you for you well done job. But I found bug in SOAP output of check-script:

When bad non-valid page contains non-utf8 character, this is provided in your SOAP output aswell witch causes non-valid XML and I am not able to work with XML like that in PHP.

Sample page with bad character: http://www.itrebon.cz/ubytovani-v-treboni-a-okoli_78.html

I've solved that by removing non-utf8 chars from your output before creating SimpleXMLElement so I am good now, but I want to let you know about this because I mean that however this mistake is not in your code, you should output only valid XML in SOAP.

Thank you very much!
Pavel Janda

Comment 1 Michael Fairchild 2014-03-19 16:46:59 UTC

Bump.

I'm having this issue as well.
see my example script to reproduce: https://gist.github.com/mfairchild365/9645880

Comment 2 Brett Bieber 2014-03-19 19:37:35 UTC

This problem occurs in the web output as well as the soap12 output.

Perhaps when the error type is "Forbidden code point", the source sample should be altered to remove the invalid code point, or not shown at all.

Comment 3 Michael Fairchild 2014-03-19 22:14:51 UTC

Created attachment 1453 [details]
Don't include the forbidden code point

Comment 4 Michael Fairchild 2014-03-20 21:02:18 UTC

Created attachment 1454 [details]
Replace the invalid character instead of removing the entire line

This patch replaces the forbidden character with a question mark (?) before it is displayed.  This is an improvement over the last patch, which simply prevented the entire line of context from being displayed.  By showing the line with the question mark, it will hopefully be easier for people to find the location of the character and fix the problem.

Comment 5 Michael[tm] Smith 2015-08-23 07:37:37 UTC

The output=soap12 option is obsolete and no longer maintained and should no longer be used or relied on.

We recommend instead using the current HTML checker https://validator.w3.org/nu/ with the out=json option.