« Pre-Obsolete Design | Main | A validator is not an accessibility evaluation tool? »

Unescape HTML Entities in Python

I'm not a programmer, I mean a real programmer. I do hack code sometimes, mostly python, to be able to process files for recurring tasks. I had to read XHTML files to send to an XML parser (ElementTree).

This piece of code might be useful for someone and there will be certainly people who think that it is really programmed with my feet and suggest fixes. I usually work in UTF-8 but sometimes there might be character references and named entity in my files. So I wanted to convert everything to UTF-8 characters.

def parsefile(path):
   try:
      file = open(path, "r")
      fileread = file.read()
      fileread = unescape(fileread.decode('utf-8')).encode('utf-8')
      file.close()
   except:
      print "Reading File Bug"
      sys.exit(1)
   return ET.fromstring(fileread)

The Unescape HTML Entities routine was found on Frederik Lundh Web site. The code was doing too much because it was converting &, > and <. I wanted to keep those in URL and where I have escaped code sections. So I slightly modified it for my own needs.

def unescape(text):
   """Removes HTML or XML character references 
      and entities from a text string.
      keep &, >, < in the source code.
   from Fredrik Lundh
   http://effbot.org/zone/re-sub.htm#unescape-html
   """
   def fixup(m):
      text = m.group(0)
      if text[:2] == "&#":
         # character reference
         try:
            if text[:3] == "&#x":
               return unichr(int(text[3:-1], 16))
            else:
               return unichr(int(text[2:-1]))
         except ValueError:
            print "erreur de valeur"
            pass
      else:
         # named entity
         try:
            if text[1:-1] == "amp":
               text = "&"
            elif text[1:-1] == "gt":
               text = ">"
            elif text[1:-1] == "lt":
               text = "<"
            else:
               print text[1:-1]
               text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
         except KeyError:
            print "keyerror"
            pass
      return text # leave as is
   return re.sub("&#?\w+;", fixup, text)

Hope it helps.

Filed by Karl Dubost on April 8, 2008 2:21 AM in Bugs Life, Tools, XML
| | Comments (1) | TrackBacks (0)

Comments

Richard Ishida # 2008-04-08

You may also be interested in the Unicode Converter at http://rishida.net/scripts/uniview/conversion which converts cut&pasted text simultaneously between various escape and other forms. The code is in JavaScript and can be viewed by downloading the .js file.

Leave a comment

Note: this blog is intended to foster polite on-topic discussions. Comments failing these requirements and spam will not get published. Please, enter your real name and email address. Every individual comment is reviewed by the W3C staff. This may take some time, thank you for your patience.

You can use the following HTML markup (a href, b, i, br/, p, strong, em, ul, ol, li, blockquote, pre) and/or Markdown syntax.

Your comment


About you

This blog is written by W3C staff and working group participants,
 and maintained by Karl Dubost and olivier Thereaux.
Powered by Movable Type, magpierss and a lot of Web Technology