Unescape HTML Entities in Python

Part of Data

Author(s) and publish date

By:

Karl Dubost

Published:

8 April 2008

I'm not a programmer, I mean a real programmer. I do hack code sometimes, mostly python, to be able to process files for recurring tasks. I had to read XHTML files to send to an XML parser (ElementTree).

This piece of code might be useful for someone and there will be certainly people who think that it is really programmed with my feet and suggest fixes. I usually work in UTF-8 but sometimes there might be character references and named entity in my files. So I wanted to convert everything to UTF-8 characters.

def parsefile(path):
   try:
      file = open(path, "r")
      fileread = file.read()
      fileread = unescape(fileread.decode('utf-8')).encode('utf-8')
      file.close()
   except:
      print "Reading File Bug"
      sys.exit(1)
   return ET.fromstring(fileread)

The Unescape HTML Entities routine was found on Frederik Lundh Web site. The code was doing too much because it was converting &, > and <. I wanted to keep those in URL and where I have escaped code sections. So I slightly modified it for my own needs.

def unescape(text):
   """Removes HTML or XML character references 
      and entities from a text string.
      keep &amp;, &gt;, &lt; in the source code.
   from Fredrik Lundh
   http://effbot.org/zone/re-sub.htm#unescape-html
   """
   def fixup(m):
      text = m.group(0)
      if text[:2] == "&#":
         # character reference
         try:
            if text[:3] == "&#x":
               return unichr(int(text[3:-1], 16))
            else:
               return unichr(int(text[2:-1]))
         except ValueError:
            print "erreur de valeur"
            pass
      else:
         # named entity
         try:
            if text[1:-1] == "amp":
               text = "&amp;amp;"
            elif text[1:-1] == "gt":
               text = "&amp;gt;"
            elif text[1:-1] == "lt":
               text = "&amp;lt;"
            else:
               print text[1:-1]
               text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
         except KeyError:
            print "keyerror"
            pass
      return text # leave as is
   return re.sub("&#?w+;", fixup, text)

Hope it helps.

Related RSS feed

Subscribe to our blog feed

Comments (4)

Richard Ishida - 8 April 2008 at 06:41:26 UTC

You may also be interested in the Unicode Converter at http://rishida.net/scripts/uniview/conversion which converts cut&pasted text simultaneously between various escape and other forms. The code is in JavaScript and can be viewed by downloading the .js file.
Renee - 28 August 2008 at 16:19:12 UTC

Hello - I am trying desperately to keep the '&' from converting to '&' in some html/xml. I want to use a URL to pass a couple of querystring parameter to another page. I need to use the '&' for the query string otherwise it isn't recognized as a parameter. Can you tell me how to keep this conversion from happening automatically?
here is the URL that I need to keep as is. It is currently just a hyperlink on a page.
"http://MySite/sites/Myweb/mypage/Lists/Reference%20Accounts/Form.aspx?CustID=1&Customer=Some Customer&Source=http://MySite/sites/Myweb/mypage/pages/references.aspx"
The '&' in the above always converts to '&' and this doesn't work.
I'm not much of a developer, and have been wracking my brain for days on this, so any help would be much appreciated!!
Thank you.
- Karl Dubost - 29 August 2008 at 06:45:05 UTC
  
  Hi Renee
  It really depends on the programming language, framework and/or tools you are using for developing the site. I would recommend to ask the question in the community of the tools you are using.
  :)
Rak - 17 December 2008 at 07:35:38 UTC

How do I unescape percent encoded string using python, do u have a code for it ?

Comments for this post are closed.

Unescape HTML Entities in Python

Author(s) and publish date

Related to this post

Tags

Related RSS feed

Comments (4)