I’m not a programmer, I mean a real programmer. I do hack code sometimes, mostly python, to be able to process files for recurring tasks. I had to read XHTML files to send to an XML parser (ElementTree).
This piece of code might be useful for someone and there will be certainly people who think that it is really programmed with my feet and suggest fixes. I usually work in UTF-8 but sometimes there might be character references and named entity in my files. So I wanted to convert everything to UTF-8 characters.
def parsefile(path): try: file = open(path, "r") fileread = file.read() fileread = unescape(fileread.decode('utf-8')).encode('utf-8') file.close() except: print "Reading File Bug" sys.exit(1) return ET.fromstring(fileread)
The Unescape HTML Entities routine was found on Frederik Lundh Web site. The code was doing too much because it was converting
<. I wanted to keep those in URL and where I have escaped code sections. So I slightly modified it for my own needs.
def unescape(text): """Removes HTML or XML character references and entities from a text string. keep &, >, < in the source code. from Fredrik Lundh http://effbot.org/zone/re-sub.htm#unescape-html """ def fixup(m): text = m.group(0) if text[:2] == "&#": # character reference try: if text[:3] == "&#x": return unichr(int(text[3:-1], 16)) else: return unichr(int(text[2:-1])) except ValueError: print "erreur de valeur" pass else: # named entity try: if text[1:-1] == "amp": text = "&amp;" elif text[1:-1] == "gt": text = "&gt;" elif text[1:-1] == "lt": text = "&lt;" else: print text[1:-1] text = unichr(htmlentitydefs.name2codepoint]) except KeyError: print "keyerror" pass return text # leave as is return re.sub("&#?w+;", fixup, text)
Hope it helps.