Unicode 编码错误 Python - 解析 XML 无法编码字符(星号)

Unicode Encoding Errors Python - Parsing XML can#39;t encode a character (Star)(Unicode 编码错误 Python - 解析 XML 无法编码字符(星号))
本文介绍了Unicode 编码错误 Python - 解析 XML 无法编码字符(星号)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着跟版网的小编来一起学习吧!

问题描述

我是 Python 的初学者,目前正在从 eventful.com API 解析一个基于 Web 的 XML 文件,但是,在检索数据的某些元素时,我收到了一些 unicode 错误.

I am a beginner to Python and am currently parsing a web-based XML file from the eventful.com API however, I am receiving some unicode errors when retrieving certain elements of the data.

我能够从 xml 文件中检索 5 个数据元素而没有任何我想要的问题,但是它会终止并在 GAE 错误控制台中产生以下错误:

I am able to retrieve 5 data elements without any problems which I want from the xml file, however then it terminates and produces the following error in the GAE error console:

UnicodeEncodeError: 'ascii' codec can't encode character u'u2605' in position 0: ordinal not in range(128)

我知道抛出我的解析器的字符是★"字符,无论如何我都不想从 xml 文件中检索它.

I know that the character that is throwing my parser is a "★" character, which I would prefer to not retrieve from the xml file anyway.

我的代码如下:

class XMLParser(webapp2.RequestHandler):
        def get(self):
        base_url = 'my xml file'
        #downloads data from xml file
        response = urllib.urlopen(base_url)
        #converts data to string:
        data = response.read()

        #closes file
        response.close()

        #parses xml downloaded
        dom = mdom.parseString(data)
        node = dom.documentElement  
        #print out all event names (titles) found in the eventful xml
        event_main = dom.getElementsByTagName('event')

        event_names = []
        for event in event_main:
            eventObj = event.getElementsByTagName("title")[0]
            event_names.append(eventObj)

        for ev in event_names:
            nodes = ev.childNodes
            for node in nodes:
                if node.nodeType == node.TEXT_NODE:
                    print node.data

有什么方法可以检索标题"元素并忽略此处的 ★ 字符等有趣字符?我真的很感激在这件事上的任何帮助.我已经尝试过使用 word.encode('us-ascii', 'ignore') 的解决方案,但这并不能解决问题.

Is there any way that I would be able to retrieve the "title" elements and ignore funny characters like the ★ character here? I would really appreciate any help on this matter. I have already tried solutions which uses word.encode('us-ascii', 'ignore') but this is not fixing the issue.

-----------我找到了解决方案:

-----------I HAVE FOUND THE SOLUTION:

因此,当我遇到此类问题时,在与该主题的讲师交谈后,我发现只需要两行代码即可对已解析的 xml 文件进行编码和解码(在读取后进入程序).希望这可以帮助遇到同样问题的其他人!

So as I was having such issues with this problem and after talking to a lecturer on this topic I was able to find that all it required was two lines of code to both encode and decode the parsed xml file (after it was read into the program). Hope this helps someone else having the same issue!

unicode_data = data.decode('utf-8')
data = unicode_data.encode('ascii','ignore')

推荐答案

你在哪里使用你的解码方法?

Where are you using your decoding methods?

我过去遇到过这个错误,不得不解码原始数据.换句话说,我会尝试做

I had this error in the past and had to decode the raw. In other words, I would try doing

data = response.read()
#closes file
response.close()
#decode
data.encode("us-ascii")

也就是说,如果它实际上是 ascii.我的意思是,在调用 parseString 之前,请确保在原始结果仍为字符串格式时对其进行编码/解码.

That is if it is in fact ascii. My point being make sure you are encoding/decoding the raw results while it is still in a string format, before you call parseString on it.

这篇关于Unicode 编码错误 Python - 解析 XML 无法编码字符(星号)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!

本站部分内容来源互联网,如果有图片或者内容侵犯您的权益请联系我们删除!

相关文档推荐

python arbitrarily incrementing an iterator inside a loop(python在循环内任意递增迭代器)
Joining a set of ordered-integer yielding Python iterators(加入一组产生 Python 迭代器的有序整数)
Iterating over dictionary items(), values(), keys() in Python 3(在 Python 3 中迭代字典 items()、values()、keys())
What is the Perl version of a Python iterator?(Python 迭代器的 Perl 版本是什么?)
How to create a generator/iterator with the Python C API?(如何使用 Python C API 创建生成器/迭代器?)
Python generator behaviour(Python 生成器行为)