Python: difficulty converting ascii to unicode -
my goal: page source url , count instances of keyword within page source
how doing it: getting pagesource via urllib2, looping through each char of page source , comparing keyword
my problem: keyword encoded in utf-8 while page source in ascii... running errors whenever try conversions.
getting page source:
import urllib2 response = urllib2.urlopen(myurl) return response.read()
comparing page source , keyword:
pagesource[i] == keyword[j]
i need convert 1 of these strings other's encoding. intuitively felt ascii (the page source) utf-8 (the key word) best , easiest, so:
pagesource = unicode(pagesource) unicodedecodeerror: 'ascii' codec can't decode byte __ in position __: ordinal not in range(128)
i'll assume remote "source page" contains more ascii otherwise comparison work (ascii subset of utf-8. i.e. in ascii 0x41, same utf-8).
you may find python requests library easier automatically decode remote content unicode strings based on server's headers (unicode strings encoding neutral can compared without worrying encoding).
resp = requests.get("http://www.example.com/utf8page.html") resp.text >> u'my unicode data €'
you need decode reference data:
keyword[j] = "€".decode("utf-8") keyword[j] >> u'€'
if you're embedding non-ascii in source code, need define encoding you're using. example, @ top of source code/script:
# coding=utf-8
Comments
Post a Comment