Python: difficulty converting ascii to unicode -


my goal: page source url , count instances of keyword within page source

how doing it: getting pagesource via urllib2, looping through each char of page source , comparing keyword

my problem: keyword encoded in utf-8 while page source in ascii... running errors whenever try conversions.

getting page source:

import urllib2 response = urllib2.urlopen(myurl) return response.read() 

comparing page source , keyword:

pagesource[i] == keyword[j] 

i need convert 1 of these strings other's encoding. intuitively felt ascii (the page source) utf-8 (the key word) best , easiest, so:

    pagesource = unicode(pagesource) unicodedecodeerror: 'ascii' codec can't decode byte __ in position __: ordinal not in range(128) 

i'll assume remote "source page" contains more ascii otherwise comparison work (ascii subset of utf-8. i.e. in ascii 0x41, same utf-8).

you may find python requests library easier automatically decode remote content unicode strings based on server's headers (unicode strings encoding neutral can compared without worrying encoding).

resp = requests.get("http://www.example.com/utf8page.html") resp.text >> u'my unicode data €' 

you need decode reference data:

keyword[j] = "€".decode("utf-8") keyword[j] >> u'€' 

if you're embedding non-ascii in source code, need define encoding you're using. example, @ top of source code/script:

# coding=utf-8 

Comments

Popular posts from this blog

python - TypeError: start must be a integer -

c# - DevExpress RepositoryItemComboBox BackColor property ignored -

django - Creating multiple model instances in DRF3 -