Python parsing data from a website using regular expression -


i'm trying parse data website: http://www.csfbl.com/freeagents.asp?leagueid=2237

i've written code:

import urllib import re  name = re.compile('<td><a href="[^"]+" onclick="[^"]+">(.+?)</a>') player_id = re.compile('<td><a href="(.+?)" onclick=') #player_id_num = re.compile('<td><a href=player.asp?playerid="(.+?)" onclick=') stat_c = re.compile('<td class="[^"]+" align="[^"]+"><span class="[^"]?">(.+?)</span><br><span class="[^"]?">') stat_p = re.compile('<td class="[^"]+" align="[^"]+"><span class="[^"]?">"[^"]+"</span><br><span class="[^"]?">(.+?)</span></td>')  url = 'http://www.csfbl.com/freeagents.asp?leagueid=2237'  sock = urllib.request.urlopen(url).read().decode("utf-8")  #li = name.findall(sock) name = name.findall(sock) player_id = player_id.findall(sock) #player_id_num = player_id_num.findall(sock) #age = age.findall(sock) stat_c = stat_c.findall(sock) stat_p = stat_p.findall(sock) 

first question : player_id returns whole url "player.asp?playerid=4209661". unable number part. how can that? (my attempt described in #player_id_num)

second question: not able stat_c when span_class empty in "".

is there way can these resolved? not familiar re (regular expressions), looked tutorials online it's still unclear doing wrong.

very simple using pandas library.

code:

import pandas pd  url = "http://www.csfbl.com/freeagents.asp?leagueid=2237" dfs = pd.read_html(url)  # print dfs[3] # dfs[3].to_csv("stats.csv") # send csv file. print dfs[3].head() 

result:

    0                 1    2  3      4     5     6     7     8      9     10  \ 0  pos              name  age  t     po    fi    co    sy    hr     ra    gl    1    p    george pacheco   38  r   4858  7484  8090  7888  6777   4353  6979    2    p     david montoya   34  r   3944  5976  6673  8699  6267   6685  5459    3    p       robert cole   34  r   5769  7189  7285  5863  6267   5868  5462    4    p  juanold mcdonald   32  r  69100  5772  4953  4866  5976  67100  5362          11    12  13       14      15          16   0    ar    en  rl  fatigue  salary         nan   1  3747  6171  -3     100%     ---  $3,672,000   2  5257  5975  -4      96%      2%  $2,736,000   3  4953  5061  -4      96%      3%  $2,401,000   4  5982  5263  -4     100%     ---  $1,890,000  

you can apply whatever cleaning methods want here onwards. code rudimentary it's improve it.

more code:

import pandas pd import itertools  url = "http://www.csfbl.com/freeagents.asp?leagueid=2237" dfs = pd.read_html(url) df = dfs[3] # "first" stats table.  # first row actual header. # also, notice nan @ end. header = df.iloc[0][:-1].tolist() # fix atrocity of last column. df.drop([15], axis=1, inplace=true)  # last row nans. particular # table should end jeremy dix. df = df.iloc[1:-1,:] df.columns = header df.reset_index(drop=true, inplace=true)  # pandas cannot create 2 rows without # dataframe turning nightmare. let's # try aesthetic change. sub_header = header[4:13] orig = ["{}{}".format(h, "r") h in sub_header] clone = ["{}{}".format(h, "p") h in sub_header]  # http://stackoverflow.com/a/3678930/2548721 comb = [iter(orig), iter(clone)] comb = list(it.next() in itertools.cycle(comb))  # construct new header. new_header = header[0:4] new_header += comb new_header += header[13:]  # slow cleanly. s, o, c in zip(sub_header, orig, clone):     df.loc[:, o] = df[s].apply(lambda x: x[:2])     df.loc[:, c] = df[s].apply(lambda x: x[2:])  df = df[new_header] # drop other columns.  print df.head() 

more result:

  pos              name age  t por  pop fir fip cor cop     ...      rap glr  \ 0   p    george pacheco  38  r  48   58  74  84  80  90     ...       53  69    1   p     david montoya  34  r  39   44  59  76  66  73     ...       85  54    2   p       robert cole  34  r  57   69  71  89  72  85     ...       68  54    3   p  juanold mcdonald  32  r  69  100  57  72  49  53     ...      100  53    4   p      trevor white  37  r  61   66  62  64  67  67     ...       38  48       glp arr  arp enr enp  rl fatigue      salary   0  79  37   47  61  71  -3    100%  $3,672,000   1  59  52   57  59  75  -4     96%  $2,736,000   2  62  49   53  50  61  -4     96%  $2,401,000   3  62  59   82  52  63  -4    100%  $1,890,000   4  50  70  100  62  69  -4    100%  $1,887,000   

obviously, did instead separate real values potential values. tricks used gets job done @ least first table of players. next few ones require degree of manipulation.


Comments

Popular posts from this blog

python - TypeError: start must be a integer -

c# - DevExpress RepositoryItemComboBox BackColor property ignored -

django - Creating multiple model instances in DRF3 -