Python parsing data from a website using regular expression -
i'm trying parse data website: http://www.csfbl.com/freeagents.asp?leagueid=2237
i've written code:
import urllib import re name = re.compile('<td><a href="[^"]+" onclick="[^"]+">(.+?)</a>') player_id = re.compile('<td><a href="(.+?)" onclick=') #player_id_num = re.compile('<td><a href=player.asp?playerid="(.+?)" onclick=') stat_c = re.compile('<td class="[^"]+" align="[^"]+"><span class="[^"]?">(.+?)</span><br><span class="[^"]?">') stat_p = re.compile('<td class="[^"]+" align="[^"]+"><span class="[^"]?">"[^"]+"</span><br><span class="[^"]?">(.+?)</span></td>') url = 'http://www.csfbl.com/freeagents.asp?leagueid=2237' sock = urllib.request.urlopen(url).read().decode("utf-8") #li = name.findall(sock) name = name.findall(sock) player_id = player_id.findall(sock) #player_id_num = player_id_num.findall(sock) #age = age.findall(sock) stat_c = stat_c.findall(sock) stat_p = stat_p.findall(sock)
first question : player_id returns whole url "player.asp?playerid=4209661"
. unable number part. how can that? (my attempt described in #player_id_num
)
second question: not able stat_c when span_class
empty in ""
.
is there way can these resolved? not familiar re (regular expressions), looked tutorials online it's still unclear doing wrong.
very simple using pandas
library.
code:
import pandas pd url = "http://www.csfbl.com/freeagents.asp?leagueid=2237" dfs = pd.read_html(url) # print dfs[3] # dfs[3].to_csv("stats.csv") # send csv file. print dfs[3].head()
result:
0 1 2 3 4 5 6 7 8 9 10 \ 0 pos name age t po fi co sy hr ra gl 1 p george pacheco 38 r 4858 7484 8090 7888 6777 4353 6979 2 p david montoya 34 r 3944 5976 6673 8699 6267 6685 5459 3 p robert cole 34 r 5769 7189 7285 5863 6267 5868 5462 4 p juanold mcdonald 32 r 69100 5772 4953 4866 5976 67100 5362 11 12 13 14 15 16 0 ar en rl fatigue salary nan 1 3747 6171 -3 100% --- $3,672,000 2 5257 5975 -4 96% 2% $2,736,000 3 4953 5061 -4 96% 3% $2,401,000 4 5982 5263 -4 100% --- $1,890,000
you can apply whatever cleaning methods want here onwards. code rudimentary it's improve it.
more code:
import pandas pd import itertools url = "http://www.csfbl.com/freeagents.asp?leagueid=2237" dfs = pd.read_html(url) df = dfs[3] # "first" stats table. # first row actual header. # also, notice nan @ end. header = df.iloc[0][:-1].tolist() # fix atrocity of last column. df.drop([15], axis=1, inplace=true) # last row nans. particular # table should end jeremy dix. df = df.iloc[1:-1,:] df.columns = header df.reset_index(drop=true, inplace=true) # pandas cannot create 2 rows without # dataframe turning nightmare. let's # try aesthetic change. sub_header = header[4:13] orig = ["{}{}".format(h, "r") h in sub_header] clone = ["{}{}".format(h, "p") h in sub_header] # http://stackoverflow.com/a/3678930/2548721 comb = [iter(orig), iter(clone)] comb = list(it.next() in itertools.cycle(comb)) # construct new header. new_header = header[0:4] new_header += comb new_header += header[13:] # slow cleanly. s, o, c in zip(sub_header, orig, clone): df.loc[:, o] = df[s].apply(lambda x: x[:2]) df.loc[:, c] = df[s].apply(lambda x: x[2:]) df = df[new_header] # drop other columns. print df.head()
more result:
pos name age t por pop fir fip cor cop ... rap glr \ 0 p george pacheco 38 r 48 58 74 84 80 90 ... 53 69 1 p david montoya 34 r 39 44 59 76 66 73 ... 85 54 2 p robert cole 34 r 57 69 71 89 72 85 ... 68 54 3 p juanold mcdonald 32 r 69 100 57 72 49 53 ... 100 53 4 p trevor white 37 r 61 66 62 64 67 67 ... 38 48 glp arr arp enr enp rl fatigue salary 0 79 37 47 61 71 -3 100% $3,672,000 1 59 52 57 59 75 -4 96% $2,736,000 2 62 49 53 50 61 -4 96% $2,401,000 3 62 59 82 52 63 -4 100% $1,890,000 4 50 70 100 62 69 -4 100% $1,887,000
obviously, did instead separate real values potential values. tricks used gets job done @ least first table of players. next few ones require degree of manipulation.
Comments
Post a Comment