nokogiri - Using the Ruby Mechanize "links_with" to grab text but getting extra content -

- January 15, 2011

when grab group of links using mechanize links_with method want text showing link i'm getting series of characters:

links = @some_page.links_with(text: /v\s.*(bench|earcx)|(bench|earcx).*v/)  links.each |link|   link.text end

the links shown in browser "23409bench092834" , "20193bench092339" want when go save them in database saved

\r\n\t\t\t\t\r\n\t\t\t\t\t 23409bench092834\r\n\t\t\t\t\r\n\t\t\t\t

where did these characters come , represent? i've tried using text , to_s on them isn't getting rid of these random characters.

i think may escape codes if how remove them?

you failed give example html showing markup you're working against. makes difficult you. don't that; you.

mechanize uses nokogiri internally , can return nokogiri document, you'll want that. there you're in nokogiri's domain give more control on searching.

using mechanize's links_with finds matching links in document , returns them array of node, aka nodeset. contain lot of other nodes inside them, responsible tabs , returns you're seeing. while links_with useful, have aware of returning can react correctly.

the problem you're seeing because you're not accessing right tag when extract text, or values see in links isn't report.

consider this:

require 'nokogiri'  doc = nokogiri::html(<<eot) <html> <body> <p>foo</p> | <p>bar</p> </body> </html> eot

extracting text higher tag (parent) exact 1 should return in parent:

doc.search('body').text # => "\nfoo\n|\nbar\n"

notice picked line-breaks , | between tags. that's because text returns all text nodes, not inside child tag. being explicit want important.

similarly, searching p tags returns text found inside them:

doc.search('p').text # => "foobar"

this doesn't work since text concatenate text in nodes found in nodeset returned search, isn't useful usually.

instead, find specific node want , text:

doc.at('p').text # => "foo"

at returns first matching node , equivalent search('p').first.

if want text p nodes, iterate on them:

doc.search('p').map(&:text) # => ["foo", "bar"]

in more complex documents have find specific landmark in hierarchy of tags , navigate it, search further, that's separate issue.

putting together, here's sample helps visualize you're encountering , how deal it:

require 'nokogiri'  doc = nokogiri::html(<<eot) <html> <body>   <a href="http://example.com">     <span class="hubbub">foo</span>   </a>   |   <a href="http://example.com">     <span class="hubbub">bar</span>   </a> </body> </html> eot

don't these:

doc.search('body').text # => "\n  \n    foo\n  \n  |\n  \n    bar\n  \n" doc.search('a').text # => "\n    foo\n  \n    bar\n  "

do these:

doc.search('a span').map(&:text) # => ["foo", "bar"]

or:

spans = doc.search('a').map{ |link|   link.at('span').text } spans # => ["foo", "bar"]

the first faster because relies on libxml2 code find matching span nodes defined in 'a span' css selector. second slower more flexible , allows use ruby's language iterate , peek tags.

see "how avoid joining text nodes when scraping" also.

Search This Blog

Soju

nokogiri - Using the Ruby Mechanize "links_with" to grab text but getting extra content -

Comments

Post a Comment

Popular posts from this blog

python - TypeError: start must be a integer -

c# - DevExpress RepositoryItemComboBox BackColor property ignored -

django - Creating multiple model instances in DRF3 -