python - Remove content between <div> and <ahref> Beautiful Soup -
i have piece of code parse webpages. want remove content between, div, ahref, h1.
opener = urllib2.build_opener() opener.addheaders = [('user-agent', 'mozilla/5.0')] url = "http://en.wikipedia.org/wiki/viscosity" try: oururl = opener.open(url).read() except exception,err: pass soup = beautifulsoup(oururl) dem = soup.findall('p') in dem: print i.text
i want print text without content between h1, ahref mentioned above.
edit: comment "i want return text not between <div>
, </div>
tags.". should strip out blocks parent has div tag:
raw = ''' <html> text <div> avoid </div> <p> nested <div> don't me either </div> </p> </html> ''' def check_for_div_parent(mark): mark = mark.parent if 'div' == mark.name: return true if 'html' == mark.name: return false return check_for_div_parent(mark) soup = bs4.beautifulsoup(raw) text in soup.findall(text=true): if not check_for_div_parent(text): print text.strip()
this results in 2 tags, ignore div ones:
text nested
original response
it's unclear trying exactly. first up, should try post full working example seem missing headers. secondly, wikipedia seems have stance against "bots" or automated downloaders
python's `urllib2`: why error 403 when `urlopen` wikipedia page?
this can avoided following lines of code
import urllib2, bs4 url = r"http://en.wikipedia.org/wiki/viscosity" req = urllib2.request(url, headers={'user-agent' : "magic browser"}) con = urllib2.urlopen( req )
now have page, think want extract main text using bs4
. this
soup = bs4.beautifulsoup(con.read()) start_pos = soup.find('h1').parent p in start_pos.findall('p'): para = ''.join([text text in p.findall(text=true)]) print para
this gives me text looks like:
the viscosity of fluid measure of resistance gradual deformation shear stress or tensile stress. liquids, corresponds informal notion of "thickness". example, honey has higher viscosity water.[1] viscosity due friction between neighboring parcels of fluid moving @ different velocities. when fluid forced through tube, fluid moves faster near axis , near walls, therefore stress (such pressure difference between 2 ends of tube) needed overcome friction between layers , keep fluid moving. same velocity pattern, stress required proportional fluid's viscosity. liquid's viscosity depends on size , shape of particles , attractions between particles.[citation needed]
Comments
Post a Comment