การใช้งาน beautifulsoup สำหรับ python

February 22, 2014

ใช้ beautifulsoup ทำอะไรได้
เราสามารถใช้โมดูลตัวนี้แกะ dom เพิ่มหรือลบ element ออกจาก dom ได้

ติดตั้ง beautifulsoup
>> pip install beautifulsoup4

ลองเล่นกัน
>> from bs4 import Beautifulsoup

ผมจะลองโหลดหน้าเว็ปจาก google มาดู
>> import urllib2
>> p = urllib2.urlopen('http://google.com')
>> soup = Beautifulsoup(p)

ลอง print ดู ก็จะได้เห้น content เต็มๆ
>> print soup

ถ้าทำแบบนี้

>> soup.find_all('table')

เราก็จะได้ list ของ table ทั้งหมด ใน page นั้น ตัว soup นี้จะมี fn find_*** เยอะมาก อย่างเช่นถ้าเราทำแบบนี้

>> soup.find_all('table')[0]

ผลที่ได้ก็จะเป็น element obj ที่เป็นของ bs นั่นเอง ซึ่งเราสามารถเอา attr ออกมาได้โดย

>> t0 = soup.find_all('table')[0]
>> print t0.attrs
>> {'cellpadding': '0', 'cellspacing': '0'}

โดยส่วนตัวผมใช้เจ้า bs (Beautifulsoup) ในการแกะเอา link ต่างเพื่อมาใช้โหลดอีกที เพราะว่าผลที่ออกมาจาก code ด้านบน ผมเอามาลูปได้ แบบนี้นะครับ

for input in soup.find_all('input', {'type':'hidden'}):
print input.attrs

เราก็จะได้ dict ของ attribute ของ input element obj มาในแบบนนี้

{'type': 'hidden', 'name': 'ie', 'value': 'windows-874'}
{'type': 'hidden', 'name': 'hl', 'value': 'th'}
{'type': 'hidden', 'name': 'source', 'value': 'hp'}
{'type': 'hidden', 'id': 'gbv', 'value': '1', 'name': 'gbv'}

เจ้า bs สามารถเอาไปทำอะไรสนุกๆในแนวๆ web crawling ได้อย่างดีทีเดียวเลยหละครับ หึหึหึ

Search This Blog

Sand Box:)

การใช้งาน beautifulsoup สำหรับ python

Comments

Post a Comment

Popular posts from this blog

Byobu + Tmux Tips # 2014-01-22

Changing postgres password and warning

How to explode DEBian package file.