Extracting Information from Web Pages

  • This tutorial give an quick overview of how to read a web page and then extracts relevant information
  • Please make note that scraping certain information from web pages is illegal and please read terms and conditions of the respective website before doing so
  • This post is mostly educational and does not promote any kind of illegal web scraping and is not responsible if readers of this post do such activity
  • For this tutorial, the BeautifulSoup APIs will be used
  • If not installed, use pip install beautifulsoup4
In [1]:
import requests
from bs4 import BeautifulSoup
In [13]:
# Enter the amazon product page which contains the product reviews
url = '''
In [18]:
review_html = requests.get(url).text
In [23]:
soup = BeautifulSoup(review_html, 'lxml')
In [27]:
review_titles = soup.findAll('a', {'class': 'a-size-base a-link-normal review-title a-color-base a-text-bold'})
In [28]:
len( review_titles )
Out[28]:
10
In [29]:
for review_title in review_titles:
  print( review_title.text )
Five Star when you add Pros, Cons, and Price together
Oneplus two - got it via invite. Happy till now.
Good phone overall
Good specs for the money
It's nothing compared to it's predecessor .
The phone is nice and works good but
It's good but not very good.
OPO>>>>OPT
Great phone. Not perfect
THE BEST BUDGET PHONE OF 2016
In [30]:
review_texts = soup.findAll('span', {'class': 'a-size-base review-text'})
len( review_texts )
Out[30]:
10
In [34]:
import re
r_texts = list( map( lambda x: re.sub("[^a-zA-Z]", " ", x.text ), review_texts ) )
In [41]:
r_texts[0:1]
Out[41]:
['My list of pros and cons  I think the cons are negated by the very low price  so five stars from me after all things considered     Very powerful  among the fastest out there         screen is just right for me     Oxygen  Android OS is even more customizable than vanilla Android  which allows for a few extra conveniences you don t get elsewhere     good battery life  I like the reversible usb c charger  but this does mean you need to be more aware of your battery life and usb c cables since you can t borrow other peoples chargers  but the battery life makes up for that     some people hate that there s no NFC  but NFC is still too immature inconvenient to be using in the wild  imo  Let s let the payment systems duke it out another year or two and then I ll decide if NFC is a must     a little bulky   get a case because this would fall with force     can run hot  but that comes with power comes heat ']
In [36]:
r_titles = list( map( lambda x: re.sub("[^a-zA-Z]", "", x.text ), review_titles ) )
In [37]:
import pandas as pd
review_df = pd.DataFrame( { "title": r_titles, "text": r_texts } )
In [39]:
review_df[0:5]
Out[39]:
text title
0 My list of pros and cons I think the cons are... FiveStarwhenyouaddProsConsandPricetogether
1 Almost yrs with smartphones had Motorola D... OneplustwogotitviainviteHappytillnow
2 This is a great phone overall but its missing ... Goodphoneoverall
3 Quiet nice phone feels good in hands top spe... Goodspecsforthemoney
4 I have had my Oneplus One for over two years n... Itsnothingcomparedtoitspredecessor
In [ ]: