Posted on Friday, 15th January 2010 by Michael
Recently I was reading an article about using Ruby on Rails to create a web scraper as I sat there and learned Ruby I got really excited to jump to the point and build a web scraper. Though as any programmer knows that is not possible until you have the base understanding of the language down. So to solve my dilemma I set forth to try to write one via a shell script.
I was not sure what I wanted to scrap so after a few hours of thinking I decided to basically make a calculator using Google’s calculator feature. Basically a user will be able to do basic arithmetic for any two numbers and get the answer via Google. If you want to try this manually go to Google and type 1+2 and hit enter. It is that simple, well close to that simple.
To start off I ran several different manual tests to see what the URL should look like depending on the operator I used. I found out that all operators acted like they should accept addition the “+” gets converted to “%2B” this proposed a small issue but nothing that a little extra scripting could not resolve.
To get around this and to make the program interactive for the user I did this:
#!/bin/bash
#######################################
## Simple Google Query and web scraper
## Written by Michael LaSalvia
## http://www.digitaloffensive.com
## Created: 1/15/09
#######################################
##Variables
tFile=gmath.txt
oFile=rmath.txt
rm $tFile
echo "If there was a error above this line that is ok"
echo "###################################"
echo "# Press (a) for addition #"
echo "# Press (s) for subtraction #"
echo "# Press (m) for multiplication #"
echo "# Press (d) for division #"
echo "###################################"echo -e "What do you want to do:"
read Mmath
case $Mmath in
"a") dMath=%2B && echo "You chose addition";;
"s") dMath=- && echo "You chose subtration";;
"m") dMath=* && echo "You chose multiplication";;
"d") dMath=/ && echo "You chose divsion";;
esac
Now that we know what arithmetic the end user wants to do we need to find out what variables they want to use. To do this we do this:
echo -e "Enter first number:"
read nNum1
echo -e "Enter Second number:"
read nNum2
Now that we have all the needed variables comes the fun part. We now need to construct the URL, but since it is Google and they do not allow automated responses we need to make our script look like a real user agent as well. (WARNING: This may break Google’s AUP). To do this we used the following code:
wget --header="User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1)" "http://www.google.com/search?hl=en&safe=off&q=$nNum1$dMath$nNum2" -q -O $tFile
The user agent we chose to masquerade as was Internet Explorer 8. You will also notice that we outputted the file to a “known” file. This makes the rest of the process much easier and simpler to code.
Now that we have the full page downloaded we need to find just the information we want. To do this I first manually reviewed the source code of the page and notice that no matter what math problem I entered the source code always had the following around each problem EX.
Code: style="font-size: 138%;"><b>999 + 998 = 1<font size="-2"> </font>997</b>
So to remove everything except what I wanted I used the following code:
cat $tFile | awk -F "138%\"><b>" {'print $2'} | awk -F "</b>" {'print $1'} > $oFile
echo "Your answer is:" && cat $oFile
You will notice that I did not clean the file fully, that is because I noticed that when it was echoed to the terminal the html that was left did not show and instead of sitting there using “sed” to fully clean it up I left it as is.
I hope you have learned something from this. If you have any questions or concerns please feel free to contact me.
Here is a screen shot:
Posted in Code | Comments (0)