Friday, August 16, 2013

Web Parsing and Custom Alerting

Ok, so this post is about an amazing project in which I did ended up solving a real world problem. Inside, CMU we have this cbdr website is used for putting up research studies happening at CMU campus in which students can participate, and most of them are paid studies. And since everyone specially including me at CMU is almost broke because of the really high education fee (you can never be too greedy CMU ), this website is in high demand during summers.

Now, there are a few bad things about this website, one cannot in any way register for email updates when a study is added, and most of the studies need finite number of participants which are in abundance, so regular FCFS applies. And well even though I was broke, I didn't  really had the time to visit the website 20 times a day. So, I ended up writing a small script which checks if a new study has been added one their website, then send me an email.

So this script was required to do a few fundamental tasks :

1.) Parse the website automatically periodically after certain interval of time.
2.) Check the current list of studies against a database of studies that I have already been intimated about and notifying me only about the newly added studies.
3.) Send me an email when it identifies a new study has been added.

For the first part I ended up using zombie js module in nodejs, classic headless browser based web parsing and automation. Injecting jquery based code is a bit of a pain with the framework but one can always use the good old Dom objects.

For the database part mongodb was something which was really easy to setup and get working with JavaScript as it can directly add JavaScript objects and retrieve them so no conversion from/to strings. 

And finally for sending emails, Randy at create lab pointed me towards the Linux sendmail command which needs mail utils to be setup, but once that is done one can easily use a shell script to send emails.

For deployment of the script, instead of using my crappy laptop I ended up deploying the script on a t1 micro instance on amazon EC2. Though yes I had to install nodejs, mongodb and mail utils on the EC2 instance but that's fairly straight forward. 

For those of you interested in checking out the code for the same, it is hosted on my GitHub account https://github.com/tanejamohit/Parsing-Alert

Thankfully, now I am the first person to register for any study on CBDR. Sadly as the summer comes to an end, I doubt I will have a lot of time for any of the studies.