Friday, August 16, 2013

Web Parsing and Custom Alerting

Ok, so this post is about an amazing project in which I did ended up solving a real world problem. Inside, CMU we have this cbdr website is used for putting up research studies happening at CMU campus in which students can participate, and most of them are paid studies. And since everyone specially including me at CMU is almost broke because of the really high education fee (you can never be too greedy CMU ), this website is in high demand during summers.

Now, there are a few bad things about this website, one cannot in any way register for email updates when a study is added, and most of the studies need finite number of participants which are in abundance, so regular FCFS applies. And well even though I was broke, I didn't  really had the time to visit the website 20 times a day. So, I ended up writing a small script which checks if a new study has been added one their website, then send me an email.

So this script was required to do a few fundamental tasks :

1.) Parse the website automatically periodically after certain interval of time.
2.) Check the current list of studies against a database of studies that I have already been intimated about and notifying me only about the newly added studies.
3.) Send me an email when it identifies a new study has been added.

For the first part I ended up using zombie js module in nodejs, classic headless browser based web parsing and automation. Injecting jquery based code is a bit of a pain with the framework but one can always use the good old Dom objects.

For the database part mongodb was something which was really easy to setup and get working with JavaScript as it can directly add JavaScript objects and retrieve them so no conversion from/to strings. 

And finally for sending emails, Randy at create lab pointed me towards the Linux sendmail command which needs mail utils to be setup, but once that is done one can easily use a shell script to send emails.

For deployment of the script, instead of using my crappy laptop I ended up deploying the script on a t1 micro instance on amazon EC2. Though yes I had to install nodejs, mongodb and mail utils on the EC2 instance but that's fairly straight forward. 

For those of you interested in checking out the code for the same, it is hosted on my GitHub account https://github.com/tanejamohit/Parsing-Alert

Thankfully, now I am the first person to register for any study on CBDR. Sadly as the summer comes to an end, I doubt I will have a lot of time for any of the studies.

Tuesday, July 2, 2013

Practical Issues while Coding

Between, CREATE lab internship and GSoC work, and Startup Engineering course, I have been doing quite some amount of coding these days. I am learning some good lessons during all this coding, and thought it might be a good idea to jot them down for myself, and share with anyone who is interested.

The most important thing which I realized is that the amount of time spent understanding and reading the code is much much more than the amount of time spent writing that code. So, it makes sense to write the code in such a way that the amount of time spent on reading/understanding the code can be minimized. Also, 90% of the optimization happens in only 10% of the code, and sometimes it is better to write more readable code than writing optimized code.


  • Explaining what the code in a particular file does, along with the licensing info at the beginning of the file is actually a pretty good idea. And so is describing what does a function do at the beginning of the function.
  • It's better to use elaborate names for variables and functions, which are self explanatory, rather than using concise variable names. It's even better to use function names which are don't include the technical lingo and are rather understandable by a user who don't have a great idea about the library being used by you. chooseVisual() is a much better function name than, createEGLConfiguration().
  • Instead of aiming for writing the most optimized code from the beginning it is better to write the most readable code, and then try to optimize it. Also, using temporary variables is not really that a bad idea if they can make your code easier to understand.
  • It makes a lot of sense if the same convention for naming functions and variables is used throughout the code base (assuming there are multiple engineers working on the code base). Also, if you use the same indentation style, and commenting style, your code might start looking good too. As of ow that seems to be the toughest thing to achieve, a good looking code. But, I am working towards it, hopefully I should get the hang of it.