Yesterday while talking to my boss (BTW he is the only one I still consider to be my boss, apart from my wife, as he is the only one I listened to at work ;)), he came up with an interesting idea. People have not changed since centuries. They have always been drifting apart. From whole villages living in one cave to rising divorce rates, we have come a long way from social living. People are becoming more and more individualistic.
Now, back to the main topic. Do social networks work? For the past decade, we have seen numerous social networks coming and then fizzling out. MySpace, Friendster, Orkut and now Facebook. All have seen the same story. People get excited when they are launched. Call all their friends to join them. They form groups and communities. They chat all night long. But then after that what? They move on to the next big thing... Twitter.
For the fear of being called a hypocrite, I would like to admit that I have also been a part of all the bandwagons. Be it Orkut, MySpace, Blogs, Facebook or Twitter. You can find my profiles on all of them. But then I become bored like the rest of you and move on.
Now that Google is coming onto the scene with a stronger product (than Orkut) in Google +, I am not sure whether it is too late to make any dent in the SNSs' fortunes.
There are three things and only three things that really sell in this world. Sex, knowledge and food. That is why you see all the Porn sites being so popular and Google making tonnes of money. Online economies around these three will always remain popular and viable. The rest as they say will become history.
BTW with the much hyped IPOs of all the SNSes, I seriously feel there is another dot com burst brewing. So cash in your stocks and come back to the real world!!
Televisions have come a far way from their monochromatic ancestors of the 1930's to become the modern age's ultimate home entertainers. Demands and expectations of the consumers have evolved tremendously to put a strain on the traditional mediums of entertainment, i.e. audio and video. Moreover, the advent of internet and has created another dimension in home entertainment – lets call it "On Demand Entertainment" (ODE). True ODE means consumers will be able to watch, listen, play or read whatever they want whenever they want.
The consumers of Televisions now have options of going on the Internet and search for ODE including (but not restricted to) games, news, video and audio. Internet giants like Amazon, Hulu, Netflix and Youtube have started cutting into Television industry's profits and have become a major force to reckon with in home entertainment segment.
In a parallel development, consumers now want seamless integration and convergence between the new and the old media of entertainment. This expectation has led to innovation in Television industry in the form of IPTV, satellite TV and internet enabled TVs. These technologies strive to provide ODE as well as try to fulfill the demand for a unified entertainment device. However, true ODE is still a distant dream because of the strain it puts on the storage and computation power of the back-end data centers.
Cloud Computing or internet based computing, which provides on demand storage and compute power to be billed in a pay-per-use basis, comes as a perfect strategic fit to solve the puzzle of ODE. Cloud Computing can provide a solution to the issue of huge requirements in compute and storage to provide true ODE.
This post describes how Cloud Computing can be used to deliver true On Demand Entertainment, using some specific use-cases of:
- On Demand Gaming
- Ubiquitous Media Playback
- Online Personal Media Store
Entertainment today includes much more than the traditional media of books, Television and Radio. As we discussed earlier, ODE has become a major expectation now-a-days. Also, people are now looking for a single device which can take care of all their entertainment needs. Televisions are facing some serious competition in this race for a unified entertainment device from hand held gadgets and Internet. Televisions need help of modern technology to break to the fore-front of this race. Cloud Computing is one such technology which can tremendously help the Television industry.
According to Wikipedia:
Cloud Computing is Internet-based computing, whereby shared resources, software, and information are provided to computers and other devices on demand, like the electricity grid.
This on-demand Cloud of servers, generally called the cloud, can provide for the huge hardware requirements for some interesting use cases in Television industry:
On Demand Gaming:
Games are very compute intensive applications. So much so that they have dedicated platforms built for serious gamers. Televisions too have in-built games but not of the class of "Core Games". This is because core games require huge compute power that Televisions can't provide.
Now that Televisions have become internet enabled, we can use the compute power of the Cloud to do the computation at the back-end. We can push the gaming consoles on the cloud. All the user interactions can be pushed onto the Cloud; the cloud will compute based on the game rules and send back the results for the Television to display.
This can be a disruptive product in the gaming industry as this will give rise to true multi player games, where players join and leave the games as and when they will. Anyone with an internet enabled Television set can join the game as dependency on expensive gaming consoles will end.
Ubiquitous Media Playback:
Another interesting area of application of Cloud Computing in entertainment industry is of Ubiquitous Media Playback.
Let's take an example:
User is watching a movie on his Television when he suddenly gets an urgent call to go somewhere. The movie is at a very interesting phase and he doesn't want to miss it. He can simply activate his “Ubiquitous Media Playback” feature with the push of a button and the movie starts playing back on his hand held gadget.
For this to become reality, all that is needed is that both his hand held gadget and Television have internet access. The Television starts uploading the movie (from the point where it was stopped) to the cloud; the cloud converts the movie to be fit to playback on his hand held gadget and streams it to the gadget. The gadget resumes playing the movie from the same point.
Thus, a simple application of the power of cloud computing can enhance the viewer experience manifolds.
Online Personal Media Store:
People now try to keep all their media in digital format. They store it on hard disks, CDs, DVDs and BDs. But it still forms a bulky collection with the chance of loosing their data always lingering on top of their mind. What if they had a Personal media Store on the Cloud? What if they can use their Televisions to store all they want on the cloud?
This can be possible. Televisions connected to the Internet can be used to dump all the personal media on the cloud. The Cloud can then sort and organise the media under various categories and make a searchable index of all the user content. This data can then be customized according to the display and other capabilities of all the user‟s devices. The user can then access this library of data using any device he wants.
This can be a pay per use service which can be very easily commercialized.
These are just a few use cases which illustrate the power of Cloud Computing in home entertainment. If we take a sweeping look at the whole home entertainment landscape, there would be thousands of applications of this tremendously powerful technology.
We know that technology is pushing the edges day in and day out. Televisions need to adopt these new technologies rapidly, in order to provide a better experience to the consumers. Cloud Computing is one such technology which has the power of revolutionizing the way entertainment is served. As this paper suggests, the use of this new technology with different applications can make the true ODE provider and a unified entertainment device.
The TV, using a camera attached to it, first recognizes the viewer using “Face Recognition” techniques. It then relays the statistics about the time the viewer views a particular program and his facial expressions (happy, sad, interested, disinterested, etc) while watching it to a central hub.
The central hub (which may be on the cloud or in a self managed datacenter) then collates this data with the ontology of the program. [This ontology is built by data mining and supervised machine learning using a distributed cluster of computers.] Thus, over time, the central hub builds a profile of the viewer according to the viewer’s viewing pattern.
In the meanwhile, the TV also recognizes the facial expression of the viewer and looks for the signs of boredom (drowsy eyes, long frowns, vacant stares, etc). As soon as it recognizes a “disinterested pattern”, it calls a web-service running on the central hub.
This web-service searches the ontology of programs currently on air and sorts it according to the viewer’s profile. Then it predicts a list of programs which will most likely interest the viewer and sends it back to the TV.
The TV then suggests the viewer about the programs (s)he can watch if he is not interested in the current one. If the viewer changes the program then the TV sends the information to the central hub about the program he switched to. The Central hub adds this information to the profile of the viewer.
In case of conflicts, i.e. more than one viewer, the central hub decides on the predicted list of interesting programs using a priority list of the viewers based on previous history as to whose choice prevailed last time such a conflict arose.
This figure describes the architecture in a simple yet concise manner:
Note: If anyone wants to use this idea or wants further clarification then please contact me @ paritosh (dot) gunjan (at) gmail (dot) com.
I started with installing Hadoop on a single node, i.e. my machine. It was a tricky task with most of the tutorials making many assumptions. Now that I have completed the install, I can safely say that those were simple assumptions and that anyone familiar with linux is deemed to understand them. Then I decided to write an installation instruction for the dummies.
Here is the most comprehensive documentation of "How to install Hadoop on your local system". Please let me know if I have missed anything.
The first and foremost requirement is to get a PC with Linux installed on it. I used a machine with Ubuntu 9.10 installed on it. You can also work with Windows, as Hadoop is purely java based and it will work with any OS that can run JVM (which in turn implies pretty much all the modern OS's)
2. Sun Java6
Install the Sun Java6 on your Linux machine using:
$ sudo apt-get install sun-java6-bin sun-java6-jre sun-java6-jdk
3. Create a new user "hadoop"
Create a new user hadoop (though it is not required, it is recommended in order
to to separate the Hadoop installation from other software applications and user
accounts running on the same machine by having a dedicated user for hadoop).
Use the following commands:
$ sudo addgroup hadoop
$ sudo useradd -d /home/hadoop -m hadoop -g hadoop
4. Configure SSH
Install OpenSSHServer on your system:
$ sudo apt-get install openssh-server
Then generate an SSH key for the hadoop user. As the hadoop user do the
$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
The key's randomart image is:
+[ RSA 2048]+
|o . |
| o E |
| = . |
| . + * |
| o = = S |
| . o o o |
| = o . |
| o * o |
Then enable SSH access to your local machine with this newly created key:
$cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
Test your SSH connection:
$ ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is 1e:be:bb:db:71:25:e2:d5:b0:a9:87:9a:2c:43:e3:ae.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Linux paritoshdesktop 2.6.3120generic #58Ubuntu SMP Fri Mar 12 05:23:09 UTC
Now that the prerequisites are complete, lets go ahead with the Hadoop
Install Hadoop from Cloudera
1. Add repository
Create a new file /etc/apt/sources.list.d/cloudera.list with the following
contents, taking care to replace DISTRO with the name of your distribution (find
out by running lsb_release -c):
deb http://archive.cloudera.com/debian DISTRO-cdh3 contrib2. Add repository key. (optional)
debsrc http://archive.cloudera.com/debian DISTRO-cdh3 contrib
Add the Cloudera Public GPG Key to your repository by executing the following
$ curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -
This allows you to verify that you are downloading genuine packages.
Note: You may need to install curl:
$ sudo apt-get install curl
3. Update APT package index.
$ sudo apt-get update
4. Find and install packages.
You may now find and install packages from the Cloudera repository using your favorite APT package manager (e.g apt-get, aptitude, or dselect). For example:
$ apt-cache search hadoop
$ sudo apt-get install hadoop
Setting up a Hadoop Cluster
Here we will try to setup a Hadoop Cluster on a single node.
Copy the hadoop0.20 directory to the hadoop home folder.
$ cd /usr/lib/Also, add the following to your .bashrc and .profile
$ cp -Rf hadoop0.20 /home/hadoop/
# Hadoop home dir declarationChange the following in different configuration files in the /$HADOOP_HOME
Change the Java home, depending on where your java is installed:
# The java implementation to use. Required.
Change your core-site.xml to reflect the following:
Change your mapred-site.xml to reflect the following:
Change your hdfs-site.xml to reflect the following:
2. Format the NameNode
To format the Hadoop Distributed filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command:
/hadoop/bin/hadoop namenode -format
To start hadoop, run the startall.sh from the
starting namenode, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-namenode-paritosh-desktop.out
localhost: starting datanode, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-datanode-paritosh-desktop.out
localhost: starting secondarynamenode, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-secondarynamenode-paritosh-desktop.out
starting jobtracker, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-jobtracker-paritosh-desktop.out
localhost: starting tasktracker, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-tasktracker-paritosh-desktop.out
To check whether all the processes are running fine, run the following:
Then I read something about a new open source entrant into the Distributed Computing space called Hadoop. Actually its not a new entrant (has been in development since 2004, when the Google's MapReduce algorithm paper was published). Yahoo has been using it for similar purposes of creating page indexes for Yahoo Web Search. Also, Apache Mahout, a machine learning project from Apache Foundation, uses Hadoop as its compute horsepower.
Suddenly, I knew Hadoop is the way to go. It uses commodity PCs (#), gives Petabyes of storage and the power of Distributed Computing. And the best par about it is that it is a FOSS.
You can read more about Hadoop from the following places:
1. Apache Hadoop site.
2. Yahoo Hadoop Developer Network.
3. Cloudera Site.
My next few posts would elaborate more on Hadoop's working.
# Commodity PC doesn't imply cheap PCs, a typical choice of machine for running a Hadoop datanode and tasktracker in late 2008 would have the following specifications:
- Processor: 2 quad-core Intel Xeon 2.0GHz CPUs
- Memory: 8 GB ECC RAM
- Storage: 41 TB SATA disks
- Network: Gigabit Ethernet
All this crawler does is take a seed blog (my blog) URL, run through all the links in its front page and store the ones that look like a blogpost URL. I assume that all the blogs in this world are linked to on at least one other blog. Thus all of them will get indexed in this world if the spider is given enough time ad memory.
This is the code for the crawler. Its in Python and is quite easy. Please run through it and let me know if there is any other way to optimise it further: