kudos: 2015

2015年4月30日星期四

Integrating Tableau and R for Data Analytics

Tableabu is one of the most popular Business Intelligence tool while R is hottest statistical analysis software(At least in school yard). Tableau is able to handle various data sources such as SQL, Google Analytics, Excel, CSV and Hadoop. R performs quite well when the dataset is smaller than 1G. How to combine Tableau's computation power with R's all-inclusive statistical package? Here comes the solution.

Steps

Tableau 9 provide users the function to connect with a Rserver, from where you could access R's functions.

Set up Rserver.Open you R or R studio and run the following cmd.

 install.packages("Rserve")
 library("Rserve")
 Rserve(args="--save")#if you are using mac

Connect Rserver with Tableau.
- Help --Settings and Performance ---Manage R connection <Connection.png>

Example

Now, your Tableau is connected with R and you could use R function to analyse Tableau data.

Here we are going to use fisher's iris data as an example. This dataset contains 5 columns, sepal length, sepal width, pedal length, pedal width and species. Fisher initially collected it for classify flowers by the first four attributes.

Drag variables in Tableau like this

Then add a calculated field -- Cluster. Right Click Measures and select add a calucalted filed.

Created filed by writing the following script.

SCRIPT_INT() function means this R script returns an int. Those quoted elements is an r script. This script takes SUM(F1),SUM(F2),SUM(F3),SUM(F4) as inputs and return the result of k-means classfication.

Drag the Cluster into color and you get final result.

This is just an simple example of integrating R and Tableau, there might be more wonderful stuff that are awaiting.

2015年3月9日星期一

What do I think when I came back from Hackthon?

This Hackathon is held by Cornell,Columbia Data Science institute, sponsored by Accenture, Capital One Lab and balabalala company. If you asked me how do I felt after the hackathon, I would like to say, tired, while it worth.

It is in this hackthon I saw Hilary Mason, Claudia Perlich and many other interesting people. It is in this hackathon I appreciate some wonderful idea, such as the lyartist (an app help lyric writer), xbox visulization, and the app to help you stay from dangerous area of a city.

Our team analyze 550 indeed job post that are hiring data scientist and we got some interesting result.

Here are the top skills that need for a data scientist. Want to do data science? start to learn python first.

python	270
r	268
sql	258
hadoop	229
java	174
processing	157
excel	138
c	112
matlab	99
c..	88

Another good news for newbie data scientist is that, company are not expected you know all about web developing, algorithm design, SQL and Non-SQL database management, hadoop family and scala. Those full stack data science is few.

Besides, the top 3 algorithm are Clustering, Sampling, Ridge Regression, Ada boosting. In a words, you do not need to know exact detail of complicated algorithm. Understanding and applying those simple algorithm are more important.

While what is the difference between the great and good data scientist? Claudia Perlich gave me the answer: data intuition. She compare data scientist to detective, great detective always had assumption then confirm or rejected their assumption by fact. Great data scientist must have sufficient understanding about human behavior and the industry.

Another idea that worth mention is there is no clean data, you had to embrace the randomness of the world. Good data is that truly reflect fact not the clean and well formatted ones.

Also I like those advice given by data scientist panel:

1. When you are young, do the job that you really like. Do not work for money.

2. Always keep writing.

3. Got some sleep after hackathon.

2015年2月21日星期六

Download all pdfs you need by python.

Notebook

Yesterday I want to download materials in pdfs from Andrew Moore. while I hate to click the link and save all pdf one by one. Thus I write a script to do that job for me.

Here are the idea: * Get url of all topic * Get url of each pdf * Download it

In []:

import requests
#from pattern import web
import re
import os
import BeautifulSoup
import urllib2
#!~/anaconda/bin/pip install wget
#if you did not install some package, install it!
import wget
#using wget to get pdf

def findlinksnot(url,spec):
    #Find the url of all topics list on the webpage:http://www.autonlab.org/tutorials/list.html
    url=url
    parent_dir=url[:url.rfind("/")]+"/"
    html = requests.get(url).text
    soup = BeautifulSoup.BeautifulSoup(html)
    links=[parent_dir+a['href'] for a in soup.findAll('a') if spec not in a['href']]
    return links

def findlink(url,spec):
    #Find all pdfs in page:http://www.autonlab.org/tutorials/infogain.html
    url=url
    parent_dir=url[:url.rfind("/")]+"/"
    html = requests.get(url).text
    soup = BeautifulSoup.BeautifulSoup(html)
    links=[parent_dir+a['href'] for a in soup.findAll('a') if spec  in a['href']]
    return links

def getpdf(links,newpath=os.getcwd()):
    name=wget.download(links,out=newpath)
    print "%s\n downloaded in %s\n",name, newpath

def main(url,newpath=os.getcwd()):
    #if newpath not set it will creat one
    if not os.path.exists(newpath): os.makedirs(newpath)
    topics=findlinksnot(url,'http')
    print topics
    for topic in topics:
        links=findlink(topic,'.pdf')
        for link in links:
            getpdf(link,newpath=newpath) 
            
#if __name__ == '__main__':
   #main()
main('http://www.autonlab.org/tutorials/list.html')

Following work: * Figure out how to write a script that can work by command line arguement * Learn more about scrapy and do the job more automatically.

In []: