2014年12月24日星期三

Installing python packages-- the easy way.












Installing a python package --the easy way

Installing a python package --the easy way

There are some package management tool, easy_install, pip.etc. These tools can help you install and uninstall package easily. Following the steps:
1.Download setuptools from the following link:
https://pypi.python.org/pypi/setuptools#unix-including-mac-os-x-curl
2.uninstall the package you downloaded.
  1. Open you commandline: cd /the/place/setup.py in
  2. sudo python setup.py install
Then you will be asked to type your computer password
  1. The setuptools had been installed.
There something to clarify, the default packages install path are under python version 2.7 for me. It is possible that you have the same setting of mine. In this case, once you install a package, say statlab, you can only do access that package in pyton 2.7 IDLE.
  1. Then you can use easy_install to make you life easier. The first package that I recommend you to install is pip. The reason I like it because it provide you a function to uninstall package much easier than do that py easy_install.
  2. sudo easy_install statlab

  1. you can use statlab in python IDLE 2.7 but not for version 3.4.

  1. using sudo easy_install statlab to install pip. Then using pip to uninstall statlab

That's it.

2014年11月25日星期二

General Idea of Userbased Recommendation

Userbased Recommendation

Userbased Recommendation

1.The idea of userbased recommendation is pretty straight forward: you like the stuff the person similar with you like. Then the question comes, how do we measure similarity between people? The ratings of the stuff both of you rated!

Algorithm

  1. Input,you had a dataset that include the userID, itemID,ratings.

  2. Caculate Similarity,following operation are avilable

    Here we have two user x and y, they both rated k items.
    (1) Pearson Similarity:

    $\frac{\sum_{i=1}^{k}(x_i-E(x))(y_i-E(y))}{std(x)std(y)}$

    (2) Eucleadian Distance:

    $\sqrt{\sum_{i=1}^{k}(x_i-y_i)^2}$

    (3) Mahattan Distance:

    $\sum_{i=1}^{k}|x_i-y_i|$

    (4) Cosine Similarity: $\frac{\sum_{i=1}^{k}A_i*B_i}{\sum{A_i^2}\sum{B_i^2}}$

    Question comes again:which similarity should I use?

    Generallly speaking,

    If data is subject to grade inflation, use pearson similarity. If data is dense and the magnitude of data is important, both manhantan distance and Euclidean distance will work. If your data is sparse, consine similarity works fine.

  3. Suppose we have N by N similarity matrix,

    Get K person similar with A, You can choose K by experiment.

    person K neighbour Score Weight
    A B 0.7 0.35
    C 0.8 0.40
    D 0.5 0.25

    If both B C D rating on X , their rating are 3, 4,4 respectively, The result should be 3 0.35+4 0.40+4 *0.25=3.65

    Choose the top i item and recommend it to user A

Pros

1.Make recommendation without knowing the detail of items.

Cons

1.Code Start, need a large amount of data to made accureate recommendation.

2.Computaion, O(n^2) is needed for compute the similarity matrix.

3.Sparsity. Few people rating items online.

2014年11月16日星期日

How to set a web page under one domain?

claim: I just write this article for fun,if you were a geek, skip it

claim: I just write this article for fun,if you were a geek, skip it.

We visit various of website everyday, while how does it work?

Each website is served by a server, maybe a super computer. When you visit, say, www.newyorktime.com, the browser translate the domain name to the Ip address of that server. By TCP/IP protocal, you contact with server got the webpage.

If you had a domain, how to write a web page?

  1. log into that server machine by your account and password.

  2. cd ~

  3. You are in home directory ,where you can see a html directory. Chmod 711 html and the directory under it. This command made these directory excutable for anyone.

  4. put the *.html in your html directory.
  5. chmod 644 *.html 4 means 100 readable, unwritable, unexcutable.
  6. type : yourdomain./~youraccount in webbrowser, you should see the webpage that you just created.

2014年11月4日星期二

GLM and Natural parameter family

GLM and Exponential Family

GLM and Exponential Family

GLM

Ordinary Linear Regression assumes that y changes a constant value when x changes a constance value. It fails in following situation 1. Response has certain range, say life or death [0,1]. 2. When we need to estimate the probability. 3. Data are not normal distributed, say, the new ebola victims each day in mali.

Generalized Linear model solved this problem by specify $g(u)=x\beta$, which means the function of mean varies constant ,rather than the mean itself with $x\beta$.

I hope you do not get bored, let's look at form of generalized linear model, which will help you understand logistics model, poisson model, probit model well. A glm is made up with the following:

1.$ \eta=x\beta$,which is called systematic component.

2.$g(\mu)=\theta$,which is called linked function

3.Last but not the least, distribution comes from natural exponential family.

Natural Parameter Famliy

Any distribution with a pdf can be written in the following way belong to natural parameter family.

$f(y_i,\theta,\phi)=exp(\frac{y_i\theta+b(\theta)}{a(\phi)}+c(y_i,\phi))$

$\theta$:function of $\mu$

$a(\phi)$:dispersion parameter

Normal distribution is one of this famliy: $\frac{1}{\sqrt{2\pi}}exp-\frac{(y_i-\mu)^2}{2\sigma^2} $=$exp(\frac{y_i\theta-\theta^2/2}{\sigma^2}+\frac{-y^2}{2\sigma^2}-log(2\pi))$

Here we can see:

$\theta:\mu$

$b(\theta):\theta^2/2$

$a(\phi):\sigma^2$

$c(y_i,\phi):\frac{-y^2}{2\sigma^2}-log(2\pi)$

Also binomial distribution is natural exponential family:

$\binom ny p^{y_i}(1-p)^{n-y_i}$=$exp[y_ilog(\frac{p}{1-p})+nlog(1-p)+log(\binom ny)]$

$\theta=log(\frac{p}{1-p})$

$\binom ny p^{y_i}(1-p)^{n-y_i}=exp[y_i\theta-nlog(1+e^\theta)+nlog(\binom ny)]$

Here:

$\theta:\log(\frac{\mu}{1-\mu})$

$b(\theta):nlog(1+e^\theta)$

$a(\phi):1$

Exponetial famliy

$\lambda exp(-\lambda y_i)=exp[-\lambda y_i +log(\lambda)]$

$u=\frac{1}{\lambda}$

$\lambda exp(-\lambda y_i)=exp[-\frac{1}{\mu}y_i -log(\mu)]$

$\theta=-\frac{1}{\mu}$

$\lambda exp(-\lambda y_i)=exp[\theta y_i +log(-\theta)]$

$\theta=-\frac{1}{\mu}$

$b(\theta)=-log(-\theta)$

canonical link

if link function \theta=g(\mu) in natrual exponential family equals \theta in the natural parameter function, this kind of link is called canonical link.

Distribution Canonical Link
Normal $\theta=\mu$
Poisson $\theta=log(\mu)$
Exponential $\theta=-\frac{1}{\mu}$
Bernoulli $log(\frac{\mu}{1-\mu})$

You can get the cannoical link of natrual exponential family by reorgnize the distribution in the way in the example

ps: You might have heard identity link,$u=\theta$, normal distribution's canonlical link is also its identity link.

why do we use cannoical link ?

How to get the parameter?

How to inference it?

What is the realtion between GLM and OLM?

See you next time !

Reference:

1.http://en.wikipedia.org/wiki/Exponential_distribution 2.http://en.wikipedia.org/wiki/Generalized_linear_model

3.Categorical Data Analysis by Alan Agresti (Wiley, 2013)

2014年10月16日星期四

Set Color for Your Terminal


Get bored with you black and white OS Terminal?


Here are some quick way to color it ^ ^

cd ~ //go to your home directory
ls -a // then you can find one .bash_profile
vim .bash_profile //
// add the following line in you .bash_profile
//if you want a Terminal with white back and dark text, just like me

exportCLICOLOR=1
export PS1="\[\033[36m\]\u\[\033[m\]@\[\033[32m\]\h:\[\033[33;1m\]\w\[\033[m\]\$ "
export LSCOLORS=ExFxBxDxCxegedabagacad

alias ls='ls -GFh'

1.  type esc
2.  :x
//line one and line 2 to exit vim and save file.
// wo hoo then you get the fi


for details ,please visit: http://osxdaily.com/2013/02/05/improve-terminal-appearance-mac-os-x/

Another exciting finding, at least to me, I want to share is how to set color on vim. 

just two single line of code:

cd ~ //go back to you home directory

vim .vimrc // you may or may not have it , do not worry vim will make it for you

//write the following line in the file

set number // which will give you the number of line
syntax on // which will set color


That's it.

// Just write a new python file and see





2014年9月21日星期日

draw map with R---something easy and amazing

How to visualize data on Maps by R?

Hi, everyone! I came back! After several days of study I found something that is super interesting in R: drawing maps.

There three ways of drawing maps:1. get you own map data. 2.get data from r maps and map data library. 3 use ggmap library.


Using your own data:
library(maps)
library(mapdata)
#install.packages("maptools")
library(maptools)
#install.packages("sp")
library(sp)
x=readShapeSpatial("/Users/leilei/Desktop/The Last Semester/ADA/china-province-border-data/bou2_4p.shp")
install.packages("gpclib")
gpclibPermit()
gpclibPermitStatus()
plot(x)

Using map library:
library(maps)
map("world", fill = TRUE, col = rainbow(200),
    ylim = c(-60, 90), mar = c(0, 0, 0, 0))
title("world map")
What a colorful map!~

Using GGmap:
library(ggmap)
library(ggplot2)
geocode("Fordham University", output = "more")

output:
       lon      lat       type     loctype                                  address
1 -73.8857 40.86204 university approximate fordham university, bronx, ny 10458, usa
     north    south      east      west postal_code       country
1 40.86469 40.85781 -73.88066 -73.89047       10458 united states
  administrative_area_level_2 administrative_area_level_1 locality street streetNo
1                bronx county                    new york     <NA>   <NA>       NA
  point_of_interest              query
1              <NA> Fordham University

Also, you can draw gif by ggmap:

library(ggmap)
library(animation)
library(XML)
library(ggplot2)
webpage <-'http://www.ldeo.columbia.edu/cgi-bin/quake.cgi'
#download earthquake data from this website
tables <- readHTMLTable(webpage,stringsAsFactors = FALSE)
raw <- tables[[1]]
data <- raw[-1,c('V1','V3','V4')]
data$V3<-sub("\\D*$","",as.vector(as.matrix(data$V3)))
data$V4<-sub("\\D*$","",as.vector(as.matrix(data$V4)))
#using regular expression to change the location
names(data) <- c('date','lan','lon')
data$lan <- as.numeric(data$lan)
data$lon <- -as.numeric(data$lon)
data$date <- as.Date(data$date,"%Y-%m-%d")
#why did you do that?
ggmap(get_googlemap(location="united states", zoom=4,maptype='terrain'),extent='device')+
  geom_point(data=data,aes(x=lon,y=lan),colour = 'red',alpha=0.7)+stat_density2d(aes(x=lon,y=lan,fill=..level..,alpha=..level..),
               size=2,bins=4,data=data,geom='polygon')+theme(legend.position = "none")

plotfunc <- function(x) {
  df <- subset(data,date <= x)
  df$lan <- as.numeric(df$lan)
  df$lon <- as.numeric(df$lon)
  p <- ggmap(get_googlemap(location="united states", zoom=4,maptype='terrain'),extent='device')+
    geom_point(data=df,aes(x=lon,y=lan),colour = 'red',alpha=0.7)
}
time <- sort(unique(data$date))
saveGIF(for( i in time) print(plotfunc(i)))





That's it. 



Reference:


2014年9月8日星期一

When you want to combine your mac disk, look here!



When I got my first mac, I set two disk for it. Several month passed, the top disk is nearly used up. Thus I want to combine it. Here is the way to combine the them.






First:Application---->Utility------->Disk Utility
Second:Copy the files in the unwanted disks 
Third:Deleted unwanted disks and resize the remain ones.(details in picture above)


That's it. Have a good day!

2014年6月15日星期日

Two ways of Getting Data from the web by R

      There are basically two ways of getting data from the web by R.  Function from get.hist.quote from tseries packages and getSymbols and setSymbolLookup from package quantmod.
      (1) get.hist.quote
      example:
      get.hist.quote(instrument = "^gdax", start, end,
               quote = c("Open", "High", "Low", "Close"),
               provider = c("yahoo", "oanda"), method = NULL,
               origin = "1899-12-30", compression = "d",
               retclass = c("zoo", "its", "ts"), quiet = FALSE, drop = FALSE) 
      instrument:a string that contact of the name of dataset you download.
      quote: the variables that you want to download
      start:the start time to download
      end: the end time of time series
      provider:'yahoo' or 'oanda'
      compression:'d','m', monthly or daily.
     
       something to remember:when you use the  oanda, the max time span is 500 day.


(2)   Use quantmod library to install the packages.  
       install.packages('quantmod')
       library(quantmod)
       # use this to extact data
       getSymbols('^GSPC',from='1970-01-01',to='2009-09-15')
       colnames(GSPC)<-c('Open','High','Low','Close','Volume','Adjcolse')
       setSymbolLookup(IBM=list(name='IBM',src='yahoo'),USDEUR=list(name='USD/EUR'
,src='oanda'))
       getSymbols(c('IBM','USDEUR'))