2014年11月25日星期二

General Idea of Userbased Recommendation

Userbased Recommendation

Userbased Recommendation

1.The idea of userbased recommendation is pretty straight forward: you like the stuff the person similar with you like. Then the question comes, how do we measure similarity between people? The ratings of the stuff both of you rated!

Algorithm

  1. Input,you had a dataset that include the userID, itemID,ratings.

  2. Caculate Similarity,following operation are avilable

    Here we have two user x and y, they both rated k items.
    (1) Pearson Similarity:

    $\frac{\sum_{i=1}^{k}(x_i-E(x))(y_i-E(y))}{std(x)std(y)}$

    (2) Eucleadian Distance:

    $\sqrt{\sum_{i=1}^{k}(x_i-y_i)^2}$

    (3) Mahattan Distance:

    $\sum_{i=1}^{k}|x_i-y_i|$

    (4) Cosine Similarity: $\frac{\sum_{i=1}^{k}A_i*B_i}{\sum{A_i^2}\sum{B_i^2}}$

    Question comes again:which similarity should I use?

    Generallly speaking,

    If data is subject to grade inflation, use pearson similarity. If data is dense and the magnitude of data is important, both manhantan distance and Euclidean distance will work. If your data is sparse, consine similarity works fine.

  3. Suppose we have N by N similarity matrix,

    Get K person similar with A, You can choose K by experiment.

    person K neighbour Score Weight
    A B 0.7 0.35
    C 0.8 0.40
    D 0.5 0.25

    If both B C D rating on X , their rating are 3, 4,4 respectively, The result should be 3 0.35+4 0.40+4 *0.25=3.65

    Choose the top i item and recommend it to user A

Pros

1.Make recommendation without knowing the detail of items.

Cons

1.Code Start, need a large amount of data to made accureate recommendation.

2.Computaion, O(n^2) is needed for compute the similarity matrix.

3.Sparsity. Few people rating items online.

2014年11月16日星期日

How to set a web page under one domain?

claim: I just write this article for fun,if you were a geek, skip it

claim: I just write this article for fun,if you were a geek, skip it.

We visit various of website everyday, while how does it work?

Each website is served by a server, maybe a super computer. When you visit, say, www.newyorktime.com, the browser translate the domain name to the Ip address of that server. By TCP/IP protocal, you contact with server got the webpage.

If you had a domain, how to write a web page?

  1. log into that server machine by your account and password.

  2. cd ~

  3. You are in home directory ,where you can see a html directory. Chmod 711 html and the directory under it. This command made these directory excutable for anyone.

  4. put the *.html in your html directory.
  5. chmod 644 *.html 4 means 100 readable, unwritable, unexcutable.
  6. type : yourdomain./~youraccount in webbrowser, you should see the webpage that you just created.

2014年11月4日星期二

GLM and Natural parameter family

GLM and Exponential Family

GLM and Exponential Family

GLM

Ordinary Linear Regression assumes that y changes a constant value when x changes a constance value. It fails in following situation 1. Response has certain range, say life or death [0,1]. 2. When we need to estimate the probability. 3. Data are not normal distributed, say, the new ebola victims each day in mali.

Generalized Linear model solved this problem by specify $g(u)=x\beta$, which means the function of mean varies constant ,rather than the mean itself with $x\beta$.

I hope you do not get bored, let's look at form of generalized linear model, which will help you understand logistics model, poisson model, probit model well. A glm is made up with the following:

1.$ \eta=x\beta$,which is called systematic component.

2.$g(\mu)=\theta$,which is called linked function

3.Last but not the least, distribution comes from natural exponential family.

Natural Parameter Famliy

Any distribution with a pdf can be written in the following way belong to natural parameter family.

$f(y_i,\theta,\phi)=exp(\frac{y_i\theta+b(\theta)}{a(\phi)}+c(y_i,\phi))$

$\theta$:function of $\mu$

$a(\phi)$:dispersion parameter

Normal distribution is one of this famliy: $\frac{1}{\sqrt{2\pi}}exp-\frac{(y_i-\mu)^2}{2\sigma^2} $=$exp(\frac{y_i\theta-\theta^2/2}{\sigma^2}+\frac{-y^2}{2\sigma^2}-log(2\pi))$

Here we can see:

$\theta:\mu$

$b(\theta):\theta^2/2$

$a(\phi):\sigma^2$

$c(y_i,\phi):\frac{-y^2}{2\sigma^2}-log(2\pi)$

Also binomial distribution is natural exponential family:

$\binom ny p^{y_i}(1-p)^{n-y_i}$=$exp[y_ilog(\frac{p}{1-p})+nlog(1-p)+log(\binom ny)]$

$\theta=log(\frac{p}{1-p})$

$\binom ny p^{y_i}(1-p)^{n-y_i}=exp[y_i\theta-nlog(1+e^\theta)+nlog(\binom ny)]$

Here:

$\theta:\log(\frac{\mu}{1-\mu})$

$b(\theta):nlog(1+e^\theta)$

$a(\phi):1$

Exponetial famliy

$\lambda exp(-\lambda y_i)=exp[-\lambda y_i +log(\lambda)]$

$u=\frac{1}{\lambda}$

$\lambda exp(-\lambda y_i)=exp[-\frac{1}{\mu}y_i -log(\mu)]$

$\theta=-\frac{1}{\mu}$

$\lambda exp(-\lambda y_i)=exp[\theta y_i +log(-\theta)]$

$\theta=-\frac{1}{\mu}$

$b(\theta)=-log(-\theta)$

canonical link

if link function \theta=g(\mu) in natrual exponential family equals \theta in the natural parameter function, this kind of link is called canonical link.

Distribution Canonical Link
Normal $\theta=\mu$
Poisson $\theta=log(\mu)$
Exponential $\theta=-\frac{1}{\mu}$
Bernoulli $log(\frac{\mu}{1-\mu})$

You can get the cannoical link of natrual exponential family by reorgnize the distribution in the way in the example

ps: You might have heard identity link,$u=\theta$, normal distribution's canonlical link is also its identity link.

why do we use cannoical link ?

How to get the parameter?

How to inference it?

What is the realtion between GLM and OLM?

See you next time !

Reference:

1.http://en.wikipedia.org/wiki/Exponential_distribution 2.http://en.wikipedia.org/wiki/Generalized_linear_model

3.Categorical Data Analysis by Alan Agresti (Wiley, 2013)