Order Statistic

Consider the following problem:

Given k bottles and 1 liter of water, we randomly split the water into these bottles. Let v_i be the volume of water in i^{th} bottle, what is the distribution of v_i?

The problem can be model by the following process. We random generated k-1 real numbers x_i in [0,1] with uniform distribution. Then the interval [0,1] is split into k intervals. We can regard length of i^{th} interval as the random variable v_i.

First let’s look at the first interval which corresponding to the first bottle. It’s length is the distance between 0 and the minimum number generated. So we have v_i = x_{(i)} = \min_i\{x_i\}. Therefore, v_i is the 1st order statistic of sample \{x_i\}.

Let \{x_{i}\} be iid. Let f(x) be the probability density function and F(x) be the cumulative distribution function of $x_i$. Then the probability density of the $i^{th}$ statistic is

f_i(x) = {n! \over (i-1)!(n-i)!} F(x)^{i-1} (1-F(x))^{n-i} f(x)

In this case, n=k-1, f(x)= 1, x\in [0,1], and F(x) = x, x\in [0,1], the first order statistic is

f_1(x) = (k-1)(1-x)^{k-2} , x\in [0,1]

This is a Beta distribution. It’s expectation is 1/k.

Notes in Implementing RVM

Due to the sparse property of RVM, many of the \alpha_i would approach infinity. This would cause the Hessian matrix to be singular and the inverse operation to be ill-posed. Therefore, we take the method in (Nabney,1999) to avoid such a problem.

Let \delta_{\mathbf{w}} = \mathbf{H}^{-1}\nabla E(\mathbf{w}). By multiplying both side with \mathbf{H} we have \mathbf{H} \delta_{\mathbf{w}} = \nabla E(\mathbf{w}). Suppose that \mathbf{H} = U^{\mathrm{T}}U and \nabla E(\mathbf{w}) = U^{\mathrm{T}}Y, then \delta_{\mathbf{w}} is the solution of (U^{\mathrm{T}}U)\delta_{\mathbf{w}} = U^{\mathrm{T}}Y where Y is the solution of equation U^{\mathrm{T}}Y = \nabla E(\mathbf{w}). Hence we convert the inverse operation into solving linear equations which is more stable.

An Interesting Discussion in Our Group

Things start from my email which sent to our group mail list on an interesting passage as following

Don’t delete this just because it looks weird. Believe it or not, you can read it

I cdnuolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg. The phaonmneal pweor of the hmuan mnid Aoccdrnig to rscheearch at Cmabrigde Uinervtisy, it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoatnt tihng is taht the frist and lsat ltteer be in the rghit pclae.The rset can be a taotl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe. Amzanig huh?

Then people in our group began to post interesting comments on this passage

Read more »

Review for Gaussian Distribution

Gaussain distribution for univariate random variable x

N(x|\mu, \sigma^2)=\frac{1}{(2\pi\sigma^2)^{1/2}}\exp\{-\frac{1}{2\sigma^2}(x-\mu)^2\}

Gaussain distribution for D-dimensional random variable \textbf{x}

N(\textbf{x}|\mu, \sigma^2)=\frac{1}{(2\pi)^{D/2}|\Sigma|^{1/2}}\exp\{-\frac{1}{2}(\textbf{x}-\mu)^T\Sigma^{-1}(\textbf{x}-\mu)

Conditional distribution:

p(x_a|x_b) = N(x_a|\mu_{a|b},\Lambda_{aa}^{-1})
where \mu_{a|b} = \mu_a - \Lambda_{aa}^{-1}\Lambda_{ab}(x_b-\mu_b)

p(x_b|x_a) = N(x_b|\mu_{b|a},\Lambda_{bb}^{-1})
where \mu_{b|a} = \mu_b - \Lambda_{bb}^{-1}\Lambda_{ba}(x_a-\mu_a)

Marginal distribution

p(x_a) = N(x_a|\mu_a,\Sigma_{aa})

p(x_b) = N(x_b|\mu_b,\Sigma_{bb})

Let p(\textbf{x}) = N(\textbf{x}|\mu,\Lambda^{-1}), and p(\textbf{y}|\textbf{x}) = \mathcal {N}(\textbf{y}|\textbf{Ax+b},\textbf{L}^{-1}),

Marginal distribution of \textbf{y},

p(\textbf{y}) = \mathcal {N}(\textbf{y}|\textbf{A}\mu+\textbf{b}, \textbf{L}^{-1}+\textbf{A}\Lambda^{-1}\textbf{A}^T)

Conditional distribution of x given y

p(x|y) = N(x|\Sigma{A^TL(y-b) +\Lambda\mu},\Sigma)

where \Sigma = (\Lambda + A^TLA)^{-1}

Correcting Sample Selection Bias by Unlabeled Data

nonparametric method which directly produces resampling weights without distribution estimation. Distribution matching; the means of the training and test points in a reproducing kernel Hilbert space are close.

The author even claim that their method can in some cases outperform reweighting using the true sample bias distribution. But why it is possible?

support of Pr’ is contained in the support of Pr

two major drawbacks: 1. Pr and Pr’ should be good. 2. overkill

Core theorem: if we find the solution of the following problem, then \beta is the resampling weight we needed.

image

 

An interesting workshop in NIPS

Analysis of Representations for Domain Adaptation

This is a nips paper in this year. Different from what we do about finding a good representation for domain adaptation, the authors want to propose an evaluation method for the representation and propose the question “under what conditions can we adapt a classifier trained on the source domain for use in the target domain?”.

As we known, it is critical point in transfer learning that how to measure the difference between two domains. In this paper, A-distance is proposed for this metric. A-distance is different from the usual distances of two distributions such as K-L divergence that it only care about the difference of probability on some subsets in the probability space. The definition is as following:

d_A(P,Q)=2\sup_{A\in S}|P(A)-Q(A)|

Sometimes it may be too strong to define the distance based every points as in KL-divergence. Therefore, A-distance may be a better choice.

When actually compute the A-distance, it is converted to a binary classification problem of discriminating source domain and target domain. The idea is straightforward, if two distributions are similar, it would be hard to classify the samples drawn from them, vise versa.

The paper provided a bound on the target domain error based on the assumption that there exist a good classifier in the chosen hypothesis set. The bound is controlled by the classification error on source domain data and the A-distance of the two distributions. It seems reasonable.

General speaking, this paper is interesting. However, it lucks some details in mathematics.

A trick on minimax problem

I learned a trick to convert a minimax problem into a minimum problem today. That is, if we have the problem of

\max_{x\in F} (\min_{i\in [1,n]}x_i)

Then it is equivalent to the following minimum problem

\max_{z\in F} z \ \ \ \  s.t.\ z < x_i

This is a very useful trick. Consider the problem of data selection in SVM. We’d like to select the subset of data which maximize the margin as well as the linear separator minimize the margin.

SVM:

minimize (1/2)||\mathbf{w}||^2, subject to c_i(\mathbf{w}\cdot\mathbf{x_i} - b) \ge 1, \quad 1 \le i \le n.

Subset SVM:

\min \max_{S\subset D} ||\textbf{w}||^2, \ \ \ s.t. \ c_i(\textbf{w}\cdot \textbf{x}_i-b)\geq  1, i\in S

Therefore, this problem can be converted to

\min z, \ \ \ s.t. \ c_i(\textbf{w}\cdot \textbf{x}_i-b)\geq  1, i\in S  \ \ z>||\textbf{w}||^2

But is it an easier problem to solve? I’m not sure at present.

An easy way to train a cost sensitive model by SVM^light

Although SVMlight supports giving each example a different weight in the cost function, it does not provide the interface in its Web site and related documents. But you can find it in its code. The interface is easy to use. You just need to change the input data file format as:

<line> .=. <target> cost:<weight> <feature>:<value> <feature>:<value> … <feature>:<value> # <info>

The weight by default is 1, of course.

Tutorial Notes


Hal Daume III gave an excellent tutorial on Bayesian Techniques for NLP at MSRA. Although the tutorial’s name is about NLP but actually it is all about graphical models. In this tutorial, Hal showed many detail usually you cannot read directly from the papers. He also gave us some vivid illusion of abstract ideas such as Metropolis-Hastings Sampling and Gibbs Sampling etc.


Why the accept probability in Metropolis-hastings Sampling is: \displaystyle \min\{1, \frac{p(x')q(x|x')}{p(x)q(x'|x)}\} ? Although there is mathematic reason why the formula is like above, but the following analysis give us more intuition why it is so.


At current step where we have sample x, suppose we get x’ from \displaystyle q(\cdot|x), the question is should we accept this sample? If p(x’) is high, it indicates it is probably a good sample for distribution p, therefore we’d better accepted. If p(x) is low, it indicates the current point is not a good stand point to generate samples for p, thus we’d better move to the new sample x’. If q(x’|x) is high and q(x|x’) is low, this means it is very likely to transfer from x to x’ but never come back, then probably it is to risk to transfer.