4.2 ECGS-Dynamic (2/3) - anobiidae's blog

An illustrated comparison of different sampling procedures is provided in Figure 4.

図4に異なるsampling処理の比較図を示す。

Given for instance an 8-tokens document with 5 times “cat” and 3 times “dog”. The standard CGS samples 8 times for every token. ECGS-Shortcut samples 2 times only, since there are only two types in the document.

"cat"5回、"dog"3回の8 tokenで構成された文書が与えられるものとする。通常のCGSであれば全てのトークンに対する、8回のsamplingが行われる。ECGS-Shortcutでは、documentには2つのtypeしか含まれていないことから、これが2回だけとなる。

In ECGS-Dynamic, Γcat is initialized as [0, 0, 0, 0, 1], the sampling procedure for each type involves two steps.

ECGS-DynamicではΓcatを[0,0,0,0,1]で初期化する。そして２つのステップをそれぞれのtypeに適応する。

For instance, to sample the topics for type “cat”, we first draw a “sampling times” from multinomial(0, 0, 0, 0, 1) (Figure 3 line 4), in this case the only possible outcome is Icat = 5.

例えば、cat typeのtopicをsamplingするためには、まずsampling timesを正規乱数(0,0,0,0,1)からdrawする。この場合、有効な出力はlcat=5だけである。

Thus, we iteratively sample the topics of “cat” for 5 times, which covers all five tokens in this document, and S1 cat = 1. Next, we find this 5-times sampling results in 3 unique topics, that is, |M| = 3.

そして、我々は5回catのtopicをsamplingする。このとき、documentの全ての5つのtokenをカバーし、S1 cat = 1とする。次に、5回のsamplingの結果から3つのuniqなtopicを見つけ、これを|M|=3とする。

Consequently, Γcat is updated to [0, 0, 1, 0, 1], the new Γcat will be used for generating the “sampling times” in the 2nd CGS iteration.

その結果、Γcatは[0,0,1,0,1]にアップデートされ、この新しいΓcatが２回目のCGS iterationでのsampling timesを決定するときに使われる。

If Icat = 3 in the 2nd CGS iteration, then we save the time cost by excluding two tokens of “cat” from the sampling procedure. Ergo, we get a sampling rate for the 2nd CGS iteration of S2 cat = 0.6.

もし、2回目のCGS iterationでlcat=3であれば、我々はsampling処理から"cat"の２つのtokenを除外する事で、処理時間を節約する。それゆえに、2回目のCGS iterationでのsampling rateは、S2 cat = 0.6である。

On the other hand, if Icat = 5 is drawn again,then we have to sample 5 times for the “cat”, which will give S2 cat = 1.

一方で、lcat = 5が再び選択された場合には、我々はS2 cat = 1であるように5回をsampleする。

This procedure continues and consequently more and more tokens are excluded. To sum up, the sampling times Idi for each type is generated from a multinomial distribution, which is the major difference between ECGS-Dynamic, ECGS-Shortcut and standard CGS algorithms.

この処理は繰り返され、非常に多くのtokenが除外されていく。合計すると、それぞれのtypeに対するsampling回数ldiは正規分布から決定される。これはECGS-Dynamic, ECGS-Shortcutそしてstandard CGSアルゴリズムの間で異なっている。