streaming estimation

2023-07-01 笔记 0 Comments

我的了解是极其有限的，但无论如何，从零开始学一点什么是让人快乐的。

数据流上的算法是多样的，例如它可以在和图结合Graph Stream Algorithms: A Survey。

最经典的问题必然是，询问每个元素的出现的频数。常见的估计方法是 sketch 和 sample。

类似的可以估计元素的 frequency moment，对于 $F_0$ 而言就是 unique 的元素个数， $F_1$ 就是元素个数， $F_2$ (Gini’s index of homogeneity , repeat rate) ，或许是有点用的，更高的 frequency moment 可以用于估计数据偏度。

STOC’96 The space complexity of approximating the frequency moment

这是一篇似乎非常经典的工作，它给出了上面的估计数据流的frequency moment 的一个经典方法，基于采样。

核心思想是构建一个随机变量 $X=m(r^k - (r-1)^k)$ ，使得 $X$ 的期望是 frequency moment ，且 $X$ 的方差较小。这里的 $m$ 是元素个数， $r$ 是这个位置之后出现的和这个位置相同的元素个数。

这个方法也被用在熵的估计中 SIGMETRICS06 Data streaming algorithms for estimating entropy of network traffic，它在估计时还使用了大小流分流的思路，对于大频数的直接使用count记录，对于小流部分使用构造随机变量的思路。空间是 $O(\log m \log (1/\delta) / \epsilon^2)$。

关于熵估计，可能有关的

The Power of Linear Estimators | IEEE Conference Publication | IEEE Xplore

给出的 $O(\frac{n}{\epsilon \log n})$ 的采样，失败概率 $o(1/poly(n))$ ，within additive accuracy $\epsilon$

2010 A near-optimal algorithm for estimating the entropy of a stream | ACM Transactions on Algorithms

2008 Sketching and streaming entropy via approximation theory

2010 A CLT and tight lower bounds for estimating entropy (purdue.edu)

我震惊了，这怎么还能和VCD扯上关系！！！Optimal Space Lower Bounds for all Frequency Moments

很神奇的一篇工作，试图学习一下。

好像和数据流没关系 SODA07 Approximating Entropy from Sublinear Samples

本文链接： http://emoairx.github.io/blog/2023/07/01/streaming_estimation/

版权声明： 本博客所有文章除特别声明外，均采用 CC BY 4.0 CN协议许可协议。转载请注明出处！

emoairxPKU,EECS

春天来了，冬天还会远吗~