Индексирование данных в пространстве большой размерности

Борис Асенович Новиков; Иван Викторович Судос

doi:10.15622/sp.33.2

Борис Асенович Новиков заведующий кафедрой информационно аналитических систем Математико-механический факультет Санкт-Петербургского Государственного Университета
Иван Викторович Судос аспирант Санкт-Петербургский Государственный Университет

DOI:

https://doi.org/10.15622/sp.33.2

Ключевые слова:

индексирование, поиск ближайшего соседа, данные большой размерности, анализ данных, редукция размерности

Аннотация

Индексирование данных является неотделимой частью задачи поиска. В то время как для данных в пространствах размерностей не более 5 существует хорошо изученный набор эффективных алгоритмов индексации и поиска, для пространств большой размерностей эти алгоритмы оказываются неэффективны или неприменимы. В этом обзоре мы приводим существующие обоснования проблем связанных с индексированием в пространствах большой размерности для задачи поиска ближайшего соседа. Рассматриваются возможные методы решения обозначенных проблем, применимые в областях анализа данных, таких как кластеризация и извлечение скрытых структур. Ставится вопрос о применимости различных методов размерностной редукции к задачи индексирования и поиска ближайшего соседа.

Литература

1. Donoho D. L. High-dimensional data analysis: the curses and blessings of dimensionality // American Mathematical Society Conf. Math Challenges of the 21st Century 2000.

2. Manning C.D., Raghavan P., Sch¨utze H. Introduction to Information Retrieval // American Mathematical Society Conf. Math Challenges of the 21st Century 2000.

3. Augenlicht L. H., Kobrin D. Cloning and Screening of Sequences Expressed in a Mouse Colon Tumor // Cancer Research. 1982. vol.42. pp. 1088-1093.

4. Dash M., Liu H. Feature Selection for Classification // Intelligent Data Analysis. 1997. vol.1. pp. 131-156.

5. Jain A.K., Duin R.P.W., Jianchang M. Statistical pattern recognition: a review // Pattern Analysis and Machine Intelligence, IEEE Transactions on. 2000. vol.22. pp. 4-37.

6. Koller D., Sahami M. Toward optimal feature selection // In 13th International Conference on Machine Learning. 1995. pp. 284–292.

7. Glazko G., Coleman M., Mushegian A. Similarity searches in genome-wide numerical data sets // Biology Direct. 2006. vol.1.

8. Yan T., Ganesan D., Manmatha R. Distributed Image Search in Sensor Networks // Proceeding SenSys ’08 Proceedings of the 6th ACM conference on Embedded network sensor systems. 2008.

9. Deshpande A., Guestrin C., Madden S.R., Hellerstein J.M., Hongn W. Model-driven data acquisition in sensor networks // Proceeding VLDB ’04 Proceedings of the Thirtieth international conference on Very large data bases. 2004. vol.30. pp. 588-599.

10. Russell S.J., Norvig P., Canny J.F., Malik J.M. Edwards D.D. Artificial intelligence: a modern approach // Prentice hall Englewood Cliffs. 1995.

11. Bellman R. E. Dynamic programming // Princeton University Press. 1957.

12. Beyer K., Goldstein J., Ramakrishnan R., Shaft U. When Is Nearest Neighbor Meaningful // ICDT ’99 Proceedings of the 7th International Conference on Database Theory. 1999. pp. 217-235.

13. Hinneburg A., Aggarwal C., Keim D.A. What Is the Nearest Neighbor in High Dimensional Spaces // VLDB ’00 Proceedings of the 26th International Conference on Very Large Data Bases. 2000. pp. 506-515.

14. Kriegel H.-P., Kr¨oger P., Zimek A. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering // ACM Transactions on Knowledge Discovery from Data. 2009. vol.3 no.3. pp. 506-515.

15. Parsons L., Haque E., Liu H. Subspace clustering for high dimensional data: a review // ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced dataset. 2004. vol.6.

16. B¨ohm C., Berchtold S., Keim D.A. Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases // ACM Computing Surveys (CSUR). 2001. no.33. vol.3.

17. Guttman A. R-Trees: A Dynamic Index Structure for Spatial Searching // Proceedings of 1984 ACM SIGMOD International Conference on Management of Data. 1984. pp. 47-57.

18. Weber R., Schek H.-J., Blott S. A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces // VLDB’98, Proceedings of 24rd International Conference on Very Large Data Bases. 1998. pp. 194-205.

19. Berchtold S., Keim D. A., Kriegel H.-P. The X-tree: An Index Structure for High-Dimensional Data // VLDB ’96 Proceedings of the 22th International Conference on Very Large Data Bases. 1996. pp. 28-39.

20. Pearson K. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions // Philosophical Magazine Bases. 1901. vol.2. pp. 559-572.

21. Gunnemann S., Kremer H., Lenhard D., Seidl T. Subspace Clustering for Indexing High Dimensional Data:A Main Memory Index based on Local Reductions and Individual Multi-Representations // Proc. International Conference on Extending Database Technology. 2011. pp. 237-248.

22. M¨uller E., Assent I., G¨unnemann S., Krieger R., Seidl T. Relevant Subspace Clustering: Mining the Most Interesting Non-Redundant Concepts in High Dimensional Database // Proc. IEEE International Conference on Data Minings. 2009. pp. 377-386.

23. Assent I., Krieger R., Miller E., Seidl T. INSCY:Indexing subspace clusters with in-process-removal of redundancy // Data Mining, 2008. ICDM ’08. Eighth IEEE International Conference. 2008. pp. 719-724.

24. Assent I., Krieger R., Miller E., Seidl T. EDSC: efficient density-based subspace clustering // Proceedings of the 17th ACM conference on Information and knowledge management. 2008. pp. 1093–1102.

25. Houle M. E., Kriegel H.-P., Kr¨oger P., Schubert E., Zimek A. Can Shared- Neighbor Distances Defeat the Curse of Dimensionality? // Scientific and Statistical Database Management. 2010. pp. 482-500.

26. Strehl A., Ghosh J. Cluster ensembles — a knowledge reuse framework for combining multiple partitions // The Journal of Machine Learning Research. 2003. pp. 583-617.

27. Aggarwal C.C., Wolf J.L., Phillip S., Yu C.P., Jong S.P. Fast algorithms for projected clustering // Proceedings of the 1999 ACM SIGMOD international conference on Management of data. 1999. pp. 61–72.

28. Nagesh H., Goil S. Adaptive Grids for Clustering Massive Data Sets // SIAM International Conference on Data Mining - SDM. 2002.

29. Fan J., Fan Y., Wu Y. High-dimensional classification // World Scientific, New Jersey. 2011. pp. 3-37.

30. Silverman B. Density Estimation for Statistics and Data Analysis // Monographs on Statistics and Applied Probability, London: Chapman and Hall. 1986.

31. Huber P.J. Projection Pursuit // The Annalas of Statistics. 1985. vol.2. no.13. pp. 435—475.

32. Friedman J.H. Exploratory projection pursuit // Journal of American Statistical Association. 1987. no.82. pp. 249—266.

33. Dobkin D., Lipton R.J. Multidimensional Searching Problems // SIAM Journal on Computing. 1974. no.5. pp. 181–186.

34. Meiser S. Point location in arrangemenm of hyperplanes // Information and Computation. 1993. no.106. pp. 286–303.

35. Andoni A., Indyk P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions // Communications of the ACM - 50th anniversary issue. 2008. vol.51. pp. 117-122.

36. Indyk P., Motwani R. Approximate nearest neighbors: towards removing the curse of dimensionality // Proceedings of the thirtieth annual ACM symposium on Theory of computing. 1998. pp. 604-613.

37. Datar M., Immorlica N., Indyk P., Mirrokni V.S. Locality-sensitive hashing scheme based on p-stable distributions // Proceedings of the twentieth annual symposium on Computational geometry. 2004. pp. 253-262.

38. Deegalla S., Bostrom H. Reducing High-Dimensional Data by Principal Component Analysis vs. Random Projection for Nearest Neighbor Classification // Proceedings of the 5th International Conference on Machine Learning and Applications. 2006. pp. 245-250.

39. Ian J. Principal component analysis // Wiley Online Library. 2005.

40. Gorban A. N. Principal manifolds for data visualization and dimension reduction // Springer. 2007. vol.58.

41. Tenenbaum J.B., De Silva V., Langford J.C. A global geometric framework for nonlinear dimensionality reduction // Science. 2000. vol.290. no.5500. pp. 2319–2323.

42. Kruskal J.B. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis // Psychometrika. 1964. vol.29. no.1. pp. 1-27.

43. Cox T.F., Cox M.A.A. Multidimensional scaling // CRC Press. 2010.

44. Heisterkamp D.R., Peng J. Kernel VA-files for relevance feedback retrieval // Proceedings of the 1st ACM international workshop on Multimedia databases. 2003. pp. 48-54.

45. Heisterkamp D.R., Peng J. Sparse subspace clustering // IEEE Conference on Computer Vision and Pattern Recognition CVPR 2009. 2009. pp. 2790-2797.

46. Von Luxburg U. A tutorial on spectral clustering // Statistics and computing. 2009. vol.17. pp. 395-416.

47. Shi J., Malik J. Normalized cuts and image segmentation // IEEE Transactions on Pattern Analysis and Machine Intelligence. 2000. vol.22. pp. 888–905.

Просмотры	2257
Скачивания	1662

Статьи

Индексирование данных в пространстве большой размерности

DOI:

Ключевые слова:

Аннотация

Литература

Опубликован

Статистика

Как цитировать

Выпуск

Раздел

Импакт-фактор

Разделы

Мы в сети

Обратная связь