Индексирование данных в пространстве большой размерности
Ключевые слова:
индексирование, поиск ближайшего соседа, данные большой размерности, анализ данных, редукция размерностиАннотация
Индексирование данных является неотделимой частью задачи поиска. В то время как для данных в пространствах размерностей не более 5 существует хорошо изученный набор эффективных алгоритмов индексации и поиска, для пространств большой размерностей эти алгоритмы оказываются неэффективны или неприменимы. В этом обзоре мы приводим существующие обоснования проблем связанных с индексированием в пространствах большой размерности для задачи поиска ближайшего соседа. Рассматриваются возможные методы решения обозначенных проблем, применимые в областях анализа данных, таких как кластеризация и извлечение скрытых структур. Ставится вопрос о применимости различных методов размерностной редукции к задачи индексирования и поиска ближайшего соседа.Литература
1. Donoho D. L. High-dimensional data analysis: the curses and blessings of dimensionality // American Mathematical Society Conf. Math Challenges of the 21st Century 2000.
2. Manning C.D., Raghavan P., Sch¨utze H. Introduction to Information Retrieval // American Mathematical Society Conf. Math Challenges of the 21st Century 2000.
3. Augenlicht L. H., Kobrin D. Cloning and Screening of Sequences Expressed in a Mouse Colon Tumor // Cancer Research. 1982. vol.42. pp. 1088-1093.
4. Dash M., Liu H. Feature Selection for Classification // Intelligent Data Analysis. 1997. vol.1. pp. 131-156.
5. Jain A.K., Duin R.P.W., Jianchang M. Statistical pattern recognition: a review // Pattern Analysis and Machine Intelligence, IEEE Transactions on. 2000. vol.22. pp. 4-37.
6. Koller D., Sahami M. Toward optimal feature selection // In 13th International Conference on Machine Learning. 1995. pp. 284–292.
7. Glazko G., Coleman M., Mushegian A. Similarity searches in genome-wide numerical data sets // Biology Direct. 2006. vol.1.
8. Yan T., Ganesan D., Manmatha R. Distributed Image Search in Sensor Networks // Proceeding SenSys ’08 Proceedings of the 6th ACM conference on Embedded network sensor systems. 2008.
9. Deshpande A., Guestrin C., Madden S.R., Hellerstein J.M., Hongn W. Model-driven data acquisition in sensor networks // Proceeding VLDB ’04 Proceedings of the Thirtieth international conference on Very large data bases. 2004. vol.30. pp. 588-599.
10. Russell S.J., Norvig P., Canny J.F., Malik J.M. Edwards D.D. Artificial intelligence: a modern approach // Prentice hall Englewood Cliffs. 1995.
11. Bellman R. E. Dynamic programming // Princeton University Press. 1957.
12. Beyer K., Goldstein J., Ramakrishnan R., Shaft U. When Is Nearest Neighbor Meaningful // ICDT ’99 Proceedings of the 7th International Conference on Database Theory. 1999. pp. 217-235.
13. Hinneburg A., Aggarwal C., Keim D.A. What Is the Nearest Neighbor in High Dimensional Spaces // VLDB ’00 Proceedings of the 26th International Conference on Very Large Data Bases. 2000. pp. 506-515.
14. Kriegel H.-P., Kr¨oger P., Zimek A. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering // ACM Transactions on Knowledge Discovery from Data. 2009. vol.3 no.3. pp. 506-515.
15. Parsons L., Haque E., Liu H. Subspace clustering for high dimensional data: a review // ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced dataset. 2004. vol.6.
16. B¨ohm C., Berchtold S., Keim D.A. Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases // ACM Computing Surveys (CSUR). 2001. no.33. vol.3.
17. Guttman A. R-Trees: A Dynamic Index Structure for Spatial Searching // Proceedings of 1984 ACM SIGMOD International Conference on Management of Data. 1984. pp. 47-57.
18. Weber R., Schek H.-J., Blott S. A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces // VLDB’98, Proceedings of 24rd International Conference on Very Large Data Bases. 1998. pp. 194-205.
19. Berchtold S., Keim D. A., Kriegel H.-P. The X-tree: An Index Structure for High-Dimensional Data // VLDB ’96 Proceedings of the 22th International Conference on Very Large Data Bases. 1996. pp. 28-39.
20. Pearson K. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions // Philosophical Magazine Bases. 1901. vol.2. pp. 559-572.
21. Gunnemann S., Kremer H., Lenhard D., Seidl T. Subspace Clustering for Indexing High Dimensional Data:A Main Memory Index based on Local Reductions and Individual Multi-Representations // Proc. International Conference on Extending Database Technology. 2011. pp. 237-248.
22. M¨uller E., Assent I., G¨unnemann S., Krieger R., Seidl T. Relevant Subspace Clustering: Mining the Most Interesting Non-Redundant Concepts in High Dimensional Database // Proc. IEEE International Conference on Data Minings. 2009. pp. 377-386.
23. Assent I., Krieger R., Miller E., Seidl T. INSCY:Indexing subspace clusters with in-process-removal of redundancy // Data Mining, 2008. ICDM ’08. Eighth IEEE International Conference. 2008. pp. 719-724.
24. Assent I., Krieger R., Miller E., Seidl T. EDSC: efficient density-based subspace clustering // Proceedings of the 17th ACM conference on Information and knowledge management. 2008. pp. 1093–1102.
25. Houle M. E., Kriegel H.-P., Kr¨oger P., Schubert E., Zimek A. Can Shared- Neighbor Distances Defeat the Curse of Dimensionality? // Scientific and Statistical Database Management. 2010. pp. 482-500.
26. Strehl A., Ghosh J. Cluster ensembles — a knowledge reuse framework for combining multiple partitions // The Journal of Machine Learning Research. 2003. pp. 583-617.
27. Aggarwal C.C., Wolf J.L., Phillip S., Yu C.P., Jong S.P. Fast algorithms for projected clustering // Proceedings of the 1999 ACM SIGMOD international conference on Management of data. 1999. pp. 61–72.
28. Nagesh H., Goil S. Adaptive Grids for Clustering Massive Data Sets // SIAM International Conference on Data Mining - SDM. 2002.
29. Fan J., Fan Y., Wu Y. High-dimensional classification // World Scientific, New Jersey. 2011. pp. 3-37.
30. Silverman B. Density Estimation for Statistics and Data Analysis // Monographs on Statistics and Applied Probability, London: Chapman and Hall. 1986.
31. Huber P.J. Projection Pursuit // The Annalas of Statistics. 1985. vol.2. no.13. pp. 435—475.
32. Friedman J.H. Exploratory projection pursuit // Journal of American Statistical Association. 1987. no.82. pp. 249—266.
33. Dobkin D., Lipton R.J. Multidimensional Searching Problems // SIAM Journal on Computing. 1974. no.5. pp. 181–186.
34. Meiser S. Point location in arrangemenm of hyperplanes // Information and Computation. 1993. no.106. pp. 286–303.
35. Andoni A., Indyk P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions // Communications of the ACM - 50th anniversary issue. 2008. vol.51. pp. 117-122.
36. Indyk P., Motwani R. Approximate nearest neighbors: towards removing the curse of dimensionality // Proceedings of the thirtieth annual ACM symposium on Theory of computing. 1998. pp. 604-613.
37. Datar M., Immorlica N., Indyk P., Mirrokni V.S. Locality-sensitive hashing scheme based on p-stable distributions // Proceedings of the twentieth annual symposium on Computational geometry. 2004. pp. 253-262.
38. Deegalla S., Bostrom H. Reducing High-Dimensional Data by Principal Component Analysis vs. Random Projection for Nearest Neighbor Classification // Proceedings of the 5th International Conference on Machine Learning and Applications. 2006. pp. 245-250.
39. Ian J. Principal component analysis // Wiley Online Library. 2005.
40. Gorban A. N. Principal manifolds for data visualization and dimension reduction // Springer. 2007. vol.58.
41. Tenenbaum J.B., De Silva V., Langford J.C. A global geometric framework for nonlinear dimensionality reduction // Science. 2000. vol.290. no.5500. pp. 2319–2323.
42. Kruskal J.B. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis // Psychometrika. 1964. vol.29. no.1. pp. 1-27.
43. Cox T.F., Cox M.A.A. Multidimensional scaling // CRC Press. 2010.
44. Heisterkamp D.R., Peng J. Kernel VA-files for relevance feedback retrieval // Proceedings of the 1st ACM international workshop on Multimedia databases. 2003. pp. 48-54.
45. Heisterkamp D.R., Peng J. Sparse subspace clustering // IEEE Conference on Computer Vision and Pattern Recognition CVPR 2009. 2009. pp. 2790-2797.
46. Von Luxburg U. A tutorial on spectral clustering // Statistics and computing. 2009. vol.17. pp. 395-416.
47. Shi J., Malik J. Normalized cuts and image segmentation // IEEE Transactions on Pattern Analysis and Machine Intelligence. 2000. vol.22. pp. 888–905.
2. Manning C.D., Raghavan P., Sch¨utze H. Introduction to Information Retrieval // American Mathematical Society Conf. Math Challenges of the 21st Century 2000.
3. Augenlicht L. H., Kobrin D. Cloning and Screening of Sequences Expressed in a Mouse Colon Tumor // Cancer Research. 1982. vol.42. pp. 1088-1093.
4. Dash M., Liu H. Feature Selection for Classification // Intelligent Data Analysis. 1997. vol.1. pp. 131-156.
5. Jain A.K., Duin R.P.W., Jianchang M. Statistical pattern recognition: a review // Pattern Analysis and Machine Intelligence, IEEE Transactions on. 2000. vol.22. pp. 4-37.
6. Koller D., Sahami M. Toward optimal feature selection // In 13th International Conference on Machine Learning. 1995. pp. 284–292.
7. Glazko G., Coleman M., Mushegian A. Similarity searches in genome-wide numerical data sets // Biology Direct. 2006. vol.1.
8. Yan T., Ganesan D., Manmatha R. Distributed Image Search in Sensor Networks // Proceeding SenSys ’08 Proceedings of the 6th ACM conference on Embedded network sensor systems. 2008.
9. Deshpande A., Guestrin C., Madden S.R., Hellerstein J.M., Hongn W. Model-driven data acquisition in sensor networks // Proceeding VLDB ’04 Proceedings of the Thirtieth international conference on Very large data bases. 2004. vol.30. pp. 588-599.
10. Russell S.J., Norvig P., Canny J.F., Malik J.M. Edwards D.D. Artificial intelligence: a modern approach // Prentice hall Englewood Cliffs. 1995.
11. Bellman R. E. Dynamic programming // Princeton University Press. 1957.
12. Beyer K., Goldstein J., Ramakrishnan R., Shaft U. When Is Nearest Neighbor Meaningful // ICDT ’99 Proceedings of the 7th International Conference on Database Theory. 1999. pp. 217-235.
13. Hinneburg A., Aggarwal C., Keim D.A. What Is the Nearest Neighbor in High Dimensional Spaces // VLDB ’00 Proceedings of the 26th International Conference on Very Large Data Bases. 2000. pp. 506-515.
14. Kriegel H.-P., Kr¨oger P., Zimek A. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering // ACM Transactions on Knowledge Discovery from Data. 2009. vol.3 no.3. pp. 506-515.
15. Parsons L., Haque E., Liu H. Subspace clustering for high dimensional data: a review // ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced dataset. 2004. vol.6.
16. B¨ohm C., Berchtold S., Keim D.A. Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases // ACM Computing Surveys (CSUR). 2001. no.33. vol.3.
17. Guttman A. R-Trees: A Dynamic Index Structure for Spatial Searching // Proceedings of 1984 ACM SIGMOD International Conference on Management of Data. 1984. pp. 47-57.
18. Weber R., Schek H.-J., Blott S. A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces // VLDB’98, Proceedings of 24rd International Conference on Very Large Data Bases. 1998. pp. 194-205.
19. Berchtold S., Keim D. A., Kriegel H.-P. The X-tree: An Index Structure for High-Dimensional Data // VLDB ’96 Proceedings of the 22th International Conference on Very Large Data Bases. 1996. pp. 28-39.
20. Pearson K. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions // Philosophical Magazine Bases. 1901. vol.2. pp. 559-572.
21. Gunnemann S., Kremer H., Lenhard D., Seidl T. Subspace Clustering for Indexing High Dimensional Data:A Main Memory Index based on Local Reductions and Individual Multi-Representations // Proc. International Conference on Extending Database Technology. 2011. pp. 237-248.
22. M¨uller E., Assent I., G¨unnemann S., Krieger R., Seidl T. Relevant Subspace Clustering: Mining the Most Interesting Non-Redundant Concepts in High Dimensional Database // Proc. IEEE International Conference on Data Minings. 2009. pp. 377-386.
23. Assent I., Krieger R., Miller E., Seidl T. INSCY:Indexing subspace clusters with in-process-removal of redundancy // Data Mining, 2008. ICDM ’08. Eighth IEEE International Conference. 2008. pp. 719-724.
24. Assent I., Krieger R., Miller E., Seidl T. EDSC: efficient density-based subspace clustering // Proceedings of the 17th ACM conference on Information and knowledge management. 2008. pp. 1093–1102.
25. Houle M. E., Kriegel H.-P., Kr¨oger P., Schubert E., Zimek A. Can Shared- Neighbor Distances Defeat the Curse of Dimensionality? // Scientific and Statistical Database Management. 2010. pp. 482-500.
26. Strehl A., Ghosh J. Cluster ensembles — a knowledge reuse framework for combining multiple partitions // The Journal of Machine Learning Research. 2003. pp. 583-617.
27. Aggarwal C.C., Wolf J.L., Phillip S., Yu C.P., Jong S.P. Fast algorithms for projected clustering // Proceedings of the 1999 ACM SIGMOD international conference on Management of data. 1999. pp. 61–72.
28. Nagesh H., Goil S. Adaptive Grids for Clustering Massive Data Sets // SIAM International Conference on Data Mining - SDM. 2002.
29. Fan J., Fan Y., Wu Y. High-dimensional classification // World Scientific, New Jersey. 2011. pp. 3-37.
30. Silverman B. Density Estimation for Statistics and Data Analysis // Monographs on Statistics and Applied Probability, London: Chapman and Hall. 1986.
31. Huber P.J. Projection Pursuit // The Annalas of Statistics. 1985. vol.2. no.13. pp. 435—475.
32. Friedman J.H. Exploratory projection pursuit // Journal of American Statistical Association. 1987. no.82. pp. 249—266.
33. Dobkin D., Lipton R.J. Multidimensional Searching Problems // SIAM Journal on Computing. 1974. no.5. pp. 181–186.
34. Meiser S. Point location in arrangemenm of hyperplanes // Information and Computation. 1993. no.106. pp. 286–303.
35. Andoni A., Indyk P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions // Communications of the ACM - 50th anniversary issue. 2008. vol.51. pp. 117-122.
36. Indyk P., Motwani R. Approximate nearest neighbors: towards removing the curse of dimensionality // Proceedings of the thirtieth annual ACM symposium on Theory of computing. 1998. pp. 604-613.
37. Datar M., Immorlica N., Indyk P., Mirrokni V.S. Locality-sensitive hashing scheme based on p-stable distributions // Proceedings of the twentieth annual symposium on Computational geometry. 2004. pp. 253-262.
38. Deegalla S., Bostrom H. Reducing High-Dimensional Data by Principal Component Analysis vs. Random Projection for Nearest Neighbor Classification // Proceedings of the 5th International Conference on Machine Learning and Applications. 2006. pp. 245-250.
39. Ian J. Principal component analysis // Wiley Online Library. 2005.
40. Gorban A. N. Principal manifolds for data visualization and dimension reduction // Springer. 2007. vol.58.
41. Tenenbaum J.B., De Silva V., Langford J.C. A global geometric framework for nonlinear dimensionality reduction // Science. 2000. vol.290. no.5500. pp. 2319–2323.
42. Kruskal J.B. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis // Psychometrika. 1964. vol.29. no.1. pp. 1-27.
43. Cox T.F., Cox M.A.A. Multidimensional scaling // CRC Press. 2010.
44. Heisterkamp D.R., Peng J. Kernel VA-files for relevance feedback retrieval // Proceedings of the 1st ACM international workshop on Multimedia databases. 2003. pp. 48-54.
45. Heisterkamp D.R., Peng J. Sparse subspace clustering // IEEE Conference on Computer Vision and Pattern Recognition CVPR 2009. 2009. pp. 2790-2797.
46. Von Luxburg U. A tutorial on spectral clustering // Statistics and computing. 2009. vol.17. pp. 395-416.
47. Shi J., Malik J. Normalized cuts and image segmentation // IEEE Transactions on Pattern Analysis and Machine Intelligence. 2000. vol.22. pp. 888–905.
Опубликован
2014-06-02
Как цитировать
Новиков, Б. А., & Судос, И. В. (2014). Индексирование данных в пространстве большой размерности. Труды СПИИРАН, 2(33), 24-47. https://doi.org/10.15622/sp.33.2
Раздел
Статьи
Авторы, которые публикуются в данном журнале, соглашаются со следующими условиями:
Авторы сохраняют за собой авторские права на работу и передают журналу право первой публикации вместе с работой, одновременно лицензируя ее на условиях Creative Commons Attribution License, которая позволяет другим распространять данную работу с обязательным указанием авторства данной работы и ссылкой на оригинальную публикацию в этом журнале.
Авторы сохраняют право заключать отдельные, дополнительные контрактные соглашения на неэксклюзивное распространение версии работы, опубликованной этим журналом (например, разместить ее в университетском хранилище или опубликовать ее в книге), со ссылкой на оригинальную публикацию в этом журнале.
Авторам разрешается размещать их работу в сети Интернет (например, в университетском хранилище или на их персональном веб-сайте) до и во время процесса рассмотрения ее данным журналом, так как это может привести к продуктивному обсуждению, а также к большему количеству ссылок на данную опубликованную работу (Смотри The Effect of Open Access).