Алгоритм генерации тезаурусных расширений для корпоративного информационного поиска

Дмитрий Олегович Донцов

doi:10.15622/sp.30.12

Дмитрий Олегович Донцов спирант, лаборатория информационно-вычислительных систем СПИИРАН

DOI:

https://doi.org/10.15622/sp.30.12

Ключевые слова:

информационный поиск, расширение пользовательского запроса, тезаурусные расширения, извлечение синонимов, распознавание именованных сущностей, строковая кластеризация

Аннотация

Целью работы является создание алгоритма генерации тезауруса синонимов для названий продуктов. Такие тезаурусы используются в современных поисковых машинах для расширения пользовательского запроса и улучшения качества поиска. При этом подходе из поискового индекса выбираются документы, включающие в себя не только слова, содержащиеся в запросе, но и близкие по смыслу термины. В ходе работы был реализован полуавтоматический метод обучения распознавателя именованных сущностей. Для валидации извлеченных сущностей был предложен метод полуавтоматической валидации.

Литература

Alias-i, LingPipe 4.0.1 (Online; accessed 10 October 2011), http://alias-i.com/lingpipe

Beauliev M. Experiments of interfaces to support query expansion // Journal of Documentation, 53 (1997), 8-19

Brajnik G., Mizzaro S., Tasso C. Evaluating user interfaces to information retrieval systems: A case study on user support // Proceedings of the 19th annual conference on Research and Development in Information Retrieval, ACM/SIGIR, Zurich, Switzerland, 1996, 128-136

Charikar Moses S. Similarity estimation techniques from rounding algorithms // Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, ACM, New York, NY, USA, 2002, 380-388, ISBN: 1-58113-495-9

Crouch Carolyn J., Yang Bokyung Experiments in automatic statistical thesaurus construction // Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, New York, NY, USA, 1992, 77-88, ISBN:0-89791-523-2

Fox E.A., Nutter J.T., Ahlswede T., Evens M., Markowitz J. Building a large thesaurus for information retrieval // Proceedings of the second conference on Applied natural language processing, Association for Computational Linguistics, Stroudsburg, PA, USA, 1988, 101-108

Gonzalo J., Verdejo F., Chugur I., Cigarran J. Indexing with WordNet synsets can improve Text Retrieval // Proceedings of the COLING/ACL '98, ACL, 1988, 38-44

Hearst M. Automatic Acquisition of Hyponyms from Large Text Corpora // Proceedings of the Fourteenth International Conference on Computational Linguistics, Nantes, France, 1992

Kiseleva J., Simanovsky A. Exploring synonyms within large commercial site search engine queries, 2011

Kohlschutter C., Fankhauser P., Nejdl W. Boilerplate detection using shallow text features // Proceedings of the third ACM international conference on Web search and data mining, ACM, New York, NY, USA, 2010, 441450, ISBN: 978-1-60558-889-6

Kohlschutter C., Nejdl W. A densitometric approach to web page segmentation, Proceeding of the 17th ACM conference on Information and knowledge management, ACM, New York, NY, USA, 2008, 1173-1182, ISBN: 978-1-59593-991-3

Lafferty J.D., McCallum A., Pereira F.C.N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data // Proceedings of the Eighteenth International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2001, 282-289, ISBN: 1-55860-778-1, http://dl.acm.org/citation.cfm?id=645530.655813

McCrae J., Collier N. Synonym set extraction from the biomedical literature by lexical pattern discovery // BMC Bioinformatics, 9 (2008), 159, http://www.biomedcentral.com/1471-2105/9/159

Milne D., Medelyan O., Witten I.H. Mining Domain-Specic Thesauri from Wikipedia: A Case Study // Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, IEEE Computer Society, Washington, DC, USA, 2006, 442-448, ISBN: 0-7695-2747-7

Nakayama K., Hara T., Nishio S. Wikipedia mining for an association web thesaurus construction // Proceedings of the 8th international conference on Web information systems engineering, Springer-Verlag, Berlin, Heidelberg, 2007, 322-334, ISBN: 3-540-76992-7, http://dl.acm.org/citation.cfm?id=1781374.1781410

Panchenko A. Could we automatically reproduce semantic relatoin of an information retrieval thesaurus?, 2011

Ratinov L., Roth D. Design challenges and misconceptions in named entity recognition // Proceedings of the Thirteenth Conference on Computational Natural Language Learning, Association for Computational Linguistics, Stroudsburg, PA, USA, 2009, 147-155, ISBN: 978-1-932432-29-9, http://dl.acm.org/citation.cfm?id=1596374.1596399

Simanovsky A., Ulanov A. Mining Text Patterns for Synonyms Extraction, Proceedings of DEXA Workshops, E-LKR'11, IEEE Computer Society, 2011, 473-477, ISBN: 987-0-7695-4486-1

Snow R., Jurafsky D., Ng A.Y. Learning Syntactic Patterns for Automatic Hypernym Discovery // NIPS, 2004, http://www.stanford.edu/ jurafsky/paper887.pdf

Strube M., Ponzetto S.P. WikiRelate! computing semantic relatedness using wikipedia // proceedings of the 21st national conference on Articial intelligence, 2, AAAI Press, 2006, 14191424, ISBN: 978-1-57735-281-5, http://dl.acm.org/citation.cfm?id=1597348.1597414

Wandmacher T. How semantic is Latent Semantic Analysis? // Proceedings of TALN/RECITAL, 2005, 6-10

Yi L., Liu B., Li X. Eliminating noisy information in Web pages for data mining // Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, New York, NY, USA, 2003, 296-305, ISBN: 1-58113-737-0

Takenobu Tokunaga, Iwayama Makoto, Tanaka Hozumi Automatic Thesaurus Construction Based on Grammatical Relations, 1995

Просмотры	2088
Скачивания	1422

Статьи

Алгоритм генерации тезаурусных расширений для корпоративного информационного поиска

DOI:

Ключевые слова:

Аннотация

Литература

Опубликован

Статистика

Как цитировать

Выпуск

Раздел

Импакт-фактор

Разделы

Мы в сети

Обратная связь