Исследование применимости метода матричной факторизации для ранжирования больших языковых моделей

Артем Андреевич Вяткин; Александр Владимирович Попцов; Валерий Дмитриевич Олисеенко; Максим Викторович Абрамов

doi:10.15622/ia.25.2.1

Артем Андреевич Вяткин Санкт-Петербургский Федеральный исследовательский центр Российской академии наук (СПб ФИЦ РАН) Orcid
Александр Владимирович Попцов Санкт-Петербургский Федеральный исследовательский центр Российской академии наук (СПб ФИЦ РАН) Orcid
Валерий Дмитриевич Олисеенко Санкт-Петербургский Федеральный исследовательский центр Российской академии наук (СПб ФИЦ РАН) Orcid
Максим Викторович Абрамов Санкт-Петербургский Федеральный исследовательский центр Российской академии наук (СПб ФИЦ РАН) Orcid

DOI:

https://doi.org/10.15622/ia.25.2.1

Ключевые слова:

большие языковые модели, оценка качества моделей, матричная факторизация, финансовая сфера

Аннотация

В последние годы широкое применение в области финансов получили большие языковые модели (англ. Large Language Models, LLM). Прямое сравнение таких моделей может быть затруднено, так как наборы данных и сами LLM могут быть закрыты, а параметры при оценке могут отличаться. В работе для задачи заполнения неизвестных метрик предлагается использование метода матричной факторизации из рекомендательных систем, изначально созданного для прогнозирования предпочтений пользователей. Целью работы является оценка применимости матричной факторизации для предсказания метрик качества LLM на финансовых задачах, а также разработка метода ранжирования LLM на основе агрегации метрик качества. Проводится эксперимент по применению матричной факторизации на собранных из научных исследований данных о 34 LLM и 42 финансовых наборах данных. Усредненная MAE метода на всех запусках составляет 0.07 на тестовом наборе данных. Верхние позиции в рейтинге занимают модели DeepSeek R1, OpenAI GPT-4o, OpenAI o1-mini, Fin-R1, Claude 3.5 Sonnet. Двумя способами исследуется влияние ошибки прогнозирования на итоговые предсказания: при помощи MAE и метода Монте Карло. Анализируются полученные результаты, основными выводами которых являются: а) метод матричной факторизации может быть применен для прогнозирования неизвестных значений метрик моделей на наборах данных; б) ведущие большие языковые модели сблизились в оценке настолько, что невозможно выявить явного лидера; в) большие ошибки предсказания позволяют выявить специфические особенности моделей на конкретных задачах. Представленный метод ранжирования способен упростить выбор подходящей модели для финансовых задач.

Литература

1. Ali C.-S.-M., Mahmood I. A Comprehensive Survey on Large Language Models: Architectures, Applications, and Ethical Considerations // Engineering and Technology Journal. 2025. vol. 10. no. 04. pp. 4578–4593. DOI: 10.47191/etj/v10i04.26.
2. Zhao H., Liu Z., Wu Z., Li Y., Yang T., Shu P., Xu S., Dai H., Zhao L., Mai G., Liu N., Liu T. Revolutionizing finance with LLMs: An overview of applications and insights // arXiv preprint arXiv:2401.11641. 2024.
3. Dhake S.-P., Lassi L., Hippalgaonkar A., Gaidhani R.-A., Jyothi N.-M. Impacts and Implications of Generative AI and Large Language Models: Redefining Banking Sector // Journal of Informatics Education and Research. 2024. vol. 4. no. 2. pp. 248–257. DOI: 10.52783/jier.v4i2.767.
4. Zhao W.-X., Liu J., Ren R., Wen J.-R. Dense text retrieval based on pretrained language models: A survey // ACM Transactions on Information Systems. 2024. vol. 42. no. 4. pp. 1–60. DOI: 10.1145/3637870.
5. Luo B., Zhang Z., Wang Q., Ke A., Lu S., He B. AI-powered fraud detection in decentralized finance: A project life cycle perspective // arXiv preprint arXiv:2308.15992. 2023.
6. Feng D., Dai Y., Huang J., Zhang Y., Xie Q., Han W., Lopez-Lira A., Wang H. Empowering many, biasing a few: Generalist credit scoring through large language models // arXiv preprint arXiv:2310.00566. 2023.
7. Dong Y., Yan D., Almudaifer A.-I., Yan S., Jiang Z., Zhou Y. BELT: A pipeline for stock price prediction using news. IEEE International Conference on Big Data // IEEE. 2020. pp. 1137–1146. DOI: 10.1109/BigData50022.2020.9378345.
8. Zhao W.-X., Zhou K., Li J., et al. A Survey of Large Language Models // arXiv preprint arXiv:2303.18223. 2023.
9. Li Y., Wang S., Ding H., Chen H. Large language models in finance: A survey // Proceedings of the Fourth ACM International Conference on AI in Finance (ICAIF '23). 2023. pp. 374–382. DOI: 10.1145/3604237.3626869.
10. Chen Z., Chen W., Smiley C., et al. FinQA: A Dataset of Numerical Reasoning over Financial Data // Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2021. pp. 3697–3711. DOI: 10.18653/v1/2021.emnlp-main.300.
11. Shah A., Gullapalli A., Vithani R., Galarnyk M., Chava S. FiNER-ORD: Financial Named Entity Recognition Open Research Dataset // arXiv preprint arXiv:2302.11157. 2023.
12. Tang Y., Yang Y., Huang A., Tam A., Tang J. FinEntity: Entity-level Sentiment Classification for Financial Texts // Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2023. pp. 15465–15471. DOI: 10.18653/v1/2023.emnlp-main.956.
13. Mukherjee R., Bohra A., Banerjee A., et al. ECTSum: A New Benchmark Dataset For Bullet Point Summarization of Long Earnings Call Transcripts // Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2022.
pp. 10893–10906. DOI: 10.18653/v1/2022.emnlp-main.748.
14. Xie Q., Han W., Chen Z., et al. The FinBen: An Holistic Financial Benchmark for Large Language Models // arXiv preprint arXiv:2402.12659. 2024.
15. Chang Y., Wang X., Wang J., et al. A Survey on Evaluation of Large Language Models // ACM Transactions on Intelligent Systems and Technology. 2024. vol. 15. no. 3.
pp. 1–45. DOI: 10.1145/3641289.
16. Koren Y., Bell R., Volinsky C. Matrix Factorization Techniques for Recommender Systems // Computer. 2009. vol. 42. no. 8. pp. 30–37. DOI: 10.1109/MC.2009.263.
17. Zhao Q., Xu M., Gupta K., et al. Can We Predict Performance of Large Models across Vision-Language Tasks // arXiv preprint arXiv:2410.10112. 2024.
18. Zhang Q., Lyu F., Liu X., Ma C. Collaborative Performance Prediction for Large Language Models // Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2024. pp. 2576–2596. DOI: 10.18653/v1/2024.emnlp-main.150.
19. Zhong X.-X., Yi C., Ye H.-J. Efficient Evaluation of Large Language Models via Collaborative Filtering // arXiv preprint arXiv:2504.08781. 2025.
20. Laskar M.-T.-R., Alqahtani S., Bari M.-S., et al. A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations // Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2024. pp. 13785–13816. DOI: 10.18653/v1/2024.emnlp-main.764.
21. Owen D. How predictable is language model benchmark performance // arXiv preprint arXiv:2401.04757. 2024.
22. Matlin G., Okamoto M., Pardawala H., Yang Y., Chava S. Finance Language Model Evaluation (FLaME) // Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics. 2025. pp. 880–926. DOI: 10.48550/arXiv.2506.15846.
23. Huang J., Xiao M., Li D., et al. Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications // arXiv preprint arXiv:2408.11878. 2024.
24. Liu Z., Guo X., Lou F., et al. Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning // arXiv preprint arXiv:2503.16252. 2025.
25. KAI-GPT: The First Large Language Model Purpose-Built for Banking // Kasisto.
URL: https//kasisto.com/blog/kai-gpt-the-first-large-language-model-purpose-built-for-banking/ (дата обращения: 25.11.2025).
26. Qian L., Zhou W., Wang Y., Peng X., Huang J., Xie Q. Fino1: On the Transferability of Reasoning-Enhanced LLMs and Reinforcement Learning to Finance // arXiv preprint arXiv:2502.08127. 2025.
27. Sharma S., Nayak T., Bose A., et al. FinRED: A Dataset for Relation Extraction in Financial Domain // Companion Proceedings of the Web Conference (WWW '22). 2022. pp. 595–597. DOI: 10.1145/3487553.3524637.
28. Kaur S., Smiley C., Gupta A., et al. REFinD: Relation Extraction Financial Dataset // Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2023. pp. 3054–3063. DOI: 10.1145/3539618.3591911.
29. Sharma S., Khatuya S., Hegde M., et al. Financial Numeric Extreme Labelling: A dataset and benchmarking // Findings of the Association for Computational Linguistics: ACL 2023. 2023. pp. 2933–2946. DOI: 10.18653/v1/2023.findings-acl.219.
30. Sinha A., Khandait T. Impact of News on the Commodity Market: Dataset and Results // Advances in Information Retrieval (ECIR 2021). 2021. pp. 589–601. DOI: 10.1007/978-3-030-73103-8_41.
31. Yang L., Kenny E., Ng T.-L., Yang Y., Smyth B., Dong R. Generating Plausible Counterfactual Explanations for Deep Transformers in Financial Text Classification // Proceedings of the 28th International Conference on Computational Linguistics. 2020. pp. 6150–6160. DOI: 10.18653/v1/2020.coling-main.541.
32. Pardawala H., Sukhani S., Shah A., et al. SubjECTive-QA: Measuring Subjectivity in Earnings Call Transcripts’ QA Through Six-Dimensional Feature Analysis // Proceedings of the 38th International Conference on Neural Information Processing Systems (NIPS '24). 2024. pp. 59342–59372.
33. Zhu F., Lei W., Chao Y., et al. TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance // Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics. 2021. pp. 3277–3287. DOI: 10.18653/v1/2021.acl-long.254.
34. Chen Z., Li S., Smiley C., et al. ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering // Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2022.
pp. 6279–6290. DOI: 10.18653/v1/2022.emnlp-main.421.
35. Malo P., Sinha A., Korhonen P., Wallenius J., Takala P. Good debt or bad debt: Detecting semantic orientations in economic texts // Journal of the Association for Information Science and Technology. 2014. vol. 65. no. 4. pp. 782–796. DOI: 10.1002/asi.23062.
36. Cortis K., Freitas A., Daudert T., et al. SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs and News Headlines // Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). 2017. pp. 519–535. DOI: 10.18653/v1/S17-2089.
37. Twitter Financial News Sentiment // Hugging Face: Zeroshot. 2024.
URL: https//huggingface.co/datasets/zeroshot/twitter-financial-news-sentiment (дата обращения: 25.11.2025).
38. Casanueva I., Temcinas T., Gerz D., Henderson M., Vulic I. Efficient Intent Detection with Dual Sentence Encoders // Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI. 2020. pp. 38–45. DOI: 10.18653/v1/2020.nlp4convai-1.5.
39. Mariko D., Abi-Akl H., Labidurie E., et al. The Financial Document Causality Detection Shared Task (FinCausal 2020) // Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation. 2020. pp. 23–32. DOI:10.48550/arXiv.2012.02505
40. Shah A., Hiray A., Shah P., et al. Numerical Claim Detection in Finance: A New Financial Dataset, Weak-Supervision Model, and Market Analysis // Proceedings of the Seventh Fact Extraction and VERification Workshop (FEVER). 2024. pp. 170–185. DOI: 10.18653/v1/2024.fever-1.21.
41. Chen C.-C., Lin C.-Y., Chiu C.-J., et al. Overview of the NTCIR-17 FinArg-1 Task: Fine-grained argument understanding in financial analysis // Proceedings of the 17th NTCIR Conference on Evaluation of Information Access Technologies. 2023.
pp. 16–20. DOI: 10.20736/0002001323.
42. Zhao Y., Liu H., Long Y., et al. FinanceMATH: Knowledge-Intensive Math Reasoning in Finance Domains // Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024. vol. 1. pp. 12841–12858. DOI: 10.18653/v1/2024.acl-long.693.
43. Zhao Y., Long Y., Liu H., et al. DocMath-Eval: Evaluating Math Reasoning Capabilities of LLMs in Understanding Long and Specialized Documents // Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024. vol. 1. pp. 16103–16120. DOI: 10.18653/v1/2024.acl-long.852.
44. Yin Y., Yang Y., Yang J., Liu Q. FinPT: Financial Risk Prediction with Profile Tuning on Pretrained Foundation Models // arXiv preprint arXiv:2308.00065. 2024.
45. Hofmann H. Statlog (German Credit Data) // UCI Machine Learning Repository. 1994. URL: https//archive.ics.uci.edu/dataset/144/statlog+german+credit+data (дата обращения: 25.11.2025).
46. Quinlan R. Statlog (Australian Credit Approval) // UCI Machine Learning Repository. URL: https//archive.ics.uci.edu/dataset/143/statlog+australian+credit+approval (дата обращения: 25.11.2025).
47. Flowers J.G. Finance Instruct 500k // Hugging Face. 2025.
URL: https//huggingface.co/datasets/Josephgflowers/Finance-Instruct-500k (дата обращения: 25.11.2025).
48. Financial Evaluation Dataset // GitHub: Alipay Team. 2023.
URL: https//github.com/alipay/financial_evaluation_dataset (дата обращения: 25.11.2025).
49. Shah A., Paturi S., Chava S. Trillion Dollar Words: A New Financial Dataset, Task & Market Analysis // Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 2023. vol. 1. pp. 6664–6679. DOI: 10.18653/v1/2023.acl-long.368.
50. Bottou L. Large-Scale Machine Learning with Stochastic Gradient Descent // Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT '2010). Physica-Verlag HD. 2010. pp. 177–186. DOI: 10.1007/978-3-7908-2604-3_16.
51. Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection // Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI '95). 1995. vol. 2. pp. 1137–1143.

Просмотры	496
Скачивания	239

Искусственный интеллект, инженерия данных и знаний

Исследование применимости метода матричной факторизации для ранжирования больших языковых моделей

DOI:

Ключевые слова:

Аннотация

Литература

Опубликован

Статистика

Как цитировать

Выпуск

Раздел

Импакт-фактор

Разделы

Мы в сети

Обратная связь