Comparative Analysis of Scientific Journals Collections
Keywords:
Comparative Thematic Analysis, Comparative Text Model, Deep Text Analysis, Social Network Analysis, Graph MetricsAbstract
The authors developed an approach to comparative analysis of scientific journals collections based on the analysis of co-authors graph and the text model. The use of time series of co-authorship graphs metrics allowed the authors to analyze trends in the development of journal authors. The text model was built using machine learning techniques. The journals content was classified to determine the authenticity degree of various journals and different issues of a single journal via a text model. The authors developed a metric of Content Authenticity Ratio, which allows quantifying the authenticity of journal collections in comparison. Comparative thematic analysis of journals collections was carried out using the thematic model with additive regularization. Based on the created thematic model, the authors constructed thematic profiles of the journals archives in a single thematic basis. The approach developed by the authors was applied to archives of two journals on the Rheumatology for the period 2000–2018. As a benchmark for comparing the co-author’s metrics, public data sets from the SNAP research laboratory at Stanford University were used. As a result, the authors adapted the existing examples of the effective functioning of the authors collaborations in order to improve the work of journals editorial staff. Quantitative comparison of large volumes of texts and metadata of scientific articles was carried out. As a result of the experiment conducted using the developed methods, it was shown that the content authenticity of the selected journals is 89%, co-authorships in one of the journals have a pronounced centrality, which is a distinctive feature of the policy editor. The clarity and consistency of the results confirm the effectiveness of the approach proposed by the authors. The code developed in the course of the experiment in the Python programming language can be used for comparative analysis of other collections of journals in the Russian language.
References
2. Newman M.E.J. Scientific collaboration networks. II. Shortest paths, weighted networks, and centrality. Physical review E. 2001. vol. 64. no. 1. pp. 016131.
3. Smeaton A.F. et al. Analysis of papers from twenty-five years of SIGIR conferences: what have we been doing for the last quarter of a century?. ACM SIGIR Forum. 2002. vol. 36. pp. 39–43.
4. Farkas I. et al. Networks in life: Scaling properties and eigenvalue spectra. Physica A: Statistical Mechanics and its Applications. 2002. vol. 314. no. 1-4. pp. 25–34.
5. Cunningham S.J., Dillon S.M. Authorship patterns in information systems. Scientometrics. 1997. vol. 39. no. 1. pp. 19.
6. Egghe L., Rousseau R., Van Hooydonk G. Methods for accrediting publications to authors or countries: Consequences for evaluation studies. Journal of the American Society for Information Science. 2000. vol. 51. no. 2. pp. 145–157.
7. Garfield E. Is citation analysis a legitimate evaluation tool?. Scientometrics. 1979. vol. 1. no. 4. pp. 359–375.
8. Witten I.H., Frank E., Hall M.A., Pal C.J. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann. 2016. 558 p.
9. Lucas C. et al. Computer-assisted text analysis for comparative politics. Political Analysis. 2015. vol. 23. no. 2. pp. 254–277.
10. Zhao W.X. et al. Comparing twitter and traditional media using topic models. European conference on information retrieval. 2011. pp. 338–349.
11. Shumskaya A.O. [Method of the Artificial Text Identification based on the Calculation of the Belonging Measure to the Invariants]. Trudy SPIIRAN – SPIIRAS Proceedings. 2016. vol. 6(49). pp. 104–121. (In Russ.). (In Russ.).
12. Bondy J.A., Murty U.S.R. Graph theory with applications. London: Macmillan. 1976. vol. 290. 270 p.
13. Wasserman S., Faust K. Social network analysis: Methods and applications. Cambridge university press. 1994. vol. 8. 857 p.
14. Newman M.E.J. Analysis of weighted networks. Physical review E. 2004. vol. 70. no. 5. pp. 056131.
15. Weizenbaum J. ELIZA — a computer program for the study of natural language communication between man and machine. Communications of the ACM. 1966. vol. 9. no. 1. pp. 36–45.
16. Kucera H., Francis W.N. Computational analysis of present-day American English. Dartmouth Publishing Group. 1967. 424 p.
17. Kleene S.C. Representation of events in nerve nets and finite automata. RAND PROJECT AIR FORCE SANTA MONICA CA. 1951. 101 p.
18. Thompson K. Programming techniques: Regular expression search algorithm. Communications of the ACM. 1968. vol. 11. no. 6. pp. 419–422.
19. Lovins J.B. Development of a stemming algorithm. Mech. Translat. & Comp. Linguistics. 1968. vol. 11. no. 1-2. pp. 22–31.
20. Segalovich I. A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine. International Conference on Machine Learning; Models, Nechnologies and Applications (MLMTA). 2003. pp. 273–280.
21. Sharoff S., Nivre J. The proper place of men and machines in language technology: Processing Russian without any linguistic knowledge. 2011 Russian Conference on Computational Linguistics. 2011. 14 p.
22. Korobov M. Morphological analyzer and generator for Russian and Ukrainian languages. International Conference on Analysis of Images, Social Networks and Texts. 2015. pp. 320–332.
23. Willett P. The Porter stemming algorithm: then and now. Program: electronic library and information systems. 2006. vol. 40. no. 3. pp. 219–223.
24. Porter M.F. Snowball: A language for stemming algorithms. 2001. Available at: http://snowball.tartarus.org/texts/introduction.html (accessed: 15.02.2019).
25. Packard D. Computer-assisted morphological analysis of ancient Greek. Proceedings of the International Conference on Computational Linguistics (COLING-1973). 1973. vol. 2. 14 p.
26. Bird S., Klein E., Loper E. Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc. 2009. 504 p.
27. Schwenk H., Gauvain J.L. Connectionist language modeling for large vocabulary continuous speech recognition. 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 2002. vol. 1. pp. I-765–I-768.
28. Teahan W.J., Cleary J.G. The entropy of English using PPM-based models. Proceedings of Data Compression Conference-DCC'96. 1996. pp. 53–62.
29. Teahan W.J, Cleary J.G. Models of English text. Proceedings DCC’97. Data Compression Conference. 1997. pp. 12–21.
30. Hofmann T. Probabilistic latent semantic indexing. ACM SIGIR Forum. 2017. vol. 15. no. 2. pp. 211–218.
31. Lu X., Zheng X., Li X. Latent semantic minimal hashing for image retrieval. IEEE Transactions on Image Processing. 2016. vol. 26. no. 1. pp. 355–368.
32. Law J. Latent Topical Skip-Gram for mutually learning topic model and vector representations. arXiv preprint arXiv:1702.07117. 2017.
33. Blei D.M., Ng A.Y., Jordan M.I. Latent dirichlet allocation. Journal of machine Learning research. 2003. vol. 3. pp. 993–1022.
34. Leskovec J., Kleinberg J., Faloutsos C. Graph evolution: Densification and shrinking diameters. ACM Transactions on Knowledge Discovery from Data (TKDD). 2007. vol. 1. no. 1. pp. 2.
35. Arthur D., Vassilvitskii S. k-means++: The advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. 2007. pp. 1027–1035.
36. Bholowalia P., Kumar A. EBK-means: A clustering technique based on elbow method and k-means in WSN. International Journal of Computer Applications. 2014. vol. 105. no. 9. pp. 17–24.
37. Alba R.D. A graph-theoretic definition of a sociometric clique. Journal of Mathematical Sociology. 1973. vol. 3. no. 1. pp. 113–126.
38. Vorontsov K.V., Potapenko A.A. [Additive regularization of topic models]. Machine Learning. vol. 101. no. 3. pp. 303–323. (In Russ.).
39. Krasnov F., Sen A. The Number of Topics Optimization: Clustering Approach. Machine Learning and Knowledge Extraction. 2019. vol. 1. no. 1. pp. 416–426.
40. Krasnov F.V., Ushmaev O.S. [Exploration of Hidden Research Directions in Oil and Gas Industry via Full Text Analysis of OnePetro Digital Library]. International Journal of Open Information Technologies. 2018. vol. 6. no. 5. pp. 7–14. (In Russ.).
Published
How to Cite
Section
Copyright (c) 2019 Федор Владимирович Краснов, Михаил Ефремович Шварцман, Александр Владимирович Диментов

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms: Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).