Proceedings of the 53rd Hawaii International Conference on System Sciences
Topic modeling is an unsupervised method for discovering semantically coherent combinations of words, called topics, in unstructured text. However, the human interpretability of topics discovered from non-natural language corpora, specifically Windows API call logs, is unknown. Our objective is to explore the coherence of topics and their ability to represent the themes of API calls from malware analysts’ perspective. Three Latent Dirichlet Allocation (LDA) models were fit to a collection of dynamic API call logs. Topics, or behavioral themes, were manually evaluated by malware analysts. The results were compared to existing automated quality measures. Participants were able to accurately determine API calls that did not belong in behavioral themes learned by the 20 topic model. Our results agree with topic coherence measures in terms of highest interpretable topics. The results are not compatible with log-perplexity, which concur with the findings of topic evaluation literature on natural language corpora.
Glendowne, Puntitra and Glendowne, Dae, "Interpretability of API Call Topic Models: An Exploratory Study" (2020). Faculty Publications. 3.