Port Poisson Matrix Factorization Topic Model to Scala

Abstract

ARL:UT with UT Austin (ECE and IROM) has developed a Poisson matrix factorization model for modeling documents and related information. It receives an input of documents with words in each document and the frequency with which the words appear. The output is probabilities a topic appears in a document and probabilities a word appears in a topic. This can be parsed to reveal the topics that are most represented in the group of documents and the words associated with these topics. However, this model was poorly written in C++ with little commenting and documentation, and it is also limited by few support from the developer community as the project grew in scope. Thus the goal of the project is to improve the model from a C++ procedural implementation to a more maintainable model, running on the JVM. Originally, the model ran as a C++ executable which uses the GNU Scientific Library (GSL) for probability function calls. Now, all code has been ported to Scala, has been commented out and properly documented, optimized for better efficiency, and has utilized the build tool Maven for all library dependencies. Moving forward, goals include applying this logic to other models for interactions and also for integration with Apache Spark.

First Name
Edward
Last Name
Babbe
Industry
Supervisor
Date
Spring 2017