The Snap Machine Learning (Snap ML) library, an efficient, scalable machine learning library that enables fast training of generalized linear models, combines recent advances in machine learning systems and algorithms in a nested manner to reflect the hierarchical architecture of modern distributed systems.
Three main features distinguish Snap ML:
Distributed training: It’s built as a data-parallel framework, enabling you to scale out and train on massive data sets that exceed the memory capacity of a single machine, which is crucial for large-scale applications.
GPU acceleration: It has specialized solvers that are designed to use the massively parallel architecture of GPUs while respecting the data locality in GPU memory to avoid large data transfer overheads.
Sparse data structures: Many machine learning data sets are sparse. Therefore, it uses new optimizations for the algorithms that are used in the system when applied to sparse data structures.
Financial sector use cases
We have tested generalized linear models (GLMs) from the Snap ML library on three different use cases that are related to the financial services sector. These use cases are:
Credit card fraud detection
The purpose here is to identify fraudulent credit card transactions. Given certain characteristics of a transaction (for example, the time the transaction was made or the amount of the transaction) the task is to create a machine learning model that can classify the transaction as a possible fraudulent transaction. The data scientist is provided with a data set of 285,000 transactions, each of which is characterized by 30 features (time of transaction, amount, and so on). Also provided are the labels of these transactions, for example, fraudulent or not. The task is to build a model to predict fraudulent transactions in an unseen data set, that is, a data set that was not used to train the model and does not have the labels.
This is a public data set, and it has been used in a Kaggle data science competition recently.
Stock volatility prediction
The task is to predict the volatility of the stock of a company from 10-K textual financial reports of this and other companies. The corpus contains 10-K reports from thousands of publicly traded US companies from 1996 – 2006 and measured volatility of stock returns for the 12-month periods preceding and following each report. The textual reports have been converted to vectors of real numbers, representing the frequency of occurrence of each word in the report, from a large dictionary of words. We construct regression models of volatility for the period following a report. We apply both linear regression (on the smaller data set) and ridge regression (on the larger data set).
This is a public data set (we have used both the small data set, E2006-tfidf and the bigger one, E2006-log1p).
Credit default prediction
The task in this use case is to predict whether a person who has credit will default (not be able to repay his credit). The data scientist is provided with a data set of 10 million transactions, each of which is characterized by 18 features (including account age, account type, credit history, owns car, transaction amount, and transaction category). Also provided are the labels of these transactions, default or not. The task is to build a model to predict whether transactions will default in an unseen data set, that is, a data set that has not been used to train the model and does not have labels.
The data set is made available by IBM.
Here is a description of the results we have obtained from the above use cases.