Distributed Training of Embeddings using Graph Analytics

Published in IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2021

Recommended citation: Gurbinder Gill, Roshan Dathathri, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Olli Saarikivi, “Distributed Training of Embeddings using Graph Analytics,” Proceedings of the 35th IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2021.

(Download publication here)

(Download earlier arxiv version here)

Abstract

Many applications today, such as natural language processing, network and code analysis, rely on semantically embedding objects into low-dimensional fixed-length vectors. Such embeddings naturally provide a way to perform useful downstream tasks, such as identifying relations among objects and predicting objects for a given context. Unfortunately, training accurate embeddings is usually computationally intensive and requires processing large amounts of data.

This paper presents a distributed training framework for a class of applications that use Skip-gram-like models to generate embeddings. We call this class Any2Vec and it includes Word2Vec (Gensim), and Vertex2Vec (DeepWalk and Node2Vec) among others. We first formulate Any2Vec training algorithm as a graph application. We then adapt the state-of-the-art distributed graph analytics framework, D-Galois, to support dynamic graph generation and re-partitioning, and incorporate novel communication optimizations. We show that on a cluster of 32 48-core hosts our framework GraphAny2Vec matches the accuracy of the state-of-the-art shared-memory implementations of Word2Vec and Vertex2Vec, and gives geo-mean speedups of 12x and 5x respectively. Furthermore, GraphAny2Vec is on average 2x faster than DMTK, the state-of-the-art distributed Word2Vec implementation, on 32 hosts while yielding much better accuracy.