Memory bandwidth and communication bandwidth limits the hardware efficiency of deep learning. Conventional model compression techniques save memory bandwidth but need hand-crafted features and require domain-experts to explore the large design space, which is usually sub-optimal and time-consuming. We propose to leverage reinforcement learning to efficiently sample the design space that greatly improved the compression quality. We applied this push-the-button compression pipeline on MobileNet and achieved a 2× reduction in FLOPs, and a speedup of 1.49× on Titan Xp an1.65× on the mobile (Galaxy S7).
Large-scale distributed training requires significant communication bandwidth that limits the scalability of multi-node training. We find 99.9% of the gradient exchange in distributed SGD is actually redundant, and propose Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth by 270× to 600× without losing accuracy. DGC enables large-scale distributed training on inexpensive commodity 1Gbps Ethernet and facilitates distributed training on mobile.