Deep Learning (DL) algorithms are multiply-accumulate (MAC) intensive, and it is common practice to quote the complexity of an algorithm or the capability of DL hardware in terms of “Tera MACs per second.” This talk explores the many considerations that dictate how efficiently the MACs available in hardware can be utilized to implement a given DL algorithm, and the pros and cons of different design approaches to match the raw MAC throughput of hardware with the specific requirements of different DL algorithms.