This paper introduces a novel reconfigurable architecture for real-time execution of CNNs at the edge (next to the video cameras). The proposed architecture is based on direct convolution over streaming pixels. With this, it avoids the huge data redundancy and memory accesses imposed by GEMM-based CNN accelerators (as in Google TPU). The core of our proposed architecture is Convolutional Processing Engine (CPE) which performs direct convolution as pixels arrived from the camera. The architecture consists of multiple CPE instances to construct a Macro-pipelined data-path of CPEs with respect to the inherent spatial and temporal parallelism of the network.