Performance improvements for microprocessors have traditionally been achieved by increasing their clock frequency. However, this technique has reached a point where further scaling is impractical.This thesis describes and evaluates a novel System-on-Chip architecture that focuses on exploiting all forms of concurrency in programs. It does so by defining generic hardware concurrency management extensions in simple multi-threaded cores. These extensions enable low-overhead bulk-creation and dynamic distribution of threads and expose efficient dataflow-like primitives for both inter-thread and intra-thread communication and synchronization.Additionally, this thesis describes a new cycle-accurate processor architecture software simulator written in C++, in a way to make it a valuable research and education tool by allowing for a clean and relatively high-level description of the architecture.