With the turn to multicore in chip design and manufacturing, both consumer and high performance applications can benefit from ubiquitous hardware parallelism. However, the performance improvement to be achieved is not always in the orders of magnitude range. In this paper, we present the challenging example of designing a parallel version of a model fitting algorithm used in calibrating telescope observation data in radio astronomy. The complexity of the application, together with the limited opportunities for code modification, bound the performance gain that any parallel system could achieve. However, we show how classical “bound-and-bottleneck” analysis and optimization using multicore architectures help achieving up to 2.3x “wall clock” speedup compared to the original sequential implementation. We further discuss the reasons for this limitation, and suggest possible solutions to address it.