Distributed Tensorflow - [Intro]

cristi (70)in #deep-learning • 8 years ago

Matthew Rahtz, a Master's student in Neuroscience and Machine Learning in Switzerland, has posted a very detailed introduction to distributed TensorFlow on amid.fish website. As per Matthew:

"Distributed TensorFlow allows us to share parts of a TensorFlow graph between multiple processes, possibly each on a different machine." [source]

One of the reasons for which one might want to do this is to be able to benefit from the power of more than one machines during the training process, having the parameters shared between all machines.

Matthew does not delay too much with theory and starts by doing the implementation in TensorFlow. One of the key features that one needs to understand about distributed TensorFlow is that to share parameters between processes, one needs to link the execution engines (across multiple machines) together.

Thus:

for each process there will be a TensorFlow server (execution engine)
servers are linked together in clusters
each server in the cluster is known as a task
each task is associated with a job (a collection of related tasks)

In the post, Matthew goes on into explaining where the variables are placed, how graphs work with distributed TF, and he also shares some practical details that are important to be accounted for, such as:

what happens when a server leaves the cluster
what happens if it returns to the cluster
to whom falls the responsibility for variable initialization
and others.

So, if you want to learn more about this, you can read the entire post by Matthew as linked below or you could also read the official documentation for distributed TensorFlow:

Distributed Tensorflow - [Intro]