This is looking nice.
A few points to make sure I am understanding the encapsulation correctly: I do not see the code to combine results from different GPUs (logs and gradients); will that be in run() or in self.app? I do not see the code for moving averages, will that still be in run()?
A few questions: Can you give me a quick example of when multiple samplers are needed?
I see that you have removed the loop that does multiple different sess.run() call in an iteration, as well as the code that lets the application define which ops are run in which iterations (i.e. train_op_generator). How do you envisage alternately training the generator then the discriminator? We should discuss the need for flexibility vs complexity here.
The VAE (and probably others) have several different type of inference. Is the plan to generate all the ops for all the inference types in inference_ops()? Or should the app be looking at param to see which samplers and inference_ops to create?
@eli.gibson I think moving average should be in
tf.get_collectionoperations these could be well decoupled. Averaging gradient from multiple gpus would be tricky, still trying different designs, but I believe the application doesn't need to know how/when to average the gradients.
multiple sampler for domain adaptation problems? with two samplers, one for source domain another for target domain, this is based on pure imagination, I don't have any solid example at hand... also if we want to have another not shuffled queue for evaluation purpose durining training process (see also https://cmiclab.cs.ucl.ac.uk/CMIC/NiftyNet/issues/108)
multiple sess.run() For training GANs,
model_opscould be a list of ops, for example
[apply_lossD_grads, apply_lossD_grads, apply_lossD_grads, apply_lossG_grads]means updating discriminator four times, then updating generator once.
Another option would be passing
sessvariable as a parameter to
self.app, and letting
self.appdecide what to run. The downside is that we break the design of the "controller" idea,
self.apptaking care of
sesscan be very ugly :D
inference_ops() This function would take some parameters, and let
self.apptake care of the details, so it should be
To solve both multiple sess.run and inference_ops concerns, how about to have an app interface:
model_op_at(iteration_i=0, type=a_string_for_training_or_inference), then in
model_output = sess.run(self.app.model_op_at(iter_i, ...)), I think this could be a good way to go.
by the way I'm looking at tf.logging.* https://www.tensorflow.org/api_docs/python/tf/logging, maybe we could use this in
@wenqili re:gradient averaging: The current training.py handles gradient averaging in a way that seems to work. Is there a problem with that implementation you are trying to solve?
re:app.inference_ops(param.inference_type): I think the app can get this from the parameters at construction time, so that should be fine.
re:model_ops: I am not sure you want to have multiple gradient updates in the same sess.run. First, I think this will result in only one sample being taken, rather than one for each train op. Second, if the variables being updated are not mutually exclusive, I think there will be a race condition and the variables will not be updated correctly. If you are doing one discriminator and one generator update, they would be mutually exclusive for some (but not all) GANs, but if you have multiple discriminator updates they will not. (This may be solved by the model_op_at interface anyways)
re: multiple samplers: I think the multiple samplers could make the GAN a little more elegant for some designs (one sampler for real data and one for noise samples), and it could make the VAE more elegant for some designs (one sampler for inference types that need an image, and one sampler for inference types that need a low-dim. representation). My gut feeling is that there should be 1 net_model.model() function that accepts all the samples, rather than a bunch of separate functions, but we should discuss this in more detail.
re:model_op_at: The model_op_at() interface is sort of equivalent to my train_op_generator interface. Both are simple for the default case, but I suspect that we won't know which is easier for the complex cases until we have many more applications developed. It also potentially solves the 'multiple gradient updates in the same sess.run' if you return the right type of op in the right order (i.e. return apply_lossD_grads if iter%4==0 else apply_lossG_grads).
Some part of me thinks there is another object we are missing, like 'TrainingSchedule' that handles that sort of logic. It shouldn't really be necessary to subclass GANApplication just to change what order the G and D are trained. That said, I don't think we should separate that out until it we know it is necessary.
re: log_writer.write_with(log_ops): I know this snippet is just the general framework, but just to flag for future reference, the logging will need to pull values out of the sess.run. Some other future features might also need to do this, so having the code that builds up ops_to_run and then breaks it back down into parts needed by inference, logging, others will still be necessary. If you are concerned about the complexity of that part of the code, we can think about ways to achieve that outcome more elegantly.
re: tf.logging: I think it would be a good idea to tie into tf.logging for our console output, warnings and errors. Most of engine.logging is about logging summaries to tensorboard, but there would be some overlap.
Sure, let me get my hands dirty first and then have more discussions with you :D
But now I understand
train_op_generator! It was the itertools confused me...
Refactoring that, we could have something like:
def train_op_generator(self, num_iterations): i = 0 while num_iteration > 0: num_iteration -= 1 train_op = ... i = i + 1 yield i, train_op # in the driver: for (current_iter, model_op) in self.app.train_op_generator(1000): sess.run(model_op)
The only thing I was using itertools for was making sure it ignored the first K iterations when you were starting from iteration K after resuming from a saved training run.
I see, that could be solved by
num_iterations, anyway I'll stick to the op_generator idea