-
Notifications
You must be signed in to change notification settings - Fork 6
Potential Extensions
The Compositional SynerGAN approach extends well beyond the vision domain that we have focused on above, and potentially encompasses a wide variety of perception, action and reinforcement tasks.
Recent literature has explored the possibility of connecting deep neural networks for pattern recognition in multiple different modalities to a common neural network with shared weights in order to learn representations invariant to modality (Kaiser2017, Aytar2017). By training latent variable models for different modalities (convolutions for visual data, causal convolutions for auditory, etc.) on different sensory input streams, and finding the best fitting graph structures for each of them, we want to find relations between those using the pattern miner and PLN. We can then find relationships between the modalities in a more transparent way than these blackbox neural network models can.
A natural initial testing ground for these ideas would be combined visual and auditory perception. We have been exploring datasets that combine visual and auditory stimuli displaying various emotional expressions in human face and voice. Training neural nets to recognize emotion in face and voice, and exploring the interactions between the latent variables for the corresponding neural nets, would be quite interesting. For instance, a visual variable corresponding to how widely the mouth opens might have a dependency on an auditory variable corresponding to volume.
We can apply these methods to generative models for action as well: Instead of storing the results of inference purely for perception, we can modify and reason on the latent code to generate new action sequences. For example, a generative network for speech could have different characteristics based on the input of the latent code. Complex feedback between action and perception are possible.
Some of our recent work has involved training deep neural net models to recognize patterns in facial expressions, and then generate new facial expressions from these models. The models we are using in this work do not have probabilistic latent variables, but this would be a natural extension. Adjusting the latent variables would then control the parameters of the facial expressions generated. In this case one is driving action via imitation of perceived actions in a fairly direct way.
To use a similar approach for robot walking, for example, one could train a (generative) neural network to output the desired next state of the sensors on a robot body, given the previous state of the robot body and variables representing the environmental state. If many of these sensors are kinesthetic sensors, then having a prediction of the next state of the sensors on the body, will naturally imply what movements to make to realize these desired future sensor values. In other words, you use predictive modeling to predict ``where the parts of your body should be next", given where they are now and given the goals and the situation. If the difference between where the body parts are and where they should be next is not too large, then calculating the required movements to achieve the desired new body state is straightforward.
The job of the discriminative network here, in a GAN type framework, would be to distinguish correct
Reinforcement learning for complex behaviors that are based on complex perceptions, may emerge naturally from the compositional aspect of Compositional SynerGAN. For instance, consider an animal that gets reward from obtaining food. Suppose its food occurs high up in trees. Then it should get some subsidiary reward from climbing up a tree (because this generally brings it near food), and should get some subsidiary reward from seeing a tree, etc. This would happen in the proposed framework because the network for obtaining food, which would predict the next move'' an animal should make in pursuit of food, would depend on (among others) context variables that tell whether the animal is high up in a tree or not. These context variables would be the output of an in a tree or not'' discriminative network that is learned for perception. The latent variables of the in a tree or not'' network might have dependencies with the latent variables of the next move in a tree'' network. In general multiple action and perception networks will be compositionally interwoven in pursuit of a real-world reward, and the dependencies between their latent variables may become quite complex. Symbolic probabilistic reasoning should be very helpful for sorting out these dependencies.
A first step in this direction could be implementing a set of SynerGAN networks for controlling a small ground-based robot (say a quadruped or hexaped) that sought "simulated food" in the world based on a combination of visual and auditory clues.