Currently the avatar does it based on the text, which maps the incoming audio to one of our emotion codes, biasing the generation to that emotion. It's not foolproof, but we've found it works pretty well in practice.