Whichever motion generation technique is used, there must be a way of triggering the desired activity in the avatar. Specifying the motion can be as simple as direct sensor tracking (where each joint is driven by a corresponding sensor input), end effector tracking (where inverse kinematics or other behaviors generate the ``missing'' joint data), or external invocation via menu, speech, or button selection of the actions (whether then synthesized or interpreted from pre-stored data). The interesting observation is that the only mechanism available to an ``unencumbered'' participant is actually speech! Any other avatar control mechanism requires either a hands-on device (mouse, keyboard, glove input), or else external sensors and a limited field of movement. While there is considerable progress in using computer vision techniques to capture human motion [1,15,12,23], both user mobility and movement generality are still in the future. Our intention is not to promote speech input per se, but to use this observation to promote (in Section 3 a language-centered view of action ``triggering'' augmented and elaborated by lower-level motion synthesis or playback. (For example, this technique is used to great advantage in virtual environment applications such as the immersive interface to MediSim  and in the responsive characters in Improv [33,34].) Although textual instructions can describe and trigger actions, details need not be explicited communicated. Thus the agent/avatar architecture must include semantic interpretation of instructions and even a lower reactive level within the movement generators that allows motion generality and environmental context-sensitivity.