Time Space Transformer (TimeSformer)

Instead of applying joint space-time attention, which is time-consuming, a divided space and time attention is proposed in here. In this approach, each patch in an image is first used to compute temporal attention, with all patches having the same spatial index. The resulting encoding is then used for computing spatial attention with patches having the same temporal index. This approach surpasses the usage of spatial attention only or joint space-time attention in terms of accuracy. In addition, the inference cost of this approach is minimal compared to other well-known approaches relying on 3D convolution, such as SlowFast.

Example

from dronevis.models import ActionRecognizer

model = ActionRecognizer()
model.load_model("facebook")
model.detect_webcam()