Un-EvMoSeg: Unsupervised Event-based Independent Motion Segmentation

Ziyun Wang1, Jinyuan Guo1, Kostas Daniilidis1, 2

Abstract

Event cameras are a novel type of biologically inspired vision sensor known for their high temporal resolution, high dynamic range, and low power consumption. Because of these properties, they are well-suited for processing fast motions that require rapid reactions. Although event cameras have recently shown competitive performance in unsupervised optical flow estimation, performance in detecting independently moving objects (IMOs) is lacking behind, although event-based methods would be suited for this task based on their low latency and HDR properties. Previous approaches to event-based IMO segmentation have been heavily dependent on labeled data. However, biological vision systems have developed the ability to avoid moving objects through daily tasks without being given explicit labels. In this work, we propose the first event framework that generates IMO pseudo-labels using geometric constraints. Due to its unsupervised nature, our method can handle an arbitrary number of not predetermined objects and is easily scalable to datasets where expensive IMO labels are not readily available. We evaluate our approach on the EVIMO dataset and show that it performs competitively with supervised methods, both quantitatively and qualitatively.

Pipeline

Here is the pipeline of Un-EvMoSeg. Left Dotted Box: we train a network to directly predict IMO masks from events. Rest of Figure: we use a geometric self-labeling method to generate binary IMO pseudo-labels that supervise the IMO segmentation network. Our framework uses off-the-shelf optical flow (fine-tuned on image-based flow) and input depth. The camera motion fitted from flow and depth through RANSAC is used to compute rigid flow from the camera only. Pseudo-labels are generated through adaptive thresholding techniques based on the magnitude of estimated IMO motion field.

Qualitative Results

It can be seen that qualitatively, our results are very similar in quality compared with supervised CNN methods, largely outperform optimization-based methods, and even outperforms supervised SNNs. SpikeMS tends to sparsify the events and keep edges. EMSGC needs extensive tuning to get reasonable results. However, it still misclassifies IMO as rigid areas. With these noise predictions across the image from SpikeMS and EMSGC, IMO cannot be easily detected and handled, while our network produces spatially consistent segmentations.

Motion Segmentation Results in Action

We compute the motion segmentation results on the Wall test sequence in the EV-IMO dataset. The red events belong to a real segmented IMOs and the blue events are the background events. unning inference Un-EvMoSeg is simple without parameter turning because while the training process requires geometry-based labels, only events are used for prediction. We take the best of both worlds of deep learning and optimization: 1) simple and robust inference with a simple feed-forward pass, and 2) scalable with no expensive annotations required to train the network.