For a full list of all papers click here

Tuesday, 29 October 2019

A camera that CNNs: Towards Embedded Neural Networks on Pixel Processor Arrays, (ICCV 2019 Oral)

We present a convolutional neural network implementation for pixel processor array (PPA) sensors. PPA hardware consists of a fine-grained array of general-purpose processing elements, each capable of light capture, data storage, program execution, and communication with neighboring elements. This allows images to be stored and manipulated directly at the point of light capture, rather than having to transfer images to external processing hardware. Our CNN approach divides this array up into 4x4 blocks of processing elements, essentially trading-off image resolution for increased local memory capacity per 4x4 "pixel". We implement parallel operations for image addition, subtraction and bit-shifting images in this 4x4 block format. Using these components we formulate how to perform ternary weight convolutions upon these images, compactly store results of such convolutions, perform max-pooling, and transfer the resulting sub-sampled data to an attached micro-controller. We train ternary weight filter CNNs for digit recognition and a simple tracking task, and demonstrate inference of these networks upon the SCAMP5 PPA system. This work represents a first step towards embedding neural network processing capability directly onto the focal plane of a sensor.

A Camera That CNNs: Towards Embedded Neural Networks on Pixel Processor Arrays, Laurie Bose, Jianing Chen, Stephen J. Carey, Piotr Dudek and Walterio Mayol-Cuevas. International Conference on Computer Vision ICCV 2019. Seoul, Korea, 2019. Arxiv [PDF].

Sunday, 29 September 2019

Rebellion and Obedience: The Effects of Intention Prediction in Cooperative Handheld Robots (IROS 2019)

Within this work, we explore intention inference for user actions in the context of a handheld robot
setup. Handheld robots share the shape and properties of handheld tools while being able to process task information and aid manipulation. Here, we propose an intention prediction model to enhance cooperative task solving. The model derives intention from the user’s gaze pattern which is captured using a robot-mounted remote eye tracker. The proposed model yields real-time capabilities and reliable accuracy up to 1.5 s prior to predicted actions being executed. We assess the model in an assisted pick and place task and show how the robot’s intention obedience or rebellion affects the cooperation with the robot.

Janis Stolzenwald and Walterio Mayol-Cuevas, "Rebellion and Obedience: The Effects of Intention Prediction in Cooperative Handheld Robots", to be published at 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, 2019. [PDF]

Thursday, 30 May 2019

The Pros and Cons: Rank-aware Temporal Attention for Skill Determination in Long Videos CVPR 2019

We present a new model to determine relative skill from long videos, through learnable temporal attention modules. Previous work formulates skill determination for common tasks as a ranking problem, yet measures skill from randomly sampled video segments. We believe this approach to be limiting since many parts of the video are irrelevant to assessing skill, and there may be variability in the skill exhibited throughout a video. Assessing skill from a single section may not reflect the overall skill in the video. We propose to train rank-specific temporal attention modules, learned with only video-level supervision, using a novel rank-aware loss function. In addition to attending to task-relevant video parts, our proposed loss jointly trains two attention modules to separately attend to video parts which are indicative of higher (pros) and lower (cons) skills. We evaluate the approach on the public EPIC-Skills dataset and additionally collect and annotate a larger dataset for skill determination with five previously unexplored tasks. Our method outperforms previous approaches and classic softmax attention on both datasets by over 4% pairwise accuracy, and as much as 12% on individual tasks. We also demonstrate our model's ability to attend to rank-aware parts of the video.

Wednesday, 19 September 2018

Who's Better, Who's Best: Skill Determination in Video using Deep Ranking. (CVPR 2018 Spotlight)

Who's Better, Who's Best: Skill Determination in Video using Deep Ranking. CVPR PDF | arxiv
CVPR Spotlight talk

Published at CVPR2018.

Here we explore a new problem for egocentric perception. The study of how to determine who is better at a task in an unsupervised manner. We use Deep methods to rank videos in both medical tasks -for which there is expert scoring for validation- and also for the more complex daily living tasks -for which the scoring is ambiguous-.

Saturday, 1 September 2018

I Can See Your Aim: Estimating User Attention From Gaze For Handheld Robot Collaboration (IROS2018)

Extended version of our demo video for our IEEE/RSJ IROS 2018 submission [Arxiv]
Janis StolzenwaldWalterio W. Mayol-Cuevas. "I Can See Your Aim: Estimating User Attention From Gaze For Handheld Robot Collaboration". IROS 2018.

Also see for extended information on this project line.

Tuesday, 1 May 2018

Where can i do this? Geometric Affordances from a Single Example with the Interaction Tensor (ICRA2018)

Where can i do this? Geometric Affordances from a Single Example with the Interaction Tensor. ICRA .

This paper introduces and evaluates a new tensor field representation to express the geometric affordance of one object over another. Evaluations also include crowdsourcing comparisons that confirm the validity of our affordance proposals, which agree on average 84% of the time with human judgments, which is 20-40% better than the baseline methods.

Tuesday, 24 October 2017

Visual Odometry for Pixel Processor Arrays (ICCV2017 Spotlight)

"Visual Odometry for Pixel Processor Arrays [PDF]", Laurie Bose, Jianing Chen, Stephen J. Carey, Piotr Dudek and Walterio Mayol-Cuevas. International Conference on Computer Vision (ICCV), Venice, Italy, 2017.
ICCV2017 Spotlight Talk

We present an approach of estimating constrained egomotion on a Pixel Processor Array (PPA). These devices embed processing and data storage capability into the pixels of the image sensor, allowing for fast and low power parallel computation directly on the image plane. Rather than the standard visual pipeline whereby whole images are transferred to an external general processing unit, our approach performs all computation upon the PPA itself, with the camera’s estimated motion as the only information output. Our approach estimates 3D rotation and a 1D scale less estimate of translation. We introduce methods of image scaling, rotation and alignment which are performed solely upon the PPA itself and form the basis for conducting motion estimation. We demonstrate the algorithms on a SCAMP-5 vision chip, achieving frame rates higher than 1000Hz and at about 2W power consumption. project website

Friday, 20 October 2017

Trespassing the Boundaries: Labeling Temporal Bounds for Object Interactions in Egocentric Video (ICCV2017)

Trespassing the Boundaries: Labeling Temporal Bounds for Object Interactions in Egocentric Video. D Moltisanti, M Wray, W Mayol-Cuevas, D Damen. International Conference on Computer Vision (ICCV), 2017. pdf

Monday, 25 September 2017

Visual target tracking with a Pixel Processor Array (IROS2017)

"Tracking control of a UAV with a parallel visual processor", C. Greatwood, L. Bose, T. Richardson, W. Mayol-Cuevas, J. Chen, S.J. Carey and P. Dudek, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017 [PDF].

This paper presents a vision-based control strategy for tracking a ground target using a novel vision sensor featuring a processor for each pixel element. This enables computer vision tasks to be carried out directly on the focal plane in a highly efficient manner rather than using a separate general purpose computer. The strategy enables a small, agile quadrotor Unmanned Air Vehicle (UAV) to track the target from close range using minimal computational effort and with low power consumption. The tracking algorithm exploits the parallel nature of the visual sensor, enabling high rate image processing ahead of any communication bottleneck with the UAV controller. With the vision chip carrying out the most intense visual information processing, it is computationally trivial to compute all of the controls for tracking onboard. This work is directed toward visual agile robots that are power efficient and that ferry only useful data around the information and control pathways.
Visit the project website at:

Wednesday, 8 March 2017

Automated capture and delivery of assistive task guidance with an eyewear computer: The GlaciAR system (Augmented Human 2017)

An approach that allows both automatic capture and delivery of mixed reality guidance fully onboard Google Glass.

From the paper:

Teesid Leelasawassuk, Dima Damen, Walterio Mayol-Cuevas, Automated capture and delivery of assistive task guidance with an eyewear computer: The GlaciAR system. Augmented Human 2017.

In this paper we describe and evaluate an assistive mixed reality system that aims to augment users in tasks by combining automated and unsupervised information collection with minimally invasive video guides. The result is a fully self-contained system that we call GlaciAR (Glass-enabled Contextual Interactions for Augmented Reality). It operates by extracting contextual interactions from observing users performing actions. GlaciAR is able to i) automatically determine moments of relevance based on a head motion attention model, ii) automatically produce video guidance information, iii) trigger these guides based on an object detection method, iv) learn without supervision from observing multiple users and v) operate fully on-board a current eyewear computer (Google Glass). We describe the components of GlaciAR together with user evaluations on three tasks. We see this work as a first step toward scaling up the notoriously difficult authoring problem in guidance systems and an exploration of enhancing user natural abilities via minimally invasive visual cues.

Wednesday, 4 November 2015

Investigating Spatial Guidance for a Cooperative Handheld Robot

How to provide feedback information to guide users of a handheld robotic device when performing a spatial exploration task? We are investigating this issue via various feedback methods for communicating a 5 degree of freedom target to a user. We compare against a non-robotic handheld wand and use various display alternatives including a stereoscopic VR display (Oculus Rift), a monocular see-through AR display and a 2D screen, as well as simple robot arm gesturing. Our results indicate a significant improvement when using the handheld robot and some interesting conclusions on the effect of various popular display technologies. More to come on a follow up publication... Watch this space and also

Improving MAV Control by Predicting Aerodynamic Effects of Obstacles

Building on our previous work, we demonstrate how it is possible to improve flight control of a MAV that experiences aerodynamic disturbances caused by objects on its path. Predictions based on low resolution depth images taken at a distance are incorporated into the flight control loop on the throttle channel as this is adjusted to target undisrupted level flight. We demonstrate that a statistically significant improvement (p << 0:001) is possible for some common obstacles such as boxes and steps, compared to using conventional feedback-only control. Our approach and results are encouraging toward more autonomous MAV exploration strategies.

Beautiful vs Useful Maps

We are developing methods for compressing maps in a more intelligent way than only looking at geometry. Specifically we look at the relocalisation criteria as a more useful metric. Our methods in the above slide show that they are much better at relocalisation than traditional compression approaches. Some of them also preserve geometry well.

Tuesday, 1 September 2015

Estimating Visual Attention from a Head Mounted IMU

How to obtain attention for an eyewear computer without a gaze tracker?  We developed this method for Google Glass.
This paper concerns with the evaluation of methods for the estimation of both temporal and spatial visual attention using a head-worn inertial measurement unit (IMU). Aimed at tasks where there is a wearer-object interaction, we estimate the when and the where the wearer is interested in. We evaluate various methods on a new egocentric dataset from 8 volunteers and compare our results with those achievable with a commercial gaze tracker used as ground-truth. Our approach is primarily geared for sensor-minimal EyeWear computing.

Wednesday, 6 May 2015

Accurate Photometric and Geometric Error Minimisation with Inverse Depth & What to landmark?

Two papers related to RGBD mapping from a nice collaboration with Daniel Gutierrez and Josechu Guerrero from the University of Zaragoza. Both to be presented at ICRA 2015. One of them nominated for Awards:
  •  D. Gutiérrez-Gómez, W. Mayol-Cuevas, J.J. Guerrero. "Inverse Depth for Accurate Photometric and Geometric Error Minimisation in RGB-D Dense Visual Odometry", In IEEE International Conference on Robotics and Automation (ICRA), 2015. Nominated for Best Robotic Vision Paper Award. [pdf][video][code available] 

  •  D. Gutiérrez-Gómez, W. Mayol-Cuevas, J.J. Guerrero. "What Should I Landmark? Entropy of Normals in Depth Juts for Place Recognition in Changing Environments Using RGB-D Data", In IEEE International Conference on Robotics and Automation (ICRA), 2015.[pdf] 

Monday, 1 September 2014

Cognitive Handheld Robots

We have been working for 2.5 years on prototypes (and since 2006 in the concept!) on what we think is a new extended type of robot. Handheld robots have the shape of tools and are intended to have cognition and action while cooperating with people. This video is from our first prototype back in November 2013. We are also offering details of its construction and 3D CAD models at . We are currently developing a new prototype and more on this soon. Austin Gregg-Smith is sponsored by the James Dyson Foundation.

  • Austin Gregg-Smith and Walterio Mayol. The Design and Evaluation of a Cooperative Handheld Robot. IEEE  International Conference on Robotics and Automation (ICRA). Seattle, Washington, USA. May 25th-30th, 2015. [PDF] Nominated for Best Cognitive Robotics Paper Award.

Saturday, 9 August 2014

Recognition and Reconstruction of Transparent Objects for Augmented Reality

Dealing with real transparent objects for AR is challenging due to their lack of texture and visual features as well as the drastic changes in appearance as the background, illumination and camera pose change. In this work, we explore the use of a learning approach for classifying transparent objects from multiple images with the aim of both discovering such objects and building a 3D reconstruction to support convincing augmentations. We extract, classify and group small image patches using a fast graph-based segmentation and employ a probabilistic formulation for aggregating spatially consistent glass regions. We demonstrate our approach via analysis of the performance of glass region detection and example 3D reconstructions that allow virtual objects to interact with them.
From our paper: Alan Francisco Torres-Gomez, Walterio Mayol-Cuevas, Recognition and reconstruction of transparent objects for Augmented Reality. ISMAR 2014. PDF available at here.

Friday, 1 August 2014

Discovering Task Relevant Objects, their Usage and Providing Video Guides from Multi-User Egocentric Video

Using a wearable gaze tracking setup, we have developed a fully unsupervised method for the discovery of task relevant objects (TRO) and how these objects have been used. A TRO is an object, or part of an object that a person interacts with during a task. We combine visual appearance, motion and user attention to discover TROs. Both static objects such as a coffee machine as well as movable ones such as a cup are discovered. We introduce the term Mode of Interaction to refer to the different ways in which TROs are used. Say, a cup can be lifted, washed, or poured into. When harvesting interactions with the same object from multiple operators, common modes of interaction can be found. We also developed an online fully unsupervised prototype for automatically extracting video guides of how the objects are used. The method automatically selects suitable video segments that indicate how others have used that object before.
  • Damen, Dima and Leelasawassuk, Teesid and Haines, Osian and Calway, Andrew and Mayol-Cuevas, Walterio (2014). You-Do, I-Learn: Discovering Task Relevant Objects and their Modes of Interaction from Multi-User Egocentric Video. British Machine Vision Conference (BMVC), Nottingham, UK. [pdf]
  • Damen, Dima and Haines, Osian and Leelasawassuk, Teesid and Calway, Andrew and Mayol-Cuevas, Walterio (2014). Multi-user egocentric Online System for Unsupervised Assistance on Object Usage. ECCV Workshop on Assistive Computer Vision and Robotics (ACVR), Zurich, Switzerland. [preprint]  
We also have released the dataset available on this project webpage.

Wednesday, 23 April 2014

Learning to Predict Obstacle Aerodynamics from Depth Images for Micro Air Vehicles

This work develops a method to anticipate the aerodynamic ground effects a Micro Air Vehicle will have when going above different obstacles. This works by learning a mapping from depth images to the acceleration experienced from flying above a variety of objects. Computing full 3D aerodynamic effects using air flow simulation onboard a MAV is currently unfeasible. This work uses the alternative approach of learning the visual appearance of objects that produce a given effect which therefore turns the problem into one of regression rather than raw computation.
With the current "easiness" with which 3D maps are now possible to be constructed, this work in a way aims to enhance maps with information that is beyond purely geometric. We have also closed the control loop so we correct for the deviation in anticipation, but that is for another paper.

  • John Bartholomew, Andrew Calway and Walterio Mayol-Cuevas, Learning to Predict Obstacle Aerodynamics from Depth Images for Micro Air Vehicles by , IEEE ICRA 2014. [PDF]

Saturday, 12 October 2013

3D from Looking: Using Wearable Gaze Tracking for Hands-Free and Feedback-Free Object Modelling

How to get a 3D model of something one looks at without any clicks or even feedback to the user?

From our paper T. Leelasawassuk and W.W. Mayol-Cuevas. 3D from Looking: Using Wearable Gaze Tracking for Hands-Free and Feedback-Free Object Modelling. ISWC 2013.

Monday, 1 July 2013

Real-time 3D simultaneous localization and map-building for a dynamic walking humanoid robot

 Image from our Advanced Robotics Journal paper

 This work is with our Samsung Electronics colleagues where our real-time slam system has been used to provide Samsung's RoboRay with mapping capability. RoboRay is one of the most advanced humanoid robots and uses dynamic walking which imposes higher demands on the slam system as there is more agility and motion overall. More details in the paper:

Real-time 3D simultaneous localization and map-building for a dynamic walking humanoid robot
S Yoon, S Hyung, M Lee, KS Roh, SH Ahn, A Gee, P Bunnun, A Calway, & WW Mayol-Cuevas,
Advanced Robotics, published online on May 1st, 2013.

Friday, 28 June 2013

Real-Time Continuous 6D Relocalisation for Depth Cameras

In this work we present our fast (50Hz) relocalisation method based on simple visual descriptors plus a 3D geometrical test for a system performing visual 6-D relocalisation at every single frame and in real time. Continuous relocalisation  is useful in re-exploration of scenes or for loop-closure in earnest. Our experiments suggest the feasibility of this novel approach that benefits from depth camera data, with a relocalisation performance of 73% while running on a single core onboard a moving platform over trajectory segments of about 120m. The system also reduces in 95% the memory footprint compared to a system using conventional SIFT-like descriptors.

  • J. Martinez-Carranza, Walterio Mayol-Cuevas. Real-Time Continuous 6D Relocalisation for Depth Cameras. Workshop on Multi VIew Geometry in RObotics (MVIGRO), in conjunction with Robotics Science and Systems RSS. Berlin, Germany. June, 2013. PDF 
  • J. Martinez Carranza, A. Calway, W. Mayol-Cuevas, Enhancing 6D visual relocalisation with depth cameras. International Conference on Intelligent Robots and Systems IROS. November 2013.

Wednesday, 1 May 2013

Mapping and Auto Retrieval for Micro Air Vehicles (MAV) using RGBD Visual Odometry

These videos show work we been doing with our partners at Blue Bear for onboard visual mapping for MAVs. These are based on visual odometry mapping for working over large areas and build maps onboard the MAV using an asus xtion-pro RGBD camera mounted on the vehicle. One of the videos show autoretrieval of the vehicle where a human pilot first flies the vehicle through the space and then the map is used for relocalisation using the map built on the way back. The other video is on a nuclear reactor installation. These are works we been doing for a while on uses of our methods for industrial inspection.

Monday, 25 March 2013

March 2013 New Robotics MSc. I redesigned our joint MSc in Robotics which is now aimed to support students from various backgrounds in Engineering, Physics and Maths. I am looking forward to supervise MSc project here. Have a look at the programme  here. The application deadline is August 31st.

Thursday, 1 November 2012

Integrating 3D Object Detection, Modelling and Tracking on a Mobile Phone
Click image for video
 This is from a paper at ISMAR 2012 where we combine three things on a mobile phone: 1) in-situ interactive object modelling, 2) real-time tracking based on the wireframe model of the object and 3) re-detection of a lost object using our textureless object detector. It works on a Nokia 900, but look for another post for the Android App.
  • Pished Bunnun, Dima Damen, Andrew Calway, Walterio Mayol-Cuevas, Integrating 3D Object Detection, Modelling and Tracking on a Mobile Phone. International Symposium on Mixed and Augmented Reality (ISMAR). November 2012. PDF.

Sunday, 7 October 2012

Predicting Micro Air Vehicle Landing Behaviour from Visual Texture

How does a UAV can decide where is best to land and what to expect if landing on a particular material? Here we develop a framework to predict the landing behaviour of a Micro Air Vehicle (MAV) from the visual appearance of the landing surface. We approach this problem by learning a mapping from visual texture observed from an onboard camera to the landing behaviour on a set of sample materials. In this case we exemplify our framework by predicting the yaw angle of the MAV after landing.

Egocentric Real-time Workspace Monitoring using an RGB-D Camera

We developed an integrated system for personal workspace monitoring based around an RGB-D sensor. The approach is egocentric, facilitating full flexibility, and operates in real-time, providing object detection and recognition, and 3D trajectory estimation whilst the user undertakes tasks in the workspace. A prototype on-body system developed in the context of work-flow analysis for industrial manipulation and assembly tasks is described. We evaluated on two tasks with multiple users, and results indicate that the method is effective, giving good accuracy performance.
  • Dima Damen, Andrew Gee, Walterio Mayol-Cuevas and Andrew Calway. Egocentric Real-time Workspace Monitoring using an RGB-D Camera, IROS 2012. PDF, Video available from here.

Monday, 3 September 2012

6D RGBD relocalisation

Relocalisation is about finding out where the camera is in translation and rotation (6D) when it visits the space for the first time after a map has been created, or if gets lost during tracking due to occlusion. This is also known as the "kidnapped robot" problem in Robotics and appears frequently in SLAM at the loopclosing stage. Here, we develop a fast relocalisation method for RGB-D cameras that operates in workplaces where low texture and some occlusion can be present. Videos are available here.