GazeGNN: A Gaze-Guided Graph Neural Network for Chest X-ray Classification

Published on Nov 30, 2023

Scene 1 (0s)

[Audio] Hello everyone, I am going to present our WACV 2024 paper, GazeGNN: A Gaze-Guided Graph Neural Network for Chest X-ray Classification.

Scene 2 (18s)

[Audio] Nowadays, Chest X-ray has become a common tool for disease diagnostic. More and more researcher are applying deep learning methods to help conduct the medical image analysis based on the X-ray images. But there are some limitations. Chest X-ray has limited soft tissue contrast, containing a variety of complex anatomical structures overlapping in planar (2D) view so that it will lead to a result, which many tissues, such as organs, blood vessels, and muscles, have similar intensity values on the chest X-ray images. As you can see in the figure shown on the screen, the heart and pneumonia are overlapped together. This can largely confuse the deep learning model to localize the abnormality. Hence, only apply deep learning model on the x-ray images will sometimes be hard to conduct correct diagnosis because there are many expert knowledge missing..

Scene 3 (1m 16s)

[Audio] To address the aforementioned problem, many people are applying radiologists expert knowledge into the deep learning training. One way is to use eye tracking technology to collect the eye gaze from radiologists. And the eye gaze contains the location information that radiologists focused during the diagnostic. This is highly likely to contain the potential abnormality or important regions, which some challenging X-ray cases can not provide. In this way, the deep learning model is supplemented by the human attention and learned in an interpretable way..

Scene 4 (1m 54s)

[Audio] There are two main kinds of current solutions that incorporate the eye gaze in the disease classification. The first one is called attention consistency architecture. It considers eye gaze as a supervision source and minimize the difference between the model attention and human attention generated by eye gaze. Here in the figure VAM is visual attention map, which is achieved from the eye gaze map. CAM is class attention map, which is generated from the model. This enforces the model to make the model attention look like the human attention and to focus more on the important abnormality regions. But the drawback of this method is that there is no eye gaze data during the inference, and it can easily lose robustness. We observe large performance drop when testing on dataset with distribution shift to the training dataset..

Scene 5 (2m 49s)

[Audio] The other solution is called two-stream architecture. Two branches dedicated to processing the image and eye gaze information separately. So in the figure, we can see that there are two encoders that separately encode the image and eye gaze data. But since it still requires transfer the eye gaze to the Visual Attention Map (VAM) during inference, which takes some time. Current methods Pre-generate all VAM in advance before training or inference. Hence, the drawback of this method the Generation of VAM is Time Consuming, which is around 10 seconds for each case. So, it is not practical for real-world clinical diagnosis because it will disturb the normal work of doctors if it can not make to the real-time application..

Scene 6 (3m 40s)

[Audio] Our method can fix both problems very well, by first, embed eye gaze in inference, so it will fix the robustness problem. Second, we get rid of the visual attention map generation procedure and replace it with the proposed gaze embedding, which is time-efficient..

Scene 7 (3m 58s)

[Audio] GazeGNN constructs a graph from an image and eye-gaze data. Each node in the graph is represented as a combination of features through patch, gaze, and position embedding. After graph is constructed, a graph neural network is applied to conduct graph-level classification and get the predicted class for the input image..

Scene 8 (4m 21s)

[Audio] The gaze embedding is really simple. It involves to sum up all the eye-gaze points' fixation time in the patch to represent the attention feature of the patch. This operation is much more time efficient..

Scene 9 (4m 37s)

[Audio] After the graph is constructed, we use a graph neural network to conduct graph-level classification. The detailed architecture is shown in the figure..

Scene 10 (4m 51s)

[Audio] In this study, the reason of applying GNN is because GNN is not data hungry as Transformer based models and is more flexible to represent different types information in single graph node. We can see in the table GNN achieve best evaluation score with limited data when compared to other backbone networks..

Scene 11 (5m 32s)

[Audio] With our method, it can improve the classification accuracy a lot, as shown in the table. Here we compared with existing state-of-the-art, and proposed GazeGNN achieves the best performance on all the evaluation metrics.

Scene 12 (5m 48s)

[Audio] We also generate the AUC figure, and proposed method outperforms the model with the best average AUC..

Scene 13 (5m 58s)

[Audio] Here, we prove that we can improve the inference speed significantly. As shown in the table, Two-stream architecture takes the longest inference time. GazeGNN obtains comparable inference time as attention consistency architecture. However, Attention consistency architecture do not require gaze input in the inference stage, while GazeGNN involves the eye gaze.

Scene 14 (6m 25s)

[Audio] We also show that the model robustness can be improved by using our method. Here, we test the models on dataset with distribution shift, and we found that Attention consistency architecture exhibits a much larger performance drop compared to our proposed method. It means our model is more robust compared to State-of-the-art methods..

Scene 15 (6m 48s)

[Audio] Here is the visualization of the result. We use Grad-CAM to generate the model attention. It proves that the model attention is learned to align with the human attention, which is eye gaze..

Scene 16 (7m 3s)

[Audio] In conclusion, this work Bypass the time-consuming VAM generation, significantly increase the inference speed. Incorporates the eye gaze during inference, showing strong robustness. Our method Achieve SOTA classification accuracy and strong robustness with time-efficient performance. Hence, It Proves the feasibility of bringing real-time eye tracking techniques to radiologists' daily work.

Scene 17 (7m 32s)

[Audio] Thank you for listening this presentation and more information can be found in our project page..