Fig. 1
From: DCATNet: polyp segmentation with deformable convolution and contextual-aware attention network

Overall structure of the proposed model. GAM is employed to capture additional spatial features from the decoder, maintaining an equal number of input and output channels. CAG aggregates features from both the encoder and decoder to reduce the semantic gap. MSFE serves as the decoder for multi-scale feature extraction and fusion. 12 Transformer layers are incorporated in this model