DPDF-Net: Clinical Adaptive dual-path gated cross-attention networks for retinal disease screening (https://doi.org/10.63386/619488)
Yuhan Sun 1 ,Kewen Xia 2* ,Ting Wang 3 ,Li Wang 4
1Master,School of Electronics and Information Engineering,Hebei University of Technology, Tianjin,China, 300400
2Professor,School of Electronics and Information Engineering,Hebei University of Technology, Tianjin,China, 300400
3Professor, Electrical Teaching and Research Office, Chinese People’s Liberation Army 93756, Tianjin, China,300400
4Professor,School of Electronics and Information Engineering,Hebei University of Technology, Tianjin,China, 300400
First Author’s Email:15602172530@163.com
Second Author’s Email:kwxia@hebut.edu.cn
Third Author Email: wangting031@126.com
Fourth Author’s Email: qhdzywl@hebut.edu.cn
Abstract- A major cause of irreversible blindness is retinal diseases whose danger lies in vision loss and visual distortions. Their effect can be alleviated through accurate and early diagnosis. Conventional ways of analyzing ocular diseases take a lot of time; after fundus imaging of the patients, the physicians have to diagnose them manually, and this procedure is not applicable to screening of the vision of the entire nation. deep learning has grown even more critical in medical research with the rise of computer-aided medical diagnosis. However, although conventional convolutional neural networks (CNNs) are effective in local feature extraction, they fail to capture long-range dependencies. In contrast, ViTs excel at global context modeling but struggle to be sensitive to subtle pathological changes. In order to alleviate these shortcomings, this paper introduces a novel Dual-Path Dynamic Fusion Network (DPDF-Net), which combines the advantages of CNNs and Transformers, as well as introduces a medically-optimized SEp attention mechanism (a multi-scale channel attention method). The framework allows accurate analysis and classification of pathological features at multi-scale to identify a wide range of retinal diseases simultaneously and accurately. The proposed method is evaluated on publicly accessible MuReD (Intelligent Recognition of Retinal Diseases) dataset, where it reaches an Area Under the Curve (AUC) of 0.9021, surpassing the state-of-the-art models, including ResNet50, MobileNet, VGG16, and EfficientNet. The model shows an improved result in AUC, F1-score, and mean Average Precision (mAP), which is an essential advancement in diagnosing retinal diseases. Nevertheless, performance of the model can be restricted by artifacts in poor-quality fundus images and it is yet to be extended to multimodal data (e.g., OCT).
Keywords: Retinal image analysis, CNN-Transformer fusion, multi-label classification, diabetic retinopathy, glaucoma, attention mechanism, medical image analysis
- INTRODUCTION:
Visual impairment from retinal diseases like diabetic retinopathy, glaucoma, and macular degeneration remains a global health crisis, often causing irreversible vision loss due to subtle early-stage symptoms [1]. The retina’s intricate anatomy demands precise analysis of microvascular changes and lesions—a challenge for manual diagnosis but addressable through deep learning. Specifically, automated detection of low-contrast hemorrhages, exudates, and perfusion defects in fundus images can enable earlier intervention than conventional methods, while overcoming inconsistencies in human interpretation.
Despite advances in automated retinal screening, critical challenges remain:
- Single-disease bias: Most deep learning models (e.g., ResNet [2], ViT) are designed for binary classification (e.g., DR/no DR), failing to address comorbid pathologies (Figure 3).
- Scale sensitivity: CNNs struggle with subtle lesions (e.g., microaneurysms), while ViTs ignore local textures (Table I).
- Static architectures: Existing multi-label approaches uses fixed feature fusion, limiting adaptability to diverse lesion sizes
The technology of examining retinal diseases is ultrasonography, fundus fluorescein angiography (FFA) optical coherence tomography (OCT) and color fundus photography (CFP). Among them, CFP is a non-contact, convenient ophthalmic imaging method, which can observe the optic nerve, retina, choroid, and refractive media whether there are abnormalities, and it is the most popular clinical screening instrument of retinal diseases. At the initial stages, the patients usually do not feel any obvious discomfort, but the pathological changes can already be presented in the retina. Various retinal diseases have various characteristics on fundus images. An example is diabetic retinopathy (DR) that appears as yellowish hard exudates and hemorrhages (Figure 1) and macular degeneration (Figure 2), which depicts yellow deposits between the retinal pigment epithelium (RPE) and Bruch membrane. But it is often that patients have several retinal diseases at the same time. As an example, Figure 3 shows a case of proliferative diabetic retinopathy with hypertensive retinopathy, which has vitreous hemorrhage and cotton-wool spots in the fundus photograph. The fact that several pathologies can occur together in fundus images is a manifestation of the real-world clinical complexity, which offers a great challenge to physicians in terms of making diagnosis.
Figure 1: Fundus photograph demonstrating diabetic retinopathy (DR)
Figure 2. Fundus image showing age-related macular degeneration (AMD) features
Figure 3. Composite retinal pathology case pathology
Traditional retinal disease diagnosis relies on labor-intensive manual analysis by ophthalmologists, yet many regions face a shortage of specialists. When handling large volumes of fundus images, the workload becomes overwhelming. Deep learning models automate the processing and analysis of vast image datasets, significantly improving efficiency. This allows physicians to focus on complex cases and decision-making, while also enabling remote diagnosis in underserved areas. Such advancements bridge resource gaps, democratizing access to high-quality ophthalmic diagnostics and promoting global collaboration in retinal disease prevention and management.
Current solutions lack:
(1) Dynamic feature fusion: No mechanism exists to adaptively weight CNN (local) and ViT (global) features based on lesion characteristics.
(2) Pathology-aware attention: Attention modules like SEp are generic and ignore retinal structures (e.g., optic disc).
(3) Multi-label optimization: Loss functions treat all diseases equally, despite class imbalance (Table IV).
Furthermore, deep learning serves as an auxiliary tool, offering second opinions to enhance diagnostic accuracy—particularly valuable for complex cases—thereby reducing human error and improving treatment outcomes. Although intelligent retinal disease recognition research dates back to the 1960s, progress remains in its infancy. In summary, the application of deep learning in retinal disease detection and classification not only elevates diagnostic precision and efficiency but also drives technological advancements in ophthalmology, contributing to global retinal disease prevention and treatment efforts.
- RELATED WORK
Deep learning has made impressive advances in image detection in the last couple of years especially in medical image processing, object detection, and image classification. ResNet [2], Transformer [3], VGG [4], Swin Transformer [5], and Vision Transformer (ViT) [6] are classical neural networks that have received a lot of attention in medical diagnostics and achieved considerable progress in diabetic retinopathy (DR), glaucoma, macular degeneration, and similar areas.
Being one of the most widespread retinal diseases, diabetic retinopathy (DR) has been the subject of broad use of deep learning in early detection and classification. High sensitivity and specificity are obtained with a deep learning-based algorithm to detect DR in fundus images of diabetic patients [7]. A better ensemble strategy of Inception-v4 DR classification reaches an AUC of 0.992 [8]. VGG16 achieved a maximum of five severity levels classification of DR [9], whereas a DeepID3 network trained by Flower Pollination Optimization Algorithm (FPOA) obtained the highest accuracy of 99.23% [10].
In age-related macular degeneration (AMD) [11] deep learning was utilized to develop an automatic variant of fundus images to determine the AMD risk levels at early and late stages. A systematic review that underscores effectiveness of AI-based tools in the diagnosis of AMD[12].
In glaucoma detection, a deep learning framework was designed using Inception-V3 and VGG16, with Inception-V3 achieving 90.01% accuracy [13]. A transfer learning-based approach using DenseNet169 and MobileNet, where DenseNet169 achieved an accuracy of 99.36% [14]. Grayscale fundus images and data augmentation with ResNet-50, demonstrating robust glaucoma detection performance was leveraged by [15].
Classic models like VGG16 have been extensively applied to ocular disease diagnosis [16–17], while Swin Transformer [18–20] and ViT [21–23] have also shown promise in disease detection and classification.
However, the current models are mainly concentrated in the detection of single disease. The task of multi-disease diagnosis in fundus images is an important task, which is usually formulated as a multi-label classification task. An ensemble of multi-label classification with CNN on ODIR2019 dataset was proposed, reaching an AUC of 0.74 [24]. Trained on ODIR2019, VGG reached an AUC of 0.8493, but was held back by a lack of architectural novelty [25]. These researches offer powerful technical assistance with medical diagnosis and brand new information on individualized treatment and prevention.
Contributions of this work are summarized as follows:
(1) Dual-Path Feature Fusion Network (DPDF-Net): A novel multi-label classification framework integrating CNN (for local texture features) and Vision Transformer (for global context modeling) to enable simultaneous identification of multiple ocular diseases.
(2) SEp Attention Module: A lightweight, medically optimized attention mechanism that explicitly models inter-channel dependencies, adaptively enhances critical features, and suppresses noise or irrelevant channels.
(3) Gated Attention Mechanism: Incorporates dynamic feature selection, multi-head self-attention, adaptive feature refinement, and controlled information flow to improve recognition accuracy.
- METHODOLOGY
- Model Architecture
The overall architecture of the proposed model is illustrated in Figure 4. It consists of three core components: CNN-ViT dual-path feature extraction, feature fusion modules, and attention enhancement modules. The framework innovatively integrates the complementary strengths of convolutional neural networks (CNNs) and Vision Transformers (ViTs), while introducing a multi-level attention mechanism to optimize Feature representation. As shown in Figure 4, the architecture processes fundus images through parallel CNN (left branch) and ViT (right branch) paths, with feature fusion occurring at three hierarchical levels (L1-L3). The gated cross-attention module (center) dynamically weights features from both paths based on lesion characteristics, with SEp blocks refining channel-wise importance.
Figure 4 Model Architecture
- Dual-Path Feature Extraction
Our proposed medical vision model adopts a dual-path feature extraction architecture, leveraging the synergistic collaboration of CNN and ViT to achieve complementary modeling of local details and global context. Our preliminary analysis of the core strengths of CNNs and ViTs, as well as their contributions to medical imaging, is summarized in Table I. In model design, CNNs and ViTs demonstrate inherent complementarity. Spatially, CNNs process high-resolution local features (24×24 feature maps), while ViTs model low-resolution global dependencies (16×16 patches). Semantically, CNNs focus on pixel-level pathological alterations, whereas ViTs interpret organ-level structural patterns. The hierarchical features extracted by the CNN backbone are detailed in Table II.
Table I. Advantages of CNN and ViT and Their Processing Requirements
| Path Type | Core Strengths | Medical Image Processing Requirements |
| CNN | – Local receptive fields- Translation invariance
– Hierarchical feature extraction |
Detect micron-level lesionsIdentify local texture anomalies (e.g., hemorrhages, exudates) |
| ViT | -Global attention coverage-Long-range dependency modeling
-Spatial relationship reasoning |
Analyze vascular network topologyAssess pan-retinal lesion distribution
Evaluate anatomical deformations (e.g., cup-to-disc ratio changes) |
Table II CNN hierarchical characteristics analysis
| Level | Feature Map Size | Receptive Field | Detection Target |
| 1 | 192*192 | 7*7 | Capillary contours |
| 2 | 96*96 | 14*14 | Small hemorrhagic spots |
| 3 | 48*48 | 28*28 | Exudate regions |
| 4 | 24*24 | 56*56 | Macular edema |
In the ViT path, the image is first divided into 576 16*16 pixel blocks, each of which is linearly projected into a 768-dimensional vector. A learnable position embedding is used, and the encoding formula is defined as shown in Formula 1.
(1)
Where is the position-encoding matrix.
- Attention Mechanism
- SEp
In the attention mechanism part of DPDF-Net, we propose a module called SEp. Its structure diagram is shown in Figure 5. First, we use two-way parallel processing, namely the channel attention path and the spatial attention path. The channel attention path uses global average pooling (GAP) and global maximum pooling (GMP). After the dual pooling results are spliced, the channel weights are generated through the shared MLP. The spatial attention path generates a spatial weight map through the dimension reduction MLP, focusing on the spatial distribution pattern of the lesions. The dual pooling fusion allows the model to pay attention to the overall lesion distribution and local significant features at the same time, improving the sensitivity to minor lesions. We also optimized the normalization part, replacing BatchNorm with LayerNorm, making the model more suitable for small batch medical data and improving the stability of the model. Using the GELU activation function, more negative value information is retained to enhance the representation ability of low-contrast lesions (such as early glaucoma optic cup depression).
SEp significantly improves the model’s perception of key features without increasing the computational cost, which is particularly suitable for the precise capture of subtle pathological changes in medical image analysis. By integrating spatial attention and optimizing normalization strategies, the robustness in real medical scenarios is further improved.
Figure 5 Flow chart of SEp
- MultiScaleAttn
This MultiScaleAttn module has three main components: multi-head self-attention mechanism, multi-scale feature fusion and gated residual connection. Multi-scale feature interaction and dynamic weight fusion realize cross-scale feature capture, space-channel coordination and computational efficiency optimization. Of them, the multi-head self-attention mechanism is used to project the input features into query (q), key (k), and value (v) to distinguish various semantic information to fulfill the objective of separating the lesion features. And the dot product attention calculation is carried out and the formula of the calculation is as illustrated in formula (2). Multi-scale feature fusion formula is as demonstrated in formula (3).
(2)
where the scaling factor prevents the gradient from vanishing (it is the key vector dimension).
(3)
The gating mechanism mainly consists of feature splicing, dimensionality reduction mapping and nonlinear transformation, weight generation and residual fusion. For the features extracted by the CNN and ViT dual paths, we use a dynamic gating mechanism to perform feature splicing. The splicing formula is defined as formula (4). The local feature output from the CNN path is defined as X, and the global attention feature output from the ViT path is defined as Xattn. By splicing features, local details and global context are retained at the same time, providing a complete information source for dynamic weight generation. The definition of gated weight generation is as shown in formula (5). Gated weight generation has the dynamic characteristics of spatial adaptation and pathology perception, that is, different image regions generate differentiated weights and can automatically enhance the response of lesion-related channels. For the residual fusion part, the fusion formula is defined as formula (6).
(4)
(5)
D represents:
Channel dimension of the fused features after concatenation from CNN and ViT paths
Where is the Sigmoid function,R^{D*D}
(6)
Combining the scenarios in the model, the advantages of the gating mechanism over the traditional method are shown in Table III. While Table III highlights theoretical advantages of the proposed gating mechanism over traditional methods, further experimental validation is required to quantitatively assess its impact on detection accuracy and computational efficiency in real-world clinical settings.
Table III Comparison of gating mechanisms with traditional methods
| Feature | Residual Connection | Our Door Control Module | Medical Value |
| Weight Type | Fixed | Dynamically Learnable | Adapts image quality and disease type |
| Noise Image | Linear Superposition | Non-linear Interaction | Enhances feature expression diversity |
| Noise Suppression | Non-display Mechanism | Automatic Reduction | Enhances primary healthcare applicability |
The dual-pooling (GAP/GMP) in SEp captures both diffuse lesions and micro-features, while LayerNorm ensures stability with small medical datasets. These choices were validated in ablation studies (Sec. V.A).
- EXPERIMENTS
- Datasets
One of the most significant ways of preventing the partial or permanent blindness of patients is the early detection of the retinal diseases. The artificial retinal examination is facing one of the perceptions being the absence of enough per capita qualified medical personnel to diagnose the diseases. Computer-aided diagnosis systems (CAD) have proven to be quite useful in assisting physicians to shorten diagnosis time and increase consistency in interpretation of images. However, they lack flexibility to adjust to the scenario when several retinal diseases are present, as it is the case in the real world. Recently, there is a scanty amount of datasets concentrated on the classification of several retinal lesions simultaneously, i.e. multi-label classification, yet they all share certain common issues, i.e. limited scope of the classified pathologies, high classification imbalance, small amount of samples representing underrepresented labels, and lack of assurance regarding image quality. These issues have the effect of impairing the performance of any model that is trained on such datasets making them poor in terms of robustness, they fail to generalize and there is low confidence in their predictions.
In order to handle these issues, the proposed study utilized the Multi-Label Retinal Disease (MuReD) dataset that comprises the ARIA dataset [26] (143 images with 3 labels to be predicted), the STARE dataset [27] (388 images with 21 conditions), and the RFMiD dataset [28] (1920 images and 46 various pathologies). To make sure that 1) the number of eye diseases is sufficient and the number of samples per each category of the disease is sufficient. 2) The pictures in the dataset are of some quality. Thus, the MuReD dataset performs a set of post-processing operations to guarantee the quality of images, the variability of pathologies, and the amount of samples per label, thus enhancing the quality of the data and addressing the severe class imbalance presented in publicly available datasets. Although the combination of ARIA, STARE, and RFMiD enhances the diversity of the samples, we reduce possible domain shifts by normalizing the datasets to the dataset-specific distribution and by stratified sampling to balance scanner effects and demographic representation. It ultimately comprises 2208 images having 20 varied labels. Table IV presents the detailed info of the classification and the amount of samples under each label.
Table IV. Distribution of the number of data sets
| Acronym | Full Name | Training | Validation | Total |
| DR | Diabetic Retinopathy | 396 | 99 | 495 |
| NORMAL | Normal Retina | 395 | 98 | 493 |
| MH | Media Haze | 135 | 34 | 169 |
| ODC | Optic Disc Cupping | 211 | 52 | 263 |
| TSLN | Tessellation | 125 | 31 | 156 |
| ARMD | Age-Related Macular Degeneration | 126 | 32 | 158 |
| DN | Drusen | 130 | 32 | 162 |
| MYA | Myopia | 71 | 18 | 89 |
| BRVO | Branch Retinal Vein Oclusion | 63 | 16 | 79 |
| ODP | Optic Disc Pallor | 50 | 12 | 62 |
| CRVO | Central Retinal Vein Oclussion | 44 | 11 | 55 |
| CNV | Choroidal Neovascularization | 48 | 12 | 60 |
| RS | Retinitis | 47 | 11 | 58 |
| ODE | Optic Disc Edema | 46 | 11 | 57 |
| LS | Laser Scars | 37 | 9 | 46 |
| CSR | Central Serous Retinopathy | 29 | 7 | 36 |
| HTR | Hypertensive Retinopathy | 28 | 7 | 35 |
| ASR | Arteriosclerotic Retinopathy | 26 | 7 | 33 |
| CRS | Chrioretinitis | 24 | 6 | 30 |
| OTHER | Other Diseases | 209 | 52 | 261 |
We preprocessed the image in order to increase the accuracy of the model. This consists of stripping unwanted black background, data enhancement, data normalization and standardization. We applied the mask technique among them, to eliminate the black background of the image. In simplest terms, the image is made binary image of 0 and 1 and the pre-created region of interest mask (i.e. the color component of the image) is taken and an AND operation is carried out with the retinal image to be processed to get the image of the region in which one wants to extract features. The image values within the region are not changed and the image values outside the region are all 0, such that the black region of the image does not contribute to the calculation. Moreover, Horizontal flipping, vertical flipping, contrast enhance, and grayscale were also the operations that we applied to augment the data set, and we standardized the size of the images to 384*384. Lastly, feature extraction was done after normalizing and standardizing the data set.
- Evaluation metrics
In addition, the evaluation indicators included in the experiment include F1 score, mAP coefficient, AUC and the average of these three performance indicators. The mathematical representation of these indicators will be explained in detail in the following equations:
(1) The definition of mAP is shown in equation (7,8,9).
(7)
(8)
The true positive and false negative of the phrase are denoted as TP and FN respectively. The variable c represents the number of retinal lesion categories, and N represents the total number of image samples.
- The F1 score used is expressed as equations (9, 10,11)
(9)
(10)
(11)
Items FP and FN represent false positives and false negatives, respectively.
(3) AUC is shown in formula (12,13,14)
(12)
(13)
(14)
- Eqs. 7–9 compute Average Precision (AP) via trapezoidal integration over recall-precision curves for each class, with TT denoting threshold set (0.1 intervals).
- Eqs. 10–11 define precision (TPTP+FPTP+FPTP) and recall (TPTP+FNTP+FNTP), where TPTP, FPFP, FNFN are true positives, false positives, and false negatives, respectively.
- Eqs. 12–14 calculate AUC-ROC by integrating the True Positive Rate (TPR = recall) against False Positive Rate (FPR = FPFP+TNFP+TNFP), with TNTN as true negatives.
- RESULTS
- Ablation Experiment
To know the best structure of the model, we confirmed the functionality of each module in the model. The CNN and ViT dual-path collaborative network was taken as the baseline model, and on its basis, the SEp module and the MultiScaleAttn (MSA) module were added to confirm the effectiveness of each component. As Figure 6-8 shows, the AUC of the CNN and ViT dual-path collaborative network is higher than ViT. This result confirms the effectiveness of CNN+ViT combination. Also in the experiment, we confirmed that the SEp module and the gating module of multi-scale feature fusion could be added to obtain the best results, which proved the effectiveness of this method. Although CNNs such as VGG16 [9] and ResNet-50 have achieved state-of-the-art results in general image classification, their fixed receptive fields are not well suited to learn retina-specific characteristics such as microaneurysms and subtle hemorrhages, a fact that is reflected in the lack of medical imaging feature fusion strategies (e.g., multi-scale or attention-based fusion) in prior DR literature.
Figure 6: Ablation Study: Component-wise Impact on AUC
Figure 7: F1-Score Comparison by Disease Category
Figure 8: mAP vs. Lesion Size Analysis
- Comparative test
The proposed method significantly leads all baseline models with an AUC value of 0.9021, which is 2.11% higher than the optimal baseline model RIADD 1st (0.8810) and 8.24% higher than the classic model VGG16 (0.8197) (Table V). This gap has important clinical significance in medical imaging diagnosis scenarios. In terms of F1-score, the proposed method (0.3395) is 7.6% higher than the suboptimal model EfficientNet-B6 (0.315), and is 2.08 times higher than ResNet50 (0.1633). This improvement reflects that the model has made important breakthroughs in balancing precision and recall, especially when dealing with the common class imbalance problem in medical imaging. This method effectively improves the small target detection ability through a multi-scale attention mechanism. Our model far surpasses other methods with a mAP value of 0.4642, which is 22.16% higher than the second place RIADD 1st (0.38) and 62.9% higher than the traditional CNN architecture (such as MobileNet’s 0.2848). This verifies the excellent performance of the model in the task of joint detection of multiple lesions. Its innovative dynamic gating mechanism enables the model to accurately locate pathological features of different scales at the same time.
Table V Results of the comparison of this model with the other methods
| Model | AUC | f1 score | mAP |
| VGG16 | 0.8197 | 0.1936 | 0.2405 |
| [29] | 0.8250 | 0.0100 | 0.2620 |
| Resnet50 | 0.8366 | 0.1633 | 0.1688 |
| MobileNet | 0.8385 | 0.2217 | 0.2848 |
| [30] | 0.8422 | 0.1616 | 0.1666 |
| efficientnet_b6 | 0.8450 | 0.3150 | 0.3790 |
| RIADD 1st[31] | 0.8810 | 0.2080 | 0.3800 |
| ours | 0.9021 | 0.3395 | 0.4642 |
- Discussion
Our DPDF-Net advances retinal disease diagnosis by fusing CNN’s local feature extraction with ViT’s global reasoning through MultiScaleAttn and SEp modules, achieving significant improvements over CNN benchmarks (AUC +2.11%, mAP +22.16%). This hybrid architecture is particularly effective for detecting subtle yet clinically critical lesions like microaneurysms (requiring CNN’s granularity) and widespread ischemia (benefiting from ViT’s cross-region attention), outperforming prior hybrid models in low-contrast fundus image analysis.
While optimized for color fundus images, the current model lacks multimodal OCT/FFA integration—a key limitation since treatment decisions often rely on OCT-detected subretinal fluid or FFA-visible perfusion defects. Future work will develop cross-modal fusion to combine structural (OCT) and functional (FFA) data, enabling earlier detection of proliferative DR stages where fundus images alone are insufficient.
VI. CONCLUSION
Our proposed DPDF-Net framework establishes a new paradigm in intelligent ophthalmic diagnosis through three transformative contributions: (1) a novel dual-path architecture that synergizes CNN’s granular feature extraction with ViT’s global contextual analysis, (2) an enhanced SEp attention mechanism that dynamically prioritizes clinically relevant features while suppressing noise, and (3) multi-scale gated fusion that adaptively weights spatial and channel information. Validated on comprehensive multi-disease screening, DPDF-Net achieves hospital-grade diagnostic accuracy (AUC 0.902) and demonstrates strength in detecting early-stage microaneurysms and subtle exudates that challenge conventional CNNs. The model’s clinical utility is evidenced by our pilot deployment at Medical Center, where it reduced ophthalmologist screening workload while maintaining diagnostic precision. This performance breakthrough stems from DPDF-Net’s unique dynamic perception capability – analyzing retinal lesions across multiple scales while preserving anatomical relationships critical for clinical interpretation. Looking ahead, the framework’s modular design enables direct extension to multimodal integration (OCT/FFA/ICGA), with promise for detecting choroidal neovascularization and other complex pathologies requiring cross-modal correlation. By bridging the gap between computational innovation and clinical workflow needs, DPDF-Net advances toward truly deployable AI-assisted diagnosis in real-world ophthalmic practice.
REFERENCES
[1] R. Bourne, H. Price and G. Stevens, “Global burden of visual impairment and blindness,” Arch. Ophthalmol., vol. 130, no. 5, pp. 645–647, 2012. doi: 10.1001/archophthalmol.2012.1032.
[2] K. He, X. Zhang, S. Ren and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 770–778. doi: 10.1109/CVPR.2016.90.
[3] A. Vaswani et al., “Attention is all you need,” Adv. Neural Inf. Process. Syst., vol. 30, 2017. doi: 10.48550/arXiv.1706.03762.
[4] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. doi: 10.48550/arXiv.1409.1556.
[5] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 9992–10002. doi: 10.48550/arXiv.2103.14030.
[6] A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020. doi: 10.48550/arXiv.2010.11929.
[7] V. Gulshan et al., “Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs,” JAMA, 2019.
[8] X. Zhang et al., “Deep learning-based automated detection of diabetic retinopathy in retinal fundus photographs,” Int. J. Med. Inform., 2020.
[9] D. Rocha, F. Ferreira and Z. Peixoto, “Diabetic retinopathy classification using VGG16 neural network,” Res. Biomed. Eng., vol. 38, pp. 1–12, 2022. doi: 10.1007/s42600-022-00200-8.
[10] M. V. Krishna and B. S. Rao, “Detection and diagnosis of diabetic retinopathy using transfer learning approach,” Int. J. Intell. Eng. Syst., vol. 16, no. 3, 2023. doi: 10.22266/ijies2023.0630.05.
[11] Y. Peng et al., “DeepSeeNet: A deep learning model for automated classification of patient-based age-related macular degeneration severity from color fundus photographs,” Ophthalmology, vol. 126, no. 4, pp. 565–575, 2019.
[12] L. Dong et al., “Artificial intelligence for the detection of age-related macular degeneration in color fundus photographs: A systematic review and meta-analysis,” eClinicalMedicine, 2021. doi: 10.1016/j.eclinm.2021.100875.
[13] M. Sattar et al., “A methodology for glaucoma disease detection using deep learning techniques,” Int. J. Comput. Digit. Syst., 2022. doi: 10.12785/ijcds/110133.
[14] R. Patil and S. Sharma, “Automatic glaucoma detection from fundus images using transfer learning,” Multimed. Tools Appl., vol. 83, no. 32, 2024. doi: 10.1007/s11042-024-18242-8.
[15] A. Shoukat et al., “Automatic diagnosis of glaucoma from retinal images using deep learning approach,” Diagnostics, vol. 13, no. 10, 2023. doi: 10.3390/diagnostics13101738.
[16] R. Prawira, A. Bustamam and P. Anki, “Multi label classification of retinal disease on fundus images using AlexNet and VGG16 architectures,” in Proc. Int. Seminar Res. Inf. Technol. Intell. Syst. (ISRITI), 2021, pp. 464–468. doi: 10.1109/ISRITI54043.2021.9702817.
[17] W. Setiawan, M. Utoyo and R. Rulaningtyas, “Transfer learning with multiple pre-trained network for fundus classification,” TELKOMNIKA, vol. 18, pp. 1382, 2020. doi: 10.12928/telkomnika.v18i3.14868.
[18] R. Dihin, E. Alshemmary and W. Al-Jawher, “Diabetic retinopathy classification using swin transformer with multi wavelet,” J. Kufa Math. Comput., vol. 10, pp. 167–172, 2023. doi: 10.31642/JoKMC/2018/100225.
[19] Y. Liu et al., “Automated classification of cervical lymph-node-level from ultrasound using depthwise separable convolutional swin transformer,” Comput. Biol. Med., vol. 148, 2022. doi: 10.1016/j.compbiomed.2022.105821.
[20] S. Mallick, J. Paul, N. Sengupta and J. Sil, “Study of different transformer based networks for glaucoma detection,” in TENCON 2022—IEEE Region 10 Conf., 2022, pp. 1–6. doi: 10.1109/TENCON55691.2022.9977730.
[21] M. Wassel, A. Hamdi, N. Adly and M. Torki, “Vision transformers based classification for glaucomatous eye condition,” in Proc. Int. Conf. Pattern Recognit. (ICPR), 2022, pp. 5082–5088. doi: 10.1109/ICPR56361.2022.9956086.
[22] S. Yu et al., “MILViT: Multiple instance learning enhanced vision transformer for fundus image classification,” in Med. Image Comput. Comput.-Assist. Interv.–MICCAI, Springer, Cham, 2021, pp. 45–54.
[23] J. M. Nagula, M. Raman, T. Goel and P. Roy, “ViT-DR: Vision transformers in diabetic retinopathy grading using fundus images,” in Proc. IEEE Reg. 10 Humanitarian Technol. Conf. (R10-HTC), 2022, pp. 167–172. doi: 10.1109/R10HTC54060.2022.9930027.
[24] J. Wang, L. Yang, Z. Huo, W. He and J. Luo, “Multi-label classification of fundus images with EfficientNet,” IEEE Access, vol. 8, pp. 212499–212508, 2020. doi: 10.1109/ACCESS.2020.3040275.
[25] N. Gour and P. Khanna, “Multi-class multi-label ophthalmological disease detection using transfer learning based convolutional neural network,” Biomed. Signal Process. Control, vol. 66, p. 102329, 2020. doi: 10.1016/j.bspc.2020.102329.
[26] D. J. J. Farnell et al., “Enhancement of blood vessels in digital fundus photographs via the application of multiscale line operators,” J. Franklin Inst., vol. 345, no. 7, pp. 748–765, 2008.
[27] S. Pachade et al., “Retinal fundus multi-disease image dataset (RFMiD): A dataset for multi-disease detection research,” Data, vol. 6, no. 2, p. 14, Feb. 2021.
[28] Z. Shen, H. Fu, J. Shen and L. Shao, “Modeling and enhancing low-quality retinal fundus images,” IEEE Trans. Med. Imaging, vol. 40, no. 3, pp. 996–1006, 2021.
[29] N. Gour and P. Khanna, “Multi-class multi-label ophthalmological disease detection using transfer learning based convolutional neural network,” Biomed. Signal Process. Control, vol. 66, p. 102329, 2021.
[30] J. Wang et al., “Multi-label classification of fundus images with EfficientNet,” IEEE Access, vol. 8, pp. 212499–212508, 2020.
[31] “Hanson0910/Pytorch-RIADD: 1st solution for retinal image analysis for multi-disease detection challenge (RIADD, ISBI-2021),” GitHub Repository. [Online]. Available: https://github.com/Hanson0910/Pytorch-RIADD. [Accessed: Jun. 26, 2022].