Real or Synthetic? Dermatologist Agreement on Synthetic vs. Real Melanoma and Pattern Recognition

Background The validation of synthetic dermatological images generated by Generative Adversarial Networks (GANs) [1] is crucial for their integration into clinical and research workflows. Despite rapid progress in image synthesis, a standardized framework for evaluating the realism and diagnostic ut...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	Epidemiology, biostatistics, and public health
Hlavní autori:	Cartocci, Alessandra, Luschi, Alessio, Tognetti, Linda, Iadanza, Ernesto, Rubegni, Pietro
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	Milano University Press 08.09.2025
ISSN:	2282-0930, 2282-0930
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	Background The validation of synthetic dermatological images generated by Generative Adversarial Networks (GANs) [1] is crucial for their integration into clinical and research workflows. Despite rapid progress in image synthesis, a standardized framework for evaluating the realism and diagnostic utility of synthetic skin lesions through expert review is still lacking [2]. Existing automated evaluation metrics, while informative, do not always align with human perception and diagnostic expectations. Particularly in medical domains, subtle visual cues and contextual interpretation often elude algorithmic assessment [3]. Human evaluations remain the most direct means of determining whether synthetic images capture the nuanced features necessary for clinical utility. Without structured expert-based validation, synthetic images may introduce bias or mislead models and clinicians, hampering their responsible deployment in diagnostic support systems, training datasets, or educational tools. Objectives This study aims to conduct an expert-based qualitative evaluation of synthetic melanoma images. Specifically, it investigates the subjective perception of image realism, diagnostic quality, and the recognizability of key dermoscopic features. By engaging dermatologists in a blinded assessment of synthetic and real images, we seek to establish a foundation for systematically validating synthetic dermatological data for use in AI development, medical education, and clinical decision support. This work emphasizes the importance of subjective expert validation as a complement to technical performance metrics in assessing the fidelity of GAN-generated skin lesion images. Materials and Methods StyleGAN3-T [4] was trained on a dataset of dermoscopic images of melanoma [5–7] with adaptive discriminator augmentation and transfer learning. A total of 25 synthetic melanoma images were generated and randomly mixed with 25 real melanoma images, resulting in a 50-image dataset. Seventeen board-certified dermatologists with varying levels of experience (low <4 years, medium 5–8 years, high >8 years) participated in the evaluation. Participants were blinded to image origin and asked to classify each image as real or synthetic. They also assessed the presence of 16 defined dermoscopic patterns according to standardized definitions and rated four dimensions—image quality, skin texture, visual realism, and color realism—on a 7-point Likert scale. Additionally, participants reported their confidence in each classification decision. Statistical analyses included Chi-square tests for categorical comparisons, and Fleiss’ Kappa and Krippendorff’s Alpha were used to measure inter-rater agreement. Results Real images were consistently rated higher than synthetic images across all qualitative dimensions: image quality (high: 15.8% real vs. 11.3% synthetic), skin texture (high: 22.4% vs. 13.4%), and visual realism (high: 22.6% vs. 13.2%), all with p < 0.001. Confidence in evaluations was also significantly greater for real images, with high confidence reported in 17.4% of real cases compared to 8.7% for synthetic ones (p < 0.001).Regarding the recognition of image origin, the overall classification accuracy was 64%. Real images were correctly identified in 73% of cases, while only 56% of synthetic images were correctly classified as synthetic. Accuracy increased with expertise: from 59% in the low-experience group to 71% among high-experience dermatologists. Similarly, higher self-reported confidence was associated with improved performance (accuracy 74% at high confidence level). Recognition of specific dermoscopic features showed differences between real and synthetic images. The blue-white veil was detected in 29.1% of real images compared to 13.8% of synthetic ones (p < 0.001), and shiny white streaks in 22.6% vs. 7.9% (p < 0.001). Conversely, synthetic images were more frequently associated with irregular pigmented blotches (45.0% vs. 30.9%, p < 0.001). The multicomponent pattern, typically indicative of melanoma complexity, was identified in 40.6% of real images versus only 23.2% of synthetic ones (p < 0.001), suggesting a gap in the synthetic images’ structural fidelity (Table 1). Inter-rater agreement for the classification of real versus synthetic images was low, with a Fleiss’ kappa of 0.183. Pattern recognition agreement also remained weak (e.g., kappa < 0.3 for most features), underscoring variability in expert interpretations. Further subgroup analyses showed that images rated as highly realistic or evaluated with high confidence were more likely to be classified correctly, with accuracy rising to 74% in the highest-confidence subgroup. Conclusions Synthetic melanoma lesions generated using StyleGAN3-T demonstrate visually convincing features and were frequently perceived as real, yet consistently underperformed compared to real images in diagnostic quality and structural detail. Participants often struggled to distinguish synthetic from real lesions, particularly when realism ratings were medium to high. Critical diagnostic patterns, such as the blue-white veil and shiny white streaks, were significantly underrepresented in synthetic images. These limitations were reflected in the lower classification confidence and weaker inter-rater agreement. Despite these challenges, the study highlights the potential of synthetic data to approach realism levels sufficient for research and educational use. Qualitative validation by dermatologists is essential to benchmark the readiness of synthetic images for real-world medical applications. As generative models continue to evolve, expert evaluation should remain a key component of validation pipelines to ensure clinical and pedagogical reliability.
ISSN:	2282-0930 2282-0930
DOI:	10.54103/2282-0930/29361