AI and ML in medical studies: evolving guidelines

Artificial intelligence (AI) and machine learning (ML) techniques are revolutionizing the way to deliver healthcare. ML is a subset of AI that can be used to design and train software algorithms to learn from and act on data. Such algorithms can be employed to develop new prognostic and diagnostic methods with the advantage of being quick and accurate. Starting from data of many patient samples, indeed, these algorithms generate and structure datasets, identify relations between variables, and predict outcomes or divide patients into groups. For example, a software can learn to recognize tumor tissue from normal tissue looking at immunohistochemical slides, or a program might be used to group patients in different risk categories for a pathology according to a set of diagnostic parameters.
ML can be broadly divided into supervised and unsupervised ML, and the best methodology should be selected depending on the clinical question that needs to be answered. Supervised ML starts from labelled data to make predictions on new samples. In this setting, a training set of labeled data is used to teach to the AI, and then the algorithm is tested using a validation set of unlabeled data. This means that a component of manual work is still present, as the labeled dataset has to be manually tagged by an expert. Unsupervised ML, instead, can be defined as a learning of data to find previously unknown patterns or grouping labels for the patient samples. This method does not require a presence of previously labeled data, and, during the training phase, the algorithm uses mimicry to generate a self-organization that captures patterns or allows to make classifications.
For every method, it is fundamental that ML is trained on a large and diversified dataset, in order to avoid potential biases and to make it reproducible in different settings. For example, if a ML algorithm is being designed to differentiate patients with a high or low risk of cardiovascular pathologies, based on a dataset of biochemical parameters, it would be important that the patients included in the training set come from different centers, and are, for instance, of diverse ethnic groups, or of both sexes, so that the algorithm can be reproducible across different hospitals worldwide. Similarly, a validation set should include an independent set of patients, to prove the accuracy of the algorithm. Finally, it would be important, for the generalizability of the tool, that the included parameters were commonly used in several hospitals, so that the ML might be easily integrated in different settings.
Given the great potential of ML tools, they are being increasingly utilized in clinical studies. However, the evaluation of these AI interventions can be poorly reported, and studies might not include key information to understand their validity. In order to demonstrate the efficacy of AI interventions, they must be assessed by randomized trials. A few published trials including an AI component exist, and, while this is certainly encouraging, the relative novelty of this application within trials led to the fact that the SPIRIT and CONSORT guidelines were found to be underused for AI interventions. To overcome this issue, in 2020, a project was launched to develop AI-specific extensions and elaborations to the SPIRIT and CONSORT guidelines. The SPIRIT-AI and the CONSORT-AI guidelines encourage not only the availability of data and algorithm codes, but also the clear description of the AI intervention, especially for what concerns the human–AI interaction for the handling of input data, and the role of the AI in the clinical decision making. While those guidelines are certainly useful for editors and peer reviewers, they are fundamental for investigators, as they can be used in the early stages of planning, to optimally design trials involving AI.