The old-school formula that’s beating modern AI cancer tools

Medical datasets often present a major challenge for machine learning models: skewness in continuous variables such as age, tumor size, and survival months. This skewness can undermine the assumptions of certain algorithms, distort variance, and hinder the ability of models to draw clear separations between classes.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 13-08-2025 18:29 IST | Created: 13-08-2025 18:29 IST
The old-school formula that’s beating modern AI cancer tools
Representative Image. Credit: ChatGPT

A newly published study has found that a decades-old statistical technique could dramatically improve the accuracy of machine learning models used to predict breast cancer outcomes. The research evaluates the Box–Cox transformation’s ability to handle skewed medical data and boost performance across a variety of algorithms, including logistic regression, support vector machines, random forests, XGBoost, and ensemble models.

The paper, titled “Machine Learning Techniques Improving the Box–Cox Transformation in Breast Cancer Prediction” and published in Electronics, compares the Box–Cox method with more common preprocessing strategies, such as logarithmic transformation and Synthetic Minority Oversampling Technique (SMOTE). The findings indicate that Box–Cox, particularly with a transformation parameter of λ=1, consistently delivers higher accuracy and F1 scores across both synthetic and real-world datasets, positioning it as a potentially critical step in clinical data modeling workflows.

Why skewed medical data limits AI accuracy

Medical datasets often present a major challenge for machine learning models: skewness in continuous variables such as age, tumor size, and survival months. This skewness can undermine the assumptions of certain algorithms, distort variance, and hinder the ability of models to draw clear separations between classes.

To test potential solutions, the study used two datasets: a synthetic gamma-distributed dataset of 1,000 samples to simulate right-skewed distributions, and a breast cancer dataset from the U.S. Surveillance, Epidemiology, and End Results (SEER) program containing 4,024 patient records from 2006 to 2010. The SEER cohort was split into “alive” and “dead” outcomes, with 3,408 and 616 cases respectively.

The researcher compared four preprocessing approaches: no transformation, Box–Cox transformation at different λ values (None, 0.5, and 1), logarithmic transformation, and SMOTE for class imbalance correction. Each method was tested across six machine learning setups: logistic regression, support vector machines, random forest, XGBoost, soft-voting ensembles, and stacking ensembles that combine predictions from random forest and XGBoost using logistic regression as a meta-learner.

How Box–Cox outperformed other preprocessing methods

The results were consistent and striking: the Box–Cox transformation, especially at λ=1, delivered superior results in both datasets and across almost all models tested.

On the SEER dataset, the stacking ensemble reached a 94.53% accuracy and a 94.74 F1 score using Box–Cox at λ=1, outperforming all other configurations. Random forest achieved 94.29% accuracy under the same transformation, while XGBoost followed at 93.04%. Logistic regression and support vector machines, which are particularly sensitive to skewness, saw their performance rise from the high-80% range to around 92–93% when Box–Cox was applied.

In contrast, logarithmic transformation improved results compared to no preprocessing but lagged behind Box–Cox, with stacking ensembles achieving about 90.66% accuracy. SMOTE improved class balance but still underperformed, yielding stacking ensemble results closer to 87.33% accuracy.

The synthetic gamma dataset produced a similar trend, with Box–Cox again driving accuracy levels near or above 97% in some models. The method’s effectiveness across both simulated and real-world data reinforces its value as a generalizable preprocessing tool.

Implications for clinical AI and predictive healthcare

The findings have direct implications for the design of machine learning systems in healthcare, where data preprocessing is often overlooked in favor of algorithm selection or hyperparameter tuning. The study demonstrates that a well-chosen transformation can yield larger performance gains than more complex modeling strategies.

By stabilizing variance and reducing skew, the Box–Cox transformation creates a more uniform distribution of input features. This helps linear models like logistic regression and SVM adhere more closely to their statistical assumptions, while also improving separability for tree-based methods and ensemble architectures.

The research also highlights the importance of pairing preprocessing strategies with appropriate validation. The study employed both statistical testing, via ANOVA and Kruskal–Wallis, to confirm that improvements were not random, and direct performance comparisons to verify consistency across datasets.

In clinical contexts, these findings could translate into more reliable prediction systems for cancer prognosis, treatment response, or survival analysis. Enhanced model accuracy may lead to earlier detection of high-risk patients, better resource allocation, and more personalized treatment planning.

However, the study also notes limitations. The SEER dataset, while robust, is still limited in scope and may not capture all patient diversity. Moreover, Box–Cox requires parameter tuning (λ selection), and its computational cost could be significant when deployed in large-scale or real-time clinical decision support systems.

Future research directions include expanding the evaluation to other cancer datasets, integrating Box–Cox into federated learning pipelines to protect patient privacy, and testing multimodal data combinations such as genomics, imaging, and clinical notes.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback