Research on Mathematical Modeling and Statistical Analysis of Complex Datasets Based on Machine Learning

Authors

  • Songtao Ran Sendelta international Academy Shenzhen, Shenzhen, China

DOI:

https://doi.org/10.54097/bz6ka354

Keywords:

Machine Learning; Mathematical Modeling; Statistical Analysis; Complex Datasets; Dynamic Statistical Constrained Deep Learning.

Abstract

With the explosive growth of data scale, the high-dimensional sparsity, dynamic evolution and multi-source heterogeneity of complex data sets have brought great challenges to traditional statistical modeling methods. Machine learning provides a new paradigm for complex data modeling with its ability of nonlinear function approximation and automatic feature learning. However, the pure data-driven machine learning model has some problems such as poor interpretability and insufficient robustness. This article proposes a Dynamic Statistical Constrained Deep Learning (DSC-DL), which integrates techniques such as graph neural networks (GNN), Bayesian dynamic regularization, and multimodal variational encoders to address the challenges of complex data. DSC-DL deals with high-dimensional sparsity by structural feature selection and graph embedding dimension reduction technology, captures dynamic evolution by time-aware hidden variable modeling, and realizes multi-source heterogeneous fusion through joint variational inference and adaptive fusion mechanism. In this paper, the generalization error bound of DSC-DL under dependent identically distributed data is derived, which provides theoretical guarantee for its robustness. Experimental results on financial fraud detection, social topic prediction and cancer typing diagnosis show that DSC-DL can effectively handle complex data sets, and shows excellent prediction performance, robustness and interpretability.

Downloads

Download data is not yet available.

References

[1] Yangqing Ye,Yang Yu,Xiaoyan Ma & Wanfeng Liang. (2025). Robust distributed precision matrix estimation for high-dimensional data. Journal of Statistical Computation and Simulation,95(11),2494-2511.

[2] Charlotte Castel,Zhi Zhao & Magne Thoresen. (2025). Comparing LASSO and IPF-LASSO for multi-modal data: variable selection with Type I error control. Journal of Statistical Computation and Simulation,95(10),2204-2218.

[3] Alberto Brini,Abu Manju & Edwin R. van den Heuvel. (2025). A variable clustering approach for overdispersed high-dimensional count data using a copula-based mixture model. Communications in Statistics - Simulation and Computation,54(7),2564-2584.

[4] Feng Xie,Cheng Li,Weike Lu,Zhen Yang, Hanling Zhang & Jie Xie. (2025). Decision variables to be discovered in modelling high-dimensional omics data for cancer studies. Intelligent Data Analysis,29(4),835-849.

[5] Nanjun Ye. (2025). Elasticsearch for Complex Data Association Analysis: Modeling, Aggregation, and Optimization Techniques.Frontiers in Computing and Intelligent Systems,12(3),5-11.

[6] Da Chuan Chen,Long Feng & De Cai Liang. (2024). Asymptotic Independence of the Quadratic Form and Maximum of Independent Random Variables with Applications to High-Dimensional Tests.Acta Mathematica Sinica, English Series,40(12),3093-3126.

[7] Salvatore Fiorenza & Cameron J Turtle. (2024). High-dimensional data bridges for CARs.Blood,144(24),2463-2464.

[8] Efe Precious Onakpojeruo & Nuriye Sancar. (2024). A Two-Stage Feature Selection Approach Based on Artificial Bee Colony and Adaptive LASSO in High-Dimensional Data.AppliedMath,4(4),1522-1538.

[9] Mohammadtaher Abbasi & Pooya Zakian. (2024). Optimal design of truss domes with frequency constraints using seven metaheuristic algorithms incorporating a comprehensive statistical assessment. Mechanics of Advanced Materials and Structures,31(30),12533-12559.

[10] Odunayo AdiatOyegoke,Kayode SamuelAdekeye,John OlutunjiOlaomi & Jean ClaudeMalela Majika. (2024). Hotelling T2 control chart based on minimum vector variance for monitoring high‐dimensional correlated multivariate process. Quality and Reliability Engineering International,41(2),765-783.

[11] Jiujing Wu & Hengjian Cui. (2024). Model-free feature screening based on Hellinger distance for ultrahigh dimensional data. Statistical Papers,65(9),1-28.

[12] Belcher Paul. (2024). Definitions for outliers in two-dimensional and higher- dimensional data. The Mathematical Gazette,108(573),507-511.

Downloads

Published

23-12-2025

How to Cite

Ran, S. (2025). Research on Mathematical Modeling and Statistical Analysis of Complex Datasets Based on Machine Learning. Highlights in Science, Engineering and Technology, 159, 20-25. https://doi.org/10.54097/bz6ka354