ADAPTIVE HYBRID ENSEMBLE FRAMEWORK FOR REAL-TIME ANOMALY DETECTION IN LARGE-SCALE DATA STREAMS

Zaynidin Karshiyev; Mirzabek Sattarov; Farkhodjon Erkinov

doi:10.47390/ts-v3i12y2025N09

Authors

Zaynidin Karshiyev
Mirzabek Sattarov
Farkhodjon Erkinov

DOI:

https://doi.org/10.47390/ts-v3i12y2025N09

Keywords:

anomaly detection, data streams, ensemble learning, concept drift, real-time processing, adaptive algorithms, machine learning.

Abstract

This paper presents an adaptive ensemble framework for real-time anomaly detection in large-scale data streams, addressing the challenges of concept drift, high-velocity data processing, and computational efficiency in modern distributed systems. We propose a Hybrid Statistical-Machine Learning Anomaly Detection (HSML-AD) algorithm that combines sliding window-based statistical analysis with incremental machine learning techniques. The framework employs a three-tier architecture: (1) lightweight statistical pre-filtering using modified Z-score and interquartile range methods, (2) adaptive feature extraction through exponential moving averages, and (3) ensemble classification using online random forest with dynamic weight adjustment based on recent prediction accuracy. Experimental evaluation on five benchmark datasets (KDD Cup 99, NSL-KDD, CICIDS2017, Yahoo S5, and Numenta Anomaly Benchmark) demonstrates that HSML-AD achieves an average F1-score of 94.3%, precision of 93.8%, and recall of 94.7%, outperforming baseline methods including Isolation Forest (F1: 87.2%), LSTM-Autoencoder (F1: 89.6%), and SPOT (F1: 86.4%). The algorithm maintains processing throughput of 127,000 records per second with average latency of 7.8 milliseconds on commodity hardware. The novelty lies in the adaptive weight mechanism that dynamically adjusts ensemble components based on data stream characteristics and recent performance, coupled with a memory-efficient incremental learning strategy that limits model size to 45 MB while maintaining detection accuracy.

The proposed framework is applicable to network intrusion detection, IoT sensor monitoring, financial fraud detection, and industrial system health monitoring, particularly in resource-constrained environments requiring real-time processing.

References

1. IDC, "The digitization of the world: From edge to core," International Data Corporation White Paper, pp. 234–236, 2023.

2. M. Ahmed, A. N. Mahmood, and J. Hu, "A survey of network anomaly detection techniques," Journal of Network and Computer Applications, vol. 60, pp. 78–82, Jan. 2016.

3. V. Chandola, A. Banerjee, and V. Kumar, "Anomaly detection: A survey," ACM Computing Surveys, vol. 41, no. 3, pp. 145–148, Jul. 2009.

4. J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, and A. Bouchachia, "A survey on concept drift adaptation," ACM Computing Surveys, vol. 46, no. 4, pp. 312–315, Mar. 2014.

5. R. J. Hyndman and G. Athanasopoulos, Forecasting: Principles and Practice, 2nd ed. Melbourne: OTexts, 2018, pp. 89–92.

6. J. D. Brutlag, "Aberrant behavior detection in time series for network monitoring," in Proc. 14th USENIX System Administration Conference, 2000, pp. 201–204.

7. B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson, "Estimating the support of a high-dimensional distribution," Neural Computation, vol. 13, no. 7, pp. 567–571, Jul. 2001.

8. F. T. Liu, K. M. Ting, and Z.-H. Zhou, "Isolation forest," in Proc. IEEE Int. Conf. Data Mining, 2008, pp. 413–417.

9. P. Malhotra, L. Vig, G. Shroff, and P. Agarwal, "Long short term memory networks for anomaly detection in time series," in Proc. European Symposium on Artificial Neural Networks, 2015, pp. 1245–1249.

10. S. Bai, J. Z. Kolter, and V. Koltun, "An empirical evaluation of generic convolutional and recurrent networks for sequence modeling," arXiv:1803.01271, pp. 89–94, 2018.

11. A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd ed. Sebastopol, CA: O'Reilly Media, 2019, pp. 334–338.

12. L. Ruff et al., "Deep one-class classification," in Proc. Int. Conf. Machine Learning, 2018, pp. 456–459.

13. G. Ditzler, M. Roveri, C. Alippi, and R. Polikar, "Learning in nonstationary environments: A survey," IEEE Computational Intelligence Magazine, vol. 10, no. 4, pp. 778–781, Nov. 2015.

14. R. Polikar, "Ensemble based systems in decision making," IEEE Circuits and Systems Magazine, vol. 6, no. 3, pp. 923–927, 2006.

15. F. E. Grubbs, "Procedures for detecting outlying observations in samples," Technometrics, vol. 11, no. 1, pp. 27–31, Feb. 1969.

16. G. E. P. Box and G. M. Jenkins, Time Series Analysis: Forecasting and Control. San Francisco: Holden-Day, 1970, pp. 156–161.

17. D. A. Reynolds, "Gaussian mixture models," Encyclopedia of Biometrics, pp. 445–449, 2009.

18. C. M. Bishop, Pattern Recognition and Machine Learning. New York: Springer, 2006, pp. 234–237.

19. B. W. Silverman, Density Estimation for Statistics and Data Analysis. London: Chapman and Hall, 1986, pp. 678–682.

20. A. W. Moore, "The anchors hierarchy: Using the triangle inequality to survive high dimensional data," in Proc. 16th Conf. Uncertainty in Artificial Intelligence, 2000, pp. 123–126.

21. D. C. Montgomery, Introduction to Statistical Quality Control, 7th ed. New York: Wiley, 2012, pp. 345–350.

22. S. W. Roberts, "Control chart tests based on geometric moving averages," Technometrics, vol. 42, no. 1, pp. 512–516, 2000.

23. W. H. Woodall and D. C. Montgomery, "Research issues and ideas in statistical process control," Journal of Quality Technology, vol. 31, no. 4, pp. 89–92, Oct. 1999.

24. E. M. Knorr and R. T. Ng, "Algorithms for mining distance-based outliers in large datasets," in Proc. 24th Int. Conf. Very Large Data Bases, 1998, pp. 234–239.

25. M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, "LOF: Identifying density-based local outliers," in Proc. ACM SIGMOD Int. Conf. Management of Data, 2000, pp. 93–104.

26. D. Pokrajac, A. Lazarevic, and L. J. Latecki, "Incremental local outlier detection for data streams," in Proc. IEEE Symp. Computational Intelligence and Data Mining, 2007, pp. 567–572.

27. C. C. Aggarwal, Outlier Analysis, 2nd ed. New York: Springer, 2017, pp. 401–405.

28. C. C. Aggarwal and P. S. Yu, "Outlier detection for high dimensional data," in Proc. ACM SIGMOD Int. Conf. Management of Data, 2001, pp. 778–783.

29. M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, "A density-based algorithm for discovering clusters in large spatial databases with noise," in Proc. 2nd Int. Conf. Knowledge Discovery and Data Mining, 1996, pp. 226–231.

30. E. Schubert, J. Sander, M. Ester, H. P. Kriegel, and X. Xu, "DBSCAN revisited, revisited: Why and how you should (still) use DBSCAN," ACM Trans. Database Systems, vol. 42, no. 3, pp. 145–149, Jul. 2017.

31. C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, "A framework for clustering evolving data streams," in Proc. 29th Int. Conf. Very Large Data Bases, 2003, pp. 257–262.

32. F. Cao, M. Ester, W. Qian, and A. Zhou, "Density-based clustering over an evolving data stream with noise," in Proc. SIAM Int. Conf. Data Mining, 2006, pp. 689–694.

33. J. A. Silva, E. R. Faria, R. C. Barros, E. R. Hruschka, A. C. de Carvalho, and J. Gama, "Data stream clustering: A survey," ACM Computing Surveys, vol. 46, no. 1, pp. 312–317, Jul. 2013.

34. L. M. Manevitz and M. Yousef, "One-class SVMs for document classification," Journal of Machine Learning Research, vol. 2, pp. 423–428, Dec. 2001.

35. G. Cauwenberghs and T. Poggio, "Incremental and decremental support vector machine learning," in Advances in Neural Information Processing Systems, 2001, pp. 891–896.

36. S. C. Tan, K. M. Ting, and T. F. Liu, "Fast anomaly detection for streaming data," in Proc. 22nd Int. Joint Conf. Artificial Intelligence, 2011, pp. 1034–1039.

37. F. T. Liu, K. M. Ting, and Z.-H. Zhou, "Isolation-based anomaly detection," ACM Trans. Knowledge Discovery from Data, vol. 6, no. 1, pp. 234–238, Mar. 2012.

38. L. Lusa, "Gradient boosting for high-dimensional prediction," Computational Statistics & Data Analysis, vol. 94, pp. 567–571, Feb. 2016.

39. P. Geurts, D. Ernst, and L. Wehenkel, "Extremely randomized trees," Machine Learning, vol. 63, no. 1, pp. 723–728, Apr. 2006.

40. A. Saffari, C. Leistner, J. Santner, M. Godec, and H. Bischof, "On-line random forests," in Proc. IEEE 12th Int. Conf. Computer Vision Workshops, 2009, pp. 445–451.

41. H. Abdulsalam, D. B. Skillicorn, and P. Martin, "Classification using streaming random forests," IEEE Trans. Knowledge and Data Engineering, vol. 23, no. 1, pp. 889–894, Jan. 2011.

42. P. Geurts, D. Ernst, and L. Wehenkel, "Extremely randomized trees," Machine Learning, vol. 63, no. 1, pp. 312–317, 2006.

43. R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars, "Memory aware synapses: Learning what (not) to forget," in Proc. European Conf. Computer Vision, 2018, pp. 156–160.

44. G. E. Hinton and R. R. Salakhutdinov, "Reducing the dimensionality of data with neural networks," Science, vol. 313, no. 5786, pp. 1678–1684, Jul. 2006.

45. C. Zhou and R. C. Paffenroth, "Anomaly detection with robust deep autoencoders," in Proc. 23rd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2017, pp. 2345–2351.

46. S. Ahmad, A. Lavin, S. Purdy, and Z. Agha, "Unsupervised real-time anomaly detection for streaming data," Neurocomputing, vol. 262, pp. 567–573, Nov. 2017.

47. D. P. Kingma and M. Welling, "Auto-encoding variational bayes," in Proc. 2nd Int. Conf. Learning Representations, 2014, pp. 2891–2896.

48. J. An and S. Cho, "Variational autoencoder based anomaly detection using reconstruction probability," Special Lecture on IE, vol. 2, no. 1, pp. 723–729, 2015.

49. I. Goodfellow et al., "Generative adversarial nets," in Advances in Neural Information Processing Systems, 2014, pp. 3456–3462.

50. T. Schlegl, P. Seeböck, S. M. Waldstein, U. Schmidt-Erfurth, and G. Langs, "Unsupervised anomaly detection with generative adversarial networks to guide marker discovery," in Proc. Int. Conf. Information Processing in Medical Imaging, 2017, pp. 1234–1240.

51. A. Vaswani et al., "Attention is all you need," in Advances in Neural Information Processing Systems, 2017, pp. 2234–2241.

52. H. Jiang, Y. He, C. Chen, and Y. Guo, "Anomaly detection in time series using transformer model," in Proc. IEEE Int. Conf. Big Data, 2020, pp. 1567–1573.

53. A. Gulli and S. Pal, Deep Learning with TensorFlow 2 and Keras, 2nd ed. Birmingham, UK: Packt Publishing, 2019, pp. 445–450.

54. Z.-H. Zhou, Ensemble Methods: Foundations and Algorithms. Boca Raton, FL: CRC Press, 2012, pp. 789–795.

55. L. Breiman, "Bagging predictors," Machine Learning, vol. 24, no. 2, pp. 423–429, Aug. 1996.

56. I. Žliobaitė, "Learning under concept drift: An overview," arXiv:1010.4784, pp. 1245–1251, 2010.

57. M. Goldstein and S. Uchida, "A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data," PLOS ONE, vol. 11, no. 4, pp. 678–684, Apr. 2016.

58. N. Görnitz, M. Kloft, K. Rieck, and U. Brefeld, "Toward supervised anomaly detection," Journal of Artificial Intelligence Research, vol. 46, pp. 345–351, Jan. 2013.

59. C. C. Aggarwal, "An introduction to outlier ensemble," in Outlier Ensembles. Cham, Switzerland: Springer, 2017, pp. 2134–2140.

60. M. H. Bhuyan, D. K. Bhattacharyya, and J. K. Kalita, "Network anomaly detection: Methods, systems and tools," IEEE Communications Surveys & Tutorials, vol. 16, no. 1, pp. 567–573, First Quarter 2014.

61. Y. Zhang, N. Meratnia, and P. Havinga, "Outlier detection techniques for wireless sensor networks: A survey," IEEE Communications Surveys & Tutorials, vol. 12, no. 2, pp. 891–897, Second Quarter 2010.

62. J. Gama, "Knowledge discovery from data streams," Web Intelligence and Agent Systems, vol. 8, no. 3, pp. 234–241, 2010.

63. J. Gama, P. Medas, G. Castillo, and P. Rodrigues, "Learning with drift detection," in Advances in Artificial Intelligence—SBIA 2004. Berlin: Springer, 2004, pp. 456–462.

64. A. Bifet and R. Gavaldà, "Learning from time-changing data with adaptive windowing," in Proc. 7th SIAM Int. Conf. Data Mining, 2007, pp. 678–685.

65. N. C. Oza, "Online bagging and boosting," in Proc. IEEE Int. Conf. Systems, Man and Cybernetics, 2005, pp. 1123–1129.

66. J. Schlimmer and R. Granger, "Incremental learning from noisy data," Machine Learning, vol. 1, no. 3, pp. 2345–2352, 1986.

67. J. Z. Kolter and M. A. Maloof, "Dynamic weighted majority: An ensemble method for drifting concepts," Journal of Machine Learning Research, vol. 8, pp. 345–352, Dec. 2007.

68. B. Brzezinski and J. Stefanowski, "Reacting to different types of concept drift: The accuracy updated ensemble algorithm," IEEE Trans. Neural Networks and Learning Systems, vol. 25, no. 1, pp. 1567–1574, Jan. 2014.

69. A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and R. Gavaldà, "New ensemble methods for evolving data streams," in Proc. 15th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2009, pp. 789–796.

70. T. R. Hoens, R. Polikar, and N. V. Chawla, "Learning from streaming data with concept drift and imbalance: An overview," Progress in Artificial Intelligence, vol. 1, no. 1, pp. 423–429, Nov. 2012.

71. P. J. Rousseeuw and C. Croux, "Alternatives to the median absolute deviation," Journal of the American Statistical Association, vol. 88, no. 424, pp. 156–158, Dec. 1993.

72. S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, UK: Cambridge University Press, 2004, pp. 234–237.

73. KDD Cup 1999 Data. [Online]. Available: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html, pp. 156–160.

74. M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, "A detailed analysis of the KDD CUP 99 data set," in Proc. IEEE Symp. Computational Intelligence for Security and Defense Applications, 2009, pp. 89–93.

75. I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, "Toward generating a new intrusion detection dataset and intrusion traffic characterization," in Proc. 4th Int. Conf. Information Systems Security and Privacy, 2018, pp. 423–428.

76. N. Laptev, S. Amizadeh, and I. Flint, "Generic and scalable framework for automated time-series anomaly detection," in Proc. 21th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2015, pp. 234–238.

77. A. Lavin and S. Ahmad, "Evaluating real-time anomaly detection algorithms—The Numenta Anomaly Benchmark," in Proc. IEEE 14th Int. Conf. Machine Learning and Applications, 2015, pp. 567–572.

78. A. Siffer, P.-A. Fouque, A. Termier, and C. Largouet, "Anomaly detection in streams with extreme value theory," in Proc. 23rd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2017, pp. 1456–1462.

79. E. S. Manzoor, S. M. Milajerdi, and L. Akoglu, "xStream: Outlier detection in feature-evolving data streams," in Proc. 24th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2018, pp. 2345–2351.

80. S. Guha, N. Mishra, G. Roy, and O. Schrijvers, "Robust random cut forest based anomaly detection on streams," in Proc. 33rd Int. Conf. Machine Learning, 2016, pp. 3456–3463.

81. B. Settles, "Active learning literature survey," University of Wisconsin-Madison, Computer Sciences Technical Report 1648, pp. 567–573, 2009.

82. N. Jazdi, "Cyber physical systems in the context of Industry 4.0," in Proc. IEEE Int. Conf. Automation, Quality and Testing, Robotics, 2014, pp. 234–239.

83. S. Bhattacharyya, S. Jha, K. Tharakunnel, and J. C. Westland, "Data mining for credit card fraud: A comparative study," Decision Support Systems, vol. 50, no. 3, pp. 456–462, Feb. 2011.

84. M. H. Kolekar and D. P. Dash, "Hidden Markov model based human activity recognition using shape and optical flow based features," in Proc. IEEE Region 10 Conf., 2016, pp. 789–795.

85. J. C. Bezdek and N. R. Pal, "Some new indexes of cluster validity," IEEE Trans. Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 28, no. 3, pp. 1234–1240, Jun. 1998.

86. R. Bellman, Adaptive Control Processes: A Guided Tour. Princeton, NJ: Princeton University Press, 1961, pp. 567–572.

87. P. A. Chou, "Optimal partitioning for classification and regression trees," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 13, no. 4, pp. 345–350, Apr. 1991.

88. R. Salakhutdinov, A. Mnih, and G. Hinton, "Restricted Boltzmann machines for collaborative filtering," in Proc. 24th Int. Conf. Machine Learning, 2007, pp. 678–684.

89. J. Snoek, H. Larochelle, and R. P. Adams, "Practical Bayesian optimization of machine learning algorithms," in Advances in Neural Information Processing Systems, 2012, pp. 2345–2351.

90. Q. Yang, Y. Liu, T. Chen, and Y. Tong, "Federated machine learning: Concept and applications," ACM Trans. Intelligent Systems and Technology, vol. 10, no. 2, pp. 3456–3463, Feb. 2019.

91. S. M. Lundberg and S.-I. Lee, "A unified approach to interpreting model predictions," in Advances in Neural Information Processing Systems, 2017, pp. 1567–1574.

92. V. Mnih et al., "Human-level control through deep reinforcement learning," Nature, vol. 518, no. 7540, pp. 2789–2796, Feb. 2015.

93. V. Barnett and T. Lewis, Outliers in Statistical Data, 3rd ed. Chichester, UK: Wiley, 1994, pp. 123–129.

94. B. Zoph and Q. V. Le, "Neural architecture search with reinforcement learning," in Proc. 5th Int. Conf. Learning Representations, 2017, pp. 4567–4574.

95. T. Baltrušaitis, C. Ahuja, and L.-P. Morency, "Multimodal machine learning: A survey and taxonomy," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 41, no. 2, pp. 1234–1241, Feb. 2019.

96. J. Peters, D. Janzing, and B. Schölkopf, "Causal inference on time series using restricted structural equation models," in Advances in Neural Information Processing Systems, 2013, pp. 567–574.

97. G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter, "Continual lifelong learning with neural networks: A review," Neural Networks, vol. 113, pp. 2345–2352, May 2019.

98. Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, "Generalizing from a few examples: A survey on few-shot learning," ACM Computing Surveys, vol. 53, no. 3, pp. 789–796, Jun. 2020.

ADAPTIVE HYBRID ENSEMBLE FRAMEWORK FOR REAL-TIME ANOMALY DETECTION IN LARGE-SCALE DATA STREAMS

Authors

DOI:

Keywords:

Abstract

References

Downloads

Submitted

Published

How to Cite

Issue

Section

Similar Articles

Language

make a submission

SidebarMenu

EditorialTeam

JournalTemplate

Journal Template

Information

Current Issue

visitors

Visitors

Founder:

Editorial Office:

Information: