Difference between revisions of "Concept drift"

From Deletionpedia.org: a home for articles deleted from Wikipedia
Jump to: navigation, search
m (inclusion power)
m (survived on Wikipedia)
 
Line 1: Line 1:
<!-- Please do not remove or change this AfD message until the discussion has been closed. -->
+
{{survived}}
{{Article for deletion/dated|page=Concept drift|timestamp=20181124062117|year=2018|month=November|day=24|substed=yes|help=off}}
 
<!-- Once discussion is closed, please place on talk page: {{Old AfD multi|page=Concept drift|date=24 November 2018|result='''keep'''}} -->
 
<!-- End of AfD message, feel free to edit beyond this point -->
 
In [[predictive analytics]] and [[machine learning]], the '''concept drift''' means that the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions become less accurate as time passes.
 
 
 
The term ''concept'' refers to the quantity to be predicted. More generally, it can also refer to other phenomena of interest besides the target concept, such as an input, but, in the context of concept drift, the term commonly refers to the target variable.
 
 
 
==Examples==
 
In a [[fraud detection]] application the target concept may be a [[Binary numeral system|binary]] attribute FRAUDULENT with values "yes" or "no" that indicates whether a given transaction is fraudulent. Or, in a [[weather prediction]] application, there may be several target concepts such as TEMPERATURE, PRESSURE, and HUMIDITY.
 
 
 
The behavior of the customers in an [[online shop]] may change over time. For example, if weekly merchandise sales are to be predicted, and a [[predictive modelling|predictive model]] has been developed that works satisfactorily. The model may use inputs such as the amount of money spent on [[advertising]], [[Promotion (marketing)|promotions]] being run, and other metrics that may affect sales. The model is likely to become less and less accurate over time – this is concept drift. In the merchandise sales application, one reason for concept drift may be seasonality, which means that shopping behavior changes seasonally. Perhaps there will be higher sales in the winter holiday season than during the summer, for example.
 
 
 
==Possible remedies==
 
 
 
To prevent deterioration in [[prediction]] accuracy because of concept drift, both active and passive solutions can be adopted.  Active solutions rely on triggering mechanisms, e.g., change-detection tests (Basseville and Nikiforov 1993; Alippi and Roveri, 2007) to explicitly detect concept drift as a change in the statistics of the data-generating process. In stationary conditions, any fresh information made available can be integrated to improve the model. Differently, when concept drift is detected, the current model is no more up-to-date and must be substituted with a new one to maintain the prediction accuracy (Gama et al., 2004; Alippi et al., 2011). On the contrary, in passive solutions the model is continuously updated, e.g., by retraining the model on the most recently observed samples (Widmer and Kubat, 1996), or enforcing an ensemble of classifiers (Elwell and Polikar 2011).
 
 
 
Contextual information, when available, can be used to better explain the causes of the concept drift: for instance, in the sales prediction application, concept drift might be compensated by adding information about the season to the model. By providing information about the time of the year, the rate of deterioration of your model is likely to decrease, concept drift is unlikely to be eliminated altogether. This is because actual shopping behavior does not follow any static, [[finite model]]. New factors may arise at any time that influence shopping behavior, the influence of the known factors or their interactions may change.
 
 
 
Concept drift cannot be avoided for complex phenomena that are not governed by fixed [[Physical law|laws of nature]]. All processes that arise from human activity, such as [[socioeconomic]] processes, and [[biological processes]] are likely to experience concept drift. Therefore, periodic retraining, also known as refreshing, of any model is necessary.
 
 
 
==Software==
 
* [[RapidMiner]] (formerly YALE (Yet Another Learning Environment)): free open-source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time-varying concepts, and tracking drifting concept (if used in combination with its data stream mining plugin (formerly: concept drift plugin))
 
* EDDM ([https://web.archive.org/web/20070322063617/http://iaia.lcc.uma.es/Members/mbaena/papers/eddm/ EDDM (Early Drift Detection Method)]): free open-source implementation of drift detection methods in [[Weka (machine learning)]].
 
* [[MOA (Massive Online Analysis)]]: free open-source software specific for mining data streams with concept drift. It contains a prequential evaluation method, the EDDM concept drift methods, a reader of ARFF real datasets, and artificial stream generators as SEA concepts, STAGGER, rotating hyperplane, random tree, and random radius based functions. MOA supports bi-directional interaction with [[Weka (machine learning)]].
 
 
 
==Datasets==
 
 
 
===Real===
 
* '''Airline''', approximately 116 million flight arrival and departure records (cleaned and sorted) compiled by E.Ikonomovska. Reference: Data Expo 2009 Competition [http://stat-computing.org/dataexpo/2009/]. [http://kt.ijs.si/elena_ikonomovska/data.html Access]
 
* '''Chess.com''' (online games) and '''Luxembourg''' (social survey) datasets compiled by I.Zliobaite. [https://sites.google.com/site/zliobaite/resources-1 Access]
 
* '''ECUE spam''' 2 datasets each consisting of more than 10,000 emails collected over a period of approximately 2 years by an individual. [http://www.comp.dit.ie/sjdelany/Dataset.htm Access] from S.J.Delany webpage
 
* '''Elec2''', electricity demand, 2 classes, 45312 instances. Reference: M.Harries, Splice-2 comparative evaluation: Electricity pricing, Technical report, The University of South Wales, 1999. [http://www.inescporto.pt/~jgama/ales/ales_5.html Access] from J.Gama webpage. [https://arxiv.org/pdf/1301.3524v1.pdf Comment on applicability].
 
* '''PAKDD'09 competition''' data represents the credit evaluation task. It is collected over a five-year period. Unfortunately, the true labels are released only for the first part of the data. [http://sede.neurotech.com.br/PAKDD2009/ Access]
 
* '''Sensor stream''' and '''Power supply stream''' datasets are available from X. Zhu's Stream Data Mining Repository.  [http://www.cse.fau.edu/~xqzhu/stream.html Access]
 
* '''SMEAR''' is a benchmark data stream with a lot of missing values. Environment observation data over 7 years. Predict cloudiness. [https://github.com/zliobaite/paper-missing-values Access]
 
* '''Text mining''', a collection of text mining datasets with concept drift, maintained by I.Katakis. [https://web.archive.org/web/20100704072013/http://mlkd.csd.auth.gr/concept_drift.html Access]
 
* '''Gas Sensor Array Drift Dataset''', a collection of 13910 measurements from 16 chemical sensors utilized for drift compensation in a discrimination task of 6 gases at various levels of concentrations. [http://archive.ics.uci.edu/ml/datasets/Gas+Sensor+Array+Drift+Dataset Access]
 
 
 
===Other===
 
* '''KDD'99 competition''' data contains ''simulated'' intrusions in a military network environment. It is often used as a benchmark to evaluate handling concept drift. [http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html Access]
 
 
 
===Synthetic===
 
* '''Extreme verification latency benchmark''', Souza, V.M.A.; Silva, D.F.; Gama, J.; Batista, G.E.A.P.A.  : Data Stream Classification Guided by Clustering on Nonstationary Environments and Extreme Verification Latency.  SIAM International Conference on Data Mining (SDM), pp.&nbsp;873–881, 2015. [https://sites.google.com/site/nonstationaryarchive/ Access] from Nonstationary Environments – Archive.
 
* '''Sine, Line, Plane, Circle and Boolean Data Sets''', L.L.Minku, A.P.White, X.Yao, The Impact of Diversity on On-line Ensemble Learning in the Presence of Concept Drift,  IEEE Transactions on Knowledge and Data Engineering, vol.22, no.5, pp.&nbsp;730–742, 2010. [http://www.cs.le.ac.uk/people/llm11/opensource/ArtificialConceptDriftDataSets.zip Access] from L.Minku webpage.
 
* '''SEA concepts''', N.W.Street, Y.Kim, A streaming ensemble algorithm (SEA) for large-scale classification, KDD'01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, 2001. [https://web.archive.org/web/20080315131143/http://www.liaad.up.pt/~jgama/ales/ales_5.html Access] from J.Gama webpage.
 
* '''STAGGER''', J.C.Schlimmer, R.H.Granger, Incremental Learning from Noisy Data, Mach. Learn., vol.1, no.3, 1986.
 
* '''Mixed''', J.Gama, P.Medas, G.Castillo, P.Rodrigues, Learning with drift detection, 2004.
 
 
 
===Data generation frameworks===
 
* L.L.Minku, A.P.White, X.Yao, The Impact of Diversity on On-line Ensemble Learning in the Presence of Concept Drift,  IEEE Transactions on Knowledge and Data Engineering, vol.22, no.5, pp.&nbsp;730–742, 2010. [http://www.cs.le.ac.uk/people/llm11/opensource/DriftsGenerator.zip Download] from L.Minku webpage.
 
* Lindstrom P, SJ Delany & B MacNamee (2008) Autopilot: Simulating Changing Concepts in Real Data In: Proceedings of the 19th Irish Conference on Artificial Intelligence & Cognitive Science, D Bridge, K Brown, B O'Sullivan & H Sorensen (eds.) p272-263 [http://www.comp.dit.ie/sjdelany/publications/aics08-pl.pdf PDF]
 
* Narasimhamurthy A., L.I. Kuncheva, A framework for generating data to simulate changing environments, Proc. IASTED, Artificial Intelligence and Applications, Innsbruck, Austria, 2007, 384–389 [https://wayback.archive-it.org/all/20110401035628/http://www.bangor.ac.uk/~mas00a/papers/anlkAIA07.pdf PDF] [http://pages.bangor.ac.uk/~mas00a/EPSRC_simulation_framework/changing_environments_stage1a.htm Code]
 
 
 
==Projects==
 
* [http://www.infer.eu/ INFER]: Computational Intelligence Platform for Evolving and Robust Predictive Systems (2010–2014), Bournemouth University (UK), Evonik Industries (Germany), Research and Engineering Centre (Poland)
 
* [http://www.win.tue.nl/~mpechen/projects/hacdais/ HaCDAIS]: Handling Concept Drift in Adaptive Information Systems (2008–2012), Eindhoven University of Technology (the Netherlands)
 
* [http://www.liaad.up.pt/~kdus/ KDUS]: Knowledge Discovery from Ubiquitous Streams, INESC Porto and Laboratory of Artificial Intelligence and Decision Support (Portugal)
 
* [http://www.cs.man.ac.uk/~gbrown/adept/ ADEPT]: Adaptive Dynamic Ensemble Prediction Techniques, University of Manchester (UK), University of Bristol (UK)
 
* [https://web.archive.org/web/20090309132402/http://www.aladdinproject.org/ ALADDIN]: autonomous learning agents for decentralised data and information networks (2005–2010)
 
 
 
==Benchmarks==
 
* [https://github.com/numenta/NAB NAB]: The Numenta Anomaly Benchmark, benchmark for evaluating algorithms for anomaly detection in streaming, real-time applications. (2014–2018)
 
 
 
==Meetings==
 
*2014
 
** [http://www.ieee-wcci2014.org/accepted-ss.htm] Special Session on "Concept Drift, Domain Adaptation & Learning in Dynamic Environments" @IEEE IJCNN 2014
 
*2013
 
** [https://sites.google.com/site/realstream2013/ RealStream] Real-World Challenges for Data Stream Mining Workshop-Discussion at the [[ECML PKDD]] 2013, Prague, Czech Republic.
 
** [http://aiai2013.cut.ac.cy/leaps-2013/ LEAPS 2013] The 1st International Workshop on Learning stratEgies and dAta Processing in nonStationary environments
 
*2011
 
** [http://www.icmla-conference.org/icmla11/LEE.htm LEE 2011] Special Session on Learning in evolving environments and its application on real-world problems at ICMLA'11
 
** [http://wwwis.win.tue.nl/hacdais2011/ HaCDAIS 2011] The 2nd International Workshop on Handling Concept Drift in Adaptive Information Systems
 
** [https://web.archive.org/web/20101031152019/http://icais.uni-klu.ac.at/cfp.php ICAIS 2011] Track on Incremental Learning
 
** [https://web.archive.org/web/20110128002602/http://www.ijcnn2011.org/special_section.php IJCNN 2011] Special Session on Concept Drift and Learning Dynamic Environments
 
** [http://www.soft-computing.de/CIDUE2011.html CIDUE 2011] Symposium on Computational Intelligence in Dynamic and Uncertain Environments 
 
*2010
 
** [http://wwwis.win.tue.nl/hacdais2010/ HaCDAIS 2010] International Workshop on Handling Concept Drift in Adaptive Information Systems: Importance, Challenges and Solutions
 
** [http://www.icmla-conference.org/icmla10/CFP_SpecialSession9.html ICMLA10] Special Session on Dynamic learning in non-stationary environments
 
** [https://web.archive.org/web/20100425011804/http://www.liaad.up.pt/~jgama/SAC10/ SAC 2010] Data Streams Track at ACM Symposium on Applied Computing
 
** [https://web.archive.org/web/20100418214526/http://www.ornl.gov/sci/knowledgediscovery/SensorKDD-2010/ SensorKDD 2010] International Workshop on Knowledge Discovery from Sensor Data
 
** [https://web.archive.org/web/20100419123949/http://lyle.smu.edu/cse/dbgroup/IDA/StreamKDD2010/ StreamKDD 2010] Novel Data Stream Pattern Mining Techniques
 
** Concept Drift and Learning in Nonstationary Environments at [http://www.wcci2010.org/ IEEE World Congress on Computational Intelligence]
 
** [http://cig.iet.unipi.it/isda2010/files/MLMD.pdf MLMDS’2010] Special Session on Machine Learning Methods for Data Streams at the 10th International Conference on Intelligent Design and Applications, ISDA’10
 
 
 
== Bibliographic references ==
 
Many papers have been published describing algorithms for concept drift detection. Only reviews, surveys and overviews are here:
 
 
 
===Reviews===
 
 
 
* Krawczyk, B., Minku, L.L., Gama, J., Stefanowski, J., Wozniak, M. (2017). "Ensemble Learning for Data Stream Analysis: a survey", Information Fusion, Vol 37, pp.&nbsp;132–156,  [https://dx.doi.org/10.1016/j.inffus.2017.02.004 Access]
 
* Dal Pozzolo, A., Boracchi, G., Caelen, O., Alippi, C., & Bontempi, G. (2015). Credit card fraud detection and concept-drift adaptation with delayed supervised information. In 2015 International Joint Conference on Neural Networks (IJCNN) (pp.&nbsp;1–8). IEEE. [http://www.ulb.ac.be/di/map/adalpozz/pdf/IJCNN2015_final.pdf PDF]
 
* C.Alippi, "Learning in Nonstationary and Evolving Environments", Chapter in ''Intelligence for Embedded Systems.'' Springer, 2014, 283pp, {{ISBN|978-3-319-05278-6}}.
 
* C.Alippi, R.Polikar, Special Issue on Learning In Nonstationary and Evolving Environments, IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 1, JANUARY 2014
 
* Dal Pozzolo, A., Caelen, O., Le Borgne, Y. A., Waterschoot, S., & Bontempi, G. (2014). Learned lessons in credit card fraud detection from a practitioner perspective. Expert systems with applications, 41(10), 4915–4928. [http://www.ulb.ac.be/di/map/adalpozz/pdf/FraudDetectionPaper_8.pdf PDF]
 
* Zliobaite, I., Learning under Concept Drift: an Overview. Technical Report. 2009, Faculty of Mathematics and Informatics, Vilnius University: Vilnius, Lithuania. [http://zliobaite.googlepages.com/Zliobaite_CDoverview.pdf PDF]
 
* Jiang, J., A Literature Survey on Domain Adaptation of Statistical Classifiers. 2008. [http://sifaka.cs.uiuc.edu/jiang4/domain_adaptation/survey/da_survey.pdf PDF]
 
* Kuncheva L.I. Classifier ensembles for detecting concept change in streaming data: Overview and perspectives, Proc. 2nd Workshop SUEMA 2008 (ECAI 2008), Patras, Greece, 2008, 5–10, [https://wayback.archive-it.org/all/20110401040229/http://www.bangor.ac.uk/~mas00a/papers/lkSUEMA2008.pdf PDF]
 
* Gaber, M, M., Zaslavsky, A., and Krishnaswamy, S., Mining Data Streams: A Review, in ACM SIGMOD Record, Vol. 34, No. 1, June 2005, {{ISSN|0163-5808}}
 
* Kuncheva L.I., Classifier ensembles for changing environments, Proceedings 5th International Workshop on Multiple Classifier Systems, MCS2004, Cagliari, Italy, in F. Roli, J. Kittler and T. Windeatt (Eds.), Lecture Notes in Computer Science, Vol 3077, 2004, 1–15, [https://wayback.archive-it.org/all/20110401040200/http://www.bangor.ac.uk/~mas00a/papers/lkMCS04.pdf PDF].
 
* Tsymbal, A., The problem of concept drift: Definitions and related work. Technical Report. 2004, Department of Computer Science, Trinity College: Dublin, Ireland. [https://www.cs.tcd.ie/publications/tech-reports/reports.04/TCD-CS-2004-15.pdf PDF]
 
 
 
==See also==
 
* [[Data stream mining]]
 
* [[Data mining]]
 
* [[Machine learning]]
 
 
 
[[Category:Data mining]]
 
[[Category:Machine learning]]
 

Latest revision as of 07:15, 8 December 2018

Hurray! It looks like this article has survived the deletionist battle. Check Concept drift on the English Wikipedia.