M. Stonebraker and U. Cetintemel, one size ts all": an idea whose time has come and gone, 21st International Conference on Data Engineering (ICDE'05), pp.2-11, 2005.

D. R. Turner, J. Gantz, and S. Minton, The digital universe of opportunities: Rich data and the increasing value of the internet of things, 2014.

, Facts and Stats About The Big Data Industry Webpage http://cloudtweaks.com/ 2015/03/surprising-facts-and-stats-about-the-big-data-industry

M. S. University and M. Stonebraker, The case for shared nothing, Database Engineering, vol.9, pp.4-9, 1986.

A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. Dewitt et al., A comparison of approaches to large-scale data analysis, Proceedings of the 35th SIGMOD international conference on Management of data, SIGMOD '09
DOI : 10.1145/1559845.1559865

I. F. Ilyas and X. Chu, Trends in Cleaning Relational Data: Consistency and Deduplication, Foundations and Trends?? in Databases, vol.5, issue.4, pp.281-393, 2015.
DOI : 10.1561/1900000045

D. J. Dewitt, R. H. Gerber, G. Graefe, M. L. Heytens, K. B. Kumar et al., Gamma -a high performance dataaow database machine, Proceedings of the 12th International Conference on Very Large Data Bases, ser. VLDB '86, pp.228-237, 1986.

. Teradata,

. Greenplum,

. Netezza,

. Mysql-cluster,

A. Data,

. Postgres-xc,

. Stado,

A. Redshift,

. Paraccel-analytic and . Platform, Webpage

A. Tez,

, Streaming Data Webpage, " https://aws.amazon.com/streaming-data

S. Apache,

A. Samza,

A. Flink,

S. Streaming, http://spark.apache.org/streaming/. [23] Spark Streaming

A. Storm,

R. Macnicol and B. French, Sybase IQ Multiplex???Designed For Analytics, Proceedings of the Thirtieth International Conference on Very Large Data Bases - ser. VLDB '04. VLDB Endowment, pp.1227-1230, 2004.
DOI : 10.1016/B978-012088469-8.50111-X

. Available,

A. Lamb, M. Fuller, R. Varadarajan, N. Tran, B. Vandiver et al., The vertica analytic database, Proc. VLDB Endow, pp.1790-1801, 2012.
DOI : 10.14778/2367502.2367518

C. Baru and G. Fecteau, An overview of DB2 parallel edition, ACM SIGMOD Record, vol.24, issue.2, pp.460-462, 1995.
DOI : 10.1145/568271.223876

M. Gorawski, A. Gorawska, and K. Pasterak, A Survey of Data Stream Processing Tools, Information Sciences and Systems, p.295, 2014.
DOI : 10.1007/978-3-319-09465-6_31

. Deng, The data civilizer system [30] Improving Data Preparation for Business Ana- lytics Webpage, " https://tdwi.org/research/2016/07/ best-practices-report-improving-data-preparation-for-business-analytics, CIDR, 2017.

N. Swartz, Gartner warns rms of 'dirty data, Information Management Journal, 2007.

C. Batini and M. Scannapieco, Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications), 2006.

T. White, Hadoop: The Deenitive Guide, 2009.

A. Floratou, U. F. Minhas, and F. Özcan, SQL-on-Hadoop, Proceedings of the VLDB Endowment, vol.7, issue.12, pp.1295-1306, 2014.
DOI : 10.14778/2732977.2733002

M. Kornacker, Impala: A modern, open-source SQL engine for hadoop, CIDR, 2015.

J. Dean and L. A. Barroso, The tail at scale, Communications of the ACM, vol.56, issue.2, 2013.
DOI : 10.1145/2408776.2408794

Y. Tian, I. Alagiannis, E. Liarou, A. Ailamaki, P. Michiardi et al., DiNoDB, Proceedings of the First International Workshop on Bringing the Value of "Big Data" to Users (Data4U 2014), Data4U '14, 2014.
DOI : 10.1145/2658840.2658841

S. R. Labs,

A. Abouzeid, HadoopDB, VLDB, 2009.
DOI : 10.14778/1687627.1687731

I. Alagiannis, NoDB: eecient query execution on raw data les, SIGMOD, 2012.

J. Baker, C. Bond, J. Corbett, J. J. Furman, A. Khorlin et al., Megastore: Providing scalable, highly available storage for interactive services, CIDR 2011, Fifth Biennial Conference on Innovative Data Systems Research Online Proceedings. www.crdrdb.org, pp.223-234, 2011.

K. Shvachko, H. Kuang, S. Radia, and R. Chansler, The Hadoop Distributed File System, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), 2010.
DOI : 10.1109/MSST.2010.5496972

J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost et al., Spanner, ACM Transactions on Computer Systems, vol.31, issue.3, pp.1-822, 2013.
DOI : 10.1145/2518037.2491245

J. Dean, MapReduce: Simpliied Data Processing on Large Clusters, USENIX OSDI, 2004.

J. Dittrich, Hadoop++, VLDB, 2010.
DOI : 10.14778/1920841.1920908

J. Dittrich, J. Quiané-ruiz, S. Richter, S. Schuh, A. Jindal et al., Only aggressive elephants are fast elephants, Proc. of VLDB, pp.1591-1602, 2012.
DOI : 10.14778/2350229.2350272

M. Y. Eltabakh, CoHadoop, VLDB, 2011.
DOI : 10.14778/2002938.2002943

A. Floratou, J. M. Patel, E. J. Shekita, and S. Tata, Column-oriented storage techniques for MapReduce, Proceedings of the VLDB Endowment, vol.4, issue.7, 1105.
DOI : 10.14778/1988776.1988778

Y. He, R. Lee, Y. Huai, Z. Shao, N. Jain et al., Rccle: A fast and space-eecient data placement structure in mapreduce-based warehouse systems, Data Engineering (ICDE), 2011 IEEE 27th International Conference on, pp.1199-1208, 2011.

S. Idreos, I. Alagiannis, R. Johnson, and A. Ailamaki, Here are my data les. here are my queries. where are my results, CIDR'11, pp.57-68, 2011.

J. Lin, Mapreduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That's Not a Nail!, Big Data, vol.1, issue.1, 1209.
DOI : 10.1089/big.2012.1501

J. Shute, R. Vingralek, B. Samwel, B. Handy, C. Whipkey et al., F1, Proceedings of the VLDB Endowment, vol.6, issue.11, 2013.
DOI : 10.14778/2536222.2536232

K. Shvachko, The hadoop distributed le system, Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), ser. MSST '10, 2010.

M. Stonebraker and A. Weisberg, The voltdb main memory dbms, IEEE Data Eng. Bull, vol.36, issue.2, pp.21-27, 2013.

R. S. Xin, Shark, Proceedings of the 2013 international conference on Management of data, SIGMOD '13, 2013.
DOI : 10.1145/2463676.2465288

M. Zaharia, Spark: Cluster Computing with Working Sets, USENIX Hot- Cloud, 2010.

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, Pig latin, Proceedings of the 2008 ACM SIGMOD international conference on Management of data , SIGMOD '08, 2008.
DOI : 10.1145/1376616.1376726

A. Pig,

. Sqoop, Webpage, www.vertica.com/. [62] Hadoop

, Postgresql

M. Kornacker, Impala: A modern, open-source sql engine for hadoop, Proc. CIDR '15, 2015.

J. Giceva, T. Salomie, A. Schüpbach, G. Alonso, and T. Roscoe, Cod: Database / operating system co-design, CIDR, 2013.

M. Zaharia, Spark: Cluster computing with working sets, Proc. of USENIX HotCloud, 2010.

A. Spark,

A. Storm,

M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu et al., Spark SQL, Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD '15, pp.1383-1394, 2015.
DOI : 10.1007/3-540-59451-5_2

M. Zaharia, Discretized streams, Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP '13, 2013.
DOI : 10.1145/2517349.2522737

, The Lambda Architecture Webpage

J. B. Macqueen, Some methods for classiication and analysis of multivariate observations, Proc. of 5th Berkeley Symposium on Mathematical Statistics and Probability, 1967.

M. Ester, A density-based algorithm for discovering clusters in large spatial databases with noise, Proc. of the 2nd International Conference on Knowledge Discovery and Data Mining, 1996.

. Discardable-distributed and . Memory, Supporting Memory Storage in HDFS

A. Abouzied, Invisible loading, Proceedings of the 16th International Conference on Extending Database Technology, EDBT '13, 2013.
DOI : 10.1145/2452376.2452377

Y. W. Teh, D. Newman, and M. Welling, A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation, Advances in neural information processing systems, 2006.
DOI : 10.21236/ADA629956

D. M. Blei, A. Y. Ng, and M. I. Jordan, Latent Dirichlet Allocation, Journal of Machine Learning Research, vol.3, pp.993-1022, 2003.

Y. Chen, S. Alspaugh, and R. Katz, Interactive query processing in big data systems: A cross-industry study of MapReduce workloads, Proc. of VLDB, 2012.
DOI : 10.21236/ADA561769

K. Ren, Hadoop's adolescence, Proc. of VLDB, 2013.
DOI : 10.14778/2536206.2536213

H. Li, Tachyon, Proceedings of the ACM Symposium on Cloud Computing, SOCC '14, 2014.
DOI : 10.1145/2517349.2522737

P. Flajolet, É. Fusy, O. Gandouet, and F. Meunier, HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm, DMTCS Proceedings, 2008.
URL : https://hal.archives-ouvertes.fr/hal-00406166

K. Krish, A. Anwar, and A. R. Butt, hatS: A Heterogeneity-Aware Tiered Storage for Hadoop, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp.502-511, 2014.
DOI : 10.1109/CCGrid.2014.51

URL : http://people.cs.vt.edu/~butta/docs/ccgrid2014-hats.pdf

U. One,

A. Parquet,

A. Hive and -. Files, Webpage

J. Lefevre, MISO, Proceedings of the 2014 ACM SIGMOD international conference on Management of data, SIGMOD '14, 2014.
DOI : 10.1145/2588555.2588568

M. Zaharia, Resilient Distributed Datasets, NSDI, 2012.
DOI : 10.1145/2886107.2886110

Y. Cheng and F. Rusu, Parallel in-situ data processing with speculative loading, Proceedings of the 2014 ACM SIGMOD international conference on Management of data, SIGMOD '14, p.14
DOI : 10.1145/2588555.2593673

Y. Cheng and R. Florin, SCANRAW, ACM Transactions On Database Systems, 2015.
DOI : 10.1109/PACT.2011.9

Y. Cheng, C. Qin, and F. Rusu, GLADE, Proceedings of the 2012 international conference on Management of Data, SIGMOD '12
DOI : 10.1145/2213836.2213936

S. Melnik, Dremel: Interactive analysis of web-scale datasets Webpage, " https://en.wikipedia.org/wiki/FITS. [95] Reservoir sampling, Proc. of the 36th Int'l Conf on Very Large Data Bases, 2010.

W. Fan, Incremental Detection of Inconsistencies in Distributed Data, 2012 IEEE 28th International Conference on Data Engineering, pp.1367-1383, 2014.
DOI : 10.1109/ICDE.2012.82

W. Fan, Incremental Detection of Inconsistencies in Distributed Data, 2012 IEEE 28th International Conference on Data Engineering, pp.318-329, 2012.
DOI : 10.1109/ICDE.2012.82

. Gracia-tinedo, Dissecting UbuntuOne, Proceedings of the 2015 ACM Conference on Internet Measurement Conference, IMC '15, pp.155-168, 2015.
DOI : 10.1109/ICC.2014.6883506

Z. Khayyat, BigDansing, Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD '15, pp.1215-1230, 2015.
DOI : 10.1145/2463676.2463706

. Arocena, Messing up with BART, Proc. of VLDB, pp.36-47, 2015.
DOI : 10.14778/2850578.2850579

. Dallachiesa, NADEEF, Proceedings of the 2013 international conference on Management of data, SIGMOD '13, pp.541-552, 2013.
DOI : 10.1145/2463676.2465327

X. Chu, Holistic data cleaning: Putting violations into context, 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp.458-469, 2013.
DOI : 10.1109/ICDE.2013.6544847

. Jeeery, A pipelined framework for online cleaning of sensor data streams, Tech. Rep, 2005.

. Zhao, A model-based approach for rrd data stream cleansing, Proc. of CIKM, pp.862-871, 2012.

S. Song, SCREEN, Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD '15, pp.827-841, 2015.
DOI : 10.1109/ICDE.2007.367867

Q. Lin, Scalable Distributed Stream Join Processing, Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD '15, pp.811-825, 2015.
DOI : 10.1109/TKDE.2015.2427795

. Elseidy, Scalable and adaptive online joins, Proc. of VLDB, pp.441-452, 2014.
DOI : 10.14778/2732279.2732281

URL : http://infoscience.epfl.ch/record/190035/files/paper.pdf

. Wang, Crowd-Based Deduplication, Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD '15, pp.1263-1277, 2015.
DOI : 10.14778/2536336.2536337

C. , Katara: A data cleaning system powered by knowledge bases and crowdsourcing, Proc. of SIGMOD, pp.1247-1261, 2015.

. Beskales, Sampling the repairs of functional dependency violations under hard constraints, Proc. of VLDB, pp.197-207, 2010.
DOI : 10.14778/1920841.1920870

P. Bohannon, Conditional Functional Dependencies for Data Cleaning, 2007 IEEE 23rd International Conference on Data Engineering, pp.746-755, 2007.
DOI : 10.1109/ICDE.2007.367920

. Interlandi, Proof positive and negative in data cleaning, 2015 IEEE 31st International Conference on Data Engineering
DOI : 10.1109/ICDE.2015.7113269

Q. Chen, Repairing Functional Dependency Violations in Distributed Data, DASFAA, pp.441-457, 2015.
DOI : 10.1007/978-3-319-18120-2_26

. Kolahi, On approximating optimum repairs for functional dependency violations, Proceedings of the 12th International Conference on Database Theory, ICDT '09, pp.53-62, 2009.
DOI : 10.1145/1514894.1514901

M. Volkovs, Continuous data cleaning, 2014 IEEE 30th International Conference on Data Engineering, pp.244-255, 2014.
DOI : 10.1109/ICDE.2014.6816655

T. Akidau, The dataaow model: A practical approach to balancing correctness , latency, and cost in massive-scale, unbounded, out-of-order data processing, Proc. of VLDB, pp.1792-1803, 2015.

. Fernandez, Liquid: Unifying nearline and ooine big data integration, CIDR, 2015.

. Abedjan, Temporal rules discovery for web data cleaning, Proc. of VLDB, pp.336-347, 2015.
DOI : 10.14778/2856318.2856328

, Webpage, " https://spark-summit.org/east-2015/ streaming-machine-learning-in-spark

. Kafka,

. Xin, GraphX, First International Workshop on Graph Data Management Experiences and Systems, GRADES '13, 2013.
DOI : 10.1145/2484425.2484427

. Recordedfuture,

. Gdeltproject,

, Spark stream cleaning Webpage

. Bohannon, A cost-based model and eeective heuristic for repairing constraints by value modiication, Proc. of SIGMOD, pp.143-154, 2005.
DOI : 10.1145/1066157.1066175

URL : http://homepages.inf.ed.ac.uk/wenfei/papers/sigmod05.pdf

. Stonebraker, The 8 requirements of real-time stream processing, ACM SIGMOD Record, vol.34, issue.4, pp.42-47, 2005.
DOI : 10.1145/1107499.1107504

URL : http://www.sigmod.org/publications/sigmod-record/0512/p42-article-stonebraker.pdf

. Trifacta,

. Openreene, The llunatic data-cleaning framework, Proc. of VLDB, pp.625-636, 2013.

S. R. Jeeery, M. Garofalakis, and M. J. Franklin, Adaptive cleaning for rrd data streams, Proceedings of the 32Nd International Conference on Very Large Data Bases, ser. VLDB '06. VLDB Endowment, pp.163-174, 2006.

A. Zhang, S. Song, and J. Wang, Sequential Data Cleaning, Proceedings of the 2016 International Conference on Management of Data, SIGMOD '16, pp.909-924, 2016.
DOI : 10.1145/2463676.2463706