Big Data Architecture:
One of the well known and widely implemented architectures in Big Data is the MapReduce. MapReduce is an architectural framework for processing huge amounts of data. It was initially proposed in 2004 by Google which is based on the No-SQL drift. It was implemented in tools such as Mars, Phoenix and Hadoop (which is most popular in the present day). Hadoop MapReduce is based on the shared-nothing architecture which doesn't allow sharing of altering data and stance challenges for some of its algorithms. It can run applications on parallel computing clusters by distributing the large amounts of data over the network. The main emphasis of the architecture is to distribute, parallelize and balance the data load in a system. Any library or tool which is built on Map reduce architecture is based on key-value pairs. It will have three major parts a Master, map and a reduce function.
The image table gives us an idea of how the functions operate and clustering is handled in MapReduce.
MapReduce is intended to withstand faults as it is a data-intensive application architecture. It is a good choice as it
- Can process petabytes of data by handling failures.
- Supports Data recovery.
- Does Parallel and distributed information processing
When it comes to testing any system Unit testing is the first and foremost. In addition, MapReduce systems need a tester to replicate the system behavior in all the Mapping, Combining and Reducing stages. Further, A MapReduce system calls for dynamic testing and distributed testing. It is possible to identify the performance, node failures, and network latency issues only during an active execution. Distributed testing can be performed as the testers would be assigned individual test cases which restricts fault injections in any state. Due to the size of data it handles, MapReduce applications potentially pose failures, thus increasing the weight on the testers. The testing team must be much aware of the processes, frameworks, and notions introduced. They should also work in collaboration with the Data and system architects. A need for enormous amounts of test data arises as the number of faults found may vary proportionally with size. Test data creation is moreover firmly identified with execution issues. Test cases and the test data must be created in accordance with the architecture, design, and workflow. Equivalence class partitioning can be used to partition the test data into meaningful clusters to verify the functionality and performance of the application. Boundary value analysis is another testing technique which focuses on the equivalence class boundaries. It is evident that testing the data around boundaries yields some good results.
Challenges of Testing:
The major issue while testing Big Data is having to deal with the unstructured or semistructured data because the traditional relational databases deal with structured data values. There are innumerable challenges handling large unstructured real-time data and just to understand the correlation with data between then and now makes Big Data testing an absolute nightmare. The test cases and test data with different permutations and combination are of humongous exponential power. A manual testing method would be tedious, so some type of automation test must be implemented. When testing, test data must be stored on the Big Data platform and the test environment must meet the standard of the source network to handle real-time data during testing. Big Data testing is always influenced by the below factors
Precision in integration
Data is collected from a wide variety of sources and integrated at some point, for the architecture tools to produce an efficient output the integration should be performed instantly and readily available.
Data collection and deployment
In a real-time environment where the data changes from time to time, data collection and deployment is always a challenge.
Feeding applications data in the right form and factor is essential for it to be processed and visualized in an efficient way, else the decision-making capabilities of the application will be affected. Scalability The applications must be scalable enough with the changing data to handle huge workloads based on the business expansions.
The data is distributed over nodes and clusters over a network in Big Data processing tools such as Hadoop. Failure to any node or a network would result in forfeiting consequence for companies.
Limitations & Future Trends:
The major limitation about current testing methods is that they are done on stationary data flow while Big Data testing must be performed in the continuous data flow. Process architectures are built around centralized data, but Big Data architectures are designed for distributed data. In the former data types are static, but dynamic with Big Data. The current need is to test at each stage with real-time data, to achieve this we need to be test-driven development and quality on all levels. Now we have the test data completeness, data values, how relevant is the data, sentiment, and behavior of unstructured real-time data. Multiple copies of Big Data are other challenges that must be handled in an environment where the efficiency of query executing is poor.https://www.youtube.com/watch?v=exzoBxm65ik Source: Eurostar Software Testing Conference
Testing both functional and non-functional requirements (such as performance, quality, and correctness of data) is required for any Big Data application to perform at its best. Though a system is tested for functional and non-functional requirements the test results are completely domain dependent and vary from each application domain. In the future, we focus to examine the correlation between the type of faults and generalize across all domains of Big Data.
About the Author: Ananth IyerThe writer of this article is a Ph.D. Student & Graduate Teaching Assistant for the Department Of Computer Science at The University of North Dakota. Sources:
 Zhengyi jin, “a software architecture-based testing technique”, dissertation submitted to the graduate faculty of George Mason University, 2000.  Hassan Reza, Suhas Lande,” Model-Based Testing Using Software Architecture”, Seventh International Conference on Information Technology: New Generations, 2010.  Juan Jin, Fei Xue, “Rethinking Software Testing Based on Software Architecture”, Seventh International Conference on Semantics, Knowledge, and Grids, 2011.  Antonia Bertolino, Paola Inverardi, Henry Muccini, Andrea Rosetti, “An Approach to Integration Testing Based on Architectural Descriptions”, Universita’ di L’Aquila, Dipartimento di Matematica Pura ed Applicata.  Ashlesha S. Nagdive, Dr. R. M. Tugnayat, Manish P. Tembhurkar, “Overview on Performance Testing Approach in Big Data”, International Journal of Advanced Research in Computer Science, 2014.  Muthuraman Thangaraj, Subramanian Anuradha, “State of the art in Testing for Big Data”, IEEE International Conference on Computational Intelligence and Computing Research (ICCIC), 2015.  Swati Panwar, “How to Test Big Data Systems”, June 9, 2016.  Riya Rose, “Big Data Testing – How to Overcome Quality Challenges”, 2017.  Garg, Naveen, Sanjay Singla, and Surender Jangra. "Challenges and techniques for testing of Big Data." Procedia Computer Science 85 (2016): 940-948.  Khezr, Seyed Nima, and Nima Jafari Navimipour. "MapReduce and its applications, challenges, and architecture: a comprehensive review and directions for future research." Journal of Grid Computing 15.3 (2017): 295-321.  Aarnio, Tomi. "Parallel data processing with MapReduce." TKK T-110.5190, Seminar on Internetworking. 2009.  Tinetti, Fernando G., et al. "Hadoop Scalability and Performance Testing in H