Tuesday, February 2, 2016

Chapter 2. Data Aggregation : An Overview

 Wireless sensor network have recently been the focus of many research efforts and emerged as an important new area in wireless technology. As chapter 1 already discuss about the problem of data aggregation, this chapter presents an overview of data aggregation. This chapter first discusses in-network aggregation and then different approaches that are widely used for data aggregation.

2.1 In-Network Aggregation

In a typical sensor network scenario, different node collect data from the environment and then send it to some central node or sink which analyze and process the data and then send it to the application. But in many cases, data produced by different node can be jointly processed while being forwarded to the sink node. So in-network aggregation deals with this distributed processing of data within the network.

Data aggregation techniques explore how the data is to be routed in the network as well as the processing method that are applied on the packets received by a node. They have a great impact on the energy consumption of nodes and thus on network efficiency by reducing number of transmission or length of packet. Elena Fosolo et al in [7] defines the in-network aggregation process as follows: “In-network aggregation is the global process of gathering and routing information through a multi-hop network, processing data at intermediate nodes with the objective of reducing resource consumption (in particular energy), thereby increasing network lifetime.”
 There are two approaches for in-network aggregation: with size reduction and without size reduction. In-network aggregation with size reduction refers to the process of combining & compressing the data packets received by a node from its neighbors in order to reduce the packet length to be transmitted or forwarded towards sink. As an example, consider the situation when a node receives two packets which have a spatial correlated data. In this case it is worthless to send both packets. Instead of that one should apply any function like AVG, MAX, MIN and then send a single packet. This approach considerably reduces the amount of bits transmitted in the network and thus saving a lot of energy but on the other hand, it also reduces the precision of value of data received. In-network aggregation without size reduction refers to the process merging data packets received from different neighbors in to a single data packet but without processing the value of data. As an example, two packets may contain different physical quantities (like temperature & humidity) and they can be merged in to a single packet by keeping both values intact but keeping a single header. This approach preserves the value of data and thus transmit more bits in the network but still reduce the overhead by keeping single header.
This of the two approaches to use depends on many factors like the type of application, data rate, network characteristics and so on. There is also a trade-off between energy consumption and precision of data for the two approaches.
Most of the work done till now on in-network aggregation mainly deals with problem of forwarding packets from source to sink, to facilitate aggregation therein. Actually the main idea behind were to enhance existing routing protocols such that they can efficiently aggregate data. So till now, most of the data aggregation techniques fall under three categories. They are tree-based approaches, multi-path approaches, and cluster-based approaches. There also some hybrid approaches that combines any of the three techniques above. So, all the three approaches will be described in coming sections with giving details of some of the main techniques by different authors.

2.2 Tree-Based Approach

The simplest way to aggregate data is to organize the nodes in a hierarchical manner and then select some nodes as the aggregation point or aggregators. The tree-based approach perform aggregation by constructing an aggregation tree, which could be a minimum spanning tree, rooted at sink and source nodes are considered as leaves. Each node has a parent node to forward its data. Flow of data starts from leaves nodes up to the sink and therein the aggregation done by parent nodes. The way this approach operates has some drawbacks. As we know like any wireless network the wireless sensor networks are also not free from failures. In case of packet loss at any level of tree, the data will be lost not only for a single level but for whole related sub-tree as well. In spite of high cost for maintaining tree structure in dynamic networks and scarce robustness of the system, this approach is very much suitable for designing optimal aggregation technique and energy-efficient techniques.
S. Madden et al. in [14] proposed a data-centric protocol which is based on aggregation tress, known as Tiny Aggregation (TAG) approach [14]. TAG works in two phases: distribution phase and collection phase. In distribution phase, TAG organizes nodes in to a routing tree rooted at sink. The tree formation starts with broadcasting a message from sink specify level or distance from root. When a node receive this message it sets its own level to be the level of message plus one and elect parent as node from which it receives the message. After that, node rebroadcast this message with its own level. This process continues until all nodes elect their parent. After tree formation, sink send queries along structure to all nodes in the network. TAG uses database query language (SQL) for selection and aggregation functions. In collection phase, data is forwarded and aggregated from leaves nodes to root. A parent node has to wait for data from all its child node before it can send its aggregate up the tree. Apart from the simple aggregation function provided by SQL (eg: COUNT, MIN, MAX, SUM, and AVG), TAG also partitions aggregates according to the duplicate sensitivity, exemplary and summary, and monotonic properties. Though TAG periodically refresh tree structure of network but as most of the tree-based schemes are inefficient for dynamic network, so TAG may be.
C. Intanagonwiwat et al. in [3] proposed a reactive data-centric protocol for applications where sink ask some specific information by flooding, known as directed diffusion paradigm. The main idea behind directed diffusion paradigm is to combine data coming from different source and en-route them by eliminating redundancy, minimizing the number of data transmission; thus maximizing network lifetime. Directed diffusion consists of several elements: interests, data messages, gradients, and reinforcements.


Figure 2.1 Simplified schematic for directed diffusion. (a) Interest propagation. (b) Initial gradients setup. (c) Data delivery along reinforced path [3].

The base station (BS) requests data by broadcasting an interest message which contains a description of a sensing task. This interest message propagates through the network hop-by-hop and each node also broadcast interest message to its neighbor. As interest message propagates throughout the network, gradients are setup by every node within the network. The gradient direction is set toward the neighboring node from which the interest is received. This process continues until gradients are setup from source node to base station. Loops are not checked at this stage but removed at later stage. After this path of information flow are formed and then best path are reinforced to prevent further flooding according to a local rule. Data aggregation took place on the way of different paths from different sources to base station or sink. The base station periodically refresh & resend the interest message as soon as it start to receives data from sources to provide reliability. The problem with directed diffusion is that it may not be applied to applications (e.g. environmental monitoring) that require continuous data delivery to base station. This is because query driven on demand data model may not help in this regard. Also matching data to queries might require some extra overhead at the sensor nodes. Mobility of sink nodes can also degrade the performance as path from sources to sinks cannot be updated until next interest message is flooded throughout the network. To cope up with above issue if introduce frequent flooding then also too much overhead of bandwidth and battery power will be introduced. Furthermore, exploratory data follow all possible paths in the network following gradients which lead to unnecessary communications overhead.
M. Lee et al. in [2] proposed a new low-control-overhead data dissemination scheme, which they called as pseudo-distance data dissemination (PDDD), for efficiently disseminating data from all sensor nodes to mobile sink. Some assumption have been made, they are: (1) all source nodes maintain routes to mobile sink node, (2) no periodically messaging for topological changes due to mobile sink node, (3) all link are bi-directional and no control messages are lost, (4) mobile sink nodes have unlimited battery power, so no need to care about battery efficiency of sink node, and (5) network partitioning is not considered. Data dissemination process is influenced by directed diffusion [3]. Though mobile sink periodically broadcast interest message, sensor nodes do not send exploratory data and do not wait reinforcement message because each sensor node already has routes to the sink node. After getting interest message, adjacent nodes set a parent-child relationship using pseudo-distance of each node and finally a partial ordered graph (POG) has been build. Optimal data dissemination is achieved in terms of path length by forwarding packets to a parent node until topology is unchanged. Then each sensor node is assigned a level for a corresponding sink node with pseudo-distance. In order to overcome the shortcoming of POG, author used totally ordered graph (TOG) in place of POG. The problem identified in this approach is that due to mobility of sink node all sensor nodes have to maintain routes and for any change in topology nodes have to again change route accordingly which led to energy waste.
Marc Lee et al. in [8] proposed an energy-aware spanning tree algorithm for data aggregation, referred as E-Span. E-Span is a distributed protocol in which source node that has highest residual energy is chosen as root. Other source nodes choose their parent based on residual energy and distance to the root. The protocol uses configuration message to exchange information of node i.e., residual energy and distance to the root. Each node performs single-hop broadcast operation to send packets. Single-hop broadcast refers to the operation of sending a packet to all single-hop neighbors [8].

Figure 2.2 E-Span protocol [8].


2.3 Multi-path Approach

One of the main drawbacks of tree-based approach is the scarce robustness of the system. To overcome this drawback, a new approach was proposed by many researchers. Instead of sending partially aggregated data to a single parent node in aggregation tree, a node could send data over multiple paths. The idea behind is that each node can send the data to its possibly multiple neighbors by exploiting the wireless medium characteristic. Hence data will flow from sources to sink along multiple paths and aggregation can be performed by each intermediate node. Clearly schemes using this approach will make the system robust but with some extra overhead. One of the aggregation structures that fit well with this approach is ring topology, where network is divided in to concentric circles with defining levels according to the hop distance from sink.
S. Nath et al. in [15] presented a data aggregation technique using multi-path approach, known as synopsis diffusion. Synopsis diffusion works in two phases: distribution of queries and data retrieval phase. During distribution of queries phase, a node sends a query in the network. The network nodes then form a set of rings around the querying node. The node which is i hop away from querying node is considered is ring Ri. In the second phase, aggregation starts from outermost ring and propagate level by level towards the sink. Here a source node can have multiple paths towards sink.

Figure 2.3 Examples of aggregation paths over a ring structure [7].

L. Gatani et al. in [5] describe a new strategy for data gathering in wireless sensor network that consider both issues: energy efficiency and robustness. Authors first say that single path to connect each node to the base station is simple & energy-saving approach but expose a high risk of disconnection due to node/link failures. But multi-path approach would require more nodes to participate with consequent waste of energy. Authors present a clever use of multi-path only when there is loss of packet which is implemented by smart caching of data at sensor nodes. Authors also argue that in many practical situation data may be gathered only from a particular region, so they use a different approach that relies on a spanning tree and provides alternative paths only when a malfunctioning is detected. Algorithm adopts a tree-based approach for forwarding packets through the network. In the ideal situation when no failures occur, this is certainly the best choice, as the minimum number of nodes is engaged in the transmission phase. In the presence of link or node failures, the algorithm will discover alternative paths, so as ensure the delivery of as many packets as possible within the time constraints. The problem with this approach is that it may cause the arising of hot spots and nodes along preferred paths will consume their energy resources quickly, possibly causing disconnection in the network.

2.4 Cluster-Based Approach

We talked about hierarchical organization of the network in tree-based approach. Another scheme to organize the network in hierarchical manner is cluster-based approach. In cluster-based approach, whole network is divided in to several clusters. Each cluster has a cluster-head which is selected among cluster members. Cluster-heads do the role of aggregator which aggregate data received from cluster members locally and then transmit the result to sink. The advantages and disadvantages of the cluster-based approaches is very much similar to tree-based approaches.
K. Dasgupta et al. in [16] proposed a maximum lifetime data aggregation (MLDA) algorithm which finds data gathering schedule provided location of sensors and base-station, data packet size, and energy of each sensor. A data gathering schedule specifies how data packet are collected from sensors and transmitted to base station for each round. A schedule can be thought of as a collection of aggregation trees. In [6], they proposed heuristic-greedy clustering-based MLDA based on MLDA algorithm. In this they partitioned the network in to cluster and referred each cluster as super-sensor. They then compute maximum lifetime schedule for the super-sensors and then use this schedule to construct aggregation trees for the sensors.
W. Choi et al. in [1] present a two-phase clustering (TPC) scheme. Phase I of this scheme creates clusters with a cluster-head and each node within that cluster form a direct link with cluster-head. Phase I of this scheme is similar to various scheme used for clustering but differ in one way that the cluster-head rotation is localized and is done based on the remaining energy level of the sensor nodes which minimize time variance of sensors and this lead to energy saving from unnecessary cluster-head rotation. In phase II, each node within the cluster searches for a neighbor closer than cluster-head which is called data relay point and setup up a data relay link. Now the sensor nodes within a cluster either use direct link or data relay link to send their data to cluster head which is an energy efficient scheme. The data relay point aggregates data at forwarding time to another data relay point or cluster-head. In case of high network density, TPC phase II will setup unnecessary data relay link between neighbors as closely deployed sensor will sense same data and this lead to a waste of energy.
Figure 2.4 Illustration of Two Phase Clustering [1].

H. Cam et al. in [4] present energy efficient and secure pattern based data aggregation protocol which is designed for clustered environment. In conventional method data is aggregated at cluster-head and cluster-head eliminate redundancy by checking the content of data. This protocol says that instead of sending raw data to cluster-head, the cluster members send corresponding pattern codes to cluster-head for data aggregation. If multiple nodes send the same pattern code then only one of them is finally selected for sending actual data to cluster-head. For pattern matching, authors present a pattern comparison algorithm.



Figure 2.5 Data transmission using ESPDA [4].


2.5 Simulation Tools

We have plenty of simulation tools or simulators for simulating wireless networks. The simulators which are most popular are NS-2, OPNET, OMNet++, J-Sim, GlomoSim, Qualnet, TOSSIM and so on. Since wireless sensor networks are special type of wireless networks, most of the simulators available are not enough supported for simulating a wireless sensor network scenario. The literature shows that the simulators which are mostly used for wireless sensor network are NS-2, J-Sim, GlomoSim, OPNET, TOSSIM, PROWLER and even MATLAB is also used.
NS-2 is the most popular and powerful simulator. NS-2 is an object-oriented discrete time event simulator and its modular design made it to be extensible. The detail of NS-2 is provided in appendix A. To simulate sensor network, there had been made attempt to put some add-ons. The most appreciable extension to NS-2 for wireless sensor networks was developed in 2004 by Ian Downward of Naval Research Laboratory (NRL) [18]. In this extension, they had created phenomenon packet which trigger event for sensor nodes. They also designed sensor agent, sensor application and other application to support wireless sensor network. But there has been no improvement or work done since 2006. This is the reason why all research is going by using NS-2 without any extension support. Other simulators like J-Sim, GlomoSim, OPNET, OMNet++ are exceptionally used by researchers. Another simulators which is MATLAB based, known as PROWLER, is sometimes used for simulating routing protocol or some MAC issues in wireless sensor network. MATLAB is also used for simulating some physical layer issues. But still there is no efficient simulator which is purely dedicated to wireless sensor network. More details for comparative study on simulator for wireless sensor network could be found in [19].  

No comments: