DATA MINING APPROACHES FOR HABITATS AND STOPOVERS DISCOVERY OF MIGRATORY BIRDS

This paper mainly focuses on using data mining technology to efficiently and accurately discover habitats and stopovers of migratory birds. The three methods we used are as follows: 1. Density-based clustering method, detecting stopovers of birds during their migration through density-based clustering of location points. 2. Location histories parser method, detecting areas that have been overstayed by migratory birds during a set time period by setting time and distance thresholds. 3. Time-parameterized line segment clustering method, clustering directed line segments to analysis shared segments of migratory pathways of different migratory birds, and discovers the habitats and stopovers of these birds. At last, we analyzed the migration data of bar-headed goose in the Qinghai Lake Area through the three methods above and verified the effectiveness of the three methods, and by comparison, identified the scope and context of use of these three methods respectively.


INTRODUCTION
One of the most important tasks to protect migratory birds around the globe is to identify the ecological needs of birds in their breeding and wintering grounds as well as the stopovers during their migration (Berthold, &Terrill, 1991).The information of specific migration routes, net structures of these migration routes and important stopovers during migration is the key to research migratory birds' selection of habitats and stopovers, birds' migration strategy and the influence of global climate change on migratory birds' migration.On the other hand, the role of migratory birds in the spread of avian influenza virus has been a hot topic nowadays.Among the wild birds which have been infected by the H5N1 highly pathogenic avian influenza virus, many are migratory, so migratory bird might be avian influenza virus vectors.As the ecological environment and natural resources of the habitats and stopovers might set the stage for interspecific or intraspecific transmission of avian influenza virus among birds, studying wild birds' migration and detecting these birds' habitats or stopovers efficiently and precisely are of significant value for the research and prevention of the spread of avian influenza virus.
The traditional way of studying bird migration, like bird banding, is simple and easy to carry out, but its result depends on long-time observation and the number and quality of returned birds are under expectation, thus it's impossible to get a whole picture of the track of bird migration in short time (Zhang, &Yang, 1997).In other words, the traditional way is hard to meet the requirements of modern study.The development of satellite tracking technology and its application in biology in recent years provide new opportunities for bird migration study (Cagnacci, Boitani, Powell, & Boyce, 2010).Some of the raw data by using satellite tracking technology is shown in the following Table 1.In this chart, ID is the recording number, Animal is the label of the migratory bird, Latitude and Longitude showing the specific location, and the Date time field signifying time stamp.Obviously， traditional data analysis methods such as drawing-dot or manual statistics method cannot process these high-resolution spatial-temporal data.This paper mainly focuses on using data mining technology to discover habitats and stopovers of migratory birds among the original satellite telemetry data efficiently and accurately, these methods are described as follows:  Density-based clustering method.The habitats and stopovers of migratory bird are the areas where the bird continuously stays for some time, corresponding to the dense regions in space.We use the density-based clustering method to discover these dense regions.Although the location data of the migratory bird may be lost because of some different reasons, these dense regions can characterize the habitats or stopovers of the bird. Location histories parser method.Given a time and distance threshold, modeling the move status (stay or move) of migratory bird, and then scanning a certain bird's migration route point by point.This method can get the arriving and leaving time of the migratory bird at its every stopover. Time-parameterized line segment clustering method.We measure the space-time density of moving objects by the spatial distance, the direction of the movement and the time characteristics.We use the time-based plane-sweeping trajectory clustering algorithm to analysis shared segments of migratory pathways of different migratory birds, and discover the habitats and stopovers of these birds.
The following part of this paper is organized as the following: the second section introduces some relevant researches; the third section defines some specific terms; the forth section elaborates three ways to discover stopovers among GPS data; the fifth section presents the experiments and the result analysis; the last section provides the major conclusions of the paper.

RELATED WORK
As the improvement in GPS-based radio telemetry and growing international concern about the migratory birds, many international organizations began to trace the birds' migration through satellite positioning technology (Frisch, Vagg, & Hepworth, 2006).There is increasing interest on developing methods to perform data analysis for trajectory datasets (Schiller, &Voisard, 2004) (Stauffer, &Grimson, 2000).A typical data analysis task is to detect the stopovers of the moving objects.We used the same satellite telemetry datasets with (Tang et al.,2009), Tang et al. (2009) proposed a hierarchical spatial clustering method HDBSCAN to find the habitats or stopovers of migratory birds in different spatial scale levels, but HDBSCAN algorithm measures the proximity of birds mainly by Euclidean distance between two points and does not take time information into account.Hariharan, & Toyama(2004), Zheng, Zhang, Ma, Xie, &Ma(2011), Zheng, &Li(2008), Zheng, &Xie(2010) modeled the location histories of human and proposed a method to find the stopovers of human, but their attention focused on personalized recommendation based on location, so they did not study the stopovers in depth.Gaffney, & Smyth(1999), Gaffney, Robertson, Smyth, Camargo, & Ghil(2006) observed that existing trajectory clustering algorithms group similar trajectories as a whole, thus revealing common trajectories.But clustering trajectories as a whole could not detect similar portions of trajectories or could miss common sub-trajectories.The framework and algorithm proposed by Lee, Han, & Whang (2007) did not consider temporal information.Satellite telemetry datasets or GPS-based locations datasets are essentially time series of spatial data.To measure the space-time density of moving objects, this paper defines different distance functions from (Lee et al., 2007) to measure the similarity of different line segments, so that we can find the shared segments of migratory pathways both in time and space.In this paper, we use three data mining methods to discover habitats and stopovers of migratory birds, and analyze in detail the characteristics and the contexts of use of the three algorithms respectively.

PRELIMINARY
In this section, we clarify some terms used in this paper such as point, line segment, trajectory etc.
Point: a point P is indicated by a tuple , which refers to that one bird once presented in a location at where the latitude is Lat and the longitude is Lng.
Point set: a point set PS consists of a series of points which are generated by one or more birds.
Trajectory: a trajectory TR is defined as an ordered set of pairs ordered by time serials.
, where is point 's timestamp.Line segment: Given a trajectory TR, a line segment of TR is defined as , where represents object moves from position to position during .The displacement of moving object is denoted by , and the duration of is denoted by .Line segment set: The line segment set of a trajectory TR is defined as a collection of two sequential pairs in TR, Stop region: stop region is the area where the migratory birds stay for some time during their migration.Migratory birds' habitats and stopovers are all stop regions.We use a stop region center's coordinate to indicate the stop region in the following sections.

THREE METHODS TO DISCOVER THE STOP REGIONS
Migratory routes of migratory birds are long and complex paths (Figure 1), and the migratory birds' raw GPS data can't be used conveniently due to its large scale and high complexity.In this section, we will provide three methods to solve the problem, and explain their principle in detail.
Figure 1.Migratory pathway of one bar-headed goose captured in the Qinghai Lake Area.

Density-based clustering method
As depicted in Figure 1, the dense regions in the picture may be the stop regions from the visual point of view.
We can assume that dense regions in spatial-temporal data are equivalent to the stop regions.The GPS position sampling frequency of satellite telemetry device was about once every 2 hours during the day.If a bird stays in a small area more than a certain period of time, the sampling point in this area may be denser than other place.So it is possible to detect the migratory birds' stop regions by finding the dense areas in GPS location history data.
In order to find the dense clusters in spatial data, Ester, Kriegel, Sander, & Xu(1996) proposed the DBSCAN algorithm.The density-based algorithm based on the following notions: ε-neighborhood is the neighborhood within a radius ε of a given object; an object is a Core object if the ε-neighborhood of this object contains at least a minimum number (MinPts) of objects; an object p is directly density-reachable from object q if p is within the ε-neighborhood of q, and q is a core object; an object p is density-reachable from object q with respect to ε and MinPts in a set of objects, D, if there is a chain of objects , where such that is directly density-reachable from with respect to ε and MinPts, for ; an object p is density-connected to object q with respect to ε and MinPts in a set of objects, D, if there is an object both p and q are density-reachable from o with respect to ε and MinPts (Han, &Kamber, 2000).All points within the cluster are mutually density-connected.If a point is density-connected to any point of the cluster, it is part of the cluster as well.
The stop region detection algorithm based on DBSCAN (Ester et al., 1996)  The time complexity of DBS_SR_DETECTION is , where n is the number of points in PS.If the appropriate spatial index is used, the time complexity of this algorithm will reduce to .If ε and MinPts are appropriately set, this algorithm can detect arbitrarily shaped clusters, but as for how to choose these two parameters there is no good way.When we use this algorithm, PS can be either one bird's history location set or multi-birds' history location set.Here, , where , the points in PS only contain spatial dimension, and we use great-circle distance as geographical distance formula between two points.Furthermore, the NOISE in DBS_SR_DETECTION may be significant for the ornithologist, because the object may be flying fast at this location.

Location histories parser method
As stated before, the DBS_SR_DETECTION only takes the spatial dimension into account, dismissing the time dimension.In fact, birds' migration routes are complex and not regular (Figure 1 In order to solve the problem above, we need take the time dimension into account.Hariharan et al.(2004), Zheng et al.(2011),Zheng et al.(2008), Zheng et al.(2010) proposed a time and distance threshold based method to discover human's stay point from the historical location data.This method may be useful for detecting the migratory birds' stop regions.The stops of migratory birds may be divided into two kinds:  As the stop region 1 depicted on Figure 3, during the migration, birds may keep stationary for some time because of the bad weather or they need a rest. As the stop region 2 depicted on Figure 3, the birds may stay in a little area for some time, because they need to find food or for some other reasons.
Both of the stops can be defined as this: Given a trajectory , if there is a subset of TR where and for , , , the denotes the geospatial distance between two points and , the is the time interval between two points, then the area where the points at sTR locate is a stop region S (Zheng, &Xie, 2010).We can also use a quaternion to indicate a stop region .The Lat stands for the average latitude of the collection sTR; the Lng stands for the average longitude of the collection sTR; the ts means the bird's arriving time on stop region S; the te means bird's leaving time.We can compute them as: . The algorithm that detects all stop regions from a trajectory is described as follows: .The data LHP_SR_DETECTION can process is one bird's trajectory.Before use this algorithm we should sort the bird's location history data by timestamp.This algorithm can't deal with multi-birds' trajectory.A simple method to solve this problem is to combine DBS_SR_DETECTION with LHP_SR_DETECTION, which can detect all stop regions of one bird respectively, and then cluster all the stop regions of all birds.

Time-parameterized line segment clustering method
Birds in the same region usually share their habitats or stopovers.As indicated in Figure 4, different birds fly from one same place to another, and as a result many similar line segments will be generated between these two places.The sets of starting points and finishing points of each line segments in this cluster may be the stopovers or habitats of migratory birds.

Figure 3 .
Figure 2. A typical migratory route

Table 1 .
Relational representation of raw GPS data.
is described as follows: : A trajectory: TR; Distance threshold: Dr; Time threshold: Tr Output: A set of all stop regions SS LHP_SR_DETECTION (TR, Dr, Tr): i=0, ; //the number of GPS points in a GPS logs While i < n do: j=i+1; While j < n do: