Due to the difficulties described in the current CV approaches, tracking pig activity is a challenging task without considerable labor efforts. The objective of this paper is to develop a semi-supervised pipeline, Virtual Tag , to automate long-term tracking of group-housed pigs. In this pipeline, successful tracking algorithms are implemented. They include Sparse Optical Flow proposed by Lucas and Kanade , multiple instance learning , and channel and spatial reliability that learn representatives from the object of interest and to find the similar image region in the next input video frame. These algorithms are lightweight and require no specific computing resources such as graphics processing units . The implemented tracker substantially reduce efforts in labeling pig positions for every single frame. To start tracking, users can either assign initial positions, or VTag can predict the positions based on their motion, which is anticipated to be effective features under different monitoring environments. We validated VTag by four three-hundred-frame videos collected from our farming trials, and the benchmark test is performed to compare the performance and detected frames per second of the implemented trackers and other state-of-the-art models, such as YOLOv5 and Mask R-CNN . In addition, VTag is released as a friendly software tool in both a graphical user interface and a Python library, allowing users to freely utilize the labeled data for their following research. Therefore, neither hard-coded features selected by human experts nor large training datasets labeled from a massive manual work are required in our pipeline.All animal experiments were approved and carried out in accordance with the Virginia Tech Institutional Animal Care and Use Committee under protocol #19-182. The demonstrated video recordings were obtained from ,vertical grow racks which reported the image-based live body weight prediction of non-restrained grower pigs. The pigs entered the trial at 5 wk post-weaning. The imaging system was built with a laptop-controlled camera that captured RGB and depth videos with resolution of 848 × 480 pixels.
The camera was installed at a height of 2.25 m perpendicularly to the floor in each 5 × 7 ft pen, where pigs can freely move and walk during the entire recording. In each day, each monitored pen was recorded in a three-hundred frame video at a rate of 6 frames per second. Raw videos were saved in Robot Operating System bag video format, and the decoder Intel RealSense Viewer was applied to obtain sequential image fles as the input data. In this study, only RGB converted grayscale images were used, and depth and color information were excluded from the pipeline. Each video clip had 300 time frames. There were four video clips being evaluated for the performance of the presented pipeline: 3 clips contain 1, 2, and 3 pigs , respectively. The last clip also contains 2 pigs, but more motion was observed.To evaluate the precision, we manually labeled the central positions of each pig body as the ground truths. The precision error was determined by the Euclidean distance between the ground truth and the centroid of the predicted bounding box. To make the results comparable with other studies, the error was standardized by being divided by the diagonal distance of the video frame ranging from 0 to 1. In addition, as the tracking process may be unsuccessful when the similarity of two consecutive frames is low, human supervision is needed to provide new tracking positions to resume tracking. Hence, we also evaluated the number of supervision is needed to complete tracking the 300 frames in each dataset. To evaluate the computing time, the elapsed time to track one single frame is measured for 100 iterations. The time is presented in FPS by inverting the observed elapsed time. In addition to the implemented trackers, the object detection models, YOLOv5 and Mask R-CNN, pre-trained by the COCO dataset are also included in the evaluation of computation time. It can help explore the possibility of adapting these pre-trained deep learning models in the pig tracking tasks.
The evaluation was run on a personal laptop, MacBook Pro with Apple M1 Max chip, 10 CPU cores, and 32 GB RAM. The GPU resources were not utilized during the evaluation.The VTag pipeline is released as a Python software and can be accessed by a GUI or an interactive Python session. There are 3 components that users can interact with in the GUI: the video previewer, the playback controller, and the configuration. The preview shows the video overlayed by the tracking results, which are presented by a centroid and its tracking window area. Different tracking points are colored differently to show pig entities. The video can be played, paused, and traversed to any video frame by interacting with the playback controller . Each frame in the progress bar is colored in a gradient scale from yellow to blue, showing the tracking errors estimated from the implemented tracker. In the configuration panel , parameters needed for the tracking task are tunable. In the panel, users can load a directory containing the video to start the tracking tasks, adjust the number of tracked objects and tracking size, and optimize the quality for displaying the tracking results. If users need to work with their own analysis in an interactive programming session, users can load VTag in Python as a library. The library has commands available that correspond to all the actions in the VTag GUI. In sum, VTag provides a friendly platform to annotate video data and generate informative farming guidance for pig activity.The precision evaluation is presented in Figure 3, the standardized errors over frames were plotted in boxplots. Every 0.1 of the standardized error is 26.29 cm in the presented datasets. The colors represent different supervisions. For example, the results shown in red are evaluated after 8 times of human supervision.
With adequate human supervision, all trackers can precisely track pig activity with errors less than 22.82 cm in all the 4 datasets. In particular, the tracker LK can complete the tasks without any resuming supervision with the median errors of 18.03 and 13.81 cm for the datasets of one-pig and three-pig, respectively. The tracker CSRT performed similarly well with only one additional supervision with the median error of 16.3 cm in the studied datasets except the dataset of two-pig . Among the studied trackers, MIL has similar precision but requires more human supervision than others in all the dataset. It is noted that the number of tracked objects is not a major limiting factor when it comes to tracking precision. In this study, more supervision is needed when the objects are found to move rapidly and create blurry image features. When the pigs move rapidly, the input video with low FPS had latencies to display object positions timely. In the 2-pig dataset although with similar precision, 7, 5, and 13 supervisions were needed to complete tracking the 300 frames for the 3 trackers, respectively. The computing time is presented by FPS, which indicates how many frames the tracker can process per second . As results, LK tracked averaged FPS of 900 and showed out performance in computing speed to other trackers by more than 100 folds. CSRT is the second fast tracker with a performance ranging from 9.9 FPS to 60.81 FPS in the tasks of tracking different number of pigs. MIL is found to be the slowest tracker, with as low FPS as 1.8 FPS when it tracked six pigs. It is also found that for the trackers CSRT and MIL, the numbers of tracked objects affect the tracking speed non-linearly. Additionally, the pretrained object detection models are evaluated in this study as well. Without enabling GPU resources, both models predict the studied videos slower than the presented trackers. Only 4 FPS and 0.17 FPS are processed by YOLOv5 and Mask RCNN, respectively.The distance between studied subjects implies 2 types of general social interactions: separated or engaged. When the subjects engage closely, the distance values are low during the period of time frames. Otherwise,indoor grow lights shelves subjects are separated apart without much interaction. A line chart of the distance against the 300 time frames was visualized to monitor such patterns, showing 4 peaks and 4 valley values from the 2-pig data . To examine whether the distance is an effective indicator for the interactions, video frames with peak and valley values were displayed. Consequently, in the frames with peak values, interaction was observed among pigs, and they were observed staying in 2 different corners of the pen at the examined time frame. On the other hand, in the frame with valley values, social interactions were observed for all inspected frames. Pigs were in the status of in-taking feeds alongside or chasing each other. From the examined 300 frames, the estimated distance between pigs is an accurate indicator to filter time frames where social interactions may occur. By knowing the tracks of each pig, pixel movements per time frame were studied to monitor the activities individually. In the presented data, 2 studied pigs were denoted as “Pig_1” and “Pig_2”. The median movement of Pig_1 and Pig_2 is 21.1 and 21.98 pixels per frame, which show no significant difference in overall activity . However, individual-specific temporal pattern can be discovered by dissecting the activity at certain time frames.
For example, during the first 50 frames, Pig_2 was much more active, the difference between Pig_1 and Pig_2 was especially revealed in those peak movements. Moreover, after the 50th frame, Pig_2 continuously had greater changes of accumulated movements over Pig_1. The superiority was 1739.7 pixels at the 50th frame, and it was later expanded to 3612.9 pixels at the 250th frame . Finally, we inspected the synchronicity between pigs by comparing their movements per frame . A moderate correlation was observed in the studied data, which implied that the activity of each individual was not independent and were partially determined by its neighboring pig. In addition to monitoring the temporal activity, spatial patterns of pig movement can be informative for herd management. Heat maps generated from pixel-wise variation across all time frames provided insightful guide on what areas were visited most . In the one-pig data , middle-top and bottom-left regions have found to be the hot spots, which were the places to engage with neighboring pigs and the feeding area, respectively. Whereas in the 2-pig data , there was no clear spatial trend of the subject activity. Most corners of the pen were visited by both pigs except the central area.Continuously tracking pig activity from videos is an important initial step to monitor farming conditions in swine industry. Including animal diseases, welfare, and pen-scale social interactions, such complex monitoring tasks require detailed observation of pig activity. Many existing works have automated the tasks through the aid of CV technology but required massive human effort in preparing data sets to build an effective system. In contrast, this paper presented a semi-supervised pipeline, VTag, which does not require laborious work in setting up the training system. Solely relying on a top-view and grayscale video, VTag provides an efficient approach to continuously track the positions of group-housed pigs with an average error of 17.99 cm in the presented datasets. The results can serve as preliminary farming guidance to infer complex traits that used to require intensive labor resources. For example, by continuously tracking pig positions with VTag, individual-level activity per unit time and walking speed can be estimated. This is important information for the trait assessment of pig lameness, which can be potential indicators of fractures, lesions, and development disease , and diminishes welfare in pigs. Hence, effectively evaluating lameness allows farmers to control economic losses from losing pigs with poor body conditions . Another important monitoring task that can be improved with VTag is tail biting in pigs. Because tail biting is linked to stressful farming conditions and lower body weights , detecting the negative events at an early growing stage can be beneficial to both animal welfare and production. As the real-time pig positions are obtained automatically, the relative distance between individuals in the pen can be estimated. Behavioral researchers can use this information to filter a specific time range from an hour-long video: When the relative distance is low, it is more likely to observe tail-biting events.