Computer Vision
R. Iranpoor; S. H. Zahiri
Abstract
Background and Objectives: Re-identifying individuals due to its capability to match a person across non-overlapping cameras is a significant application in computer vision. However, it presents a challenging task because of the large number of pedestrians with various poses and appearances appearing ...
Read More
Background and Objectives: Re-identifying individuals due to its capability to match a person across non-overlapping cameras is a significant application in computer vision. However, it presents a challenging task because of the large number of pedestrians with various poses and appearances appearing at different camera viewpoints. Consequently, various learning approaches have been employed to overcome these challenges. The use of methods that can strike an appropriate balance between speed and accuracy is also a key consideration in this research.Methods: Since one of the key challenges is reducing computational costs, the initial focus is on evaluating various methods. Subsequently, improvements to these methods have been made by adding components to networks that have low computational costs. The most significant of these modifications is the addition of an Image Re-Retrieval Layer (IRL) to the Backbone network to investigate changes in accuracy. Results: Given that increasing computational speed is a fundamental goal of this work, the use of MobileNetV2 architecture as the Backbone network has been considered. The IRL block has been designed for minimal impact on computational speed. By examining this component, specifically for the CUHK03 dataset, there was a 5% increase in mAP and a 3% increase in @Rank1. For the Market-1501 dataset, the improvement is partially evident. Comparisons with more complex architectures have shown a significant increase in computational speed in these methods.Conclusion: Reducing computational costs while increasing relative recognition accuracy are interdependent objectives. Depending on the specific context and priorities, one might emphasize one over the other when selecting an appropriate method. The changes applied in this research can lead to more optimal results in method selection, striking a balance between computational efficiency and recognition accuracy.
Computer Vision
N. Rahimpour; A. Azadbakht; M. Tahmasbi; H. Farahani; S.R. Kheradpishe; A. Javaheri
Abstract
Background and Objectives: Cadastral boundary detection deals with locating the boundary of the ownership and use of land. Recently, there has been high demand for accelerating and improving the automatic detection of cadastral mapping. As this problem is in its starting point, there are few researches ...
Read More
Background and Objectives: Cadastral boundary detection deals with locating the boundary of the ownership and use of land. Recently, there has been high demand for accelerating and improving the automatic detection of cadastral mapping. As this problem is in its starting point, there are few researches using deep learning algorithms. Methods: In this paper, we develop an algorithm with a Mask R-CNN core followed with geometric post-processing methods that improve the quality of the output. Many researches use classification or semantic segmentation but our algorithm employs instance segmentation. Our algorithm includes two parts, each of which consists of a few phases. In the first part, we use Mask R-CNN with the backbone of a pre-trained ResNet-50 on the ImageNet dataset. In the second part, we apply three geometric post-processing methods to the output of the first part to get better overall output. Here, we also use computational geometry to introduce a new method for simplifying lines which we call pocket-based simplification algorithm.Results: We used 3 google map images with sizes 4963 × 2819, 3999 × 3999, and 5520 × 3776 pixels. And divide them to overlapping and non-overlapping 400×400 patches used for training the algorithm. Then we tested it on a google map image from Famenin region in Iran. To evaluate the performance of our algorithm, we use popular metrics Recall, Precision, and F-score. The highest Recall is 95%, which also maintains a high precision of 72%. This results in an F-score of 82%.Conclusion: The idea of semantic segmentation to derive boundary of regions, is new. We used Mask R-CNN as the core of our algorithm, that is known as a very suitable tools for semantic segmentation. Our algorithm performs geometric post-process improves the f-score by almost 10 percent. The scores for a region in Iran containing many small farms is very good.
Computer Vision
Z. Raisi; J. Zelek
Abstract
Background and Objectives: Signage is everywhere, and a robot should be able to take advantage of signs to help it localize (including Visual Place Recognition (VPR)) and map. Robust text detection & recognition in the wild is challenging due to pose, irregular text instances, illumination variations, ...
Read More
Background and Objectives: Signage is everywhere, and a robot should be able to take advantage of signs to help it localize (including Visual Place Recognition (VPR)) and map. Robust text detection & recognition in the wild is challenging due to pose, irregular text instances, illumination variations, viewpoint changes, and occlusion factors.Methods: This paper proposes an end-to-end scene text spotting model that simultaneously outputs the text string and bounding boxes. The proposed model leverages a pre-trained Vision Transformer based (ViT) architecture combined with a multi-task transformer-based text detector more suitable for the VPR task. Our central contribution is introducing an end-to-end scene text spotting framework to adequately capture the irregular and occluded text regions in different challenging places. We first equip the ViT backbone using a masked autoencoder (MAE) to capture partially occluded characters to address the occlusion problem. Then, we use a multi-task prediction head for the proposed model to handle arbitrary shapes of text instances with polygon bounding boxes.Results: The evaluation of the proposed architecture's performance for VPR involved conducting several experiments on the challenging Self-Collected Text Place (SCTP) benchmark dataset. The well-known evaluation metric, Precision-Recall, was employed to measure the performance of the proposed pipeline. The final model achieved the following performances, Recall = 0.93 and Precision = 0.8, upon testing on this benchmark.Conclusion: The initial experimental results show that the proposed model outperforms the state-of-the-art (SOTA) methods in comparison to the SCTP dataset, which confirms the robustness of the proposed end-to-end scene text detection and recognition model.
Computer Vision
S. H. Safavi; M. Sadeghi; M. Ebadpour
Abstract
Background and Objectives: Persian Road Surface Markings (PRSMs) recognition is a prerequisite for future intelligent vehicles in Iran. First, the existence of Persian texts on the Road Surface Markings (RSMs) makes it challenging. Second, the RSM could appear on the road with different qualities, such ...
Read More
Background and Objectives: Persian Road Surface Markings (PRSMs) recognition is a prerequisite for future intelligent vehicles in Iran. First, the existence of Persian texts on the Road Surface Markings (RSMs) makes it challenging. Second, the RSM could appear on the road with different qualities, such as poor, fair, and excellent quality. Since the type of poor-quality RSM is variable from one province to another (i.e., varying road structure and scene complexity), it is a very essential and challenging task to recognize unforeseen poor-quality RSMs. Third, almost all existed datasets have imbalanced classes that affect the accuracy of the recognition problem. Methods: To address the first challenge, the proposed Persian Road Surface Recognizer (PRSR) approach hierarchically separates the texts and symbols before recognition. To this end, the Symbol Text Separator Network (STS-Net) is proposed. Consequently, the proposed Text Recognizer Network (TR-Net) and Symbol Recognizer Network (SR-Net) respectively recognize the text and symbol. To investigate the second challenge, we introduce two different scenario. Scenario A: Conventional random splitting training and testing data. Scenario B: Since the PRSM dataset include few images of different distance from each scene of RSM, it is highly probable that at least one of these images appear in the training set, making the recognition process easy. Since in any province of Iran, we may see a new type of poor quality RSM, which is unforeseen before (in training set), we design a realistic and challengeable scenario B in which the network is trained using excellent and fair quality RSMs and tested on poor quality ones. Besides, we propose to use the data augmentation technique to overcome the class imbalanced data challenge.Results: The proposed approach achieves reliable performance (precision of 73.37% for scenario B) on the PRSM dataset . It significantly improves the recognition accuracy up to 15% in different scenarios.Conclusion: Since the PRSMs include both Persian texts (with different styles) and symbols, prior to recognition process, separating the text and symbol by a proposed STS-Net could increase the recognition rate. Deploying new powerful networks and investigating new techniques to deal with class imbalanced data in the recognition problem of the PRSM dataset as well as data augmentation would be an interesting future work.
Computer Vision
M. Taheri; M. Rastgarpour; A. Koochari
Abstract
Background and Objectives: medical image Segmentation is a challenging task due to low contrast between Region of Interest and other textures, hair artifacts in dermoscopic medical images, illumination variations in images like Chest-Xray and various imaging acquisition conditions.Methods: In ...
Read More
Background and Objectives: medical image Segmentation is a challenging task due to low contrast between Region of Interest and other textures, hair artifacts in dermoscopic medical images, illumination variations in images like Chest-Xray and various imaging acquisition conditions.Methods: In this paper, we have utilized a novel method based on Convolutional Neural Networks (CNN) for medical image Segmentation and finally, compared our results with two famous architectures, include U-net and FCN neural networks. For loss functions, we have utilized both Jaccard distance and Binary-crossentropy and the optimization algorithm that has used in this method is SGD+Nestrov algorithm. In this method, we have used two preprocessing include resizing image’s dimensions for increasing the speed of our process and Image augmentation for improving the results of our network. Finally, we have implemented threshold technique as postprocessing on the outputs of neural network to improve the contrast of images. We have implemented our model on the famous publicly, PH2 Database, toward Melanoma lesion segmentation and chest Xray images because as we have mentioned, these two types of medical images contain hair artifacts and illumination variations and we are going to show the robustness of our method for segmenting these images and compare it with the other methods.Results: Experimental results showed that this method could outperformed two other famous architectures, include Unet and FCN convolutional neural networks. Additionally, we could improve the performance metrics that have used in dermoscopic and Chest-Xray segmentation which used before.Conclusion: In this work, we have proposed an encoder-decoder framework based on deep convolutional neural networks for medical image segmentation on dermoscopic and Chest-Xray medical images. Two techniques of image augmentation, image rotation and horizontal flipping on the training dataset are performed before feeding it to the network for training. The predictions produced from the model on test images were postprocessed using the threshold technique to remove the blurry boundaries around the predicted lesions.
Computer Vision
M. Fakhredanesh; S. Roostaie
Abstract
Background and Objectives: Action recognition, as the processes of labeling an unknown action of a query video, is a challenging problem, due to the event complexity, variations in imaging conditions, and intra- and inter-individual action-variability. A number of solutions proposed to solve action recognition ...
Read More
Background and Objectives: Action recognition, as the processes of labeling an unknown action of a query video, is a challenging problem, due to the event complexity, variations in imaging conditions, and intra- and inter-individual action-variability. A number of solutions proposed to solve action recognition problem. Many of these frameworks suppose that each video sequence includes only one action class. Therefore, we need to break down a video sequence into sub-sequences, each containing only a single action class.Methods: In this paper, we develop an unsupervised action change detection method to detect the time of actions change, without classifying the actions. In this method, a silhouette-based framework will be used for action representation. This representation uses xt patterns. The xt pattern is a selected frame of xty volume. This volume is achieved by rotating the traditional space-time volume and displacing its axes. In xty volume, each frame consists of two axes (x) and time (t), and y value specifies the frame number.Results: To test the performance of the proposed method, we created 105 artificial videos using the Weizmann dataset, as well as time-continuous camera-captured video. The experiments have been conducted on this dataset. The precision of the proposed method was 98.13% and the recall was 100%.Conclusion: The proposed unsupervised approach can detect action changes with a high precision. Therefore, it can be useful in combination with an action recognition method for designing an integrated action recognition system.