Eine Übersicht aller Sessions/Sitzungen dieser Tagung.
Bitte wählen Sie einen Ort oder ein Datum aus, um nur die betreffenden Sitzungen anzuzeigen. Wählen Sie eine Sitzung aus, um zur Detailanzeige zu gelangen.

S13: Bildanalyse - Computer Vision
Freitag, 22.02.2019:
9:00 - 10:30

Chair der Sitzung: Martin Weinmann
Ort: EH 01 (Exnerhaus)
EXNH-EG/58 Peter-Jordan-Straße 82, Erdgeschoß OST 150 Sitzplätze

Zeige Hilfe zu 'Vergrößern oder verkleinern Sie den Text der Zusammenfassung' an

Encoder-Decoder Network With Dilated Convolution For Local Structure Preserving Stereo Matching

J. Kang1,2, L. Chen1, F. Deng2, C. Heipke1

1Leibniz Universität Hannover, Deutschland; 2Wuhan University, China

After many years of research, stereo matching remains a challenging task in photogrammetry and computer vision. Recent work has shown great progress by forming dense stereo matching as a pixel-wise learning task to be solved with deep convolutional neural networks (CNNs). However, most estimation methods, including traditional and deep learning approaches, are very difficult to handle in a real world, especially in large displacement, high-depth discontinuity, and low texture areas.

To tackle these problems, in this paper we investigate a recent proposed end-to-end disparity learning network, DispNet [1] , and improve it to yield better results in these problematic areas. The improvements are brought by three major contributions. First, we use dilated convolution [2-3] to develop a global context feature extraction module. Dilated convolution expands the receptive field of view when extracting features and aggregates more contextual information without losing spatial resolution, which allows our network to be robust in weakly textured areas. Second, we construct the matching cost volume with patched-based correlation to large-scale disparity displacement. We modify the basic encoder-decoder module to regularize and regress detailed disparity images. Third, instead of using post-processing steps to impose smoothness and handle depth discontinuities, we incorporate disparity gradient information as a smoothness constraint to preserve local structure details in high-depth discontinuity areas when estimating disparity images.

We evaluate our model on several challenging stereo datasets such as Scene Flow, Sintel and KITTI. Experimental results demonstrate that our model decreases by more than 35% compared to the Dispute on the Scene. Moreover, our proposal is based on CNN-based methods without any post-processions, especially in inherently ill-posed regions.

[1] Mayer, Nikolaus, et al. "A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016th

[2] Holschneider, Matthias, et al. "A real-time algorithm for signal analysis with the help of the wavelet transform." Wavelets. Springer, Berlin, Heidelberg, 1990. 286-297.

[3] Yu, Fisher, and Vladlen Koltun. "Multi-scale context aggregation by dilated convolutions." arXiv preprint arXiv: 1511.07122 (2015).

Road Detection From Aerial Images By Mask R-CNN

P. Humburg, F. Rottensteiner, C. Heipke

Institute of Photogrammetry and GeoInformation, Leibniz Universität Hannover, Deutschland

Road data needs to be kept updated for a number of applications, such as navigation or urban planning. However, the description of roads in geospatial databases can change quite frequently, due to construction or deconstruction, and therefore the extraction of a correct and complete representation of the road network is required for updating.

In order to achieve this goal using aerial and satellite images, an approach using a Convolutional Neural Network (CNN) is employed in this paper. More specifically, the Mask R-CNN, first proposed by He et al. (2017), is used, and its suitability for the given task is be discussed. The approach contains a Region Proposal Network (RPN), followed by three network heads: a classification network, a bounding box regression network, fitting a bounding box to the desired object, and a network extracting the binary mask of the object. This mask prediction would ideally predict the correct and complete course of a road object, thus verifying the existing road data, and getting rid of smaller geometrical inaccuracies. The bounding box regression, however, focuses exclusively on non-rotated bounding boxes, whose axes are parallel to the image rows and columns, which becomes an issue, as roads can have any direction in the image. Thus, a relative rotation between the box and the image needs to be taken into account for this task.

The approach is evaluated on a dataset situated in northern Germany, consisting mainly of suburban regions characterised by detached houses, as well as some more rural regions. The ground sampling distance is 20cm, and the images contain both RGB and infrared information. Additionally, a digital terrain model (DTM) of the area is available, as well as a land cover classification, as performed by Yang et al., 2018. The road database used is the German Authoritative Real Estate Cadastre Information System (ALKIS).


He, K., Gkioxari, G., Dollár, P., Girshick, R. (2017): Mask R-CNN. 2017 IEEE International Conference on Computer Vision (ICCV). 22-29 October 2017. Venice, Italy. Pages: 2980 – 2988.

Yang, C., Rottensteiner, F., Heipke, C. (2018): Classification of land cover and land use based on convolution neural networks. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences IV-3. Pages: 251 – 258.

Complementary Feature Learning from RGB and Depth Information for Semantic Image Segmentation

L. Chen, D. Zhao, C. Heipke

Institute of Photogrammetry and GeoInformation, Leibniz Universität Hannover, Deutschland

Semantic segmentation, called classification in remote sensing, of aerial images is the process of assigning an object class label to each pixel in an image. Objects on the Earth surface, e.g. buildings or roads, have a size much larger than the typical ground sampling distance (GSD) of modern aerial images (5-30 cm). Therefore, a context window is often considered for extracting features that are used for classification. In the computer vision world Fully Convolutional Networks (FCN) [1], which directly output a pixel-wise classification map given an image as input, is now a standard tool for semantic segmentation.

Spectral information of images, e.g. RGB or IRRG, is normally the first source provided for classification. However, other information like depth can provide complementary information for the classification process. For example, although shadow can lead to ambiguity in classification as it changes the normal spectral appearance of objects, a digital surface model derived from laser scanning is not influenced by shadows. Typically, features computed from RGB and depth information are fused, by adding or concatenate, to improve the classification performance [2, 3]. To the best of our knowledge, there is still a lack of research about how the distribution of features computed from RGB and depth in feature space affects the classification performance.

The general structure of a FCN for image classification can be divided into three parts: feature extraction for each separate information source, feature fusion and classification. In this paper, we hypothesize that if the mid-level features (those extracted at the end of feature extraction stage in the network) computed from RGB and depth information are complementary to each other, then the information computed from RGB and depth will reinforce each other and thus provide more distinctive features for classification. Based on this hypothesis, we formulate a complementarity constraint for the mid-level features of RGB and depth information. The complementarity constraint we stated is that the features should be perpendicular to each other in high dimensional feature space, thus the features span different dimensions in feature space, which means they are aimed at extracting “different” useful features for semantic segmentation.

We modified ResNet-50 [4] to adapt it to an encoder-decoder FCN architecture and use this modified network as our baseline network. The classification results based on different fusion strategies are investigated. Specifically, fusion of mid level features from RGB and depth information with and without our proposed constraint is compared. More results and analysis will be reported in the full paper later.


[1] Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431-3440).

[2] Audebert, N., Le Saux, B., Lefèvre, S. (2018). Beyond RGB: Very high resolution urban remote sensing with multimodal deep networks. ISPRS Journal of Photogrammetry and Remote Sensing, 140, 20-32.

[3] Marmanis, D., Schindler, K., Wegner, J. D., Galliani, S., Datcu, M., & Stilla, U. (2018). Classification with an edge: Improving semantic image segmentation with boundary detection. ISPRS Journal of Photogrammetry and Remote Sensing, 135, 158-172.

[4] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).

Deep Learning zurAnalyse von Bildern von Seidenstoffen für Anwendungen im Kontext der Bewahrung des kulturellen Erbes

M. Dorozynski, D. Wittich, F. Rottensteiner

Institut für Photogrammetrie und Geoinformation, Deutschland

Im vorliegenden Beitrag geht es um die überwachte Klassifikation von Bildern von Seidenstoffen mit dem Ziel, die Entstehungsepoche zu prädizieren. Solche Informationen sind für die wissenschaftliche Einordnung von Kunstwerken von Interesse. Es gibt Museen mit Sammlungen von Bildern von Seidenstoffen mit zugehörigem deskriptivem Text, der jedoch oft frei formuliert ist, sodass die oben genannte Information nicht in standardisierter Form vorliegt. Eine computergestützte Analyse dieser Sammlungen erfordert eine Überführung dieser Informationen in ein standardisiertes Format. Im Rahmen des EU-Projekts SILKNOW ( soll dieses Problem durch die Klassifikation von Bildern mittels Convolutional Neuronal Networks (CNN ) gelöst werden.

CNN können die Extraktion geeigneter Merkmale aus Bildern ebenso lernen wie die Abbildung dieser Merkmale auf eine Menge von Klassen. Dabei geht es häufig darum, für ein gesamtes Bild ein einziges Klassenlabel abzuleiten (Krizhevsky et al., 2012). Zur Klassifikation kunsthistorischer Werke wurden bisher vor allem Bilder von Gemälden in Hinblick auf den Künstler, das Genre und die Entstehungsepoche untersucht. Neben vordefinierten Merkmalen (Saleh & Elgammal, 2016) wurden dazu auch CNNs verwendet (Bar et al., 2014), wobei vortrainierte Netzwerke an das neue Problem angepasst wurden (Re-training; Yosinsky et al., 2014; Bar et al., 2014).

Im vorliegenden Beitrag soll anhand eines CNNs exemplarisch die Herstellungsepoche eines Seidenstoffes anhand eines Bildes dieses Stoffes prädiziert werden. Als Grundlage wird das auf dem ImageNet-Datensatz (Deng et al., 2009) vortrainierte Residual Network ResNet-152 V2 (He et al., 2016) genutzt. Durch Re-training mit Hilfe von Trainingsdaten wird dieses Netzwerk an die vorgegebene Fragestellung angepasst. Eine Herausforderung stellt dabei die Heterogenität der deskriptiven Texte dar, welche manchmal sogar den Herstellungstag, oft aber nur das Jahrhundert der Produktion angeben. Zudem gibt es auch zeitliche Überlappungen einiger der Angaben. Es wurde eine Klassenstruktur gewählt, die Intervalle von 50 Jahren unterscheidet. Generell ist der Umfang an Trainingssamples relativ gering, sodass nur die Parameter der letzten Schicht des Netzes neu bestimmt werden. Dies erfolgt zunächst nur unter Verwendung von Trainingsbeispielen, die eindeutig einer Klasse zugeordnet werden können. Als Verlustfunktion im Training wird hier die Softmax cross-entropy (Bishop, 2006) genutzt.

Um zusätzliche Daten für das Training nutzen zu können, wurde die Verlustfunktion für das Training dahingehend erweitert, dass auch Samples, die nicht eindeutig einer Klasse zugeordnet sind, dazu beitragen. Für solche Samples wird die Verlustfunktion aus der Summe der Softmax-Aktivierungen aller zugeordneten Klassen berechnet, während sie für die eindeutig einer Klasse zugeordneten Samples unverändert bleibt. Samples, die mehreren Klassen zugeordnet sind, dienen somit in erster Linie als Gegenbeispiele für alle anderen Klassen.

Das Verfahren wird anhand eines Testdatensatzes evaluiert; 2876 Samples können den Basisklassen zugeordnet werden, während für 442 Samples eine eindeutige Klassenzuordnung nicht möglich ist. Die Bilder stammen von Seidenstoffen aus eineinhalb Jahrhunderten, sodass drei Klassen unterschieden werden. Die Experimente werden mit einer 10-fachen Kreuzvalidierung durchgeführt. Unter Nutzung des Standardverfahrens zum Training kann eine Gesamtgenauigkeit von 67 % erreicht werden, wobei die klassen­spezifischen Indices zeigen, dass die Qualität wesentlich von der Anzahl der Trainingsdaten abhängt. Unter Nutzung der neu entwickelten Verlustfunktion lässt sich die Gesamtgenauigkeit bei drei Klassen auf etwa 74 % steigern.


Bar Y., Levy N., Wolf L., 2014. Classification of artistic styles using binarized features derived from a Deep Neural Network. In: Agapito L., Bronstein M., Rother C. (eds) Computer Vision - ECCV 2014 Workshops. ECCV 2014. Lecture Notes in Computer Science, vol 8925. Springer, Cham, pp. 71-84.

Bishop, C., 2006. Pattern Recognition and Machine Learning. 1st edition, Springer, New York, NY, 2006, pp. 235-236.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L., 2009. ImageNet: A large-scale hierarchical image database. In: CVPR 2009 - IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255,

He , K., Zhang, X., Ren, S., Sun, J., 2015. Delving Deep into rectifiers: Surpassing human-level performance on ImageNet classification. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1026-1034.

Krizhevsky, A., Sutskever, I., Hinton, G. E., 2012. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 (NIPS’12), Volume 1, pp. 1097-1105.

Saleh, B., Elgammal, A., 2016. Large-scale classification of fine-art paintings: Learning the right metric on the right feature. International Journal for Digital Art History 2(2016), pp. 70-93.

Yosinski, J., Clune, J., Bengio, Y., Lipson, H., 2014. How transferable are features in deep neural networks? Advances in Neural Information Processing Systems 27 (NIPS’14), Volume 2, pp. 3320-3328.

Impressum · Kontaktadresse:
Datenschutzerklärung · Veranstaltung: DLT 2019
Conference Software - ConfTool Pro 2.6.124
© 2001 - 2019 by Dr. H. Weinreich, Hamburg, Germany