Skip to content

JMidasch/building_detection_dl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Building Detection U-Net

Brief Problem description

This repository contains PyTorch-based Deep Learning pipelines designed to extract building footprints from high-resolution aerial imagery.

To explore the effects of different learning rates and especially data augmentation techniques different versions of the model were created. The first baseline model already implemented a lot of data augmentation and a cosine annealing learning rate. This was taken even further in the next version, which used a cosine annealing learning rate with warm restarts, even heavier augmentation and thrice the number of epochs. To also show the other direction the third barebones model has a constant learning rate and very little augmentation.

Motivation for modification / Clarification:

The heavy focus on data augmentations is in part caused by a last minute pivot in application. Inspired by Guyot et al. 2021 I first attempted predicting potential archaeological sites from DEMs. However, the labels derrived from the bavarian Denkmalatlas proved to be too inaccurate for DL purposes. In conjunction with the low signal-to-noise ratio of the elevation indices used all versions of the model produced little meaningful results and large amounts of overfitting. I tried fighting the overfitting for a long time before eventually giving up and switching to a more promising application where the premise is less flawed and changes to the model architecture can actually make a difference.)

General Deep Learning Pipeline

The files on the disk are initially just considered raw data. When read, they are structured into a dataset and randomly divided into Training, Validation, and Testing sets, typically in a 70/20/10 split. To prevent the network from memorizing exact images, the training data is dynamically augmented on-the-fly as it is loaded. This training dataset is fed into the GPU in batches by the data loader, which operates on the CPU. Because the GPU computes much faster than the CPU can read files, the data loader utilizes multiple workers to simultaneously prefetch the next batches. This ensures the GPU never has to wait for new data and experiences as little downtime as possible.

During the forward pass, the model attempts to make predictions from this input data using an encoder and a decoder. The encoder runs a series of filters to reduce the spatial resolution of the image while simultaneously increasing the number of channels, revealing deeper mathematical features. This is the work of the neurons: they are essentially sliding windows, and their internal values are the network's weights. The decoder then uses a different set of neurons to scale the compressed image back up to its original resolution, generating a final output that hopefully resembles the ground truth label.

After each batch, the loss function evaluates how accurate the prediction was by comparing it to the label. During the backward pass, the network calculates the exact mathematical error of its prediction. The optimizer then uses this error calculation to adjust the model's weights by an amount determined by the learning rate. Once the entire training dataset has run through this process, the model's weights are temporarily frozen, and the validation dataset is fed through instead. The model still predicts and the loss function still evaluates, but the weights are not adjusted. The purpose here is strictly to provide unbiased performance metrics. After running through both the entire training and validation datasets, one epoch is complete.

The testing dataset is only used once the final version of the model is reached to evaluate it. After it is used no adjustments or retrainings should be made to the model, as that would also be a form of backpropagation, even if by human hand and indirectly. The testing dataset would at that point be contaminated and lose it's purpose.

What the models have in common

Data

Data sources

RGB imagery

For larger areas the Bayerische Vermessungsverwaltung doesn't offer the TIFFs directly, but instead a .meta4 file. This is essentially a XML containing the download links for all the tiff tiles in the area. The files were than mass downloaded using the dataprep notebook.

Building footprint shapefiles

The buildings footprints for the districts of Lower Franconia, Upper Franconia and Middle Franconia were downloaded manually as zipped shapefiles.

Data preparation

  1. The script reads out the download links from the .meta4 file and downloads the RGB TIFF files into the appropriate directory. To speed up the process multiple workers were used.
  2. To reduce GPU downtime by improving data loader speeds, the original images were converted into Cloud-Optimized GeoTIFFs (COGs). Standard TIFFs store data sequentially in horizontal rows. If the model requests a small square crop, the system is forced to read the entire width of every intersecting row into memory. COGs solve this by structuring the data internally as a grid of tiles, allowing the system to read only the specific tiles that overlap the requested area. In practice, however, this didn't make much of a difference as the original tiffs weren't that big to begin with.
  3. The building footprint shapefiles were rasterised and saved as heavily compressed binary tiffs so they could be used as labels.

Data sample

The data loader and dataset.

To ensure the single-GPU setup remains constantly fed, the loader utilizes 15 persistent CPU workers with pinned memory, processing and transferring the data in batches of 128 with a standard prefetch size of 2. The Data loader of all the models randomly crops a 256x256 tile out of each TIFF. Should a tile not contain any buildings it tries again up to 10 times in total. If none of the 10 crops contain buildings there is a 10% chance the empty tile will be left in the training data. This is supposed to keep the model from defaulting to just guessing "no building here" everywhere without completely ignoring the unpopulated areas in the cities administrative boundaries completely.

Additionally all models employ basic vertical and horizontal flips for light data augmentation.

The model

Backbone

The model itself is a smp.Unet using a ReNet-18 encoder pretrained on ImageNet as its backbone. Since it comes with a basic "understanding" of textures and edges in RBG images already this should speed up the training of the model.

The model takes in the 3 channels of the RGB imagery and compresses them in the decoder section of the Unet. In each layer the resolution is reduced but new channels are created highlighting more and more mathematical features. The decoder then scales these images back up, combining the mathematical features with the higher resolution data they were derived from. In the end the network returns a single channel of logits and hands them to the loss function.

Optimizer and loss

Loss Function

The models use a hybrid between Binary Cross Entropy and Dice loss. The dice loss operates similarly to IoU and evaluates the overlap between the predicted features and the labelled features. The BCE instead penalizes uncertain predictions on individual pixels. Essentially the dice loss gets the model to detect the buildings and the BCE makes it get the shape right.

Optimizer

The models use the AdamW optimizer, which applies a strict mathematical penalty (weight decay) to keep the network's internal weights small. This stops the model focusing too much on any specific feature and overfitting in the process.

It also applies different learning speeds to the encoder and the decoder. The encoder is updated much more carefully in order to protect the pretrained weights it comes with.

What was adjusted between the models

Learning Rate

Barebones Model: Constant (3e-4)

Baseline Model: Cosine Annealing to quickly get to the "valley" and then finetune into the lowest spot

Fancy Model: Cosine Annealing with Warm Restarts to escape local minima in case the model landed in the wrong "valley" initially

Augmentations

Barebones Model:

  • Random Crops (256x256)
  • Random Flips (Horizontal and Vertical)

Medium Model:

  • Random Crops (256x256)
  • Random Flips (Horizontal and Vertical)
  • Random Rotations (0-360°)
  • Random Scaling (0.85x - 1.15x)
  • Random Gaussian Noise
  • Random Cutout (erasing a 32x32 block of pixels)

Fancy Model:

  • Random Crops (256x256)
  • Random Flips (Horizontal and Vertical)
  • Random Rotations (0-360°)
  • Random Scaling (0.85x - 1.15x)
  • Random Gaussian Noise
  • Random Cutout (erasing a 32x32 block of pixels)
  • Random Brightness
  • Random Contrast
  • Saturation Jitter
  • Random Gaussian Blur

Results & Discussion

Metric Barebones Baseline Fancy
Best result at epoch 68/71 89/100 225/267
Best Val Loss 0.1631 0.1665 0.1592
Generalization Gap 0.0288 0.0119 0.0147
Best Val IoU 0.8255 0.8189 0.8273
Best Val F1 0.9044 0.9004 0.9055

Both the Barebones and the Fancy model didn't get to run for all the planned epochs as they were much slower than expected and the SLURM job timed out before they were done. So it is not an entirely fair comparison at this point.

The heavier augmentations pushes down the generalization gap and thus allows for more epochs without overfitting. However, the actual improvements in the various metrics is mostly negligible. This makes sense, considering even the "barebones" model is not all that barebones with random crops and flips to augment the data. Data Augmentation and Overfitting wasn't the bottleneck yet, so in hindsight the efforts should have gone into other changes like trying out other backbones (e.g. resnet-34) or loss functions (e.g. only DICE, only BCE).

The biggest change between the models seems to be the Precision and Recall. For the barebones model precision is noticibly higher than recall, for the baseline model they are a bit closer and for the fancy model they are essentially the same with recall slightly in the lead. This is caused directly by the augmentations. The barebones model works with pure data and quickly learns the average building. If a building is doesn't follow the norm it will miss it. The fancy model on the other hand has to deal with weird buildings all the time due to the destructive augmentations, so it is more used to edge cases and misses less. Overall the values aren't much better though.

barebones metrics

baseline metrics

fancy metrics

About

Deeplearning experiments for detecting buildings in franconian cities

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors