Research Scientist at the Department of Primary Industries and Regional Development Western Australia
How to efficiently create millions of overlapping raster tiles
When dealing with large amounts of raster spatial data, you will often find that operations become very slow to perform or just won’t run at all. Often this is the result of single-core processes or simply running out of RAM. Fortunately, in most situations, there is a solution to this issue. Simply chop your raster data into smaller parts and run multiple simultaneous operations. In this post, I will be covering the first half of this workflow; chopping up your data AKA ‘tiling’.
This script is one of a handful of tools I use for preprocessing raster data for Deep Learning models, however, it works just as well for other raster datasets such as DEMs. This script expects your input data to be in the geotiff format with the extension ‘.tif’. However, it should handle any raster that GDAL will open. It also expects that you either have one or more input rasters, if you have multiple input rasters they should be edge matched and have the same resolution and projection. If they do not, you may get unexpected results.
It is worth noting that I have only run this code on Linux and macOS. I believe the multiprocess library (which this script uses extensively) operates slightly differently in different operating systems (beware Windows users).
Also thanks to Carlos Babativa Rodriguez for the help with this one.
The notebook starts by importing the necessary libraries. You will need to have all of these installed within your environment to run this notebook. In my particular case, I’m using Anaconda, so if my environment was lacking ‘tqdm’, I would head over to https://anaconda.org/conda-forge/tqdm and run the first install line on that page.
Keep in mind you should avoid mixing conda channels, so either stick to the default channel for all your third party installs or stick to the ‘forge’ channel. If you mix channels you may end up breaking your environment, ask me how I know…
This next cell requires you to set some values. Firstly set the desired x and y pixel counts of your tiles as well as the desired overlap. All of these values should be entered as integers as you can’t create part of a pixel. Next, you must enter the folder which contains your rasters, this folder may have a flat structure or be a nested directory, either will work. The output folder will be the location that contains the output tiles. If you define a location that does not yet exist it will be created later. If you only wish to use a subset of your rasters in your directory, you can also define a path to a .csv file. This file should contain a list of raster filenames that you want to use. If you want to use all of your raster files, set this to an empty string. You will also need to define the output raster format, most of the tile ‘.tif’ should do the job. Lastly, the output compression must be defined. If you are using RGB imagery you should probably set ‘JPEG’ assuming you can deal with lossy compression. If you are using almost any other raster data type, ‘LZW’ should do a good job.
This cell allows you to set the output folder structure and tile filename.
The settings can now be written out as a ‘json’ file, this is helpful if you forget the specific settings used to create a dataset.
Now that the input raster folder has been validated, this cell will look through it and return all rasters with the ‘input_file_ext’ extension.
If you have set ‘valid_rasters_csv’ to an empty string, the next three cells will not run. Sometimes you may not wish to tile all of your input raster files, in which case you can subset this list by providing a ‘valid_rasters_csv’. It is expected that this file contains a list of filenames which you would like to keep. Any raster that is not in this list will be removed from the input raster list and tiles will not be made from it.
This cell will open the ‘valid_rasters_csv’ file with pandas and print the start of the pandas dataframe. Note the column heading of the filenames, in the example; ‘raster_name’.
This cell converts a dataframe column into a set. Note that you will need to replace ‘raster_name’ with your column heading. A set is used here as you should not have any repeating values and because the next cell performs many ‘in’ checks which happens very quickly on sets and much slower on lists.
Now that the ‘valid_raster_names’ is created, this next cell will loop over each raster and check to see if each raster is in the ‘valid_raster_names’ set. If they are not in the set, they are removed.
The following cell defines a function which, opens a given raster and extracts its bounds, returning the bounds and the raster path. This is defined as a function as it will be called in the next cell using multiprocessing.
Note, here we are calling the ‘get_bounds’ function with multiprocessing. This allows as many copies of this function to run as your computer has threads simultaneously. In addition, a loading bar is shown of the current process via ‘tqdm’
Now that the bounds of every raster are known, this next cell extracts the maximum geographical extent in each direction, defining one bounding box for all of the input rasters.
To know how to cut the tiles, it is necessary to know the pixel size of the input rasters. As we are assuming all the rasters are the same pixel resolution, this cell opens up the first raster in the list and extracts the pixel size.
While we are at it, the next cell extracts the projection.
Using the values supplied and derived above, this next cell calculates the x and y distance between each tile and the size of each tile. Note that if you are creating any overlap, these values will be different from each other.
Now that the above has been calculated, the following cells can work out the number of rows and columns that will be tiled. This is needed to cut the tiles in a multiprocessing manner.
This next cell defines a function that checks which tiles intersect which input rasters. This function will reduce the complexity of the next cell.
Now a function is made which defines the extent of each tile and also checks which input rasters are needed to create it. A critical part of this is identifying if a tile intersects more than one raster. Depending on your particular situation this could be every tile or none of them. It is important to identify these tiles as they will need to be dealt with slightly differently than the other tiles.
Yet again, this cell uses multiprocessing to call multiple copies of the above function at the same time.
Now that we have the extent of each tile, a geodataframe is created. This geodataframe is used to visualise the extent of each tile which can be useful if something unexpected is happening. This geodataframe contains the bounds of each tile as well as its row, column and name. In the next couple of cells, this geodataframe is displayed in multiple different ways and then exported out to disk.
A naive approach to cutting tiles may be to open a tile, open each intersecting input raster, then cut out the necessary parts of each raster, join the parts together and move to the next tile. This would work but would result in each input raster being read multiple times which could be quite slow. Instead, if each input raster is opened and multiple tiles are cut one after another, each raster would only need to be opened once, reducing load times. To operate in this manner, the below function is necessary. The function identifies which tiles intersect each input raster. Fortunately, most of this work is already done in the intersecting we performed above, so this function just looks through the geodataframe to work out the intersects.
This intersect function is called using multiprocessing
The tile cutting function is now defined, this function takes a list as an input, the first element being a path to a raster and the second being a list of tiles to cut. The raster is opened and then each tile is cut out as long as it does not already exist. This function also checks that the tile being cut does not also intersect another raster, if so the tiles filename is marked with the string ‘_incomplete.tif’. At the end of the function, a list of incomplete tiles is returned.
Now another function is defined which will split a list into roughly equally sized sublists this is necessary to batch the tiling process.
This cell simply counts the number of CPU cores available.
This cell works out the average number of tiles each input raster will be cut into.
Depending on your CPU core count and the number of input rasters there are two different approaches, both optimised for speed in different situations. For example, if you only have one input raster and 10 CPU cores then each core will get a copy of the same input raster and a sublist of tiles to cut out. However, if you have 10 CPU cores and 100 input rasters then each core will receive its own raster, cutting out all of that rasters tiles, then moving on to the next raster. Again all of this is done using multiprocessing for speed.
The returned ‘incomplete_tile_list’ above is a list of lists, this cell is used to flatten it into one long list.
The next two cells build a dataframe that contains two columns, one containing the paths to the incomplete tiles and the other containing the row and column for that tile.
This cell is quite handy and I actually use it quite a bit outside of this script as well. It utilises the ‘gdal_merge.py’ python script to drastically simplify the GDAL merging process.
This cell takes a row and column and finds all the incomplete tiles associated with that location, which are then sent to the merging script. When the merging is completed, the incomplete files are removed as they are no longer required.
Finally, multiprocessing is called one last time to join incomplete tiles.
Now the notebook is completed, if you navigate to your output folder you will find a collection of subfolders. Each subfolder contains one row of tiles. It is broken up like this as many file navigators will start to slow down if an open folder contains many thousands of images.