Applying an Azure AutoML model to raster GIS data
This is a walk through of a Jupyter Notebook I created to run a vegetation classification model over the Nullarbor Land Conservation District. This notebook assumes you are trying to execute your own geographic classification task and that you already have a trained model from Azure AutoML. It is also assumed that your input raster data is already prepared and all your data has the same extent, pixel size and projection. For my particular application I was using 100 raster layers at 80m resolution, which covers 150,000 km2 and equates to about 20,000,000 pixels in total.
The onerous but necessary setup:
Warning, these setup steps are likely to change over time as Azure updates their dependencies. These particular instructions were valid as of 28 April 2021.
If you haven’t already, go ahead and download your model from Azure. You should receive a zip archive with three files, extract them into a new folder. Within your new folder you should have “conda_env_v_1_0_0.yml” the conda environment setup file, “model.pkl” the trained model and “scoring_file_v_1_0_0.py” which contains an example of how to use the model.
Before moving ahead, it is necessary to set up a conda environment with the specific libraries necessary to run the Azure model. The ‘conda_env_v_1_0_0.yml’ file you have just extracted contains most of what you need to make the environment, however you do need to make some slight tweaks. Open the ‘conda_env_v_1_0_0.yml’ file with a text editor and add ‘- pip’ above python, then set the python version number to 3.7.
Next, open a terminal and ‘cd’ into your extracted folder, then run the following command to build your environment:
Your environment now contains all of the Azure specific ML libraries, however you still need to add a couple extra libraries. Run the following commands one by one to finish off the install:
Your environment is now set up. The following commands will change your current directory back to the root directory and then open Jupyter.
The ‘fun’ part:
After running the last command above, you should see a web browser window open. Within this window, navigate to the Jupyter Notebook that you have downloaded from here.
Try running the first cell which imports all the necessary libraries. If this executes correctly your environment is probably set up correctly and you can move on to the second cell.
The second cell requires you to input the paths to your raster dataset, the raster file format, your desired location of the output raster file and the path to the Azure model file.
The third cell performs some sanity checks on the above data and lets you know if it sees any issues.
It is possible that your raster data has some value/values which represents NaN/null/no data, which needs to be addressed as your trained model will not enjoy dealing with those values. If you built your raster dataset, you probably know what this value is but if not, load some of your rasters into QGIS or alike and find out. It’s probably something like -9999 or -3.3e+38. If you have one or more of these values, enter them as a list, if you don’t have any of them just leave ‘raster_nan_values’ as an empty list. There are two options for dealing with these NaN values. You can mask them out of the final product and/or you can replace the NaN values with the layer medians. Either or both of these approaches can be used. If you set the ‘nan_threshold’ to a number higher than 0, each pixel must have that many NaN values before it is masked out. This is useful if you have inconsistent raster coverage. However, keep in mind the ability of the model to make correct predictions may be significantly degraded in areas which have had NaN values replaced with medians. Finally, set the data type of the output raster, ‘uint8’ should almost always be sufficient for this purpose.
In this next cell you need to enter the classification classes that you trained on. They do not need to be in any particular order, just keep in mind the output raster will have these labels replaced with numbers. You must also specify an output nan value as a ‘uint8’ valid number. The default is set to 255, which should be fine for most cases.
This next cell searches the defined ‘raster_folder’ and subfolders and returns a list of files which end with the ‘input_file_ext’. This is useful as your raster folder structure can be nested or flat and this will work regardless.
The next step is to add the raster data into a pandas dataframe. Depending on the size of your raster data, this step may soak up all of your RAM so keep an eye on your RAM usage in System Monitor/Task Manager/Activity Monitor (depending on your OS). If you find yourself running out of RAM, your only choice is to tile your input rasters before running this script. This cell starts by creating an empty Pandas dataframe and then iterates over each found raster and extracts the first band of data. At this point the defined no data values within each raster are replaced with ‘numpy.nan’ values before being added to the dataframe.
Now that all the raster data is in the dataframe, the smallest and largest values are printed to make sure no NaN values were missed. Assuming these numbers look reasonable, move on to the next cell.
In this cell the number of NaN values in each row in the dataframe is counted. If this number is above the NaN threshold it gets a True tag in the ‘Mask’ column and will be masked out later.
It is now time to actually replace the NaN values with something that the model will accept. The solution here is to simply replace every NaN value with the median for that column/raster layer. You could also use mean, however this could create problems if you have a column of integers or booleans. This step should be possible with a one-liner, however I found this to use a lot of RAM so instead a for loop is used.
Now that the mask column is completed, this cell extracts the mask as a list and then removes the ‘mask’ and the ‘nan_count’ columns from the dataframe. This is done to prevent the model from returning an error if it passes extra columns that it was not trained on.
This next cell requires you to open up ‘scoring_file_v_1_0_0.py’ file which should be in your extracted folder from Azure. You should see a dataframe called ‘input_sample’ just under the import lines, which contains all your training data variable names. Copy this entire block of code and past it into the current Jupyter Notebook cell, overwriting the current ‘input_sample’ line.
The next two cells check that the raster dataframe and the ‘input_sample’ have the same column headings. Again, this is critical, otherwise the model will throw an error.
At this point the raster dataframe could be passed into the pickled model, however considering the size of the input data it may take a long time for the model to complete. This cell breaks the raster dataframe up into chunks of ‘n’ length, which will be processed one by one so that you can see some sense of progress.
Everything is now ready for the model to be run. As the dataframe is now in chunks, the model is called in a for loop which will be tracked with ‘tqdm’. The output is a list of lists with each item being one of your defined classes.
As the final export file is going to be a raster, the predictions must be converted to integers. This is done by looping over each item in each sublist and finding the index of that item in the ‘classes’ list. The result of this is appended to ‘pred_nums’.
It is now time to use the mask list we created previously, which is combined into a new dataframe with the modeled predictions. This cell also replaces any predictions which need to be masked by the predefined ‘export_raster_nan’ value.
To export the predictions into a geotiff, the geographic metadata must first be defined. This could be done manually, however, this cell instead opens one of the input rasters and extracts most of the info that is required.
This next cell makes a couple of changes to the metadata to reduce file size and to set the correct no data value.
Now that the metadata has been extracted, the prediction list can be reshaped into a 2D array to match the input rasters.
Now that the array has been reshaped, it can be visualized with ‘matplotlib’. This is a little more complicated than normal because we have a no data value, so ‘np.ma.masked_where’ is used to manually set the ‘export_raster_nan’ value to white.
The final cell simply exports out the 2D array to the location defined by ‘output_raster_file’.
You should now be able to navigate to the export location and drag the geotiff into QGIS to visualise it, hopefully it looks reasonable! My output looked like this.