Point sampling multiple raster files
When performing spatial modeling it is often necessary to extract raster values at xy point locations. If you’re working with small to moderate amounts of data, this operation can be done within QGIS or alike, however, if you are working with larger datasets or many small datasets it becomes useful to use a script to do the work for you. I ran into this issue recently when I needed to extract 150,000 point values from several hundred raster files. Needless to say, QGIS did not appreciate me trying to load all of this data into it, so I resorted to building a jupyter notebook instead.
This notebook intakes a point file and a folder of rasters and will efficiently extract the raster values at the specified point locations.
The notebook starts by importing the necessary libraries, using ‘os’ for file path manipulation, ‘multiprocessing’ for iterating over multiple rasters at once, ‘tqdm’ for displaying a loading bar, ‘rasterio’ for extracting the raster values and finally ‘geopandas’ for importing, displaying and exporting the point files.
Next, the input point file, export file name and raster folder must be defined. Geopandas is pretty flexible and should be able to load just about any vector file format. The export file will be placed in the same folder as the input file so make sure the names are not the same or you may override your input data. The raster folder should contain the raster files you would like to sample, the notebook will scan this folder and all sub folders for raster files ending with the specified file extensions.
At this point the notebook will try to load the point data. It’s worth noting that to intersect the point data with the raster data, they must have the same crs. If this is not the case it’s almost always more efficient to reproject the point data rather than the raster data. To do this, change the ‘EPSG’ code in this cell to match that of the raster data. If all your data is in the same projection you can just comment out this line.
It is often useful to visualise data to make sure nothing has gone wrong. Geopandas has a ‘plot’ function which shows the entire geodataframe, make sure this appears as you expect.
The next cell will strip out the x and y coordinates from the geodataframe and compile them into a list of tuples. To do this, both x and y are extracted as lists then combined with the ‘zip’ function. The first 10 tuples are displayed to make sure this worked as expected.
Now that the xy data is formatted as needed, the notebook searches for the raster files. The ‘os.walk’ function is used in a loop to recursively search subfolders for all files ending with the predefined file extensions. All the files found are appended to the ‘raster_list’.
The notebook now defines a function to extract the raster data. Within this function, rasterio is used to open the raster and the ‘sample’ module is called on that open data to extract all the raster values at the predefined xy locations. This function finishes by returning the name of the raster file and the extracted values as a dictionary.
At this point the ‘point_samp’ function could be called in a for loop to extract values one at a time from the raster path list. However this would be pretty slow, so instead the notebook uses ‘Pool’ from the multiprocessing library to extract values from multiple rasters at the same time. This multiprocessing is done from within tqdm to also produce a loading bar.
Now that all the data has been extracted, the notebook combines all the outputs back together into the initial ‘points’ geodataframe with the raster filename as the column heading.
Assuming the data above looks as expected, these next few cells export out the dataframe as a ‘.gpkg’ geopackage and then a ‘.csv’ comma-separated values.
You are now finished and should be able to load either of the exported files into your GIS viewing software of choice to check your results.