Adding Place Node Information to Boundary Relations¶
How to merge information from different OSM objects.
Task¶
Administrative areas often represented with two different objects in OSM: a node describes the central point and a relation that contains all the ways that make up the boundary. The task is to find all administrative boundaries and their matching place nodes and output both togther in a geojson file. Relations and place nodes should be matched when they have the same wikidata tag.
Quick solution¶
import osmium
from dataclasses import dataclass
import json
@dataclass
class PlaceInfo:
id: int
tags: dict[str, str]
coords: tuple[float, float]
geojsonfab = osmium.geom.GeoJSONFactory()
class BoundaryHandler(osmium.SimpleHandler):
def __init__(self, outfile):
self.places = {}
self.outfile = outfile
# write the header of the geojson file
self.outfile.write('{"type": "FeatureCollection", "features": [')
# This is just to make sure, we place the commas on the right place.
self.delim = ''
def finish(self):
self.outfile.write(']}')
def node(self, n):
self.places[n.tags['wikidata']] = PlaceInfo(n.id, dict(n.tags), (n.location.lon, n.location.lat))
def area(self, a):
# Find the corresponding place node
place = self.places.get(a.tags.get('wikidata', 'not found'), None)
# Geojsonfab creates a string with the geojson geometry.
# Convert to a Python object to make it easier to add data.
geom = json.loads(geojsonfab.create_multipolygon(a))
if geom:
# print the array delimiter, if necessary
self.outfile.write(self.delim)
self.delim = ','
tags = dict(a.tags)
# add the place information to the propoerties
if place is not None:
tags['place_node:id'] = str(place.id)
tags['place_node:lat'] = str(place.coords[1])
tags['place_node:lon'] = str(place.coords[0])
for k, v in place.tags.items():
tags['place_node:tags:' + k] = v
# And wrap everything in proper GeoJSON.
feature = {'type': 'Feature', 'geometry': geom, 'properties': dict(tags)}
self.outfile.write(json.dumps(feature))
# We are interested in boundary relations that make up areas and not in ways at all.
filters = [osmium.filter.KeyFilter('place').enable_for(osmium.osm.NODE),
osmium.filter.KeyFilter('wikidata').enable_for(osmium.osm.NODE),
osmium.filter.EntityFilter(~osmium.osm.WAY),
osmium.filter.TagFilter(('boundary', 'administrative')).enable_for(osmium.osm.AREA | osmium.osm.RELATION)]
with open('../data/out/boundaries.geojson', 'w') as outf:
handler = BoundaryHandler(outf)
handler.apply_file('../data/liechtenstein.osm.pbf', filters=filters)
handler.finish()
Background¶
Whenever you want to look at more than one OSM object at the time, you need to cache objects. Before starting such a task, it is always worth taking a closer look at the objects of interest. Find out how many candidates there are for you to look at and save and how large these objects are. There are always multiple ways to cache your data. Sometimes, when the number of candidates is really large, it is even more sensible to reread the file instead of caching the information.
For the boundary problem, the calculation is relatively straightforward. Boundary relations are huge, so we do not want to cache them if it can somehow be avoided. That means we need to cache the place nodes. A quick look at TagInfo tells us that there are about 7 million place nodes in the OSM planet. That is not a lot in the grand scheme of things. We could just read them all into memory and be done with it. It is still worth to take a closer look. The place nodes are later matched up by their wikidata
tag. Looking into the TagInfo combinations table, only 10% of the place nodes have such a tag. That leaves 850.000 nodes to cache. Much better!
Next we need to consider what information actually needs caching. In our case we want it all: the ID, the tags and the coordinates of the node. This information needs to be copied out of the node. You cannot just cache the entire node. Pyosmium won't let you do this because it wants to get rid of it as soon as the handler has seen it. Lets create a dataclass to receive the information we need:
@dataclass
class PlaceInfo:
id: int
tags: dict[str, str]
coords: tuple[float, float]
This class can now be filled from the OSM file:
class PlaceNodeReader:
def __init__(self):
self.places = {}
def node(self, n):
self.places[n.tags['wikidata']] = PlaceInfo(n.id, dict(n.tags), (n.location.lon, n.location.lat))
reader = PlaceNodeReader()
osmium.apply('../data/liechtenstein.osm.pbf',
osmium.filter.KeyFilter('place').enable_for(osmium.osm.NODE),
osmium.filter.KeyFilter('wikidata').enable_for(osmium.osm.NODE),
reader)
print(f"{len(reader.places)} places cached.")
29 places cached.
We use the osmium.apply()
function here with a handler instead of a FileProcessor. The two approaches are equivalent. Which one you choose, depends on your personal taste. FileProcessor loops are less verbose and quicker to write. Handlers tend to yield more readable code when you want to do very different things with the different kinds of objects.
As you can see in the code, it is entirely possible to use filter functions with the apply() functions. In our case, the filters make sure that only objects pass which have a place
tag and a wikidata
tag. This leaves exactly the objects we need already, so no further processing needed in the handler callback.
Next the relations need to be read. Relations can be huge, so we don't want to cache them but write them directly out into a file. If we want to create a geojson file, then we need the geometry of the relation in geojson format. Getting geojson format itself is easy. Pyosmium has a converter built-in for this, the GeoJSONFactory:
geojsonfab = osmium.geom.GeoJSONFactory()
The factory only needs to be instantiated once and can then be used globally.
To get the polygon from a relation, the special area handler is needed. It is easiest to invoke by writing a SimpleHandler class with an area()
callback. When apply_file()
is called on the handler, it will take the necessary steps in the background to build the polygon geometries.
class BoundaryHandler(osmium.SimpleHandler):
def __init__(self, places, outfile):
self.places = places
self.outfile = outfile
# write the header of the geojson file
self.outfile.write('{"type": "FeatureCollection", "features": [')
# This is just to make sure, we place the commas on the right place.
self.delim = ''
def finish(self):
self.outfile.write(']}')
def area(self, a):
# Find the corresponding place node
place = self.places.get(a.tags.get('wikidata', 'not found'), None)
# Geojsonfab creates a string with the geojson geometry.
# Convert to a Python object to make it easier to add data.
geom = json.loads(geojsonfab.create_multipolygon(a))
if geom:
# print the array delimiter, if necessary
self.outfile.write(self.delim)
self.delim = ','
tags = dict(a.tags)
# add the place information to the propoerties
if place is not None:
tags['place_node:id'] = str(place.id)
tags['place_node:lat'] = str(place.coords[1])
tags['place_node:lon'] = str(place.coords[0])
for k, v in place.tags.items():
tags['place_node:tags:' + k] = v
# And wrap everything in proper GeoJSON.
feature = {'type': 'Feature', 'geometry': geom, 'properties': dict(tags)}
self.outfile.write(json.dumps(feature))
# We are interested in boundary relations that make up areas and not in ways at all.
filters = [osmium.filter.EntityFilter(osmium.osm.RELATION | osmium.osm.AREA),
osmium.filter.TagFilter(('boundary', 'administrative'))]
with open('../data/out/boundaries.geojson', 'w') as outf:
handler = BoundaryHandler(reader.places, outf)
handler.apply_file('../data/liechtenstein.osm.pbf', filters=filters)
handler.finish()
There are two things you should keep in mind, when working with areas:
- When the area handler is invoked, the input file is always read twice. The first pass checks the relations and find out which ways it contains. The second pass assembles all necessary ways and builds the geometries.
- The area handler automatically enables caching of node locations. You don't need to worry about this when working with small files like our Liechtenstein example. For larger files of continent- or planet-size, the node cache can become quite large. You should read up about the alternative implementations that can write out the node cache on disk to save RAM.
This is already it. In the long version, we have read the input file twice, once to get the nodes and in the second pass to get the relations. This is not really necessary because the nodes come always before the relations in the file. The quick solution shows how to combine both handlers to create the geojson file in a single pass. The only part to pay attention to is the use of filters. Given that we have very different filters for nodes and relations, it is important to call enable_for()
with the correct OSM type.