Filtering an OSM File By Tags¶

How to create a thematic extract from an OSM file.

Task¶

Given the country extract of Liechtenstein, create a fully usable OSM file that only contains all the schools in the file.

Quick Solution¶

In [1]:

Copied!

import osmium
import osmium

In [2]:

Copied!





fp = osmium.FileProcessor('../data/liechtenstein.osm.pbf').with_filter(osmium.filter.KeyFilter('amenity'))

with osmium.BackReferenceWriter("../data/out/schools_full.osm.pbf", ref_src='../data/liechtenstein.osm.pbf', overwrite=True) as writer:
    for obj in fp:
         if obj.tags['amenity'] == 'school':
             writer.add(obj)
fp = osmium.FileProcessor('../data/liechtenstein.osm.pbf').with_filter(osmium.filter.KeyFilter('amenity'))

with osmium.BackReferenceWriter("../data/out/schools_full.osm.pbf", ref_src='../data/liechtenstein.osm.pbf', overwrite=True) as writer:
    for obj in fp:
         if obj.tags['amenity'] == 'school':
             writer.add(obj)

When filtering objects from a file, it is important, to include all objects that are referenced by the filtered objects. The BackReferenceWriter collects the references and writes out a complete file.

Background¶

Filtering school objects from a file is fairly easy. We need a file processor for the target file which returns all objects with an amenity key:

In [3]:

Copied!

fp = osmium.FileProcessor('../data/liechtenstein.osm.pbf').with_filter(osmium.filter.KeyFilter('amenity'))
fp = osmium.FileProcessor('../data/liechtenstein.osm.pbf').with_filter(osmium.filter.KeyFilter('amenity'))

The additional filtering for the school value can then be done in the processing loop.

Lets first check how many school objects are there:

In [4]:

Copied!

from collections import Counter

cnt = Counter()

for obj in fp:
    if obj.tags['amenity'] == 'school':
        cnt.update([obj.type_str()])

f"Nodes: {cnt['n']}   Ways: {cnt['w']}  Relations: {cnt['r']}"
from collections import Counter

cnt = Counter()

for obj in fp:
    if obj.tags['amenity'] == 'school':
        cnt.update([obj.type_str()])

f"Nodes: {cnt['n']}   Ways: {cnt['w']}  Relations: {cnt['r']}"

Out[4]:

'Nodes: 3   Ways: 19  Relations: 1'

The counter distinguishes by OSM object types. As we can see, schools exist as nodes (point geometries), ways (polygon geometries) and relations (multipolygon geometries). All of them need to appear in the output file.

The simple solution seems to be to write them all out into a file:

In [5]:

Copied!





with osmium.SimpleWriter('../data/out/schools.opl', overwrite=True) as writer:
    for obj in fp:
        if obj.tags['amenity'] == 'school':
            writer.add(obj)
with osmium.SimpleWriter('../data/out/schools.opl', overwrite=True) as writer:
    for obj in fp:
        if obj.tags['amenity'] == 'school':
            writer.add(obj)

However, if you try to use the resulting file in another program, you may find that it complains that the data is incomplete. The schools that are saved as ways in the file reference nodes which are now missing. The school relation references ways which are missing. And these again reference nodes, which need to appear in the output file as well. The file needs to be made referentially complete.

Finding backward references manually¶

Lets try to collect the IDs of the missing nodes and relation manually first. This helps to understand how the process works. In a first pass, we can simply collect all the IDs we encounter when processing the schools:

In [6]:

Copied!





references = {'n': set(), 'w': set(), 'r': set()} # save references by their object type

for obj in fp:
    if obj.tags['amenity'] == 'school':
        if obj.is_way():
            references['n'].update(n.ref for n in obj.nodes)
        elif obj.is_relation():
            for member in obj.members:
                references[member.type].add(member.ref)

f"Nodes: {len(references['n'])}   Ways: {len(references['w'])}  Relations: {len(references['r'])}"
references = {'n': set(), 'w': set(), 'r': set()} # save references by their object type

for obj in fp:
    if obj.tags['amenity'] == 'school':
        if obj.is_way():
            references['n'].update(n.ref for n in obj.nodes)
        elif obj.is_relation():
            for member in obj.members:
                references[member.type].add(member.ref)

f"Nodes: {len(references['n'])}   Ways: {len(references['w'])}  Relations: {len(references['r'])}"

Out[6]:

'Nodes: 325   Ways: 3  Relations: 0'

This gives us a set of all the direct references: the nodes of the school ways and and the ways in the school relations. We are still missing the indirect references: the nodes from the ways of the school relations. It is not possible to collect those while scanning the file for the first time. By the time the relations are scanned and we know which additional ways are of interest, the ways have already been read. We could cache all the node locations when scanning the ways in the file for the first time but that can become quite a lot of data to remember. It is faster to simply scan the file again once we know which ways are of interest:

In [7]:

Copied!

for obj in osmium.FileProcessor('../data/liechtenstein.osm.pbf', osmium.osm.WAY):
    if obj.id in references['w']:
        references['n'].update(n.ref for n in obj.nodes)

f"Nodes: {len(references['n'])}   Ways: {len(references['w'])}  Relations: {len(references['r'])}"
for obj in osmium.FileProcessor('../data/liechtenstein.osm.pbf', osmium.osm.WAY):
    if obj.id in references['w']:
        references['n'].update(n.ref for n in obj.nodes)

f"Nodes: {len(references['n'])}   Ways: {len(references['w'])}  Relations: {len(references['r'])}"

Out[7]:

'Nodes: 395   Ways: 3  Relations: 0'

This time it is not possible to use a key filter because the ways that are part of the relations are not necessarily tagged with amenity=school. They might not have any tags at all. However, we can use a different trick and tell the file processor to only scan the ways in the file. This is the second parameter in the FileProcessor() constructor.

After this second scan of the file, we know the IDs of all the objects that need to go into the output file. The data we are interested in doesn't have nested relations. When relations contain other relations, then another scan of the file is required to collect the triple indirection. This part shall be left as an exercise to the reader for now.

Once all the necessary ids are collected, the objects needs to be extracted from the original file. This can be done with the IdFilter. It gets a list of all object IDs it is supposed to let pass. Given that we need nodes and ways from the original file, two filters are necessary:

In [8]:

Copied!

ref_fp = osmium.FileProcessor('../data/liechtenstein.osm.pbf', osmium.osm.NODE | osmium.osm.WAY)\
               .with_filter(osmium.filter.IdFilter(references['n']).enable_for(osmium.osm.NODE))\
               .with_filter(osmium.filter.IdFilter(references['w']).enable_for(osmium.osm.WAY))
ref_fp = osmium.FileProcessor('../data/liechtenstein.osm.pbf', osmium.osm.NODE | osmium.osm.WAY)\
               .with_filter(osmium.filter.IdFilter(references['n']).enable_for(osmium.osm.NODE))\
               .with_filter(osmium.filter.IdFilter(references['w']).enable_for(osmium.osm.WAY))

The data from this FileProcessor needs to be merged with the filtered data originally written out. We cannot just concatenate the two files because the order of elements matters. Most applications that process OSM data expect the elements in a well defined order: first nodes, then ways, then relations, all sorted by ID. When the input files are ordered correctly already, then the zip_processors() function can be used to iterate over multiple FileProcessors in parallel and write out the data:

In [9]:

Copied!





filtered_fp = osmium.FileProcessor('../data/out/schools.opl')

with osmium.SimpleWriter(f'../data/out/schools_full.osm.pbf', overwrite=True) as writer:
    for filtered_obj, ref_obj in osmium.zip_processors(filtered_fp, ref_fp):
        if filtered_obj:
            writer.add(filtered_obj)
        else:
            writer.add(ref_obj.replace(tags={}))
filtered_fp = osmium.FileProcessor('../data/out/schools.opl')

with osmium.SimpleWriter(f'../data/out/schools_full.osm.pbf', overwrite=True) as writer:
    for filtered_obj, ref_obj in osmium.zip_processors(filtered_fp, ref_fp):
        if filtered_obj:
            writer.add(filtered_obj)
        else:
            writer.add(ref_obj.replace(tags={}))

This writes the data from the filtered file, if any exists and otherwise takes the data from the original file. Objects from the original files have their tags removed. This avoids to have unwanted first-class objects in your file. All additionally added objects now exist for the sole purpose of completing the ones you have filtered.

Finding backward references with the IDTracker¶

The IDTracker class will track backward references for you just like described in the last paragraph.

In [11]:

Copied!





references = osmium.IdTracker()

with osmium.SimpleWriter(f'../data/out/schools.opl', overwrite=True) as writer:
    for obj in fp:
        if obj.tags['amenity'] == 'school':
            writer.add(obj)
            references.add_references(obj)

references.complete_backward_references('../data/liechtenstein.osm.pbf', relation_depth=10)
references = osmium.IdTracker()

with osmium.SimpleWriter(f'../data/out/schools.opl', overwrite=True) as writer:
    for obj in fp:
        if obj.tags['amenity'] == 'school':
            writer.add(obj)
            references.add_references(obj)

references.complete_backward_references('../data/liechtenstein.osm.pbf', relation_depth=10)

The function complete_backward_references() repeatedly reads from the file to collect all referenced objects. In contrast to the more simple solution above, it can also collect references in nested relations. The relation_depth parameter controls how far the nesting should be followed. In this case, we have set it to 10 which should be sufficient even for the most complex relations in OSM. It is a good idea to not set this parameter too high because every level of depth requires an additional scan of the relations in the reference file.

With all the IDs collected, the final file can be written out as above. IdTracker can directly pose as a filter to a FileProcessor, so that the code can be slightly simplified:

In [12]:

Copied!





fp1 = osmium.FileProcessor('../data/out/schools.opl')
fp2 = osmium.FileProcessor('../data/liechtenstein.osm.pbf').with_filter(references.id_filter())

with osmium.SimpleWriter('../data/out/schools_full.opl', overwrite=True) as writer:
    for o1, o2 in osmium.zip_processors(fp1, fp2):
        if o1:
            writer.add(o1)
        else:
            writer.add(o2.replace(tags={}))
fp1 = osmium.FileProcessor('../data/out/schools.opl')
fp2 = osmium.FileProcessor('../data/liechtenstein.osm.pbf').with_filter(references.id_filter())

with osmium.SimpleWriter('../data/out/schools_full.opl', overwrite=True) as writer:
    for o1, o2 in osmium.zip_processors(fp1, fp2):
        if o1:
            writer.add(o1)
        else:
            writer.add(o2.replace(tags={}))        

Using BackReferenceWriter to collect references¶

The BackReferenceWriter encapsulates a SimpleWriter and IdTracker and writes out the referenced objects, when close() is called. This reduces the task of filtering schools to the simple solution shown in the beginning.