Merge pull request #798 from colouring-cities/os-data-loading

Document & test Ordnance Survey data loading
2022-04-13 13:28:06 +01:00 · 2022-04-13 13:28:06 +01:00 · a4771eaac0
commit a4771eaac0
parent 98ce36b287 a635655995
22 changed files with 233 additions and 383 deletions
--- a/.github/workflows/etl.yml
+++ b/.github/workflows/etl.yml
@ -0,0 +1,25 @@
 name: etl
 on: [pull_request]
 jobs:
  build:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - uses: actions/setup-python@v2
      with:
        python-version: '3.7'
    - name:
        Install dependencies
      run: |
        sudo apt-get install libgeos-dev
        python -m pip install --upgrade pip
        python -m pip install pytest
        python -m pip install flake8
        python -m pip install -r etl/requirements.txt
    - name: Run Flake8
      run: |
        ls etl/*py | grep -v 'join_building_data' | xargs flake8 --exclude etl/__init__.py
    - name: Run tests
      run: |
        python -m pytest
--- a/.gitignore
+++ b/.gitignore
@ -18,6 +18,11 @@ etl/**/*.txt
 etl/**/*.xls
 etl/**/*.xlsx
 etl/**/*.zip
 etl/**/*.gml
 etl/**/*.gz
 etl/**/5690395*
 postgresdata
 */__pycache__/*
 .DS_Store
--- a/docs/setup-dev-environment.md
+++ b/docs/setup-dev-environment.md
@ -49,7 +49,9 @@ ssh <linuxusername>@localhost -p 4022
  - [:rainbow: Installing Colouring London](#rainbow-installing-colouring-london)
  - [:arrow_down: Installing Node.js](#arrow_down-installing-nodejs)
  - [:large_blue_circle: Configuring PostgreSQL](#large_blue_circle-configuring-postgresql)
  - [:space_invader: Create an empty database](#space_invader-create-an-empty-database)
  - [:arrow_forward: Configuring Node.js](#arrow_forward-configuring-nodejs)
  - [:snake: Set up Python](#snake-set-up-python)
 - [:house: Loading the building data](#house-loading-the-building-data)
 - [:computer: Running the application](#computer-running-the-application)
  - [:eyes: Viewing the application](#eyes-viewing-the-application)
@ -66,7 +68,7 @@ sudo apt-get upgrade -y
 Now install some essential tools.
 ```bash
-sudo apt-get install -y build-essential git wget curl
+sudo apt-get install -y build-essential git wget curl parallel rename
 ```
 ### :red_circle: Installing PostgreSQL
@ -157,7 +159,7 @@ Ensure the `en_US` locale exists.
 sudo locale-gen en_US.UTF-8
 ```
-Configure the database to listen on network connection.
+Configure postgres to listen on network connection.
 ```bash
 sudo sed -i "s/#\?listen_address.*/listen_addresses '*'/" /etc/postgresql/12/main/postgresql.conf
@ -189,6 +191,10 @@ If you intend to load the full CL database from a dump file into your dev enviro
 </details><p></p>
 ### :space_invader: Create an empty database
 Now create an empty database configured with geo-spatial tools. The database name (`<colouringlondondb>`) is arbitrary.
 Set environment variables, which will simplify running subsequent `psql` commands.
 ```bash
@ -198,7 +204,7 @@ export PGHOST=localhost
 export PGDATABASE=<colouringlondondb>
 ```
-Create a colouring london database if none exists. The name (`<colouringlondondb>`) is arbitrary.
+Create the database.
 ```bash
 sudo -u postgres psql -c "SELECT 1 FROM pg_database WHERE datname = '<colouringlondondb>';" | grep -q 1 || sudo -u postgres createdb -E UTF8 -T template0 --locale=en_US.utf8 -O <username> <colouringlondondb>
@ -228,10 +234,22 @@ cd ~/colouring-london/app
 npm install
 ```
 ### :snake: Set up Python
 Install python and related tools.
 ```bash
 sudo apt-get install -y python3 python3-pip python3-dev python3-venv
 ```
 ## :house: Loading the building data
 There are several ways to create the Colouring London database in your environment. The simplest way if you are just trying out the application would be to use test data from OSM, but otherwise you should follow one of the instructions below to create the full database either from scratch, or from a previously made db (via a dump file).
 To create the full database from scratch, follow [these instructions](../etl/README.md), otherwise choose one of the following:
 <details>
-<summary> With a database dump </summary><p></p>
+<summary> Create database from dump </summary><p></p>
 If you are a developer on the Colouring London project (or another Colouring Cities project), you may have a production database (or staging etc) that you wish to duplicate in your development environment.
@ -261,22 +279,16 @@ ls ~/colouring-london/migrations/*.up.sql 2>/dev/null | while read -r migration;
 </details>
 <details>
-<summary> With test data </summary><p></p>
+<summary> Create database with test data </summary><p></p>
 This section shows how to load test buildings into the application from OpenStreetMaps (OSM).
-#### Set up Python
+#### Load OpenStreetMap test polygons
-Install python and related tools.
+Create a virtual environment for python in the `etl` folder of your repository. In the following example we have name the virtual environment *colouringlondon* but it can have any name.
 ```bash
 sudo apt-get install -y python3 python3-pip python3-dev python3-venv
 ```
 Now set up a virtual environment for python. In the following example we have named the
 virtual environment *colouringlondon* but it can have any name.
 ```bash
 cd ~/colouring-london/etl
 pyvenv colouringlondon
 ```
@ -293,18 +305,9 @@ pip install --upgrade pip
 pip install --upgrade setuptools wheel
 ```
-#### Load OpenStreetMap test polygons
+Install the required python packages.
 First install prerequisites.
 ```bash
 sudo apt-get install -y parallel
 ```
 Install the required python packages. This relies on the `requirements.txt` file located
 in the `etl` folder of your local repository.
 ```bash
 cd ~/colouring-london/etl/
 pip install -r requirements.txt
 ```
--- a/etl/README.md
+++ b/etl/README.md
@ -1,91 +1,109 @@
-# Data loading
+# Extract, transform and load
-The scripts in this directory are used to extract, transform and load (ETL) the core datasets
+The scripts in this directory are used to extract, transform and load (ETL) the core datasets for Colouring London. This README acts as a guide for setting up the Colouring London database with these datasets and updating it.
 for Colouring London:
-1. Building geometries, sourced from Ordnance Survey MasterMap (Topography Layer)
+# Contents
-1. Unique Property Reference Numbers (UPRNs), sourced from Ordnance Survey AddressBase
+
 - :arrow_down: [Downloading Ordnance Survey data](#arrow_down-downloading-ordnance-survey-data)
 - :penguin: [Making data available to Ubuntu](#penguin-making-data-available-to-ubuntu)
 - :new_moon: [Creating a Colouring London database from scratch](#new_moon-creating-a-colouring-london-database-from-scratch)
 - :full_moon: [Updating the Colouring London database with new OS data](#full_moon-updating-the-colouring-london-database-with-new-os-data)
 # :arrow_down: Downloading Ordnance Survey data
 The building geometries are sourced from Ordnance Survey (OS) MasterMap (Topography Layer). To get the required datasets, you'll need to complete the following steps:
 1. Sign up for the Ordnance Survey [Data Exploration License](https://www.ordnancesurvey.co.uk/business-government/licensing-agreements/data-exploration-sign-up). You should receive an e-mail with a link to log in to the platform (this could take  up to a week).
 2. Navigate to https://orders.ordnancesurvey.co.uk/orders and click the button for: ✏️ Order. From here you should be able to click another button to add a product.
 3. Drop a rectangle or Polygon over London and make the following selections, clicking the "Add to basket" button for each:
 ![](screenshot/MasterMap.png)
 <p></p>
 4. You should be then able to check out your basket and download the files. Note: there may be multiple `.zip` files to download for MasterMap due to the size of the dataset.
 6. Unzip the MasterMap `.zip` files and move all the `.gz` files from each to a single folder in a convenient location. We will use this folder in later steps.
 # :penguin: Making data available to Ubuntu
 Before creating or updating a Colouring London database, you'll need to make sure the downloaded OS files are available to the Ubuntu machine where the database is hosted. If you are using Virtualbox, you could host share folder(s) containing the OS files with the VM (e.g. [see these instructions for Mac](https://medium.com/macoclock/share-folder-between-macos-and-ubuntu-4ce84fb5c1ad)).
 # :new_moon: Creating a Colouring London database from scratch
 ## Prerequisites
-Install PostgreSQL and create a database for colouringlondon, with a database
+You should already have set up PostgreSQL and created a database in an Ubuntu environment. Make sure to create environment variables to use `psql` if you haven't already:
 user that can connect to it. The [PostgreSQL
 documentation](https://www.postgresql.org/docs/12/tutorial-start.html) covers
 installation and getting started.
-Install the [PostGIS extension](https://postgis.net/).
+```bash
-
+export PGPASSWORD=<pgpassword>
-Connect to the colouringlondon database and add the PostGIS, pgcrypto and
+export PGUSER=<username>
-pg_trgm extensions:
+export PGHOST=localhost
-
+export PGDATABASE=<colouringlondondb>
 ```sql
 create extension postgis;
 create extension pgcrypto;
 create extension pg_trgm;
 ```
 Create the core database tables:
 ```bash
-psql < ../migrations/001.core.up.sql
+cd ~/colouring-london
 psql < migrations/001.core.up.sql
 ```
 There is some performance benefit to creating indexes after bulk loading data.
 Otherwise, it's fine to run all the migrations at this point and skip the index
 creation steps below.
-Install GNU parallel, this is used to speed up loading bulk data.
+You should already have installed GNU parallel, which is used to speed up loading bulk data.
 ## Processing and loading Ordnance Survey data
-## Process and load Ordance Survey data
+Move into the `etl` directory and set execute permission on all scripts.
 Before running any of these scripts, you will need the OS data for your area of
 interest. AddressBase and MasterMap are available directly from [Ordnance
 Survey](https://www.ordnancesurvey.co.uk/). The alternative setup below uses
 OpenStreetMap.
 The scripts should be run in the following order:
 ```bash
-# extract both datasets
+cd ~/colouring-london/etl
-extract_addressbase.sh ./addressbase_dir
+chmod +x *.sh
 extract_mastermap.sh ./mastermap_dir
 # filter mastermap ('building' polygons and any others referenced by addressbase)
 filter_transform_mastermap_for_loading.sh ./addressbase_dir ./mastermap_dir
 # load all building outlines
 load_geometries.sh ./mastermap_dir
 # index geometries (should be faster after loading)
 psql < ../migrations/002.index-geometries.sql
 # create a building record per outline
 create_building_records.sh
 # add UPRNs where they match
 load_uprns.py ./addressbase_dir
 # index building records
 psql < ../migrations/003.index-buildings.sql
 ```
-## Alternative, using OpenStreetMap
+Extract the MasterMap data (this step could take a while).
-
+
-This uses the [osmnx](https://github.com/gboeing/osmnx) python package to get OpenStreetMap data. You will need python and osmnx to run `get_test_polygons.py`.
+```bash
-
+sudo ./extract_mastermap.sh /path/to/mastermap_dir
-To help test the Colouring London application, `get_test_polygons.py` will attempt to save a
+```
-small (1.5km²) extract from OpenStreetMap to a format suitable for loading to the database.
+
-
+Filter MasterMap 'building' polygons.
-In this case, run:
+
 ```bash
 sudo ./filter_transform_mastermap_for_loading.sh /path/to/mastermap_dir
 ```
 Load all building outlines. Note: you should ensure that `mastermap_dir` has permissions that will allow the linux `find` command to work without using sudo.
 ```bash
 ./load_geometries.sh /path/to/mastermap_dir
 ```
 Index geometries.
 ```bash
 # download test data
 python get_test_polygons.py
 # load all building outlines
 ./load_geometries.sh ./
 # index geometries (should be faster after loading)
 psql < ../migrations/002.index-geometries.up.sql
 # create a building record per outline
 ./create_building_records.sh
 # index building records
 psql < ../migrations/003.index-buildings.up.sql
 ```
-## Finally
+<!-- TODO: Drop outside limit. -->
-Run the remaining migrations in `../migrations` to create the rest of the database structure.
+<!-- ```bash
 ./drop_outside_limit.sh /path/to/boundary_file
 ```` -->
 Create a building record per outline.
 ```bash
 ./create_building_records.sh
 ```
 Run the remaining migrations in `../migrations` to create the rest of the database structure.
 ```bash
 ls ~/colouring-london/migrations/*.up.sql 2>/dev/null | while read -r migration; do psql < $migration; done;
 ```
 # :full_moon: Updating the Colouring London database with new OS data
 TODO: this section should instruct how to update and existing db
--- a/etl/init.py
+++ b/etl/init.py
@ -0,0 +1 @@
 from .filter_mastermap import filter_mastermap
--- a/etl/check_ab_mm_match.py
+++ b/etl/check_ab_mm_match.py
@ -1,60 +0,0 @@
 """Check if AddressBase TOIDs will match MasterMap
 """
 import csv
 import glob
 import os
 import sys
 from multiprocessing import Pool
 csv.field_size_limit(sys.maxsize)
 def main(ab_path, mm_path):
    ab_paths = sorted(glob.glob(os.path.join(ab_path, "*.gml.csv.filtered.csv")))
    mm_paths = sorted(glob.glob(os.path.join(mm_path, "*.gml.csv")))
    try:
        assert len(ab_paths) == len(mm_paths)
    except AssertionError:
        print(ab_paths)
        print(mm_paths)
    zipped_paths = zip(ab_paths, mm_paths)
    # parallel map over tiles
    with Pool() as p:
        p.starmap(check, zipped_paths)
 def check(ab_path, mm_path):
    tile = str(os.path.basename(ab_path)).split(".")[0]
    output_base = os.path.dirname(ab_path)
    ab_toids = set()
    mm_toids = set()
    with open(ab_path, 'r') as fh:
        r = csv.DictReader(fh)
        for line in r:
            ab_toids.add(line['toid'])
    with open(mm_path, 'r') as fh:
        r = csv.DictReader(fh)
        for line in r:
            mm_toids.add(line['fid'])
    missing = ab_toids - mm_toids
    print(tile, "MasterMap:", len(mm_toids), "Addressbase:", len(ab_toids), "AB but not MM:", len(missing))
    with open(os.path.join(output_base, 'missing_toids_{}.txt'.format(tile)), 'w') as fh:
        for toid in missing:
            fh.write("{}\n".format(toid))
    with open(os.path.join(output_base, 'ab_toids_{}.txt'.format(tile)), 'w') as fh:
        for toid in ab_toids:
            fh.write("{}\n".format(toid))
 if __name__ == '__main__':
    if len(sys.argv) != 3:
        print("Usage: check_ab_mm_match.py ./path/to/addressbase/dir ./path/to/mastermap/dir")
        exit(-1)
    main(sys.argv[1], sys.argv[2])
--- a/etl/extract_addressbase.sh
+++ b/etl/extract_addressbase.sh
@ -1,63 +0,0 @@
 #!/usr/bin/env bash
 #
 # Extract address points from OS Addressbase GML
 # - as supplied in 5km tiles, zip/gz archives
 #
 : ${1?"Usage: $0 ./path/to/data/dir"}
 data_dir=$1
 #
 # Unzip to GML
 #
 find $data_dir -type f -name '*.zip' -printf "%f\n" | \
 parallel \
 unzip -u $data_dir/{} -d $data_dir
 #
 # Extract to CSV
 #
 # Relevant fields:
 # WKT
 # crossReference (list of TOID/other references)
 # source (list of cross-reference sources: 7666MT refers to MasterMap Topo)
 # uprn
 # parentUPRN
 # logicalStatus: 1 (one) is approved (otherwise historical, provisional)
 #
 find $data_dir -type f -name '*.gml' -printf "%f\n"  | \
 parallel \
 ogr2ogr -f CSV \
    -select crossReference,source,uprn,parentUPRN,logicalStatus \
    $data_dir/{}.csv $data_dir/{} BasicLandPropertyUnit \
    -lco GEOMETRY=AS_WKT
 #
 # Filter
 #
 find $data_dir -type f -name '*.gml.csv' -printf "%f\n"  | \
 parallel \
 python filter_addressbase_csv.py $data_dir/{}
 #
 # Transform to 3857 (web mercator)
 #
 find $data_dir -type f -name '*.filtered.csv' -printf "%f\n" | \
 parallel \
 ogr2ogr \
    -f CSV $data_dir/{}.3857.csv \
    -s_srs "EPSG:4326" \
    -t_srs "EPSG:3857" \
    $data_dir/{} \
    -lco GEOMETRY=AS_WKT
 #
 # Update to EWKT (with SRID indicator for loading to Postgres)
 #
 find $data_dir -type f -name '*.3857.csv' -printf "%f\n" | \
 parallel \
 cat $data_dir/{} "|" sed "'s/^\"POINT/\"SRID=3857;POINT/'" "|" cut -f 1,3,4,5 -d "','" ">" $data_dir/{}.loadable
--- a/etl/extract_mastermap.sh
+++ b/etl/extract_mastermap.sh
@ -1,29 +1,29 @@
 #!/usr/bin/env bash
 #
 # Extract MasterMap
 #
 : ${1?"Usage: $0 ./path/to/mastermap/dir"}
 data_dir=$1
-#
+
-# Extract buildings from *.gz to CSV
+echo "Extract buildings from *.gz..."
-#
+
 # Features where::
 #     descriptiveGroup = '(1:Building)'
 #
 # Use `fid` as source ID, aka TOID.
 #
 find $data_dir -type f -name '*.gz' -printf "%f\n" | \
 parallel \
 gunzip $data_dir/{} -k -S gml
 echo "Rename extracted files to .gml..."
 rename 's/$/.gml/' $data_dir/*[^gzvt]
-find $data_dir -type f -name '*.gml' -printf "%f\n" | \
+# Note: previously the rename cmd above resulted in some temp files being renamed to .gml
 # so I have specified the start of the filename (appears to be consistent for all OS MasterMap downloads)
 # we may need to update this below for other downloads
 echo "Covert .gml files to .csv"
 find $data_dir -type f -name '*5690395*.gml' -printf "%f\n" | \
 parallel \
 ogr2ogr \
    -select fid,descriptiveGroup \
@ -32,5 +32,6 @@ ogr2ogr \
    TopographicArea \
    -lco GEOMETRY=AS_WKT
 echo "Remove .gfs and .gml files from previous steps..."
 rm $data_dir/*.gfs
 rm $data_dir/*.gml
--- a/etl/filter_addressbase_csv.py
+++ b/etl/filter_addressbase_csv.py
@ -1,42 +0,0 @@
 #!/usr/bin/env python
 """Read ogr2ogr-converted CSV, filter to get OSMM TOID reference, only active addresses
 """
 import csv
 import json
 import sys
 def main(input_path):
    output_path = "{}.filtered.csv".format(input_path)
    fieldnames = (
        'wkt', 'toid', 'uprn', 'parent_uprn'
    )
    with open(input_path) as input_fh:
        with open(output_path, 'w') as output_fh:
            w = csv.DictWriter(output_fh, fieldnames=fieldnames)
            w.writeheader()
            r = csv.DictReader(input_fh)
            for line in r:
                if line['logicalStatus'] != "1":
                    continue
                refs = json.loads(line['crossReference'])
                sources = json.loads(line['source'])
                toid = ""
                for ref, source in zip(refs, sources):
                    if source == "7666MT":
                        toid = ref
                w.writerow({
                    'uprn': line['uprn'],
                    'parent_uprn': line['parentUPRN'],
                    'toid': toid,
                    'wkt': line['WKT'],
                })
 if __name__ == '__main__':
    if len(sys.argv) != 2:
        print("Usage: filter_addressbase_csv.py ./path/to/data.csv")
        exit(-1)
    main(sys.argv[1])
--- a/etl/filter_mastermap.py
+++ b/etl/filter_mastermap.py
@ -1,60 +1,44 @@
-"""Filter MasterMap to buildings and addressbase-matches
+"""Filter MasterMap to buildings
 - WHERE descriptiveGroup includes 'Building'
 - OR toid in addressbase_toids
 """
 import csv
 import glob
 import json
 import os
 import sys
 from multiprocessing import Pool
 csv.field_size_limit(sys.maxsize)
 def main(ab_path, mm_path):
    mm_paths = sorted(glob.glob(os.path.join(mm_path, "*.gml.csv")))
    toid_paths = sorted(glob.glob(os.path.join(ab_path, "ab_toids_*.txt")))
-    try:
+def main(mastermap_path):
-        assert len(mm_paths) == len(toid_paths)
+    mm_paths = sorted(glob.glob(os.path.join(mastermap_path, "*.gml.csv")))
-    except AssertionError:
+    for mm_path in mm_paths:
-        print(mm_paths)
+        filter_mastermap(mm_path)
        print(toid_paths)
    zipped_paths = zip(mm_paths, toid_paths)
    # parallel map over tiles
    with Pool() as p:
        p.starmap(filter, zipped_paths)
-def filter(mm_path, toid_path):
+def filter_mastermap(mm_path):
-    with open(toid_path, 'r') as fh:
+    output_path = str(mm_path).replace(".gml.csv", "")
-        r = csv.reader(fh)
+    output_path = "{}.filtered.csv".format(output_path)
        toids = set(line[0] for line in r)
    output_path =  "{}.filtered.csv".format(str(mm_path).replace(".gml.csv", ""))
    alt_output_path =  "{}.filtered_not_building.csv".format(str(mm_path).replace(".gml.csv", ""))
    output_fieldnames = ('WKT', 'fid', 'descriptiveGroup')
    # Open the input csv with all polygons, buildings and others
    with open(mm_path, 'r') as fh:
        r = csv.DictReader(fh)
        # Open a new output csv that will contain just buildings
        with open(output_path, 'w') as output_fh:
            w = csv.DictWriter(output_fh, fieldnames=output_fieldnames)
            w.writeheader()
-            with open(alt_output_path, 'w') as alt_output_fh:
+            for line in r:
-                alt_w = csv.DictWriter(alt_output_fh, fieldnames=output_fieldnames)
+                try:
                alt_w.writeheader()
                for line in r:
                    if 'Building' in line['descriptiveGroup']:
                        w.writerow(line)
-
+                # when descriptiveGroup is missing, ignore this Polygon
-                    elif line['fid'] in toids:
+                except TypeError:
-                        alt_w.writerow(line)
+                    pass
 if __name__ == '__main__':
-    if len(sys.argv) != 3:
+    if len(sys.argv) != 2:
-        print("Usage: filter_mastermap.py ./path/to/addressbase/dir ./path/to/mastermap/dir")
+        print("Usage: filter_mastermap.py ./path/to/mastermap/dir")
        exit(-1)
-    main(sys.argv[1], sys.argv[2])
+    main(sys.argv[1])
--- a/etl/filter_transform_mastermap_for_loading.sh
+++ b/etl/filter_transform_mastermap_for_loading.sh
@ -1,29 +1,13 @@
 #!/usr/bin/env bash
-#
+: ${1?"Usage: $0 ./path/to/mastermap/dir"}
 # Filter and transform for loading
 #
 : ${1?"Usage: $0 ./path/to/addressbase/dir ./path/to/mastermap/dir"}
 : ${2?"Usage: $0 ./path/to/addressbase/dir ./path/to/mastermap/dir"}
-addressbase_dir=$1
+mastermap_dir=$1
 mastermap_dir=$2
-#
+echo "Filter WHERE descriptiveGroup = '(1:Building)'... "
-# Check which TOIDs are matched against UPRNs
+python filter_mastermap.py $mastermap_dir
 #
 python check_ab_mm_match.py $addressbase_dir $mastermap_dir
-#
+echo "Transform to 3857 (web mercator)..."
 # Filter
 # - WHERE descriptiveGroup = '(1:Building)'
 # - OR toid in addressbase_toids
 #
 python filter_mastermap.py $addressbase_dir $mastermap_dir
 #
 # Transform to 3857 (web mercator)
 #
 find $mastermap_dir -type f -name '*.filtered.csv' -printf "%f\n" | \
 parallel \
 ogr2ogr \
@ -34,13 +18,13 @@ ogr2ogr \
    $mastermap_dir/{} \
    -lco GEOMETRY=AS_WKT
-#
+echo "Update to EWKT (with SRID indicator for loading to Postgres)..."
-# Update to EWKT (with SRID indicator for loading to Postgres)
+echo "Updating POLYGONs.."
 #
 find $mastermap_dir -type f -name '*.3857.csv' -printf "%f\n" | \
 parallel \
 sed -i "'s/^\"POLYGON/\"SRID=3857;POLYGON/'" $mastermap_dir/{}
 echo "Updating MULTIPOLYGONs.."
 find $mastermap_dir -type f -name '*.3857.csv' -printf "%f\n" | \
 parallel \
 sed -i "'s/^\"MULTIPOLYGON/\"SRID=3857;MULTIPOLYGON/'" $mastermap_dir/{}
--- a/etl/get_test_polygons.py
+++ b/etl/get_test_polygons.py
@ -25,11 +25,12 @@ gdf = osmnx.footprints_from_point(point=point, dist=dist)
 # preview image
 gdf_proj = osmnx.projection.project_gdf(gdf, to_crs={'init': 'epsg:3857'})
-gdf_proj = gdf_proj[gdf_proj.geometry.apply(lambda g: g.geom_type != 'MultiPolygon')]
+gdf_proj = gdf_proj[gdf_proj.geometry.apply(lambda g: g.geom_type != 'MultiPolygon')]  # noqa
-fig, ax = osmnx.plot_footprints(gdf_proj, bgcolor='#333333', color='w', figsize=(4,4),
+fig, ax = osmnx.plot_footprints(gdf_proj, bgcolor='#333333',
-                               save=True, show=False, close=True,
+                                color='w', figsize=(4, 4),
-                               filename='test_buildings_preview', dpi=600)
+                                save=True, show=False, close=True,
                                filename='test_buildings_preview', dpi=600)
 # save
 test_dir = os.path.dirname(__file__)
@ -50,7 +51,13 @@ gdf_to_save.rename(
 # convert to CSV
 test_data_csv = str(os.path.join(test_dir, 'test_buildings.3857.csv'))
 subprocess.run(["rm", test_data_csv])
-subprocess.run(["ogr2ogr", "-f", "CSV", test_data_csv, test_data_geojson, "-lco", "GEOMETRY=AS_WKT"])
+subprocess.run(
                ["ogr2ogr", "-f", "CSV", test_data_csv,
                 test_data_geojson, "-lco", "GEOMETRY=AS_WKT"]
 )
 # add SRID for ease of loading to PostgreSQL
-subprocess.run(["sed", "-i", "s/^\"POLYGON/\"SRID=3857;POLYGON/", test_data_csv])
+subprocess.run(
                ["sed", "-i", "s/^\"POLYGON/\"SRID=3857;POLYGON/",
                 test_data_csv]
 )
--- a/etl/load_geometries.sh
+++ b/etl/load_geometries.sh
@ -1,27 +1,25 @@
 #!/usr/bin/env bash
 #
 # Load geometries from GeoJSON to Postgres
 # - assume postgres connection details are set in the environment using PGUSER, PGHOST etc.
-#
+
 : ${1?"Usage: $0 ./path/to/mastermap/dir"}
 mastermap_dir=$1
 #
 # Create 'geometry' record with
 #     id: <polygon-guid>,
 #     source_id: <toid>,
 #     geom: <geom>
-#
+
 echo "Copy geometries to db..."
 find $mastermap_dir -type f -name '*.3857.csv' \
 -printf "$mastermap_dir/%f\n" | \
 parallel \
 cat {} '|' psql -c "\"COPY geometries ( geometry_geom, source_id ) FROM stdin WITH CSV HEADER;\""
 #
 # Delete any duplicated geometries (by TOID)
-#
+echo "Delete duplicate geometries..."
 psql -c "DELETE FROM geometries a USING (
    SELECT MIN(ctid) as ctid, source_id
    FROM geometries
--- a/etl/load_uprns.sh
+++ b/etl/load_uprns.sh
@ -1,36 +0,0 @@
 #!/usr/bin/env bash
 #
 # Load UPRNS from CSV to Postgres
 # - assume postgres connection details are set in the environment using PGUSER, PGHOST etc.
 #
 : ${1?"Usage: $0 ./path/to/addressbase/dir"}
 data_dir=$1
 #
 # Create 'building_properties' record with
 #     uprn: <uprn>,
 #     parent_uprn: <parent_uprn>,
 #     toid: <toid>,
 #     uprn_geom: <point>
 #
 find $data_dir -type f -name '*.3857.csv.loadable' \
 -printf "$data_dir/%f\n" | \
 parallel \
 cat {} '|' psql -c "\"COPY building_properties ( uprn_geom, toid, uprn, parent_uprn ) FROM stdin WITH CSV HEADER;\""
 #
 # Create references
 #
 # index essential for speeed here
 psql -c "CREATE INDEX IF NOT EXISTS building_toid_idx ON buildings ( ref_toid );"
 # link to buildings
 psql -c "UPDATE building_properties
 SET building_id = (
    SELECT b.building_id
    FROM buildings as b
    WHERE
    building_properties.toid = b.ref_toid
 );"
--- a/etl/run_all.sh
+++ b/etl/run_all.sh
@ -3,13 +3,11 @@
 #
 # Extract, transform and load building outlines and property records
 #
-: ${1?"Usage: $0 ./path/to/addressbase/dir ./path/to/mastermap/dir ./path/to/boundary"}
+: ${1?"Usage: $0 ./path/to/mastermap/dir ./path/to/boundary"}
-: ${2?"Usage: $0 ./path/to/addressbase/dir ./path/to/mastermap/dir ./path/to/boundary"}
+: ${2?"Usage: $0 ./path/to/mastermap/dir ./path/to/boundary"}
 : ${3?"Usage: $0 ./path/to/addressbase/dir ./path/to/mastermap/dir ./path/to/boundary"}
-addressbase_dir=$1
+mastermap_dir=$1
-mastermap_dir=$2
+boundary_file=$2
 boundary_file=$3
 script_dir=${0%/*}
 #
@ -17,10 +15,9 @@ script_dir=${0%/*}
 #
 # extract both datasets
 $script_dir/extract_addressbase.sh $addressbase_dir
 $script_dir/extract_mastermap.sh $mastermap_dir
 # filter mastermap ('building' polygons and any others referenced by addressbase)
-$script_dir/filter_transform_mastermap_for_loading.sh $addressbase_dir $mastermap_dir
+$script_dir/filter_transform_mastermap_for_loading.sh $mastermap_dir
 #
 # Load
@ -33,7 +30,5 @@ psql < $script_dir/../migrations/002.index-geometries.up.sql
 $script_dir/drop_outside_limit.sh $boundary_file
 # create a building record per outline
 $script_dir/create_building_records.sh
-# add UPRNs where they match
+# Run remaining migrations
-$script_dir/load_uprns.sh $addressbase_dir
+ls $script_dir/../migrations/*.up.sql 2>/dev/null | while read -r migration; do psql < $migration; done;
 # index building records
 psql < $script_dir/../migrations/003.index-buildings.up.sql
--- a/etl/run_clean.sh
+++ b/etl/run_clean.sh
@ -3,11 +3,8 @@
 #
 # Filter and transform for loading
 #
-: ${1?"Usage: $0 ./path/to/addressbase/dir ./path/to/mastermap/dir"}
+: ${1?"Usage: $0 ./path/to/mastermap/dir"}
 : ${2?"Usage: $0 ./path/to/addressbase/dir ./path/to/mastermap/dir"}
-addressbase_dir=$1
+mastermap_dir=$1
 mastermap_dir=$2
 rm -f $addressbase_dir/*.{csv,gml,txt,filtered,gfs}
 rm -f $mastermap_dir/*.{csv,gml,txt,filtered,gfs}
--- a/etl/screenshot/MasterMap.png
+++ b/etl/screenshot/MasterMap.png
--- a/tests/test_filter.py
+++ b/tests/test_filter.py
@ -0,0 +1,23 @@
 import csv
 import pytest
 from etl import filter_mastermap
 def test_filter_mastermap():
    """Test that MasterMap CSV can be correctly filtered to include only buildings."""
    input_file = "tests/test_mastermap.gml.csv"  # Test csv with two buildings and one non-building
    output_file = input_file.replace('gml', 'filtered')
    filter_mastermap(input_file)  # creates output_file
    with open(output_file, newline='') as csvfile:
        csv_array = list(csv.reader(csvfile))
    assert len(csv_array) == 3  # assert that length is 3 because just two building rows after header
 def test_filter_mastermap_missing_descriptivegroup():
    """Test that MasterMap CSV can be correctly filtered when the polygon does not have a type specified."""
    input_file = "tests/test_mastermap_missing_descriptivegroup.gml.csv"  # Test csv with one building and one non-building
    output_file = input_file.replace('gml', 'filtered')
    filter_mastermap(input_file)  # creates output_file
    with open(output_file, newline='') as csvfile:
        csv_array = list(csv.reader(csvfile))
    assert len(csv_array) == 1  # assert that length is 1 because just header
--- a/tests/test_mastermap.filtered.csv
+++ b/tests/test_mastermap.filtered.csv
@ -0,0 +1,3 @@
 WKT,fid,descriptiveGroup
 "POLYGON ((484704.003 184721.2,484691.62 184729.971,484688.251 184725.214,484700.633 184716.443,484704.003 184721.2))",osgb5000005129953843,"[ ""Building"" ]"
 "POLYGON ((530022.138 177486.29,530043.695 177498.235,530043.074 177499.355,530042.435 177500.509,530005.349 177480.086,529978.502 177463.333,529968.87 177457.322,529968.446 177457.057,529968.199 177455.714,529968.16 177455.504,529966.658 177454.566,529958.613 177449.543,529956.624 177448.301,529956.62 177448.294,529956.08 177447.4,529954.238 177444.351,529953.197 177442.624,529953.186 177442.609,529950.768 177438.606,529950.454 177438.086,529949.47 177434.209,529950.212 177434.038,529954.216 177433.114,529955.098 177437.457,529952.714 177437.98,529953.55 177441.646,529953.842 177442.008,529957.116 177446.059,529957.449 177446.471,529968.508 177453.375,529974.457 177451.966,529976.183 177458.937,530003.157 177475.772,530020.651 177485.466,530021.257 177484.372,530022.744 177485.196,530022.138 177486.29))",osgb5000005283023887,"[ ""Building"" ]"
--- a/tests/test_mastermap.gml.csv
+++ b/tests/test_mastermap.gml.csv
@ -0,0 +1,4 @@
 WKT,fid,descriptiveGroup
 "POLYGON ((484704.003 184721.2,484691.62 184729.971,484688.251 184725.214,484700.633 184716.443,484704.003 184721.2))",osgb5000005129953843,"[ ""Building"" ]"
 "POLYGON ((484703.76 184849.9,484703.46 184849.7,484703.26 184849.4,484703.06 184849.2,484702.86 184848.9,484702.76 184848.6,484702.66 184848.2,484702.66 184847.3,484702.76 184847.0,484702.96 184846.7,484703.06 184846.4,484703.36 184846.2,484703.56 184846.0,484704.16 184845.6,484704.46 184845.5,484705.46 184845.5,484706.06 184845.7,484706.26 184845.8,484706.76 184846.3,484706.96 184846.6,484707.16 184846.8,484707.26 184847.2,484707.36 184847.5,484707.36 184848.4,484707.26 184848.7,484707.16 184848.9,484706.76 184849.5,484706.46 184849.7,484706.26 184849.9,484705.66 184850.2,484704.66 184850.2,484703.76 184849.9))",osgb1000000152730957,"[ ""General Surface"" ]"
 "POLYGON ((530022.138 177486.29,530043.695 177498.235,530043.074 177499.355,530042.435 177500.509,530005.349 177480.086,529978.502 177463.333,529968.87 177457.322,529968.446 177457.057,529968.199 177455.714,529968.16 177455.504,529966.658 177454.566,529958.613 177449.543,529956.624 177448.301,529956.62 177448.294,529956.08 177447.4,529954.238 177444.351,529953.197 177442.624,529953.186 177442.609,529950.768 177438.606,529950.454 177438.086,529949.47 177434.209,529950.212 177434.038,529954.216 177433.114,529955.098 177437.457,529952.714 177437.98,529953.55 177441.646,529953.842 177442.008,529957.116 177446.059,529957.449 177446.471,529968.508 177453.375,529974.457 177451.966,529976.183 177458.937,530003.157 177475.772,530020.651 177485.466,530021.257 177484.372,530022.744 177485.196,530022.138 177486.29))",osgb5000005283023887,"[ ""Building"" ]"
--- a/tests/test_mastermap_missing_descriptivegroup.filtered.csv
+++ b/tests/test_mastermap_missing_descriptivegroup.filtered.csv
@ -0,0 +1 @@
 WKT,fid,descriptiveGroup
--- a/tests/test_mastermap_missing_descriptivegroup.gml.csv
+++ b/tests/test_mastermap_missing_descriptivegroup.gml.csv
@ -0,0 +1,2 @@
 WKT,fid,descriptiveGroup
 "POLYGON ((517896.1 186250.8,517891.7 186251.6,517891.1 186248.7,517890.75 186246.7,517890.65 186246.35,517890.45 186245.95,517890.25 186245.8,517889.95 186245.75,517889.65 186245.75,517878.3 186247.9,517874.61 186248.55,517872.9 186239.5,517873.4 186239.7,517873.95 186239.8,517874.25 186239.75,517874.65 186239.7,517875.05 186239.6,517878.35 186238.95,517889.1 186236.85,517892.769 186236.213,517903.2 186234.4,517919.55 186231.4,517932.25 186229.1,517942.1 186227.25,517954.65 186225.05,517968.75 186222.45,517985.25 186219.5,518000.0 186216.65,518021.7 186212.7,518026.7 186211.75,518029.1 186211.3,518029.68 186211.173,518033.65 186210.3,518046.1 186207.65,518058.45 186204.95,518063.3 186203.6,518068.1 186202.25,518068.9 186202.05,518079.6 186198.95,518081.4 186198.3,518083.2 186197.55,518084.95 186196.8,518086.7 186196.0,518088.45 186195.25,518097.85 186191.05,518099.15 186190.45,518108.3 186186.2,518108.375 186186.175,518108.45 186186.15,518108.477 186186.132,518114.5 186183.6,518114.65 186183.55,518114.85 186183.45,518115.05 186183.4,518115.25 186183.3,518115.35 186183.2,518115.45 186183.15,518141.85 186171.55,518142.0 186171.5,518142.15 186171.4,518142.45 186171.3,518142.6 186171.2,518142.7 186171.1,518142.8 186171.05,518142.9 186170.95,518143.05 186170.85,518143.15 186170.75,518143.25 186170.6,518143.4 186170.5,518143.5 186170.4,51814
		`@ -0,0 +1 @@`
							`from .filter_mastermap import filter_mastermap`
		`@ -0,0 +1,2 @@`
							`WKT,fid,descriptiveGroup`
							"POLYGON ((517896.1 186250.8,517891.7 186251.6,517891.1 186248.7,517890.75 186246.7,517890.65 186246.35,517890.45 186245.95,517890.25 186245.8,517889.95 186245.75,517889.65 186245.75,517878.3 186247.9,517874.61 186248.55,517872.9 186239.5,517873.4 186239.7,517873.95 186239.8,517874.25 186239.75,517874.65 186239.7,517875.05 186239.6,517878.35 186238.95,517889.1 186236.85,517892.769 186236.213,517903.2 186234.4,517919.55 186231.4,517932.25 186229.1,517942.1 186227.25,517954.65 186225.05,517968.75 186222.45,517985.25 186219.5,518000.0 186216.65,518021.7 186212.7,518026.7 186211.75,518029.1 186211.3,518029.68 186211.173,518033.65 186210.3,518046.1 186207.65,518058.45 186204.95,518063.3 186203.6,518068.1 186202.25,518068.9 186202.05,518079.6 186198.95,518081.4 186198.3,518083.2 186197.55,518084.95 186196.8,518086.7 186196.0,518088.45 186195.25,518097.85 186191.05,518099.15 186190.45,518108.3 186186.2,518108.375 186186.175,518108.45 186186.15,518108.477 186186.132,518114.5 186183.6,518114.65 186183.55,518114.85 186183.45,518115.05 186183.4,518115.25 186183.3,518115.35 186183.2,518115.45 186183.15,518141.85 186171.55,518142.0 186171.5,518142.15 186171.4,518142.45 186171.3,518142.6 186171.2,518142.7 186171.1,518142.8 186171.05,518142.9 186170.95,518143.05 186170.85,518143.15 186170.75,518143.25 186170.6,518143.4 186170.5,518143.5 186170.4,51814