Merge pull request #798 from colouring-cities/os-data-loading

Document & test Ordnance Survey data loading
This commit is contained in:
Ed Chalstrey 2022-04-13 13:28:06 +01:00 committed by GitHub
commit a4771eaac0
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
22 changed files with 233 additions and 383 deletions

25
.github/workflows/etl.yml vendored Normal file
View File

@ -0,0 +1,25 @@
name: etl
on: [pull_request]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v2
with:
python-version: '3.7'
- name:
Install dependencies
run: |
sudo apt-get install libgeos-dev
python -m pip install --upgrade pip
python -m pip install pytest
python -m pip install flake8
python -m pip install -r etl/requirements.txt
- name: Run Flake8
run: |
ls etl/*py | grep -v 'join_building_data' | xargs flake8 --exclude etl/__init__.py
- name: Run tests
run: |
python -m pytest

5
.gitignore vendored
View File

@ -18,6 +18,11 @@ etl/**/*.txt
etl/**/*.xls etl/**/*.xls
etl/**/*.xlsx etl/**/*.xlsx
etl/**/*.zip etl/**/*.zip
etl/**/*.gml
etl/**/*.gz
etl/**/5690395*
postgresdata
*/__pycache__/*
.DS_Store .DS_Store

View File

@ -49,7 +49,9 @@ ssh <linuxusername>@localhost -p 4022
- [:rainbow: Installing Colouring London](#rainbow-installing-colouring-london) - [:rainbow: Installing Colouring London](#rainbow-installing-colouring-london)
- [:arrow_down: Installing Node.js](#arrow_down-installing-nodejs) - [:arrow_down: Installing Node.js](#arrow_down-installing-nodejs)
- [:large_blue_circle: Configuring PostgreSQL](#large_blue_circle-configuring-postgresql) - [:large_blue_circle: Configuring PostgreSQL](#large_blue_circle-configuring-postgresql)
- [:space_invader: Create an empty database](#space_invader-create-an-empty-database)
- [:arrow_forward: Configuring Node.js](#arrow_forward-configuring-nodejs) - [:arrow_forward: Configuring Node.js](#arrow_forward-configuring-nodejs)
- [:snake: Set up Python](#snake-set-up-python)
- [:house: Loading the building data](#house-loading-the-building-data) - [:house: Loading the building data](#house-loading-the-building-data)
- [:computer: Running the application](#computer-running-the-application) - [:computer: Running the application](#computer-running-the-application)
- [:eyes: Viewing the application](#eyes-viewing-the-application) - [:eyes: Viewing the application](#eyes-viewing-the-application)
@ -66,7 +68,7 @@ sudo apt-get upgrade -y
Now install some essential tools. Now install some essential tools.
```bash ```bash
sudo apt-get install -y build-essential git wget curl sudo apt-get install -y build-essential git wget curl parallel rename
``` ```
### :red_circle: Installing PostgreSQL ### :red_circle: Installing PostgreSQL
@ -157,7 +159,7 @@ Ensure the `en_US` locale exists.
sudo locale-gen en_US.UTF-8 sudo locale-gen en_US.UTF-8
``` ```
Configure the database to listen on network connection. Configure postgres to listen on network connection.
```bash ```bash
sudo sed -i "s/#\?listen_address.*/listen_addresses '*'/" /etc/postgresql/12/main/postgresql.conf sudo sed -i "s/#\?listen_address.*/listen_addresses '*'/" /etc/postgresql/12/main/postgresql.conf
@ -189,6 +191,10 @@ If you intend to load the full CL database from a dump file into your dev enviro
</details><p></p> </details><p></p>
### :space_invader: Create an empty database
Now create an empty database configured with geo-spatial tools. The database name (`<colouringlondondb>`) is arbitrary.
Set environment variables, which will simplify running subsequent `psql` commands. Set environment variables, which will simplify running subsequent `psql` commands.
```bash ```bash
@ -198,7 +204,7 @@ export PGHOST=localhost
export PGDATABASE=<colouringlondondb> export PGDATABASE=<colouringlondondb>
``` ```
Create a colouring london database if none exists. The name (`<colouringlondondb>`) is arbitrary. Create the database.
```bash ```bash
sudo -u postgres psql -c "SELECT 1 FROM pg_database WHERE datname = '<colouringlondondb>';" | grep -q 1 || sudo -u postgres createdb -E UTF8 -T template0 --locale=en_US.utf8 -O <username> <colouringlondondb> sudo -u postgres psql -c "SELECT 1 FROM pg_database WHERE datname = '<colouringlondondb>';" | grep -q 1 || sudo -u postgres createdb -E UTF8 -T template0 --locale=en_US.utf8 -O <username> <colouringlondondb>
@ -228,10 +234,22 @@ cd ~/colouring-london/app
npm install npm install
``` ```
### :snake: Set up Python
Install python and related tools.
```bash
sudo apt-get install -y python3 python3-pip python3-dev python3-venv
```
## :house: Loading the building data ## :house: Loading the building data
There are several ways to create the Colouring London database in your environment. The simplest way if you are just trying out the application would be to use test data from OSM, but otherwise you should follow one of the instructions below to create the full database either from scratch, or from a previously made db (via a dump file).
To create the full database from scratch, follow [these instructions](../etl/README.md), otherwise choose one of the following:
<details> <details>
<summary> With a database dump </summary><p></p> <summary> Create database from dump </summary><p></p>
If you are a developer on the Colouring London project (or another Colouring Cities project), you may have a production database (or staging etc) that you wish to duplicate in your development environment. If you are a developer on the Colouring London project (or another Colouring Cities project), you may have a production database (or staging etc) that you wish to duplicate in your development environment.
@ -261,22 +279,16 @@ ls ~/colouring-london/migrations/*.up.sql 2>/dev/null | while read -r migration;
</details> </details>
<details> <details>
<summary> With test data </summary><p></p> <summary> Create database with test data </summary><p></p>
This section shows how to load test buildings into the application from OpenStreetMaps (OSM). This section shows how to load test buildings into the application from OpenStreetMaps (OSM).
#### Set up Python #### Load OpenStreetMap test polygons
Install python and related tools. Create a virtual environment for python in the `etl` folder of your repository. In the following example we have name the virtual environment *colouringlondon* but it can have any name.
```bash
sudo apt-get install -y python3 python3-pip python3-dev python3-venv
```
Now set up a virtual environment for python. In the following example we have named the
virtual environment *colouringlondon* but it can have any name.
```bash ```bash
cd ~/colouring-london/etl
pyvenv colouringlondon pyvenv colouringlondon
``` ```
@ -293,18 +305,9 @@ pip install --upgrade pip
pip install --upgrade setuptools wheel pip install --upgrade setuptools wheel
``` ```
#### Load OpenStreetMap test polygons Install the required python packages.
First install prerequisites.
```bash
sudo apt-get install -y parallel
```
Install the required python packages. This relies on the `requirements.txt` file located
in the `etl` folder of your local repository.
```bash ```bash
cd ~/colouring-london/etl/
pip install -r requirements.txt pip install -r requirements.txt
``` ```

View File

@ -1,91 +1,109 @@
# Data loading # Extract, transform and load
The scripts in this directory are used to extract, transform and load (ETL) the core datasets The scripts in this directory are used to extract, transform and load (ETL) the core datasets for Colouring London. This README acts as a guide for setting up the Colouring London database with these datasets and updating it.
for Colouring London:
1. Building geometries, sourced from Ordnance Survey MasterMap (Topography Layer) # Contents
1. Unique Property Reference Numbers (UPRNs), sourced from Ordnance Survey AddressBase
- :arrow_down: [Downloading Ordnance Survey data](#arrow_down-downloading-ordnance-survey-data)
- :penguin: [Making data available to Ubuntu](#penguin-making-data-available-to-ubuntu)
- :new_moon: [Creating a Colouring London database from scratch](#new_moon-creating-a-colouring-london-database-from-scratch)
- :full_moon: [Updating the Colouring London database with new OS data](#full_moon-updating-the-colouring-london-database-with-new-os-data)
# :arrow_down: Downloading Ordnance Survey data
The building geometries are sourced from Ordnance Survey (OS) MasterMap (Topography Layer). To get the required datasets, you'll need to complete the following steps:
1. Sign up for the Ordnance Survey [Data Exploration License](https://www.ordnancesurvey.co.uk/business-government/licensing-agreements/data-exploration-sign-up). You should receive an e-mail with a link to log in to the platform (this could take up to a week).
2. Navigate to https://orders.ordnancesurvey.co.uk/orders and click the button for: ✏️ Order. From here you should be able to click another button to add a product.
3. Drop a rectangle or Polygon over London and make the following selections, clicking the "Add to basket" button for each:
![](screenshot/MasterMap.png)
<p></p>
4. You should be then able to check out your basket and download the files. Note: there may be multiple `.zip` files to download for MasterMap due to the size of the dataset.
6. Unzip the MasterMap `.zip` files and move all the `.gz` files from each to a single folder in a convenient location. We will use this folder in later steps.
# :penguin: Making data available to Ubuntu
Before creating or updating a Colouring London database, you'll need to make sure the downloaded OS files are available to the Ubuntu machine where the database is hosted. If you are using Virtualbox, you could host share folder(s) containing the OS files with the VM (e.g. [see these instructions for Mac](https://medium.com/macoclock/share-folder-between-macos-and-ubuntu-4ce84fb5c1ad)).
# :new_moon: Creating a Colouring London database from scratch
## Prerequisites ## Prerequisites
Install PostgreSQL and create a database for colouringlondon, with a database You should already have set up PostgreSQL and created a database in an Ubuntu environment. Make sure to create environment variables to use `psql` if you haven't already:
user that can connect to it. The [PostgreSQL
documentation](https://www.postgresql.org/docs/12/tutorial-start.html) covers
installation and getting started.
Install the [PostGIS extension](https://postgis.net/). ```bash
export PGPASSWORD=<pgpassword>
Connect to the colouringlondon database and add the PostGIS, pgcrypto and export PGUSER=<username>
pg_trgm extensions: export PGHOST=localhost
export PGDATABASE=<colouringlondondb>
```sql
create extension postgis;
create extension pgcrypto;
create extension pg_trgm;
``` ```
Create the core database tables: Create the core database tables:
```bash ```bash
psql < ../migrations/001.core.up.sql cd ~/colouring-london
psql < migrations/001.core.up.sql
``` ```
There is some performance benefit to creating indexes after bulk loading data. There is some performance benefit to creating indexes after bulk loading data.
Otherwise, it's fine to run all the migrations at this point and skip the index Otherwise, it's fine to run all the migrations at this point and skip the index
creation steps below. creation steps below.
Install GNU parallel, this is used to speed up loading bulk data. You should already have installed GNU parallel, which is used to speed up loading bulk data.
## Processing and loading Ordnance Survey data
## Process and load Ordance Survey data Move into the `etl` directory and set execute permission on all scripts.
Before running any of these scripts, you will need the OS data for your area of
interest. AddressBase and MasterMap are available directly from [Ordnance
Survey](https://www.ordnancesurvey.co.uk/). The alternative setup below uses
OpenStreetMap.
The scripts should be run in the following order:
```bash ```bash
# extract both datasets cd ~/colouring-london/etl
extract_addressbase.sh ./addressbase_dir chmod +x *.sh
extract_mastermap.sh ./mastermap_dir
# filter mastermap ('building' polygons and any others referenced by addressbase)
filter_transform_mastermap_for_loading.sh ./addressbase_dir ./mastermap_dir
# load all building outlines
load_geometries.sh ./mastermap_dir
# index geometries (should be faster after loading)
psql < ../migrations/002.index-geometries.sql
# create a building record per outline
create_building_records.sh
# add UPRNs where they match
load_uprns.py ./addressbase_dir
# index building records
psql < ../migrations/003.index-buildings.sql
``` ```
## Alternative, using OpenStreetMap Extract the MasterMap data (this step could take a while).
This uses the [osmnx](https://github.com/gboeing/osmnx) python package to get OpenStreetMap data. You will need python and osmnx to run `get_test_polygons.py`. ```bash
sudo ./extract_mastermap.sh /path/to/mastermap_dir
To help test the Colouring London application, `get_test_polygons.py` will attempt to save a ```
small (1.5km²) extract from OpenStreetMap to a format suitable for loading to the database.
Filter MasterMap 'building' polygons.
In this case, run:
```bash
sudo ./filter_transform_mastermap_for_loading.sh /path/to/mastermap_dir
```
Load all building outlines. Note: you should ensure that `mastermap_dir` has permissions that will allow the linux `find` command to work without using sudo.
```bash
./load_geometries.sh /path/to/mastermap_dir
```
Index geometries.
```bash ```bash
# download test data
python get_test_polygons.py
# load all building outlines
./load_geometries.sh ./
# index geometries (should be faster after loading)
psql < ../migrations/002.index-geometries.up.sql psql < ../migrations/002.index-geometries.up.sql
# create a building record per outline
./create_building_records.sh
# index building records
psql < ../migrations/003.index-buildings.up.sql
``` ```
## Finally <!-- TODO: Drop outside limit. -->
<!-- ```bash
./drop_outside_limit.sh /path/to/boundary_file
```` -->
Create a building record per outline.
```bash
./create_building_records.sh
```
Run the remaining migrations in `../migrations` to create the rest of the database structure. Run the remaining migrations in `../migrations` to create the rest of the database structure.
```bash
ls ~/colouring-london/migrations/*.up.sql 2>/dev/null | while read -r migration; do psql < $migration; done;
```
# :full_moon: Updating the Colouring London database with new OS data
TODO: this section should instruct how to update and existing db

1
etl/__init__.py Normal file
View File

@ -0,0 +1 @@
from .filter_mastermap import filter_mastermap

View File

@ -1,60 +0,0 @@
"""Check if AddressBase TOIDs will match MasterMap
"""
import csv
import glob
import os
import sys
from multiprocessing import Pool
csv.field_size_limit(sys.maxsize)
def main(ab_path, mm_path):
ab_paths = sorted(glob.glob(os.path.join(ab_path, "*.gml.csv.filtered.csv")))
mm_paths = sorted(glob.glob(os.path.join(mm_path, "*.gml.csv")))
try:
assert len(ab_paths) == len(mm_paths)
except AssertionError:
print(ab_paths)
print(mm_paths)
zipped_paths = zip(ab_paths, mm_paths)
# parallel map over tiles
with Pool() as p:
p.starmap(check, zipped_paths)
def check(ab_path, mm_path):
tile = str(os.path.basename(ab_path)).split(".")[0]
output_base = os.path.dirname(ab_path)
ab_toids = set()
mm_toids = set()
with open(ab_path, 'r') as fh:
r = csv.DictReader(fh)
for line in r:
ab_toids.add(line['toid'])
with open(mm_path, 'r') as fh:
r = csv.DictReader(fh)
for line in r:
mm_toids.add(line['fid'])
missing = ab_toids - mm_toids
print(tile, "MasterMap:", len(mm_toids), "Addressbase:", len(ab_toids), "AB but not MM:", len(missing))
with open(os.path.join(output_base, 'missing_toids_{}.txt'.format(tile)), 'w') as fh:
for toid in missing:
fh.write("{}\n".format(toid))
with open(os.path.join(output_base, 'ab_toids_{}.txt'.format(tile)), 'w') as fh:
for toid in ab_toids:
fh.write("{}\n".format(toid))
if __name__ == '__main__':
if len(sys.argv) != 3:
print("Usage: check_ab_mm_match.py ./path/to/addressbase/dir ./path/to/mastermap/dir")
exit(-1)
main(sys.argv[1], sys.argv[2])

View File

@ -1,63 +0,0 @@
#!/usr/bin/env bash
#
# Extract address points from OS Addressbase GML
# - as supplied in 5km tiles, zip/gz archives
#
: ${1?"Usage: $0 ./path/to/data/dir"}
data_dir=$1
#
# Unzip to GML
#
find $data_dir -type f -name '*.zip' -printf "%f\n" | \
parallel \
unzip -u $data_dir/{} -d $data_dir
#
# Extract to CSV
#
# Relevant fields:
# WKT
# crossReference (list of TOID/other references)
# source (list of cross-reference sources: 7666MT refers to MasterMap Topo)
# uprn
# parentUPRN
# logicalStatus: 1 (one) is approved (otherwise historical, provisional)
#
find $data_dir -type f -name '*.gml' -printf "%f\n" | \
parallel \
ogr2ogr -f CSV \
-select crossReference,source,uprn,parentUPRN,logicalStatus \
$data_dir/{}.csv $data_dir/{} BasicLandPropertyUnit \
-lco GEOMETRY=AS_WKT
#
# Filter
#
find $data_dir -type f -name '*.gml.csv' -printf "%f\n" | \
parallel \
python filter_addressbase_csv.py $data_dir/{}
#
# Transform to 3857 (web mercator)
#
find $data_dir -type f -name '*.filtered.csv' -printf "%f\n" | \
parallel \
ogr2ogr \
-f CSV $data_dir/{}.3857.csv \
-s_srs "EPSG:4326" \
-t_srs "EPSG:3857" \
$data_dir/{} \
-lco GEOMETRY=AS_WKT
#
# Update to EWKT (with SRID indicator for loading to Postgres)
#
find $data_dir -type f -name '*.3857.csv' -printf "%f\n" | \
parallel \
cat $data_dir/{} "|" sed "'s/^\"POINT/\"SRID=3857;POINT/'" "|" cut -f 1,3,4,5 -d "','" ">" $data_dir/{}.loadable

View File

@ -1,29 +1,29 @@
#!/usr/bin/env bash #!/usr/bin/env bash
#
# Extract MasterMap
#
: ${1?"Usage: $0 ./path/to/mastermap/dir"} : ${1?"Usage: $0 ./path/to/mastermap/dir"}
data_dir=$1 data_dir=$1
#
# Extract buildings from *.gz to CSV echo "Extract buildings from *.gz..."
#
# Features where:: # Features where::
# descriptiveGroup = '(1:Building)' # descriptiveGroup = '(1:Building)'
# #
# Use `fid` as source ID, aka TOID. # Use `fid` as source ID, aka TOID.
#
find $data_dir -type f -name '*.gz' -printf "%f\n" | \ find $data_dir -type f -name '*.gz' -printf "%f\n" | \
parallel \ parallel \
gunzip $data_dir/{} -k -S gml gunzip $data_dir/{} -k -S gml
echo "Rename extracted files to .gml..."
rename 's/$/.gml/' $data_dir/*[^gzvt] rename 's/$/.gml/' $data_dir/*[^gzvt]
find $data_dir -type f -name '*.gml' -printf "%f\n" | \ # Note: previously the rename cmd above resulted in some temp files being renamed to .gml
# so I have specified the start of the filename (appears to be consistent for all OS MasterMap downloads)
# we may need to update this below for other downloads
echo "Covert .gml files to .csv"
find $data_dir -type f -name '*5690395*.gml' -printf "%f\n" | \
parallel \ parallel \
ogr2ogr \ ogr2ogr \
-select fid,descriptiveGroup \ -select fid,descriptiveGroup \
@ -32,5 +32,6 @@ ogr2ogr \
TopographicArea \ TopographicArea \
-lco GEOMETRY=AS_WKT -lco GEOMETRY=AS_WKT
echo "Remove .gfs and .gml files from previous steps..."
rm $data_dir/*.gfs rm $data_dir/*.gfs
rm $data_dir/*.gml rm $data_dir/*.gml

View File

@ -1,42 +0,0 @@
#!/usr/bin/env python
"""Read ogr2ogr-converted CSV, filter to get OSMM TOID reference, only active addresses
"""
import csv
import json
import sys
def main(input_path):
output_path = "{}.filtered.csv".format(input_path)
fieldnames = (
'wkt', 'toid', 'uprn', 'parent_uprn'
)
with open(input_path) as input_fh:
with open(output_path, 'w') as output_fh:
w = csv.DictWriter(output_fh, fieldnames=fieldnames)
w.writeheader()
r = csv.DictReader(input_fh)
for line in r:
if line['logicalStatus'] != "1":
continue
refs = json.loads(line['crossReference'])
sources = json.loads(line['source'])
toid = ""
for ref, source in zip(refs, sources):
if source == "7666MT":
toid = ref
w.writerow({
'uprn': line['uprn'],
'parent_uprn': line['parentUPRN'],
'toid': toid,
'wkt': line['WKT'],
})
if __name__ == '__main__':
if len(sys.argv) != 2:
print("Usage: filter_addressbase_csv.py ./path/to/data.csv")
exit(-1)
main(sys.argv[1])

View File

@ -1,60 +1,44 @@
"""Filter MasterMap to buildings and addressbase-matches """Filter MasterMap to buildings
- WHERE descriptiveGroup includes 'Building' - WHERE descriptiveGroup includes 'Building'
- OR toid in addressbase_toids
""" """
import csv import csv
import glob import glob
import json
import os import os
import sys import sys
from multiprocessing import Pool
csv.field_size_limit(sys.maxsize) csv.field_size_limit(sys.maxsize)
def main(ab_path, mm_path):
mm_paths = sorted(glob.glob(os.path.join(mm_path, "*.gml.csv")))
toid_paths = sorted(glob.glob(os.path.join(ab_path, "ab_toids_*.txt")))
try: def main(mastermap_path):
assert len(mm_paths) == len(toid_paths) mm_paths = sorted(glob.glob(os.path.join(mastermap_path, "*.gml.csv")))
except AssertionError: for mm_path in mm_paths:
print(mm_paths) filter_mastermap(mm_path)
print(toid_paths)
zipped_paths = zip(mm_paths, toid_paths)
# parallel map over tiles
with Pool() as p:
p.starmap(filter, zipped_paths)
def filter(mm_path, toid_path): def filter_mastermap(mm_path):
with open(toid_path, 'r') as fh: output_path = str(mm_path).replace(".gml.csv", "")
r = csv.reader(fh) output_path = "{}.filtered.csv".format(output_path)
toids = set(line[0] for line in r)
output_path = "{}.filtered.csv".format(str(mm_path).replace(".gml.csv", ""))
alt_output_path = "{}.filtered_not_building.csv".format(str(mm_path).replace(".gml.csv", ""))
output_fieldnames = ('WKT', 'fid', 'descriptiveGroup') output_fieldnames = ('WKT', 'fid', 'descriptiveGroup')
# Open the input csv with all polygons, buildings and others
with open(mm_path, 'r') as fh: with open(mm_path, 'r') as fh:
r = csv.DictReader(fh) r = csv.DictReader(fh)
# Open a new output csv that will contain just buildings
with open(output_path, 'w') as output_fh: with open(output_path, 'w') as output_fh:
w = csv.DictWriter(output_fh, fieldnames=output_fieldnames) w = csv.DictWriter(output_fh, fieldnames=output_fieldnames)
w.writeheader() w.writeheader()
with open(alt_output_path, 'w') as alt_output_fh: for line in r:
alt_w = csv.DictWriter(alt_output_fh, fieldnames=output_fieldnames) try:
alt_w.writeheader()
for line in r:
if 'Building' in line['descriptiveGroup']: if 'Building' in line['descriptiveGroup']:
w.writerow(line) w.writerow(line)
# when descriptiveGroup is missing, ignore this Polygon
elif line['fid'] in toids: except TypeError:
alt_w.writerow(line) pass
if __name__ == '__main__': if __name__ == '__main__':
if len(sys.argv) != 3: if len(sys.argv) != 2:
print("Usage: filter_mastermap.py ./path/to/addressbase/dir ./path/to/mastermap/dir") print("Usage: filter_mastermap.py ./path/to/mastermap/dir")
exit(-1) exit(-1)
main(sys.argv[1], sys.argv[2]) main(sys.argv[1])

View File

@ -1,29 +1,13 @@
#!/usr/bin/env bash #!/usr/bin/env bash
# : ${1?"Usage: $0 ./path/to/mastermap/dir"}
# Filter and transform for loading
#
: ${1?"Usage: $0 ./path/to/addressbase/dir ./path/to/mastermap/dir"}
: ${2?"Usage: $0 ./path/to/addressbase/dir ./path/to/mastermap/dir"}
addressbase_dir=$1 mastermap_dir=$1
mastermap_dir=$2
# echo "Filter WHERE descriptiveGroup = '(1:Building)'... "
# Check which TOIDs are matched against UPRNs python filter_mastermap.py $mastermap_dir
#
python check_ab_mm_match.py $addressbase_dir $mastermap_dir
# echo "Transform to 3857 (web mercator)..."
# Filter
# - WHERE descriptiveGroup = '(1:Building)'
# - OR toid in addressbase_toids
#
python filter_mastermap.py $addressbase_dir $mastermap_dir
#
# Transform to 3857 (web mercator)
#
find $mastermap_dir -type f -name '*.filtered.csv' -printf "%f\n" | \ find $mastermap_dir -type f -name '*.filtered.csv' -printf "%f\n" | \
parallel \ parallel \
ogr2ogr \ ogr2ogr \
@ -34,13 +18,13 @@ ogr2ogr \
$mastermap_dir/{} \ $mastermap_dir/{} \
-lco GEOMETRY=AS_WKT -lco GEOMETRY=AS_WKT
# echo "Update to EWKT (with SRID indicator for loading to Postgres)..."
# Update to EWKT (with SRID indicator for loading to Postgres) echo "Updating POLYGONs.."
#
find $mastermap_dir -type f -name '*.3857.csv' -printf "%f\n" | \ find $mastermap_dir -type f -name '*.3857.csv' -printf "%f\n" | \
parallel \ parallel \
sed -i "'s/^\"POLYGON/\"SRID=3857;POLYGON/'" $mastermap_dir/{} sed -i "'s/^\"POLYGON/\"SRID=3857;POLYGON/'" $mastermap_dir/{}
echo "Updating MULTIPOLYGONs.."
find $mastermap_dir -type f -name '*.3857.csv' -printf "%f\n" | \ find $mastermap_dir -type f -name '*.3857.csv' -printf "%f\n" | \
parallel \ parallel \
sed -i "'s/^\"MULTIPOLYGON/\"SRID=3857;MULTIPOLYGON/'" $mastermap_dir/{} sed -i "'s/^\"MULTIPOLYGON/\"SRID=3857;MULTIPOLYGON/'" $mastermap_dir/{}

View File

@ -25,11 +25,12 @@ gdf = osmnx.footprints_from_point(point=point, dist=dist)
# preview image # preview image
gdf_proj = osmnx.projection.project_gdf(gdf, to_crs={'init': 'epsg:3857'}) gdf_proj = osmnx.projection.project_gdf(gdf, to_crs={'init': 'epsg:3857'})
gdf_proj = gdf_proj[gdf_proj.geometry.apply(lambda g: g.geom_type != 'MultiPolygon')] gdf_proj = gdf_proj[gdf_proj.geometry.apply(lambda g: g.geom_type != 'MultiPolygon')] # noqa
fig, ax = osmnx.plot_footprints(gdf_proj, bgcolor='#333333', color='w', figsize=(4,4), fig, ax = osmnx.plot_footprints(gdf_proj, bgcolor='#333333',
save=True, show=False, close=True, color='w', figsize=(4, 4),
filename='test_buildings_preview', dpi=600) save=True, show=False, close=True,
filename='test_buildings_preview', dpi=600)
# save # save
test_dir = os.path.dirname(__file__) test_dir = os.path.dirname(__file__)
@ -50,7 +51,13 @@ gdf_to_save.rename(
# convert to CSV # convert to CSV
test_data_csv = str(os.path.join(test_dir, 'test_buildings.3857.csv')) test_data_csv = str(os.path.join(test_dir, 'test_buildings.3857.csv'))
subprocess.run(["rm", test_data_csv]) subprocess.run(["rm", test_data_csv])
subprocess.run(["ogr2ogr", "-f", "CSV", test_data_csv, test_data_geojson, "-lco", "GEOMETRY=AS_WKT"]) subprocess.run(
["ogr2ogr", "-f", "CSV", test_data_csv,
test_data_geojson, "-lco", "GEOMETRY=AS_WKT"]
)
# add SRID for ease of loading to PostgreSQL # add SRID for ease of loading to PostgreSQL
subprocess.run(["sed", "-i", "s/^\"POLYGON/\"SRID=3857;POLYGON/", test_data_csv]) subprocess.run(
["sed", "-i", "s/^\"POLYGON/\"SRID=3857;POLYGON/",
test_data_csv]
)

View File

@ -1,27 +1,25 @@
#!/usr/bin/env bash #!/usr/bin/env bash
#
# Load geometries from GeoJSON to Postgres # Load geometries from GeoJSON to Postgres
# - assume postgres connection details are set in the environment using PGUSER, PGHOST etc. # - assume postgres connection details are set in the environment using PGUSER, PGHOST etc.
#
: ${1?"Usage: $0 ./path/to/mastermap/dir"} : ${1?"Usage: $0 ./path/to/mastermap/dir"}
mastermap_dir=$1 mastermap_dir=$1
#
# Create 'geometry' record with # Create 'geometry' record with
# id: <polygon-guid>, # id: <polygon-guid>,
# source_id: <toid>, # source_id: <toid>,
# geom: <geom> # geom: <geom>
#
echo "Copy geometries to db..."
find $mastermap_dir -type f -name '*.3857.csv' \ find $mastermap_dir -type f -name '*.3857.csv' \
-printf "$mastermap_dir/%f\n" | \ -printf "$mastermap_dir/%f\n" | \
parallel \ parallel \
cat {} '|' psql -c "\"COPY geometries ( geometry_geom, source_id ) FROM stdin WITH CSV HEADER;\"" cat {} '|' psql -c "\"COPY geometries ( geometry_geom, source_id ) FROM stdin WITH CSV HEADER;\""
#
# Delete any duplicated geometries (by TOID) # Delete any duplicated geometries (by TOID)
# echo "Delete duplicate geometries..."
psql -c "DELETE FROM geometries a USING ( psql -c "DELETE FROM geometries a USING (
SELECT MIN(ctid) as ctid, source_id SELECT MIN(ctid) as ctid, source_id
FROM geometries FROM geometries

View File

@ -1,36 +0,0 @@
#!/usr/bin/env bash
#
# Load UPRNS from CSV to Postgres
# - assume postgres connection details are set in the environment using PGUSER, PGHOST etc.
#
: ${1?"Usage: $0 ./path/to/addressbase/dir"}
data_dir=$1
#
# Create 'building_properties' record with
# uprn: <uprn>,
# parent_uprn: <parent_uprn>,
# toid: <toid>,
# uprn_geom: <point>
#
find $data_dir -type f -name '*.3857.csv.loadable' \
-printf "$data_dir/%f\n" | \
parallel \
cat {} '|' psql -c "\"COPY building_properties ( uprn_geom, toid, uprn, parent_uprn ) FROM stdin WITH CSV HEADER;\""
#
# Create references
#
# index essential for speeed here
psql -c "CREATE INDEX IF NOT EXISTS building_toid_idx ON buildings ( ref_toid );"
# link to buildings
psql -c "UPDATE building_properties
SET building_id = (
SELECT b.building_id
FROM buildings as b
WHERE
building_properties.toid = b.ref_toid
);"

View File

@ -3,13 +3,11 @@
# #
# Extract, transform and load building outlines and property records # Extract, transform and load building outlines and property records
# #
: ${1?"Usage: $0 ./path/to/addressbase/dir ./path/to/mastermap/dir ./path/to/boundary"} : ${1?"Usage: $0 ./path/to/mastermap/dir ./path/to/boundary"}
: ${2?"Usage: $0 ./path/to/addressbase/dir ./path/to/mastermap/dir ./path/to/boundary"} : ${2?"Usage: $0 ./path/to/mastermap/dir ./path/to/boundary"}
: ${3?"Usage: $0 ./path/to/addressbase/dir ./path/to/mastermap/dir ./path/to/boundary"}
addressbase_dir=$1 mastermap_dir=$1
mastermap_dir=$2 boundary_file=$2
boundary_file=$3
script_dir=${0%/*} script_dir=${0%/*}
# #
@ -17,10 +15,9 @@ script_dir=${0%/*}
# #
# extract both datasets # extract both datasets
$script_dir/extract_addressbase.sh $addressbase_dir
$script_dir/extract_mastermap.sh $mastermap_dir $script_dir/extract_mastermap.sh $mastermap_dir
# filter mastermap ('building' polygons and any others referenced by addressbase) # filter mastermap ('building' polygons and any others referenced by addressbase)
$script_dir/filter_transform_mastermap_for_loading.sh $addressbase_dir $mastermap_dir $script_dir/filter_transform_mastermap_for_loading.sh $mastermap_dir
# #
# Load # Load
@ -33,7 +30,5 @@ psql < $script_dir/../migrations/002.index-geometries.up.sql
$script_dir/drop_outside_limit.sh $boundary_file $script_dir/drop_outside_limit.sh $boundary_file
# create a building record per outline # create a building record per outline
$script_dir/create_building_records.sh $script_dir/create_building_records.sh
# add UPRNs where they match # Run remaining migrations
$script_dir/load_uprns.sh $addressbase_dir ls $script_dir/../migrations/*.up.sql 2>/dev/null | while read -r migration; do psql < $migration; done;
# index building records
psql < $script_dir/../migrations/003.index-buildings.up.sql

View File

@ -3,11 +3,8 @@
# #
# Filter and transform for loading # Filter and transform for loading
# #
: ${1?"Usage: $0 ./path/to/addressbase/dir ./path/to/mastermap/dir"} : ${1?"Usage: $0 ./path/to/mastermap/dir"}
: ${2?"Usage: $0 ./path/to/addressbase/dir ./path/to/mastermap/dir"}
addressbase_dir=$1 mastermap_dir=$1
mastermap_dir=$2
rm -f $addressbase_dir/*.{csv,gml,txt,filtered,gfs}
rm -f $mastermap_dir/*.{csv,gml,txt,filtered,gfs} rm -f $mastermap_dir/*.{csv,gml,txt,filtered,gfs}

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

23
tests/test_filter.py Normal file
View File

@ -0,0 +1,23 @@
import csv
import pytest
from etl import filter_mastermap
def test_filter_mastermap():
"""Test that MasterMap CSV can be correctly filtered to include only buildings."""
input_file = "tests/test_mastermap.gml.csv" # Test csv with two buildings and one non-building
output_file = input_file.replace('gml', 'filtered')
filter_mastermap(input_file) # creates output_file
with open(output_file, newline='') as csvfile:
csv_array = list(csv.reader(csvfile))
assert len(csv_array) == 3 # assert that length is 3 because just two building rows after header
def test_filter_mastermap_missing_descriptivegroup():
"""Test that MasterMap CSV can be correctly filtered when the polygon does not have a type specified."""
input_file = "tests/test_mastermap_missing_descriptivegroup.gml.csv" # Test csv with one building and one non-building
output_file = input_file.replace('gml', 'filtered')
filter_mastermap(input_file) # creates output_file
with open(output_file, newline='') as csvfile:
csv_array = list(csv.reader(csvfile))
assert len(csv_array) == 1 # assert that length is 1 because just header

View File

@ -0,0 +1,3 @@
WKT,fid,descriptiveGroup
"POLYGON ((484704.003 184721.2,484691.62 184729.971,484688.251 184725.214,484700.633 184716.443,484704.003 184721.2))",osgb5000005129953843,"[ ""Building"" ]"
"POLYGON ((530022.138 177486.29,530043.695 177498.235,530043.074 177499.355,530042.435 177500.509,530005.349 177480.086,529978.502 177463.333,529968.87 177457.322,529968.446 177457.057,529968.199 177455.714,529968.16 177455.504,529966.658 177454.566,529958.613 177449.543,529956.624 177448.301,529956.62 177448.294,529956.08 177447.4,529954.238 177444.351,529953.197 177442.624,529953.186 177442.609,529950.768 177438.606,529950.454 177438.086,529949.47 177434.209,529950.212 177434.038,529954.216 177433.114,529955.098 177437.457,529952.714 177437.98,529953.55 177441.646,529953.842 177442.008,529957.116 177446.059,529957.449 177446.471,529968.508 177453.375,529974.457 177451.966,529976.183 177458.937,530003.157 177475.772,530020.651 177485.466,530021.257 177484.372,530022.744 177485.196,530022.138 177486.29))",osgb5000005283023887,"[ ""Building"" ]"
1 WKT fid descriptiveGroup
2 POLYGON ((484704.003 184721.2,484691.62 184729.971,484688.251 184725.214,484700.633 184716.443,484704.003 184721.2)) osgb5000005129953843 [ "Building" ]
3 POLYGON ((530022.138 177486.29,530043.695 177498.235,530043.074 177499.355,530042.435 177500.509,530005.349 177480.086,529978.502 177463.333,529968.87 177457.322,529968.446 177457.057,529968.199 177455.714,529968.16 177455.504,529966.658 177454.566,529958.613 177449.543,529956.624 177448.301,529956.62 177448.294,529956.08 177447.4,529954.238 177444.351,529953.197 177442.624,529953.186 177442.609,529950.768 177438.606,529950.454 177438.086,529949.47 177434.209,529950.212 177434.038,529954.216 177433.114,529955.098 177437.457,529952.714 177437.98,529953.55 177441.646,529953.842 177442.008,529957.116 177446.059,529957.449 177446.471,529968.508 177453.375,529974.457 177451.966,529976.183 177458.937,530003.157 177475.772,530020.651 177485.466,530021.257 177484.372,530022.744 177485.196,530022.138 177486.29)) osgb5000005283023887 [ "Building" ]

View File

@ -0,0 +1,4 @@
WKT,fid,descriptiveGroup
"POLYGON ((484704.003 184721.2,484691.62 184729.971,484688.251 184725.214,484700.633 184716.443,484704.003 184721.2))",osgb5000005129953843,"[ ""Building"" ]"
"POLYGON ((484703.76 184849.9,484703.46 184849.7,484703.26 184849.4,484703.06 184849.2,484702.86 184848.9,484702.76 184848.6,484702.66 184848.2,484702.66 184847.3,484702.76 184847.0,484702.96 184846.7,484703.06 184846.4,484703.36 184846.2,484703.56 184846.0,484704.16 184845.6,484704.46 184845.5,484705.46 184845.5,484706.06 184845.7,484706.26 184845.8,484706.76 184846.3,484706.96 184846.6,484707.16 184846.8,484707.26 184847.2,484707.36 184847.5,484707.36 184848.4,484707.26 184848.7,484707.16 184848.9,484706.76 184849.5,484706.46 184849.7,484706.26 184849.9,484705.66 184850.2,484704.66 184850.2,484703.76 184849.9))",osgb1000000152730957,"[ ""General Surface"" ]"
"POLYGON ((530022.138 177486.29,530043.695 177498.235,530043.074 177499.355,530042.435 177500.509,530005.349 177480.086,529978.502 177463.333,529968.87 177457.322,529968.446 177457.057,529968.199 177455.714,529968.16 177455.504,529966.658 177454.566,529958.613 177449.543,529956.624 177448.301,529956.62 177448.294,529956.08 177447.4,529954.238 177444.351,529953.197 177442.624,529953.186 177442.609,529950.768 177438.606,529950.454 177438.086,529949.47 177434.209,529950.212 177434.038,529954.216 177433.114,529955.098 177437.457,529952.714 177437.98,529953.55 177441.646,529953.842 177442.008,529957.116 177446.059,529957.449 177446.471,529968.508 177453.375,529974.457 177451.966,529976.183 177458.937,530003.157 177475.772,530020.651 177485.466,530021.257 177484.372,530022.744 177485.196,530022.138 177486.29))",osgb5000005283023887,"[ ""Building"" ]"
1 WKT fid descriptiveGroup
2 POLYGON ((484704.003 184721.2,484691.62 184729.971,484688.251 184725.214,484700.633 184716.443,484704.003 184721.2)) osgb5000005129953843 [ "Building" ]
3 POLYGON ((484703.76 184849.9,484703.46 184849.7,484703.26 184849.4,484703.06 184849.2,484702.86 184848.9,484702.76 184848.6,484702.66 184848.2,484702.66 184847.3,484702.76 184847.0,484702.96 184846.7,484703.06 184846.4,484703.36 184846.2,484703.56 184846.0,484704.16 184845.6,484704.46 184845.5,484705.46 184845.5,484706.06 184845.7,484706.26 184845.8,484706.76 184846.3,484706.96 184846.6,484707.16 184846.8,484707.26 184847.2,484707.36 184847.5,484707.36 184848.4,484707.26 184848.7,484707.16 184848.9,484706.76 184849.5,484706.46 184849.7,484706.26 184849.9,484705.66 184850.2,484704.66 184850.2,484703.76 184849.9)) osgb1000000152730957 [ "General Surface" ]
4 POLYGON ((530022.138 177486.29,530043.695 177498.235,530043.074 177499.355,530042.435 177500.509,530005.349 177480.086,529978.502 177463.333,529968.87 177457.322,529968.446 177457.057,529968.199 177455.714,529968.16 177455.504,529966.658 177454.566,529958.613 177449.543,529956.624 177448.301,529956.62 177448.294,529956.08 177447.4,529954.238 177444.351,529953.197 177442.624,529953.186 177442.609,529950.768 177438.606,529950.454 177438.086,529949.47 177434.209,529950.212 177434.038,529954.216 177433.114,529955.098 177437.457,529952.714 177437.98,529953.55 177441.646,529953.842 177442.008,529957.116 177446.059,529957.449 177446.471,529968.508 177453.375,529974.457 177451.966,529976.183 177458.937,530003.157 177475.772,530020.651 177485.466,530021.257 177484.372,530022.744 177485.196,530022.138 177486.29)) osgb5000005283023887 [ "Building" ]

View File

@ -0,0 +1 @@
WKT,fid,descriptiveGroup
1 WKT fid descriptiveGroup

View File

@ -0,0 +1,2 @@
WKT,fid,descriptiveGroup
"POLYGON ((517896.1 186250.8,517891.7 186251.6,517891.1 186248.7,517890.75 186246.7,517890.65 186246.35,517890.45 186245.95,517890.25 186245.8,517889.95 186245.75,517889.65 186245.75,517878.3 186247.9,517874.61 186248.55,517872.9 186239.5,517873.4 186239.7,517873.95 186239.8,517874.25 186239.75,517874.65 186239.7,517875.05 186239.6,517878.35 186238.95,517889.1 186236.85,517892.769 186236.213,517903.2 186234.4,517919.55 186231.4,517932.25 186229.1,517942.1 186227.25,517954.65 186225.05,517968.75 186222.45,517985.25 186219.5,518000.0 186216.65,518021.7 186212.7,518026.7 186211.75,518029.1 186211.3,518029.68 186211.173,518033.65 186210.3,518046.1 186207.65,518058.45 186204.95,518063.3 186203.6,518068.1 186202.25,518068.9 186202.05,518079.6 186198.95,518081.4 186198.3,518083.2 186197.55,518084.95 186196.8,518086.7 186196.0,518088.45 186195.25,518097.85 186191.05,518099.15 186190.45,518108.3 186186.2,518108.375 186186.175,518108.45 186186.15,518108.477 186186.132,518114.5 186183.6,518114.65 186183.55,518114.85 186183.45,518115.05 186183.4,518115.25 186183.3,518115.35 186183.2,518115.45 186183.15,518141.85 186171.55,518142.0 186171.5,518142.15 186171.4,518142.45 186171.3,518142.6 186171.2,518142.7 186171.1,518142.8 186171.05,518142.9 186170.95,518143.05 186170.85,518143.15 186170.75,518143.25 186170.6,518143.4 186170.5,518143.5 186170.4,51814
Can't render this file because it contains an unexpected character in line 2 and column 1359.