Merge pull request #798 from colouring-cities/os-data-loading

Document & test Ordnance Survey data loading
This commit is contained in:
Ed Chalstrey 2022-04-13 13:28:06 +01:00 committed by GitHub
commit a4771eaac0
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
22 changed files with 233 additions and 383 deletions

25
.github/workflows/etl.yml vendored Normal file
View File

@ -0,0 +1,25 @@
name: etl
on: [pull_request]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v2
with:
python-version: '3.7'
- name:
Install dependencies
run: |
sudo apt-get install libgeos-dev
python -m pip install --upgrade pip
python -m pip install pytest
python -m pip install flake8
python -m pip install -r etl/requirements.txt
- name: Run Flake8
run: |
ls etl/*py | grep -v 'join_building_data' | xargs flake8 --exclude etl/__init__.py
- name: Run tests
run: |
python -m pytest

5
.gitignore vendored
View File

@ -18,6 +18,11 @@ etl/**/*.txt
etl/**/*.xls
etl/**/*.xlsx
etl/**/*.zip
etl/**/*.gml
etl/**/*.gz
etl/**/5690395*
postgresdata
*/__pycache__/*
.DS_Store

View File

@ -49,7 +49,9 @@ ssh <linuxusername>@localhost -p 4022
- [:rainbow: Installing Colouring London](#rainbow-installing-colouring-london)
- [:arrow_down: Installing Node.js](#arrow_down-installing-nodejs)
- [:large_blue_circle: Configuring PostgreSQL](#large_blue_circle-configuring-postgresql)
- [:space_invader: Create an empty database](#space_invader-create-an-empty-database)
- [:arrow_forward: Configuring Node.js](#arrow_forward-configuring-nodejs)
- [:snake: Set up Python](#snake-set-up-python)
- [:house: Loading the building data](#house-loading-the-building-data)
- [:computer: Running the application](#computer-running-the-application)
- [:eyes: Viewing the application](#eyes-viewing-the-application)
@ -66,7 +68,7 @@ sudo apt-get upgrade -y
Now install some essential tools.
```bash
sudo apt-get install -y build-essential git wget curl
sudo apt-get install -y build-essential git wget curl parallel rename
```
### :red_circle: Installing PostgreSQL
@ -157,7 +159,7 @@ Ensure the `en_US` locale exists.
sudo locale-gen en_US.UTF-8
```
Configure the database to listen on network connection.
Configure postgres to listen on network connection.
```bash
sudo sed -i "s/#\?listen_address.*/listen_addresses '*'/" /etc/postgresql/12/main/postgresql.conf
@ -189,6 +191,10 @@ If you intend to load the full CL database from a dump file into your dev enviro
</details><p></p>
### :space_invader: Create an empty database
Now create an empty database configured with geo-spatial tools. The database name (`<colouringlondondb>`) is arbitrary.
Set environment variables, which will simplify running subsequent `psql` commands.
```bash
@ -198,7 +204,7 @@ export PGHOST=localhost
export PGDATABASE=<colouringlondondb>
```
Create a colouring london database if none exists. The name (`<colouringlondondb>`) is arbitrary.
Create the database.
```bash
sudo -u postgres psql -c "SELECT 1 FROM pg_database WHERE datname = '<colouringlondondb>';" | grep -q 1 || sudo -u postgres createdb -E UTF8 -T template0 --locale=en_US.utf8 -O <username> <colouringlondondb>
@ -228,10 +234,22 @@ cd ~/colouring-london/app
npm install
```
### :snake: Set up Python
Install python and related tools.
```bash
sudo apt-get install -y python3 python3-pip python3-dev python3-venv
```
## :house: Loading the building data
There are several ways to create the Colouring London database in your environment. The simplest way if you are just trying out the application would be to use test data from OSM, but otherwise you should follow one of the instructions below to create the full database either from scratch, or from a previously made db (via a dump file).
To create the full database from scratch, follow [these instructions](../etl/README.md), otherwise choose one of the following:
<details>
<summary> With a database dump </summary><p></p>
<summary> Create database from dump </summary><p></p>
If you are a developer on the Colouring London project (or another Colouring Cities project), you may have a production database (or staging etc) that you wish to duplicate in your development environment.
@ -261,22 +279,16 @@ ls ~/colouring-london/migrations/*.up.sql 2>/dev/null | while read -r migration;
</details>
<details>
<summary> With test data </summary><p></p>
<summary> Create database with test data </summary><p></p>
This section shows how to load test buildings into the application from OpenStreetMaps (OSM).
#### Set up Python
#### Load OpenStreetMap test polygons
Install python and related tools.
```bash
sudo apt-get install -y python3 python3-pip python3-dev python3-venv
```
Now set up a virtual environment for python. In the following example we have named the
virtual environment *colouringlondon* but it can have any name.
Create a virtual environment for python in the `etl` folder of your repository. In the following example we have name the virtual environment *colouringlondon* but it can have any name.
```bash
cd ~/colouring-london/etl
pyvenv colouringlondon
```
@ -293,18 +305,9 @@ pip install --upgrade pip
pip install --upgrade setuptools wheel
```
#### Load OpenStreetMap test polygons
First install prerequisites.
```bash
sudo apt-get install -y parallel
```
Install the required python packages. This relies on the `requirements.txt` file located
in the `etl` folder of your local repository.
Install the required python packages.
```bash
cd ~/colouring-london/etl/
pip install -r requirements.txt
```

View File

@ -1,91 +1,109 @@
# Data loading
# Extract, transform and load
The scripts in this directory are used to extract, transform and load (ETL) the core datasets
for Colouring London:
The scripts in this directory are used to extract, transform and load (ETL) the core datasets for Colouring London. This README acts as a guide for setting up the Colouring London database with these datasets and updating it.
1. Building geometries, sourced from Ordnance Survey MasterMap (Topography Layer)
1. Unique Property Reference Numbers (UPRNs), sourced from Ordnance Survey AddressBase
# Contents
- :arrow_down: [Downloading Ordnance Survey data](#arrow_down-downloading-ordnance-survey-data)
- :penguin: [Making data available to Ubuntu](#penguin-making-data-available-to-ubuntu)
- :new_moon: [Creating a Colouring London database from scratch](#new_moon-creating-a-colouring-london-database-from-scratch)
- :full_moon: [Updating the Colouring London database with new OS data](#full_moon-updating-the-colouring-london-database-with-new-os-data)
# :arrow_down: Downloading Ordnance Survey data
The building geometries are sourced from Ordnance Survey (OS) MasterMap (Topography Layer). To get the required datasets, you'll need to complete the following steps:
1. Sign up for the Ordnance Survey [Data Exploration License](https://www.ordnancesurvey.co.uk/business-government/licensing-agreements/data-exploration-sign-up). You should receive an e-mail with a link to log in to the platform (this could take up to a week).
2. Navigate to https://orders.ordnancesurvey.co.uk/orders and click the button for: ✏️ Order. From here you should be able to click another button to add a product.
3. Drop a rectangle or Polygon over London and make the following selections, clicking the "Add to basket" button for each:
![](screenshot/MasterMap.png)
<p></p>
4. You should be then able to check out your basket and download the files. Note: there may be multiple `.zip` files to download for MasterMap due to the size of the dataset.
6. Unzip the MasterMap `.zip` files and move all the `.gz` files from each to a single folder in a convenient location. We will use this folder in later steps.
# :penguin: Making data available to Ubuntu
Before creating or updating a Colouring London database, you'll need to make sure the downloaded OS files are available to the Ubuntu machine where the database is hosted. If you are using Virtualbox, you could host share folder(s) containing the OS files with the VM (e.g. [see these instructions for Mac](https://medium.com/macoclock/share-folder-between-macos-and-ubuntu-4ce84fb5c1ad)).
# :new_moon: Creating a Colouring London database from scratch
## Prerequisites
Install PostgreSQL and create a database for colouringlondon, with a database
user that can connect to it. The [PostgreSQL
documentation](https://www.postgresql.org/docs/12/tutorial-start.html) covers
installation and getting started.
You should already have set up PostgreSQL and created a database in an Ubuntu environment. Make sure to create environment variables to use `psql` if you haven't already:
Install the [PostGIS extension](https://postgis.net/).
Connect to the colouringlondon database and add the PostGIS, pgcrypto and
pg_trgm extensions:
```sql
create extension postgis;
create extension pgcrypto;
create extension pg_trgm;
```bash
export PGPASSWORD=<pgpassword>
export PGUSER=<username>
export PGHOST=localhost
export PGDATABASE=<colouringlondondb>
```
Create the core database tables:
```bash
psql < ../migrations/001.core.up.sql
cd ~/colouring-london
psql < migrations/001.core.up.sql
```
There is some performance benefit to creating indexes after bulk loading data.
Otherwise, it's fine to run all the migrations at this point and skip the index
creation steps below.
Install GNU parallel, this is used to speed up loading bulk data.
You should already have installed GNU parallel, which is used to speed up loading bulk data.
## Processing and loading Ordnance Survey data
## Process and load Ordance Survey data
Before running any of these scripts, you will need the OS data for your area of
interest. AddressBase and MasterMap are available directly from [Ordnance
Survey](https://www.ordnancesurvey.co.uk/). The alternative setup below uses
OpenStreetMap.
The scripts should be run in the following order:
Move into the `etl` directory and set execute permission on all scripts.
```bash
# extract both datasets
extract_addressbase.sh ./addressbase_dir
extract_mastermap.sh ./mastermap_dir
# filter mastermap ('building' polygons and any others referenced by addressbase)
filter_transform_mastermap_for_loading.sh ./addressbase_dir ./mastermap_dir
# load all building outlines
load_geometries.sh ./mastermap_dir
# index geometries (should be faster after loading)
psql < ../migrations/002.index-geometries.sql
# create a building record per outline
create_building_records.sh
# add UPRNs where they match
load_uprns.py ./addressbase_dir
# index building records
psql < ../migrations/003.index-buildings.sql
cd ~/colouring-london/etl
chmod +x *.sh
```
## Alternative, using OpenStreetMap
This uses the [osmnx](https://github.com/gboeing/osmnx) python package to get OpenStreetMap data. You will need python and osmnx to run `get_test_polygons.py`.
To help test the Colouring London application, `get_test_polygons.py` will attempt to save a
small (1.5km²) extract from OpenStreetMap to a format suitable for loading to the database.
In this case, run:
Extract the MasterMap data (this step could take a while).
```bash
sudo ./extract_mastermap.sh /path/to/mastermap_dir
```
Filter MasterMap 'building' polygons.
```bash
sudo ./filter_transform_mastermap_for_loading.sh /path/to/mastermap_dir
```
Load all building outlines. Note: you should ensure that `mastermap_dir` has permissions that will allow the linux `find` command to work without using sudo.
```bash
./load_geometries.sh /path/to/mastermap_dir
```
Index geometries.
```bash
# download test data
python get_test_polygons.py
# load all building outlines
./load_geometries.sh ./
# index geometries (should be faster after loading)
psql < ../migrations/002.index-geometries.up.sql
# create a building record per outline
./create_building_records.sh
# index building records
psql < ../migrations/003.index-buildings.up.sql
```
## Finally
<!-- TODO: Drop outside limit. -->
<!-- ```bash
./drop_outside_limit.sh /path/to/boundary_file
```` -->
Create a building record per outline.
```bash
./create_building_records.sh
```
Run the remaining migrations in `../migrations` to create the rest of the database structure.
```bash
ls ~/colouring-london/migrations/*.up.sql 2>/dev/null | while read -r migration; do psql < $migration; done;
```
# :full_moon: Updating the Colouring London database with new OS data
TODO: this section should instruct how to update and existing db

1
etl/__init__.py Normal file
View File

@ -0,0 +1 @@
from .filter_mastermap import filter_mastermap

View File

@ -1,60 +0,0 @@
"""Check if AddressBase TOIDs will match MasterMap
"""
import csv
import glob
import os
import sys
from multiprocessing import Pool
csv.field_size_limit(sys.maxsize)
def main(ab_path, mm_path):
ab_paths = sorted(glob.glob(os.path.join(ab_path, "*.gml.csv.filtered.csv")))
mm_paths = sorted(glob.glob(os.path.join(mm_path, "*.gml.csv")))
try:
assert len(ab_paths) == len(mm_paths)
except AssertionError:
print(ab_paths)
print(mm_paths)
zipped_paths = zip(ab_paths, mm_paths)
# parallel map over tiles
with Pool() as p:
p.starmap(check, zipped_paths)
def check(ab_path, mm_path):
tile = str(os.path.basename(ab_path)).split(".")[0]
output_base = os.path.dirname(ab_path)
ab_toids = set()
mm_toids = set()
with open(ab_path, 'r') as fh:
r = csv.DictReader(fh)
for line in r:
ab_toids.add(line['toid'])
with open(mm_path, 'r') as fh:
r = csv.DictReader(fh)
for line in r:
mm_toids.add(line['fid'])
missing = ab_toids - mm_toids
print(tile, "MasterMap:", len(mm_toids), "Addressbase:", len(ab_toids), "AB but not MM:", len(missing))
with open(os.path.join(output_base, 'missing_toids_{}.txt'.format(tile)), 'w') as fh:
for toid in missing:
fh.write("{}\n".format(toid))
with open(os.path.join(output_base, 'ab_toids_{}.txt'.format(tile)), 'w') as fh:
for toid in ab_toids:
fh.write("{}\n".format(toid))
if __name__ == '__main__':
if len(sys.argv) != 3:
print("Usage: check_ab_mm_match.py ./path/to/addressbase/dir ./path/to/mastermap/dir")
exit(-1)
main(sys.argv[1], sys.argv[2])

View File

@ -1,63 +0,0 @@
#!/usr/bin/env bash
#
# Extract address points from OS Addressbase GML
# - as supplied in 5km tiles, zip/gz archives
#
: ${1?"Usage: $0 ./path/to/data/dir"}
data_dir=$1
#
# Unzip to GML
#
find $data_dir -type f -name '*.zip' -printf "%f\n" | \
parallel \
unzip -u $data_dir/{} -d $data_dir
#
# Extract to CSV
#
# Relevant fields:
# WKT
# crossReference (list of TOID/other references)
# source (list of cross-reference sources: 7666MT refers to MasterMap Topo)
# uprn
# parentUPRN
# logicalStatus: 1 (one) is approved (otherwise historical, provisional)
#
find $data_dir -type f -name '*.gml' -printf "%f\n" | \
parallel \
ogr2ogr -f CSV \
-select crossReference,source,uprn,parentUPRN,logicalStatus \
$data_dir/{}.csv $data_dir/{} BasicLandPropertyUnit \
-lco GEOMETRY=AS_WKT
#
# Filter
#
find $data_dir -type f -name '*.gml.csv' -printf "%f\n" | \
parallel \
python filter_addressbase_csv.py $data_dir/{}
#
# Transform to 3857 (web mercator)
#
find $data_dir -type f -name '*.filtered.csv' -printf "%f\n" | \
parallel \
ogr2ogr \
-f CSV $data_dir/{}.3857.csv \
-s_srs "EPSG:4326" \
-t_srs "EPSG:3857" \
$data_dir/{} \
-lco GEOMETRY=AS_WKT
#
# Update to EWKT (with SRID indicator for loading to Postgres)
#
find $data_dir -type f -name '*.3857.csv' -printf "%f\n" | \
parallel \
cat $data_dir/{} "|" sed "'s/^\"POINT/\"SRID=3857;POINT/'" "|" cut -f 1,3,4,5 -d "','" ">" $data_dir/{}.loadable

View File

@ -1,29 +1,29 @@
#!/usr/bin/env bash
#
# Extract MasterMap
#
: ${1?"Usage: $0 ./path/to/mastermap/dir"}
data_dir=$1
#
# Extract buildings from *.gz to CSV
#
echo "Extract buildings from *.gz..."
# Features where::
# descriptiveGroup = '(1:Building)'
#
# Use `fid` as source ID, aka TOID.
#
find $data_dir -type f -name '*.gz' -printf "%f\n" | \
parallel \
gunzip $data_dir/{} -k -S gml
echo "Rename extracted files to .gml..."
rename 's/$/.gml/' $data_dir/*[^gzvt]
find $data_dir -type f -name '*.gml' -printf "%f\n" | \
# Note: previously the rename cmd above resulted in some temp files being renamed to .gml
# so I have specified the start of the filename (appears to be consistent for all OS MasterMap downloads)
# we may need to update this below for other downloads
echo "Covert .gml files to .csv"
find $data_dir -type f -name '*5690395*.gml' -printf "%f\n" | \
parallel \
ogr2ogr \
-select fid,descriptiveGroup \
@ -32,5 +32,6 @@ ogr2ogr \
TopographicArea \
-lco GEOMETRY=AS_WKT
echo "Remove .gfs and .gml files from previous steps..."
rm $data_dir/*.gfs
rm $data_dir/*.gml

View File

@ -1,42 +0,0 @@
#!/usr/bin/env python
"""Read ogr2ogr-converted CSV, filter to get OSMM TOID reference, only active addresses
"""
import csv
import json
import sys
def main(input_path):
output_path = "{}.filtered.csv".format(input_path)
fieldnames = (
'wkt', 'toid', 'uprn', 'parent_uprn'
)
with open(input_path) as input_fh:
with open(output_path, 'w') as output_fh:
w = csv.DictWriter(output_fh, fieldnames=fieldnames)
w.writeheader()
r = csv.DictReader(input_fh)
for line in r:
if line['logicalStatus'] != "1":
continue
refs = json.loads(line['crossReference'])
sources = json.loads(line['source'])
toid = ""
for ref, source in zip(refs, sources):
if source == "7666MT":
toid = ref
w.writerow({
'uprn': line['uprn'],
'parent_uprn': line['parentUPRN'],
'toid': toid,
'wkt': line['WKT'],
})
if __name__ == '__main__':
if len(sys.argv) != 2:
print("Usage: filter_addressbase_csv.py ./path/to/data.csv")
exit(-1)
main(sys.argv[1])

View File

@ -1,60 +1,44 @@
"""Filter MasterMap to buildings and addressbase-matches
"""Filter MasterMap to buildings
- WHERE descriptiveGroup includes 'Building'
- OR toid in addressbase_toids
"""
import csv
import glob
import json
import os
import sys
from multiprocessing import Pool
csv.field_size_limit(sys.maxsize)
def main(ab_path, mm_path):
mm_paths = sorted(glob.glob(os.path.join(mm_path, "*.gml.csv")))
toid_paths = sorted(glob.glob(os.path.join(ab_path, "ab_toids_*.txt")))
try:
assert len(mm_paths) == len(toid_paths)
except AssertionError:
print(mm_paths)
print(toid_paths)
zipped_paths = zip(mm_paths, toid_paths)
def main(mastermap_path):
mm_paths = sorted(glob.glob(os.path.join(mastermap_path, "*.gml.csv")))
for mm_path in mm_paths:
filter_mastermap(mm_path)
# parallel map over tiles
with Pool() as p:
p.starmap(filter, zipped_paths)
def filter(mm_path, toid_path):
with open(toid_path, 'r') as fh:
r = csv.reader(fh)
toids = set(line[0] for line in r)
output_path = "{}.filtered.csv".format(str(mm_path).replace(".gml.csv", ""))
alt_output_path = "{}.filtered_not_building.csv".format(str(mm_path).replace(".gml.csv", ""))
def filter_mastermap(mm_path):
output_path = str(mm_path).replace(".gml.csv", "")
output_path = "{}.filtered.csv".format(output_path)
output_fieldnames = ('WKT', 'fid', 'descriptiveGroup')
# Open the input csv with all polygons, buildings and others
with open(mm_path, 'r') as fh:
r = csv.DictReader(fh)
# Open a new output csv that will contain just buildings
with open(output_path, 'w') as output_fh:
w = csv.DictWriter(output_fh, fieldnames=output_fieldnames)
w.writeheader()
with open(alt_output_path, 'w') as alt_output_fh:
alt_w = csv.DictWriter(alt_output_fh, fieldnames=output_fieldnames)
alt_w.writeheader()
for line in r:
try:
if 'Building' in line['descriptiveGroup']:
w.writerow(line)
elif line['fid'] in toids:
alt_w.writerow(line)
# when descriptiveGroup is missing, ignore this Polygon
except TypeError:
pass
if __name__ == '__main__':
if len(sys.argv) != 3:
print("Usage: filter_mastermap.py ./path/to/addressbase/dir ./path/to/mastermap/dir")
if len(sys.argv) != 2:
print("Usage: filter_mastermap.py ./path/to/mastermap/dir")
exit(-1)
main(sys.argv[1], sys.argv[2])
main(sys.argv[1])

View File

@ -1,29 +1,13 @@
#!/usr/bin/env bash
#
# Filter and transform for loading
#
: ${1?"Usage: $0 ./path/to/addressbase/dir ./path/to/mastermap/dir"}
: ${2?"Usage: $0 ./path/to/addressbase/dir ./path/to/mastermap/dir"}
: ${1?"Usage: $0 ./path/to/mastermap/dir"}
addressbase_dir=$1
mastermap_dir=$2
mastermap_dir=$1
#
# Check which TOIDs are matched against UPRNs
#
python check_ab_mm_match.py $addressbase_dir $mastermap_dir
echo "Filter WHERE descriptiveGroup = '(1:Building)'... "
python filter_mastermap.py $mastermap_dir
#
# Filter
# - WHERE descriptiveGroup = '(1:Building)'
# - OR toid in addressbase_toids
#
python filter_mastermap.py $addressbase_dir $mastermap_dir
#
# Transform to 3857 (web mercator)
#
echo "Transform to 3857 (web mercator)..."
find $mastermap_dir -type f -name '*.filtered.csv' -printf "%f\n" | \
parallel \
ogr2ogr \
@ -34,13 +18,13 @@ ogr2ogr \
$mastermap_dir/{} \
-lco GEOMETRY=AS_WKT
#
# Update to EWKT (with SRID indicator for loading to Postgres)
#
echo "Update to EWKT (with SRID indicator for loading to Postgres)..."
echo "Updating POLYGONs.."
find $mastermap_dir -type f -name '*.3857.csv' -printf "%f\n" | \
parallel \
sed -i "'s/^\"POLYGON/\"SRID=3857;POLYGON/'" $mastermap_dir/{}
echo "Updating MULTIPOLYGONs.."
find $mastermap_dir -type f -name '*.3857.csv' -printf "%f\n" | \
parallel \
sed -i "'s/^\"MULTIPOLYGON/\"SRID=3857;MULTIPOLYGON/'" $mastermap_dir/{}

View File

@ -25,9 +25,10 @@ gdf = osmnx.footprints_from_point(point=point, dist=dist)
# preview image
gdf_proj = osmnx.projection.project_gdf(gdf, to_crs={'init': 'epsg:3857'})
gdf_proj = gdf_proj[gdf_proj.geometry.apply(lambda g: g.geom_type != 'MultiPolygon')]
gdf_proj = gdf_proj[gdf_proj.geometry.apply(lambda g: g.geom_type != 'MultiPolygon')] # noqa
fig, ax = osmnx.plot_footprints(gdf_proj, bgcolor='#333333', color='w', figsize=(4,4),
fig, ax = osmnx.plot_footprints(gdf_proj, bgcolor='#333333',
color='w', figsize=(4, 4),
save=True, show=False, close=True,
filename='test_buildings_preview', dpi=600)
@ -50,7 +51,13 @@ gdf_to_save.rename(
# convert to CSV
test_data_csv = str(os.path.join(test_dir, 'test_buildings.3857.csv'))
subprocess.run(["rm", test_data_csv])
subprocess.run(["ogr2ogr", "-f", "CSV", test_data_csv, test_data_geojson, "-lco", "GEOMETRY=AS_WKT"])
subprocess.run(
["ogr2ogr", "-f", "CSV", test_data_csv,
test_data_geojson, "-lco", "GEOMETRY=AS_WKT"]
)
# add SRID for ease of loading to PostgreSQL
subprocess.run(["sed", "-i", "s/^\"POLYGON/\"SRID=3857;POLYGON/", test_data_csv])
subprocess.run(
["sed", "-i", "s/^\"POLYGON/\"SRID=3857;POLYGON/",
test_data_csv]
)

View File

@ -1,27 +1,25 @@
#!/usr/bin/env bash
#
# Load geometries from GeoJSON to Postgres
# - assume postgres connection details are set in the environment using PGUSER, PGHOST etc.
#
: ${1?"Usage: $0 ./path/to/mastermap/dir"}
mastermap_dir=$1
#
# Create 'geometry' record with
# id: <polygon-guid>,
# source_id: <toid>,
# geom: <geom>
#
echo "Copy geometries to db..."
find $mastermap_dir -type f -name '*.3857.csv' \
-printf "$mastermap_dir/%f\n" | \
parallel \
cat {} '|' psql -c "\"COPY geometries ( geometry_geom, source_id ) FROM stdin WITH CSV HEADER;\""
#
# Delete any duplicated geometries (by TOID)
#
echo "Delete duplicate geometries..."
psql -c "DELETE FROM geometries a USING (
SELECT MIN(ctid) as ctid, source_id
FROM geometries

View File

@ -1,36 +0,0 @@
#!/usr/bin/env bash
#
# Load UPRNS from CSV to Postgres
# - assume postgres connection details are set in the environment using PGUSER, PGHOST etc.
#
: ${1?"Usage: $0 ./path/to/addressbase/dir"}
data_dir=$1
#
# Create 'building_properties' record with
# uprn: <uprn>,
# parent_uprn: <parent_uprn>,
# toid: <toid>,
# uprn_geom: <point>
#
find $data_dir -type f -name '*.3857.csv.loadable' \
-printf "$data_dir/%f\n" | \
parallel \
cat {} '|' psql -c "\"COPY building_properties ( uprn_geom, toid, uprn, parent_uprn ) FROM stdin WITH CSV HEADER;\""
#
# Create references
#
# index essential for speeed here
psql -c "CREATE INDEX IF NOT EXISTS building_toid_idx ON buildings ( ref_toid );"
# link to buildings
psql -c "UPDATE building_properties
SET building_id = (
SELECT b.building_id
FROM buildings as b
WHERE
building_properties.toid = b.ref_toid
);"

View File

@ -3,13 +3,11 @@
#
# Extract, transform and load building outlines and property records
#
: ${1?"Usage: $0 ./path/to/addressbase/dir ./path/to/mastermap/dir ./path/to/boundary"}
: ${2?"Usage: $0 ./path/to/addressbase/dir ./path/to/mastermap/dir ./path/to/boundary"}
: ${3?"Usage: $0 ./path/to/addressbase/dir ./path/to/mastermap/dir ./path/to/boundary"}
: ${1?"Usage: $0 ./path/to/mastermap/dir ./path/to/boundary"}
: ${2?"Usage: $0 ./path/to/mastermap/dir ./path/to/boundary"}
addressbase_dir=$1
mastermap_dir=$2
boundary_file=$3
mastermap_dir=$1
boundary_file=$2
script_dir=${0%/*}
#
@ -17,10 +15,9 @@ script_dir=${0%/*}
#
# extract both datasets
$script_dir/extract_addressbase.sh $addressbase_dir
$script_dir/extract_mastermap.sh $mastermap_dir
# filter mastermap ('building' polygons and any others referenced by addressbase)
$script_dir/filter_transform_mastermap_for_loading.sh $addressbase_dir $mastermap_dir
$script_dir/filter_transform_mastermap_for_loading.sh $mastermap_dir
#
# Load
@ -33,7 +30,5 @@ psql < $script_dir/../migrations/002.index-geometries.up.sql
$script_dir/drop_outside_limit.sh $boundary_file
# create a building record per outline
$script_dir/create_building_records.sh
# add UPRNs where they match
$script_dir/load_uprns.sh $addressbase_dir
# index building records
psql < $script_dir/../migrations/003.index-buildings.up.sql
# Run remaining migrations
ls $script_dir/../migrations/*.up.sql 2>/dev/null | while read -r migration; do psql < $migration; done;

View File

@ -3,11 +3,8 @@
#
# Filter and transform for loading
#
: ${1?"Usage: $0 ./path/to/addressbase/dir ./path/to/mastermap/dir"}
: ${2?"Usage: $0 ./path/to/addressbase/dir ./path/to/mastermap/dir"}
: ${1?"Usage: $0 ./path/to/mastermap/dir"}
addressbase_dir=$1
mastermap_dir=$2
mastermap_dir=$1
rm -f $addressbase_dir/*.{csv,gml,txt,filtered,gfs}
rm -f $mastermap_dir/*.{csv,gml,txt,filtered,gfs}

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

23
tests/test_filter.py Normal file
View File

@ -0,0 +1,23 @@
import csv
import pytest
from etl import filter_mastermap
def test_filter_mastermap():
"""Test that MasterMap CSV can be correctly filtered to include only buildings."""
input_file = "tests/test_mastermap.gml.csv" # Test csv with two buildings and one non-building
output_file = input_file.replace('gml', 'filtered')
filter_mastermap(input_file) # creates output_file
with open(output_file, newline='') as csvfile:
csv_array = list(csv.reader(csvfile))
assert len(csv_array) == 3 # assert that length is 3 because just two building rows after header
def test_filter_mastermap_missing_descriptivegroup():
"""Test that MasterMap CSV can be correctly filtered when the polygon does not have a type specified."""
input_file = "tests/test_mastermap_missing_descriptivegroup.gml.csv" # Test csv with one building and one non-building
output_file = input_file.replace('gml', 'filtered')
filter_mastermap(input_file) # creates output_file
with open(output_file, newline='') as csvfile:
csv_array = list(csv.reader(csvfile))
assert len(csv_array) == 1 # assert that length is 1 because just header

View File

@ -0,0 +1,3 @@
WKT,fid,descriptiveGroup
"POLYGON ((484704.003 184721.2,484691.62 184729.971,484688.251 184725.214,484700.633 184716.443,484704.003 184721.2))",osgb5000005129953843,"[ ""Building"" ]"
"POLYGON ((530022.138 177486.29,530043.695 177498.235,530043.074 177499.355,530042.435 177500.509,530005.349 177480.086,529978.502 177463.333,529968.87 177457.322,529968.446 177457.057,529968.199 177455.714,529968.16 177455.504,529966.658 177454.566,529958.613 177449.543,529956.624 177448.301,529956.62 177448.294,529956.08 177447.4,529954.238 177444.351,529953.197 177442.624,529953.186 177442.609,529950.768 177438.606,529950.454 177438.086,529949.47 177434.209,529950.212 177434.038,529954.216 177433.114,529955.098 177437.457,529952.714 177437.98,529953.55 177441.646,529953.842 177442.008,529957.116 177446.059,529957.449 177446.471,529968.508 177453.375,529974.457 177451.966,529976.183 177458.937,530003.157 177475.772,530020.651 177485.466,530021.257 177484.372,530022.744 177485.196,530022.138 177486.29))",osgb5000005283023887,"[ ""Building"" ]"
1 WKT fid descriptiveGroup
2 POLYGON ((484704.003 184721.2,484691.62 184729.971,484688.251 184725.214,484700.633 184716.443,484704.003 184721.2)) osgb5000005129953843 [ "Building" ]
3 POLYGON ((530022.138 177486.29,530043.695 177498.235,530043.074 177499.355,530042.435 177500.509,530005.349 177480.086,529978.502 177463.333,529968.87 177457.322,529968.446 177457.057,529968.199 177455.714,529968.16 177455.504,529966.658 177454.566,529958.613 177449.543,529956.624 177448.301,529956.62 177448.294,529956.08 177447.4,529954.238 177444.351,529953.197 177442.624,529953.186 177442.609,529950.768 177438.606,529950.454 177438.086,529949.47 177434.209,529950.212 177434.038,529954.216 177433.114,529955.098 177437.457,529952.714 177437.98,529953.55 177441.646,529953.842 177442.008,529957.116 177446.059,529957.449 177446.471,529968.508 177453.375,529974.457 177451.966,529976.183 177458.937,530003.157 177475.772,530020.651 177485.466,530021.257 177484.372,530022.744 177485.196,530022.138 177486.29)) osgb5000005283023887 [ "Building" ]

View File

@ -0,0 +1,4 @@
WKT,fid,descriptiveGroup
"POLYGON ((484704.003 184721.2,484691.62 184729.971,484688.251 184725.214,484700.633 184716.443,484704.003 184721.2))",osgb5000005129953843,"[ ""Building"" ]"
"POLYGON ((484703.76 184849.9,484703.46 184849.7,484703.26 184849.4,484703.06 184849.2,484702.86 184848.9,484702.76 184848.6,484702.66 184848.2,484702.66 184847.3,484702.76 184847.0,484702.96 184846.7,484703.06 184846.4,484703.36 184846.2,484703.56 184846.0,484704.16 184845.6,484704.46 184845.5,484705.46 184845.5,484706.06 184845.7,484706.26 184845.8,484706.76 184846.3,484706.96 184846.6,484707.16 184846.8,484707.26 184847.2,484707.36 184847.5,484707.36 184848.4,484707.26 184848.7,484707.16 184848.9,484706.76 184849.5,484706.46 184849.7,484706.26 184849.9,484705.66 184850.2,484704.66 184850.2,484703.76 184849.9))",osgb1000000152730957,"[ ""General Surface"" ]"
"POLYGON ((530022.138 177486.29,530043.695 177498.235,530043.074 177499.355,530042.435 177500.509,530005.349 177480.086,529978.502 177463.333,529968.87 177457.322,529968.446 177457.057,529968.199 177455.714,529968.16 177455.504,529966.658 177454.566,529958.613 177449.543,529956.624 177448.301,529956.62 177448.294,529956.08 177447.4,529954.238 177444.351,529953.197 177442.624,529953.186 177442.609,529950.768 177438.606,529950.454 177438.086,529949.47 177434.209,529950.212 177434.038,529954.216 177433.114,529955.098 177437.457,529952.714 177437.98,529953.55 177441.646,529953.842 177442.008,529957.116 177446.059,529957.449 177446.471,529968.508 177453.375,529974.457 177451.966,529976.183 177458.937,530003.157 177475.772,530020.651 177485.466,530021.257 177484.372,530022.744 177485.196,530022.138 177486.29))",osgb5000005283023887,"[ ""Building"" ]"
1 WKT fid descriptiveGroup
2 POLYGON ((484704.003 184721.2,484691.62 184729.971,484688.251 184725.214,484700.633 184716.443,484704.003 184721.2)) osgb5000005129953843 [ "Building" ]
3 POLYGON ((484703.76 184849.9,484703.46 184849.7,484703.26 184849.4,484703.06 184849.2,484702.86 184848.9,484702.76 184848.6,484702.66 184848.2,484702.66 184847.3,484702.76 184847.0,484702.96 184846.7,484703.06 184846.4,484703.36 184846.2,484703.56 184846.0,484704.16 184845.6,484704.46 184845.5,484705.46 184845.5,484706.06 184845.7,484706.26 184845.8,484706.76 184846.3,484706.96 184846.6,484707.16 184846.8,484707.26 184847.2,484707.36 184847.5,484707.36 184848.4,484707.26 184848.7,484707.16 184848.9,484706.76 184849.5,484706.46 184849.7,484706.26 184849.9,484705.66 184850.2,484704.66 184850.2,484703.76 184849.9)) osgb1000000152730957 [ "General Surface" ]
4 POLYGON ((530022.138 177486.29,530043.695 177498.235,530043.074 177499.355,530042.435 177500.509,530005.349 177480.086,529978.502 177463.333,529968.87 177457.322,529968.446 177457.057,529968.199 177455.714,529968.16 177455.504,529966.658 177454.566,529958.613 177449.543,529956.624 177448.301,529956.62 177448.294,529956.08 177447.4,529954.238 177444.351,529953.197 177442.624,529953.186 177442.609,529950.768 177438.606,529950.454 177438.086,529949.47 177434.209,529950.212 177434.038,529954.216 177433.114,529955.098 177437.457,529952.714 177437.98,529953.55 177441.646,529953.842 177442.008,529957.116 177446.059,529957.449 177446.471,529968.508 177453.375,529974.457 177451.966,529976.183 177458.937,530003.157 177475.772,530020.651 177485.466,530021.257 177484.372,530022.744 177485.196,530022.138 177486.29)) osgb5000005283023887 [ "Building" ]

View File

@ -0,0 +1 @@
WKT,fid,descriptiveGroup
1 WKT fid descriptiveGroup

View File

@ -0,0 +1,2 @@
WKT,fid,descriptiveGroup
"POLYGON ((517896.1 186250.8,517891.7 186251.6,517891.1 186248.7,517890.75 186246.7,517890.65 186246.35,517890.45 186245.95,517890.25 186245.8,517889.95 186245.75,517889.65 186245.75,517878.3 186247.9,517874.61 186248.55,517872.9 186239.5,517873.4 186239.7,517873.95 186239.8,517874.25 186239.75,517874.65 186239.7,517875.05 186239.6,517878.35 186238.95,517889.1 186236.85,517892.769 186236.213,517903.2 186234.4,517919.55 186231.4,517932.25 186229.1,517942.1 186227.25,517954.65 186225.05,517968.75 186222.45,517985.25 186219.5,518000.0 186216.65,518021.7 186212.7,518026.7 186211.75,518029.1 186211.3,518029.68 186211.173,518033.65 186210.3,518046.1 186207.65,518058.45 186204.95,518063.3 186203.6,518068.1 186202.25,518068.9 186202.05,518079.6 186198.95,518081.4 186198.3,518083.2 186197.55,518084.95 186196.8,518086.7 186196.0,518088.45 186195.25,518097.85 186191.05,518099.15 186190.45,518108.3 186186.2,518108.375 186186.175,518108.45 186186.15,518108.477 186186.132,518114.5 186183.6,518114.65 186183.55,518114.85 186183.45,518115.05 186183.4,518115.25 186183.3,518115.35 186183.2,518115.45 186183.15,518141.85 186171.55,518142.0 186171.5,518142.15 186171.4,518142.45 186171.3,518142.6 186171.2,518142.7 186171.1,518142.8 186171.05,518142.9 186170.95,518143.05 186170.85,518143.15 186170.75,518143.25 186170.6,518143.4 186170.5,518143.5 186170.4,51814
Can't render this file because it contains an unexpected character in line 2 and column 1359.