Merge pull request #447 from tomalrussell/feature/extract_readme

Extract data update
This commit is contained in:
mz8i 2019-10-02 16:49:15 +01:00 committed by GitHub
commit 4cc7b59027
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
5 changed files with 170 additions and 54 deletions

View File

@ -0,0 +1,130 @@
# Colouring London Data Extract
This extract contains a snapshot of contributions to Colouring London
(https://colouring.london).
Colouring London is a citizen science platform collecting information on every building in
London, to help make the city more sustainable.
The data included are open data, licensed under the Open Data Commons Open Database License
(ODbL, http://opendatacommons.org/licenses/odbl/) by Colouring London contributors.
You are free to copy, distribute, transmit and adapt the data, as long as you credit Colouring
London and our contributors. If you alter or build upon our data, you may distribute the
result only under the same licence.
## Contents
This extract contains four files:
- README.txt
- building_attributes.csv
- building_uprns.csv
- edit_history.csv
## Building Attributes
This is the main table, containing almost all data collected by Colouring London. Apart from
`building_id`, `revision_id` and `ref_toid`, all of these fields are optional.
- `building_id`: unique building ID for Colouring London buildings
- `revision_id`: unique revision ID for Colouring London, cross-references to our edit history
- `ref_toid`: cross-reference to Ordnance Survey MasterMap TOID
- `ref_osm_id`: cross-reference to OpenStreetMap feature osm_id
- `location_name`: building name
- `location_number`: building number
- `location_street`: street name
- `location_line_two`: additional address line
- `location_town`: town
- `location_postcode`: postcode
- `location_latitude`: latitude
- `location_longitude`: longitude
- `date_year`: year built
- `date_lower`: lower bound on year built
- `date_upper`: upper bound on year built
- `date_source`: type of source for building dates
- `date_source_detail`: details of source for building dates
- `date_link`: list of links to further information relating to building dates
- `facade_year`: facade date
- `facade_upper`: upper bound on facade date
- `facade_lower`: lower bound on facade date
- `facade_source`: type of source for facade dates
- `facade_source_detail`: details of source for facade dates
- `size_storeys_attic`: number of attic storeys
- `size_storeys_core`: number of core storeys
- `size_storeys_basement`: number of basement storeys
- `size_height_apex`: height in metres to the building apex
- `size_floor_area_ground`: ground floor floor area in square metres
- `size_floor_area_total`: total floor area in square metres
- `size_width_frontage`: width of frontage in metres
- `likes_total`: number of times the building has been liked by Colouring London users
- `planning_portal_link`: link to an entry on https://www.planningportal.co.uk/
- `planning_in_conservation_area`: in a conservation area? (True/False)
- `planning_conservation_area_name`: conservation area name
- `planning_in_list`: in the National Heritage List for England? (True/False)
- `planning_list_id`: National Heritage List for England ID
- `planning_list_cat`: National Heritage List for England listing type
- `planning_list_grade`: National Heritage List for England listing grade
- `planning_heritage_at_risk_id`: on the Heritage at Risk list? (True/False)
- `planning_world_list_id`: UNESCO World Heritage list ID
- `planning_in_glher`: in the Greater London Historic Environment Record? (True/False)
- `planning_glher_url`: Greater London Historic Environment Record link
- `planning_in_apa`: in an Architectural Priority Area? (True/False)
- `planning_apa_name`: Architectural Priority Area name
- `planning_apa_tier`: Architectural Priority Area tier
- `planning_in_local_list`: in a local list? (True/False)
- `planning_local_list_url`: local list reference link
- `planning_in_historic_area_assessment`: within a historic area assessment? (True/False)
- `planning_historic_area_assessment_url`: historic area assessment reference link
## Building UPRNs
Buildings are matched to UPRNs (Unique Property Reference Numbers), which should help link
Colouring London data against other datasets.
Read more about UPRNs: https://www.ordnancesurvey.co.uk/business-government/tools-support/uprn
`building_uprns.csv` looks something like this:
building_id,uprn,parent_uprn
2810432,10091093495,100023038313
2810432,10091093496,100023038313
2810432,10091093497,
- `building_id`: Colouring London unique building ID, references the building_id in
building_attributes.csv
- `uprn`: Unique Property Reference Number associated with the building. In some cases
multiple UPRNs are associated with a single Colouring London building, for example in
blocks of flats or mixed-use buildings.
- `parent_uprn`: optional. Some UPRNs are grouped by a parent-child relationship, so while
each UPRN is unique, multiple UPRNs may share the same parent.
## Edit History
Each change to the Colouring London database is recorded, so it is possible to explore how the
dataset evolves over time.
The edit history logs changes made by users, with the following fields:
- `revision_id`: unique change id, referenced by building_attributes
- `revision_timestamp`: date and time of the change
- `building_id`: Colouring London building ID, references building_attributes
- `forward_patch`: the changes made, encoded as a JSON string where keys are attribute/column
names, and values are the values set by this change.
- `reverse_patch`: the reverse of the change, encoded as a JSON string. This shows what the
values were before this change was made.
- `user`: username of the user who made the change
For example a forward patch might show a building date being provided, along with some source
details:
{"date_year": 1911, "date_source_details": "Survey of London Marylebone draft text"}
Where the reverse patch shows that there was no previous data stored:
{"date_year": None, "date_source_details": None}

View File

@ -1,4 +1,4 @@
SELECT
COPY (SELECT
building_id,
ref_toid,
ref_osm_id,
@ -16,6 +16,7 @@ SELECT
date_upper,
date_source,
date_source_detail,
date_link,
facade_year,
facade_upper,
facade_lower,
@ -34,6 +35,8 @@ SELECT
planning_conservation_area_name,
planning_in_list,
planning_list_id,
planning_list_cat,
planning_list_grade,
planning_heritage_at_risk_id,
planning_world_list_id,
planning_in_glher,
@ -44,8 +47,7 @@ SELECT
planning_in_local_list,
planning_local_list_url,
planning_in_historic_area_assessment,
planning_historic_area_assessment_url,
planning_list_cat,
planning_list_grade,
date_link
FROM buildings
planning_historic_area_assessment_url
FROM buildings)
TO '/tmp/building_attributes.csv'
WITH CSV HEADER

View File

@ -1,3 +1,12 @@
SELECT log_id as revision_id, log_timestamp as revision_timestamp, building_id, forward_patch, reverse_patch, u.username as user
COPY(SELECT
log_id as revision_id,
date_trunc('second', log_timestamp) as revision_timestamp,
building_id,
forward_patch,
reverse_patch,
u.username as user
FROM logs l
JOIN users u ON l.user_id = u.user_id
JOIN users u
ON l.user_id = u.user_id)
TO '/tmp/edit_history.csv'
WITH CSV HEADER

View File

@ -1,3 +1,8 @@
SELECT building_id, uprn, parent_uprn
COPY(SELECT
building_id,
uprn,
parent_uprn
FROM building_properties
WHERE building_id IS NOT NULL
WHERE building_id IS NOT NULL)
TO '/tmp/building_uprns.csv'
WITH CSV HEADER

View File

@ -22,39 +22,6 @@ def get_connection():
)
def fetch_with_server_side_cursor(
connection,
query,
on_row,
row_batch_size=10000
):
with connection.cursor('server_side') as cur:
cur.itersize = row_batch_size
cur.execute(query)
header_saved = False
for row in cur:
if not header_saved:
columns = [c[0] for c in cur.description]
on_row(columns)
header_saved = True
on_row(row)
def db_to_csv(connection, query):
string_io = StringIO()
writer = csv.writer(string_io)
fetch_with_server_side_cursor(
connection,
query,
lambda row: writer.writerow(row)
)
return string_io.getvalue()
def get_extract_zip_file_path(current_time):
base_dir = Path(os.environ['EXTRACTS_DIRECTORY'])
file_name = f"data-extract-{current_time:%Y-%m-%d-%H_%M_%S}.zip"
@ -79,27 +46,30 @@ def read_sql(rel_path_from_script):
return sql_path.read_text()
building_attr_query = read_sql('./export_attributes.sql')
building_uprn_query = read_sql('./export_uprns.sql')
edit_history_query = read_sql('./export_edit_history.sql')
def make_data_extract(current_time, connection, zip_file_path):
if zip_file_path.exists():
raise ZipFileExistsError('Archive file under specified name already exists')
# Execute data dump as Postgres COPY commands, write from server to /tmp
with connection.cursor() as cur:
cur.execute(read_sql('./export_attributes.sql'))
with connection.cursor() as cur:
cur.execute(read_sql('./export_uprns.sql'))
with connection.cursor() as cur:
cur.execute(read_sql('./export_edit_history.sql'))
zip_file_path.parent.mkdir(parents=True, exist_ok=True)
try:
with zipfile.ZipFile(zip_file_path, mode='w') as newzip:
newzip.writestr('building_attributes.csv',
db_to_csv(connection, building_attr_query))
newzip.writestr('building_uprns.csv',
db_to_csv(connection, building_uprn_query))
newzip.writestr('edit_history.csv',
db_to_csv(connection, edit_history_query))
# TODO: add README
newzip.write('README.txt')
newzip.write('/tmp/building_attributes.csv', arcname='building_attributes.csv')
newzip.write('/tmp/building_uprns.csv', arcname='building_uprns.csv')
newzip.write('/tmp/edit_history.csv', arcname='edit_history.csv')
add_extract_record_to_database(connection, zip_file_path, current_time)
except: