Merge pull request #447 from tomalrussell/feature/extract_readme

Extract data update
2019-10-02 16:49:15 +01:00 · 2019-10-02 16:49:15 +01:00 · 4cc7b59027
commit 4cc7b59027
parent 13495ab495 53f3f230fd
5 changed files with 170 additions and 54 deletions
--- a/maintenance/extract_data/README.txt
+++ b/maintenance/extract_data/README.txt
@ -0,0 +1,130 @@
+# Colouring London Data Extract
+
+This extract contains a snapshot of contributions to Colouring London
+(https://colouring.london).
+
+Colouring London is a citizen science platform collecting information on every building in
+London, to help make the city more sustainable.
+
+The data included are open data, licensed under the Open Data Commons Open Database License
+(ODbL, http://opendatacommons.org/licenses/odbl/) by Colouring London contributors.
+
+You are free to copy, distribute, transmit and adapt the data, as long as you credit Colouring
+London and our contributors. If you alter or build upon our data, you may distribute the
+result only under the same licence.
+
+
+## Contents
+
+This extract contains four files:
+
+- README.txt
+- building_attributes.csv
+- building_uprns.csv
+- edit_history.csv
+
+
+## Building Attributes
+
+This is the main table, containing almost all data collected by Colouring London. Apart from
+`building_id`, `revision_id` and `ref_toid`, all of these fields are optional.
+
+- `building_id`: unique building ID for Colouring London buildings
+- `revision_id`: unique revision ID for Colouring London, cross-references to our edit history
+- `ref_toid`: cross-reference to Ordnance Survey MasterMap TOID
+- `ref_osm_id`: cross-reference to OpenStreetMap feature osm_id
+- `location_name`: building name
+- `location_number`: building number
+- `location_street`: street name
+- `location_line_two`: additional address line
+- `location_town`: town
+- `location_postcode`: postcode
+- `location_latitude`: latitude
+- `location_longitude`: longitude
+- `date_year`: year built
+- `date_lower`: lower bound on year built
+- `date_upper`: upper bound on year built
+- `date_source`: type of source for building dates
+- `date_source_detail`: details of source for building dates
+- `date_link`: list of links to further information relating to building dates
+- `facade_year`: facade date
+- `facade_upper`: upper bound on facade date
+- `facade_lower`: lower bound on facade date
+- `facade_source`: type of source for facade dates
+- `facade_source_detail`: details of source for facade dates
+- `size_storeys_attic`: number of attic storeys
+- `size_storeys_core`: number of core storeys
+- `size_storeys_basement`: number of basement storeys
+- `size_height_apex`: height in metres to the building apex
+- `size_floor_area_ground`: ground floor floor area in square metres
+- `size_floor_area_total`: total floor area in square metres
+- `size_width_frontage`: width of frontage in metres
+- `likes_total`: number of times the building has been liked by Colouring London users
+- `planning_portal_link`: link to an entry on https://www.planningportal.co.uk/
+- `planning_in_conservation_area`: in a conservation area? (True/False)
+- `planning_conservation_area_name`: conservation area name
+- `planning_in_list`: in the National Heritage List for England? (True/False)
+- `planning_list_id`: National Heritage List for England ID
+- `planning_list_cat`: National Heritage List for England listing type
+- `planning_list_grade`: National Heritage List for England listing grade
+- `planning_heritage_at_risk_id`: on the Heritage at Risk list? (True/False)
+- `planning_world_list_id`: UNESCO World Heritage list ID
+- `planning_in_glher`: in the Greater London Historic Environment Record? (True/False)
+- `planning_glher_url`: Greater London Historic Environment Record link
+- `planning_in_apa`: in an Architectural Priority Area? (True/False)
+- `planning_apa_name`: Architectural Priority Area name
+- `planning_apa_tier`: Architectural Priority Area tier
+- `planning_in_local_list`: in a local list? (True/False)
+- `planning_local_list_url`: local list reference link
+- `planning_in_historic_area_assessment`: within a historic area assessment? (True/False)
+- `planning_historic_area_assessment_url`: historic area assessment reference link
+
+
+## Building UPRNs
+
+Buildings are matched to UPRNs (Unique Property Reference Numbers), which should help link
+Colouring London data against other datasets.
+
+Read more about UPRNs: https://www.ordnancesurvey.co.uk/business-government/tools-support/uprn
+
+`building_uprns.csv` looks something like this:
+
+    building_id,uprn,parent_uprn
+    2810432,10091093495,100023038313
+    2810432,10091093496,100023038313
+    2810432,10091093497,
+
+- `building_id`: Colouring London unique building ID, references the building_id in
+  building_attributes.csv
+- `uprn`: Unique Property Reference Number associated with the building. In some cases
+  multiple UPRNs are associated with a single Colouring London building, for example in
+  blocks of flats or mixed-use buildings.
+- `parent_uprn`: optional. Some UPRNs are grouped by a parent-child relationship, so while
+  each UPRN is unique, multiple UPRNs may share the same parent.
+
+
+## Edit History
+
+Each change to the Colouring London database is recorded, so it is possible to explore how the
+dataset evolves over time.
+
+The edit history logs changes made by users, with the following fields:
+
+- `revision_id`: unique change id, referenced by building_attributes
+- `revision_timestamp`: date and time of the change
+- `building_id`: Colouring London building ID, references building_attributes
+- `forward_patch`: the changes made, encoded as a JSON string where keys are attribute/column
+  names, and values are the values set by this change.
+- `reverse_patch`: the reverse of the change, encoded as a JSON string. This shows what the
+  values were before this change was made.
+- `user`: username of the user who made the change
+
+
+For example a forward patch might show a building date being provided, along with some source
+details:
+
+    {"date_year": 1911, "date_source_details": "Survey of London Marylebone draft text"}
+
+Where the reverse patch shows that there was no previous data stored:
+
+    {"date_year": None, "date_source_details": None}
--- a/maintenance/extract_data/export_attributes.sql
+++ b/maintenance/extract_data/export_attributes.sql
@ -1,4 +1,4 @@
-SELECT
+COPY (SELECT
    building_id,
    ref_toid,
    ref_osm_id,
@ -16,6 +16,7 @@ SELECT
    date_upper,
    date_source,
    date_source_detail,
+    date_link,
    facade_year,
    facade_upper,
    facade_lower,
@ -34,6 +35,8 @@ SELECT
    planning_conservation_area_name,
    planning_in_list,
    planning_list_id,
+    planning_list_cat,
+    planning_list_grade,
    planning_heritage_at_risk_id,
    planning_world_list_id,
    planning_in_glher,
@ -44,8 +47,7 @@ SELECT
    planning_in_local_list,
    planning_local_list_url,
    planning_in_historic_area_assessment,
-    planning_historic_area_assessment_url,
-    planning_list_cat,
-    planning_list_grade,
-    date_link
-FROM buildings
+    planning_historic_area_assessment_url
+FROM buildings)
+TO '/tmp/building_attributes.csv'
+WITH CSV HEADER
--- a/maintenance/extract_data/export_edit_history.sql
+++ b/maintenance/extract_data/export_edit_history.sql
@ -1,3 +1,12 @@
-SELECT log_id as revision_id, log_timestamp as revision_timestamp, building_id, forward_patch, reverse_patch, u.username as user
+COPY(SELECT
+    log_id as revision_id,
+    date_trunc('second', log_timestamp) as revision_timestamp,
+    building_id,
+    forward_patch,
+    reverse_patch,
+    u.username as user
 FROM logs l
-JOIN users u ON l.user_id = u.user_id
+JOIN users u
+    ON l.user_id = u.user_id)
+TO '/tmp/edit_history.csv'
+WITH CSV HEADER
--- a/maintenance/extract_data/export_uprns.sql
+++ b/maintenance/extract_data/export_uprns.sql
@ -1,3 +1,8 @@
-SELECT building_id, uprn, parent_uprn
+COPY(SELECT
+    building_id,
+    uprn,
+    parent_uprn
 FROM building_properties
-WHERE building_id IS NOT NULL
+    WHERE building_id IS NOT NULL)
+TO '/tmp/building_uprns.csv'
+WITH CSV HEADER
--- a/maintenance/extract_data/extract_data.py
+++ b/maintenance/extract_data/extract_data.py
@ -22,39 +22,6 @@ def get_connection():
    )


-def fetch_with_server_side_cursor(
-    connection,
-    query,
-    on_row,
-    row_batch_size=10000
-):
-    with connection.cursor('server_side') as cur:
-        cur.itersize = row_batch_size
-        cur.execute(query)
-
-        header_saved = False
-
-        for row in cur:
-            if not header_saved:
-                columns = [c[0] for c in cur.description]
-                on_row(columns)
-                header_saved = True
-            on_row(row)
-
-
-def db_to_csv(connection, query):
-    string_io = StringIO()
-    writer = csv.writer(string_io)
-
-    fetch_with_server_side_cursor(
-        connection,
-        query,
-        lambda row: writer.writerow(row)
-    )
-
-    return string_io.getvalue()
-
-
 def get_extract_zip_file_path(current_time):
    base_dir = Path(os.environ['EXTRACTS_DIRECTORY'])
    file_name = f"data-extract-{current_time:%Y-%m-%d-%H_%M_%S}.zip"
@ -79,27 +46,30 @@ def read_sql(rel_path_from_script):
    return sql_path.read_text()


-building_attr_query = read_sql('./export_attributes.sql')
-building_uprn_query = read_sql('./export_uprns.sql')
-edit_history_query = read_sql('./export_edit_history.sql')


 def make_data_extract(current_time, connection, zip_file_path):
    if zip_file_path.exists():
        raise ZipFileExistsError('Archive file under specified name already exists')

+    # Execute data dump as Postgres COPY commands, write from server to /tmp
+    with connection.cursor() as cur:
+        cur.execute(read_sql('./export_attributes.sql'))
+
+    with connection.cursor() as cur:
+        cur.execute(read_sql('./export_uprns.sql'))
+
+    with connection.cursor() as cur:
+        cur.execute(read_sql('./export_edit_history.sql'))
+
    zip_file_path.parent.mkdir(parents=True, exist_ok=True)

    try:
        with zipfile.ZipFile(zip_file_path, mode='w') as newzip:
-            newzip.writestr('building_attributes.csv',
-                            db_to_csv(connection, building_attr_query))
-            newzip.writestr('building_uprns.csv',
-                            db_to_csv(connection, building_uprn_query))
-            newzip.writestr('edit_history.csv',
-                            db_to_csv(connection, edit_history_query))
-
-            # TODO: add README
+            newzip.write('README.txt')
+            newzip.write('/tmp/building_attributes.csv', arcname='building_attributes.csv')
+            newzip.write('/tmp/building_uprns.csv', arcname='building_uprns.csv')
+            newzip.write('/tmp/edit_history.csv', arcname='edit_history.csv')

        add_extract_record_to_database(connection, zip_file_path, current_time)
    except: