dp-hierarchy-builder
The hierarchy builder is a service that forms part of the dataset import process. It requires a 'full' hierarchy to be available for the dataset you are importing
Getting started
Run the dp-hierarchy-builder
service:
make debug
Hierarchy import scripts
The import scripts rely on the CLI tool for the database you're targeting:
cypher-shell
for Neo4j / Cypher import
gremlin.sh
for Neptune / Gremlin imports
You may also need an SSH tunnel to the database if it's not running locally.
Example import for the CPIH full hierarchy: (replace the filename for other hierarchy import files)
Cypher:
cypher-shell < import-scripts/cypher/cpih1dim1aggid.cypher
Gremlin:
./gremlin-import.sh import-scripts/gremlin/cpih1dim1aggid.grm
For Cypher imports, you can use additional flags if running against an environment other than localhost:
cypher-shell -u USER -p PASSWORD -a bolt://localhost:7687 < .....
Kafka scripts
Scripts for updating and debugging Kafka can be found here(dp-data-tools)
Configuration
Environment variable |
Default |
Description |
BIND_ADDR |
:22700 |
The host and port to bind to |
KAFKA_ADDR |
localhost:9092 |
A list of Kafka host addresses |
KAFKA_VERSION |
1.0.2 |
The kafka version that this service expects to connect to |
KAFKA_OFFSET_OLDEST |
true |
sets kafka offset to oldest if true |
KAFKA_SEC_PROTO |
unset |
if set to TLS , kafka connections will use TLS [1] |
KAFKA_SEC_CLIENT_KEY |
unset |
PEM for the client key [1] |
KAFKA_SEC_CLIENT_CERT |
unset |
PEM for the client certificate [1] |
KAFKA_SEC_CA_CERTS |
unset |
CA cert chain for the server cert [1] |
KAFKA_SEC_SKIP_VERIFY |
false |
ignores server certificate issues if true [1] |
CONSUMER_GROUP |
dp-hierarchy-builder |
The name of the Kafka consumer group |
CONSUMER_TOPIC |
observations-imported |
The name of the topic to consumes messages from |
PRODUCER_TOPIC |
hierarchy-built |
The name of the topic to produces messages to |
ERROR_PRODUCER_TOPIC |
import-error |
The name of the topic to send error messages to |
GRACEFUL_SHUTDOWN_TIMEOUT |
time.Second * 10 |
Time time to wait when gracefully shutting down before closing |
HEALTHCHECK_INTERVAL |
30s |
The time between doing health checks |
HEALTHCHECK_CRITICAL_TIMEOUT |
90s |
The time taken for the health changes from warning state to critical due to subsystem check failures |
Plus the graph database vars from dp-graph - namely GRAPH_DRIVER_TYPE
and GRAPH_ADDR
Notes:
- For more info, see the kafka TLS examples documentation
Graph / Neptune Configuration
Environment variable |
Default |
Description |
GRAPH_DRIVER_TYPE |
"" |
string identifier for the implementation to be used (e.g. 'neptune' or 'mock') |
GRAPH_ADDR |
"" |
address of the database matching the chosen driver type (web socket) |
NEPTUNE_TLS_SKIP_VERIFY |
false |
flag to skip TLS certificate verification, should only be true when run locally |
⚠ to connect to a remote Neptune environment on MacOSX using Go 1.18 or higher you must set NEPTUNE_TLS_SKIP_VERIFY
to true. See our Neptune guide for more details.
Healthcheck
The /healthcheck
endpoint returns the current status of the service. Dependent services are health checked on an interval defined by the HEALTHCHECK_INTERVAL
environment variable.
On a development machine a request to the health check endpoint can be made by:
curl localhost:22700/healthcheck
There are a number of utility applications for manual tasks (found under the cmd directory):
- v4-transformer - take a V4 file and create a full hierarchy input file / cypher script
- geography-transformer - take a geography input CSV file and output a full hierarchy input file / cypher script
- hierarchy-transformer - take a hierarchy input CSV file and generate cypher script
- builder - builds an instance hierarchy from a full hierarchy
- producer - produces a Kafka message for the dp-hierarchy-builder process to consume
Manually building instance hierarchies
- Manually create instance hierarchy for CPIH - note you will have to replace the value for 'instance-id'
go run cmd/builder/main.go --instance-id 27a4019f-6491-4876-bbdd-1439a40e5bb9 --dimension-name aggregate --code-list-id e44de4c4-d39e-4e2f-942b-3ca10584d078
- Manually create instance hierarchy for mid-year-pop-est - note you will have to replace the value for 'instance-id'
go run cmd/builder/main.go --instance-id 34b8c139-a1fe-45b1-95e2-e77df3682256 --dimension-name geography --code-list-id mid-year-pop-geography
If running one of the above commands against an environment, you can specify the neo4j URL with the flag (replacing USER, PASSSWORD, host, and port as required):
--neo-url="bolt://USER:PASSWORD@localhost:7687"
make FILE=./cmd/hierarchy-transformer/hierarchy.csv generate-full
output is written to ./cmd/hierarchy-transformer/output
make FILE=./cmd/geography-transformer/WD16_LAD16_CTY16_OTH_UK_LU.csv generate-full-from-geography
output is written to ``./cmd/geography-transformer/output`
codelistid=cpih1dim1aggid
go run cmd/code-label-parent-transformer/main.go --file import-scripts/code-label-parent-csv/$codelistid.csv --code-list-id $codelistid --output import-scripts/$codelistid.csv`
produce a Kafka message for the dp-hierarchy-builder process to consume
go run cmd/producer/main.go --instance-id '58004716-a2d4-4dd1-a6c3-6accab30ad2a' --code-list-id 'cpih1dim1aggid' --dimension-name 'aggregate'
Contributing
See CONTRIBUTING for details.
License
Copyright © 2016-2021, Office for National Statistics (https://www.ons.gov.uk)
Released under MIT license, see LICENSE for details.