Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support#457: Add Flightpath #871

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 6 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -189,16 +189,9 @@ fab pull-production-images

## Data anonymisation

[Django birdbath](https://pypi.org/project/django-birdbath/) is being used to anonymise data locally. Ensure you have exported the following variables into the VM:
[Django birdbath](https://pypi.org/project/django-birdbath/) is being used to anonymise data locally.

```
export ALLOWS_ANONYMISATION='rca-staging'
export HEROKU_APP_NAME='rca-staging'
```

After pulling data, the cli will show a warning about birdbath needing to run. Which you should do.

Birdbath is on by default, set in settings/base.py and is turned off on live environments using environment variables.
Birdbath is on by default and will run after `fab pull-production-data` is run.

## Deployments

Expand Down Expand Up @@ -231,3 +224,7 @@ Or if you want to push your local media files.
fab push-staging-media
fab push-production-media
```

## Synchronising a production environment to a development environment.

See the [reset development environment](docs/reset_development.md) documentation.
1 change: 1 addition & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ services:
SECURE_SSL_REDIRECT: 'false'
STATIC_DIR: /app/static/
STATIC_URL: /static/
HEROKU_APP_NAME: local # this stops Birdbath's HerokuNotProductionCheck complaining
command: tail -f /dev/null # do nothing forever - exec commands elsewhere
ports:
- 8000:8000 # runserver
Expand Down
37 changes: 37 additions & 0 deletions docs/anonymised-data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# RCA - Anonymising data

When pulling data from any hosted instance, take a cautious approach about whether you need full details of potentially personally-identifying, confidential or sensitive data.

## General principles:

- Pull data from staging rather than production servers, if this is good enough for your needs
- If it is necessary to pull data from production, e.g. for troubleshooting, consider whether anonymising personal data is possible and compatible with your needs
- If it is necessary to pull non-anonymised data from production, consider destroying this copy of the data as soon as you no longer need it

In more sensitive cases, consider a data protection policy to prevent access to production data except for authorised users.

## Anonymise

`django-birdbath` provides a management command (`run_birdbath`) that will anonymise the database.

As and when models/fields are added that may be populated with sensitive data (such as email addresses) a processor should be added to ensure that the data can be anonymised or deleted when it is copied from the production environment.

For full documentation see https://git.torchbox.com/internal/django-birdbath/-/blob/master/README.md.

The `flightpath` tool can be used to copy production data (and media) from the production environment to staging. It will automatically `run_birdbath` immediately following this sync operation. A manual CI action is included that will trigger flightpath to sync the environments.

Intended workflow:

1. Production data is synced to rca-development by flightpath
- Birdbath anonymises rca-development database
2. Anonymised data is pulled from **rca-production** to rca-development environments

This workfow should mean that un-anonymised data is never present on a developer's machine. If data directly from **production** is required, then `run_birdbath` command should be run immediately after download.

## Student User Account Anonymisation

Student User Accounts are anonymised in the usual way by updating the fields using fake data.

As a student account can also have a related StudentPage, these are also anonymised by using fake data or removing related personal records.

As user groups and collections are created when each student is created they are also anonymised by using the fake data from a student user account (username)
66 changes: 66 additions & 0 deletions docs/reset_development.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
## Resetting the Development site

Steps for resetting the `dev` git branch, and deploying it with a cloned/anonymised copy of the production site database and media files.

### Pre-flight checks

1. Is this okay with the client, and other developers?
2. Is there any test content on `dev` that may need to be recreated, or be a reason to delay?
3. What branches are currently merged to dev?

```bash
$ git branch -a --merged origin/dev > branches_on_dev.txt
$ git branch -a --merged origin/master > branches_on_master.txt
$ diff branches_on_{master,dev}.txt
```

Take note if any of the above are stale, not needing to be recreated.

4. Are there any user accounts on dev only, which will need to be recreated? Check with the client, and record them.
5. Make a copy of the staging dev `Wagtail Site records` urls and site names.
6. Take a backup of staging
```bash
$ heroku pg:backups:capture -a rca-development
```

### Git

1. Reset the dev branch
```bash
$ git checkout dev && git fetch && git reset --hard origin/master && git push --force
```
2. Tell your colleagues
> @here I have reset the dev branch. Please delete your local dev branches
>
> ```
> $ git branch -D dev
> ```
>
> to avoid accidentally merging in the old version
3. Force-push to Heroku, otherwise CI will later fail `$ git push --force heroku-development master` (this will trigger a deployment, bear in mind that there may be incompatibilities between the old staging database and the new code from master; this will be resolved in the Database step below)
4. Merge in the relevant branches that need to be added back on staging.
```bash
$ git merge --no-ff origin/feature/123-extra-spangles
```
You may need to create merge migrations here depending on the type of work you need to merge.

## Database

1. To copy the production database over to dev, run the management command `./manage.py copy_db_to_dev` from the Heroku console on dev
This is a destructive action. Proofread it thoroughly.
2. To copy media across as well as data, you can use the optional argument `--media` with the above command, using a value of 1, e.g. `--media=1`
3. By default, a backup of the staging database is created before the database is copied. If you don't want this behaviour, you can set the optional argument `--backup` with a value of 0, e.g. `--backup=0`
4. By default, a snapshot of the production base is taken before the database is copied. If you don't want this behaviour, you can se the option agrument `--snapshot` with a value of 0, e.g. `--snapshot=0`
5. Flightpath will anonymise the database for you at the end of the copy process.

### Cleanup

1. Check the staging site loads
2. Update the Wagtail Site records, as the database will contain the production URLs

### Comms

1. Inform the client of the changes, e.g.
> All user accounts have been copied across, so your old dev password will no longer work. Log in with your production password (and then change it), or use the 'forgot password' feature.
> Any test content has been reset. This is probably the biggest inconvenience. Sorry.
> I have deleted the personally-identifying data from form submissions **and anywhere else relevant**. If there's any more on production (there shouldn't be) then please let me know and I'll remove it from dev.
7 changes: 5 additions & 2 deletions fabfile.py
Original file line number Diff line number Diff line change
Expand Up @@ -218,7 +218,7 @@ def pull_production_images(c):
@task
def pull_production_data(c):
"""Pull database from production Heroku Postgres"""
pull_database_from_heroku(c, PRODUCTION_APP_INSTANCE)
pull_database_from_heroku(c, PRODUCTION_APP_INSTANCE, anonymise=True)


# @task
Expand Down Expand Up @@ -420,7 +420,7 @@ def pull_media_from_s3_heroku(c, app_instance):
)


def pull_database_from_heroku(c, app_instance):
def pull_database_from_heroku(c, app_instance, anonymise=False):
datestamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")

local(
Expand All @@ -438,6 +438,9 @@ def pull_database_from_heroku(c, app_instance):
),
)

if anonymise:
dexec("./manage.py run_birdbath --skip-checks")


def open_heroku_shell(c, app_instance, shell_command="bash"):
subprocess.call(["heroku", "run", shell_command, "-a", app_instance])
Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,4 @@ nav:
- 'Content relationship via API': 'implementation-specs/content-relationships.md'
- 'Support runbook': 'support-runbook.md'
- 'Upgrading guidelines': 'upgrading.md'
- 'Anonymising data': 'anonymised-data.md'
11 changes: 11 additions & 0 deletions rca/enquire_to_study/birdbath.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
from birdbath.processors import BaseModelDeleter

from rca.enquire_to_study.models import EnquiryFormSubmission


class EnquiryFormSubmissionDeleter(BaseModelDeleter):
"""
Delete EnquiryFormSubmission's
"""

model = EnquiryFormSubmission
11 changes: 11 additions & 0 deletions rca/scholarships/birdbath.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
from birdbath.processors import BaseModelDeleter

from rca.scholarships.models import ScholarshipEnquiryFormSubmission


class ScholarshipEnquiryFormSubmissionDeleter(BaseModelDeleter):
"""
Delete ScholarshipEnquiryFormSubmission's
"""

model = ScholarshipEnquiryFormSubmission
16 changes: 15 additions & 1 deletion rca/settings/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -739,11 +739,25 @@
# Birdbath
BIRDBATH_CHECKS = [
"birdbath.checks.contrib.heroku.HerokuNotProductionCheck",
"birdbath.checks.contrib.heroku.HerokuAnonymisationAllowedCheck",
]
BIRDBATH_PROCESSORS = [
"birdbath.processors.users.UserEmailAnonymiser",
"birdbath.processors.users.UserPasswordAnonymiser",
"birdbath.processors.contrib.wagtail.SearchQueryCleaner",
"birdbath.processors.contrib.wagtail.FormSubmissionCleaner",
"rca.enquire_to_study.birdbath.EnquiryFormSubmissionDeleter",
"rca.scholarships.birdbath.ScholarshipEnquiryFormSubmissionDeleter",
"rca.users.birdbath.StudentAccountAnonymiser",
]
BIRDBATH_REQUIRED = env.get("BIRDBATH_REQUIRED", "true").lower() == "true"
BIRDBATH_USER_ANONYMISER_EXCLUDE_SUPERUSERS = True
BIRDBATH_USER_ANONYMISER_EXCLUDE_EMAIL_RE = r"torchbox\.com$"
nickmoreton marked this conversation as resolved.
Show resolved Hide resolved

# Flightpath command settings
FLIGHTPATH_AUTH_KEY = os.environ.get("FLIGHTPATH_AUTH_KEY", None)
FLIGHTPATH_SOURCE_KEY = os.environ.get("FLIGHTPATH_SOURCE_KEY", None)
FLIGHTPATH_DESTINATION_KEY = os.environ.get("FLIGHTPATH_DESTINATION_KEY", None)

# Django Countries
# https://pypi.org/project/django-countries
COUNTRIES_FIRST = ["GB", "IE"]
Expand Down
4 changes: 4 additions & 0 deletions rca/settings/production.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,7 @@
# Ensure that the CSRF cookie is only sent by browsers under an HTTPS connection.
# https://docs.djangoproject.com/en/stable/ref/settings/#csrf-cookie-secure
CSRF_COOKIE_SECURE = True

# Don't use Birdbath in production
# https://git.torchbox.com/internal/django-birdbath#common-settings
BIRDBATH_REQUIRED = False
98 changes: 98 additions & 0 deletions rca/users/birdbath.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
from birdbath.processors import BaseProcessor
from django.conf import settings
from faker import Faker

from rca.people.models import StudentPage
from rca.users.models import User


class BaseUserAnonymiser(BaseProcessor):
def get_queryset(self):
users = User.objects.all()

if settings.BIRDBATH_USER_ANONYMISER_EXCLUDE_SUPERUSERS:
users = users.exclude(is_superuser=True)

if settings.BIRDBATH_USER_ANONYMISER_EXCLUDE_EMAIL_RE:
users = users.exclude(
email__regex=settings.BIRDBATH_USER_ANONYMISER_EXCLUDE_EMAIL_RE
)

# Exclude users with an email that matches (end with) @rca.ac.uk
# Users are all students as they use email addresses ending with @network.rca.ac.uk
users = users.exclude(email__endswith="@rca.ac.uk")

return users


class StudentAccountAnonymiser(BaseUserAnonymiser):
nickmoreton marked this conversation as resolved.
Show resolved Hide resolved
"""For all user accounts anonymise the user account +

1. the student page
2. the student group name
3. the student image collection name

"""

def run(self):
victoriachan marked this conversation as resolved.
Show resolved Hide resolved
fake = Faker()
users = self.get_queryset()

for count, user in enumerate(users):
# generate some fake data per user account
fake_first = fake.first_name()
fake_last = fake.last_name()
# count for uniqueness
fake_username = f"{fake_first}.{fake_last}-{count}".lower()

self.rename_student_group(user, fake_username)
self.anonymise_student_page(count, user, fake_first, fake_last)
self.update_user_account(user, fake_first, fake_last, fake_username)

def update_user_account(self, user, fake_first, fake_last, fake_username):
"""Update the user account"""
user.username = fake_username
user.first_name = fake_first
user.last_name = fake_last
user.save()

def anonymise_student_page(self, count, user, fake_first, fake_last):
"""Anonymise the student page and related image collection name"""
student_pages = StudentPage.objects.filter(student_user_account=user)
for student_page in student_pages:

# change the image collection name to match the student name count is used for uniqueness
image_collection = student_page.student_user_image_collection
image_collection.name = f"{fake_first} {fake_last} {count}"
image_collection.save()

# change the student page fields so they match the student user account
# and remove any personal information that may be in the fields
student_page.title = f"{fake_first} {fake_last}"
student_page.first_name = fake_first
student_page.last_name = fake_last
student_page.email = ""

rev = student_page.save_revision()
rev.publish()

def rename_student_group(self, user, fake_username):
"""Rename the student group to match the student user account"""

# If a student group begins with Student: then rename it to match the student user account
student_user_group = None
for group in user.groups.all():
if group.name.startswith("Student: "):
# Student group names are (Student: firstname.lastname)
group.name = f"Student: {fake_username}"
# print(group.name)
group.save()
student_user_group = group

if student_user_group:
student_user_pages = student_user_group.page_permissions.all()
for student_user_page in student_user_pages:
student_page = StudentPage.objects.get(id=student_user_page.page_id)
if student_page.student_user_account == user:
student_user_group.name = f"Student: {fake_username}"
student_user_group.save()
Empty file.
Empty file.
Loading
Loading