Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unflatten is sliently dropping all but last index of supplier array from UK government procurement CSVs #384

Open
poulson opened this issue Apr 21, 2021 · 3 comments

Comments

@poulson
Copy link

poulson commented Apr 21, 2021

2021.04.20.csv

Despite the clear existence of 31 separate suppliers (including Palantir Technologies) in this single-row input CSV (extracted from yesterday's export of UK government procurement contracts), unflatten is only preserving the last supplier (Workday).

I have been unflattening using a command of the form

flatten-tool unflatten input-dir --root-id=ocid --root-is-list --input-format csv --encoding ascii --output-name unflattened.json

The relevant -- and incomplete -- portion of the output is:

                "awards": [
                    {
                        "id": "6df7e3ce-54f4-4151-a3ed-0dfc7aead845",
                        "description": "See description of related tender",
                        "status": "active",
                        "date": "2021-03-26T00:00:00Z",
                        "value": {
                            "amount": "1200000000.0",
                            "currency": "GBP"
                        },
                        "suppliers": [
                            {
                                "id": "0.0",
                                "identifier": {
                                    "scheme": "GB-COH",
                                    "id": "521013.0"
                                },
                                "name": "WORKDAY LIMITED",
                                "address": {
                                    "streetAddress": "THE KING'S BUILDING,MAY LANE\nDUBLIN 7\nIE"
                                },
                                "x_awardValue": {
                                    "currency": "GBP"
                                },
                                "sme": "1.0"
                            }
                        ],
                        "contractPeriod": {
                            "startDate": "2021-04-06T00:00:00Z",
                            "endDate": "2024-12-09T00:00:00Z"
                        }
                    }
                ]

While I understand that a schema would be of use, I don't understand after reading https://flatten-tool.readthedocs.io/en/latest/unflatten/ why most of the supplier columns are being entirely ignored. I am therefore posting here because this looks like a bug in unflatten.

@jpmckinney
Copy link
Contributor

I believe the issue is that the UK assigns the same id to all awards, which the unflatten routine ends up merging into a single row. I believe the command outputs a warning about this, but I could be wrong.

I think if you delete the id values, then the command will work as expected.

@poulson
Copy link
Author

poulson commented Apr 21, 2021

When run from the commandline I don't see any warnings, but I will indeed look into preprocessing out the id values and appreciate the tip.

@poulson
Copy link
Author

poulson commented Apr 21, 2021

FWIW, I have confirmed that dropping the suppliers/id columns before unflattening fixes the problem. I agree that one would hope a warning would have been printed about this and appreciate the help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants