Replace `images.json` approach as it's not scalable, not transactional, and is fragile

- [ ] Create a database table `project_images`:

Enums
```sql
-- 1. Create enum type for image status
CREATE TYPE image_status AS ENUM (
  'staged',         -- uploaded but not yet classified
  'classified',     -- successfully classified and moved
  'invalid_exif',   -- EXIF unreadable or missing
  'unmatched',      -- no task intersects this photo
  'duplicate'       -- hash collision within project
);
```
Create table, partitioned:
```sql
CREATE TABLE project_images (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),

  project_id UUID NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
  task_id UUID REFERENCES tasks(id) ON DELETE SET NULL,

  filename TEXT NOT NULL,
  s3_key TEXT NOT NULL,
  hash_md5 CHAR(32) NOT NULL,

  location GEOMETRY(POINT, 4326),  -- extracted from exif lat/lon
  exif JSONB,  -- The raw EXIF info if we need to process it again at some point

  uploaded_by UUID REFERENCES users(id) ON DELETE SET NULL,
  uploaded_at TIMESTAMP WITH TIME ZONE DEFAULT now(),
  classified_at TIMESTAMP WITH TIME ZONE,
  status image_status NOT NULL DEFAULT 'staged',

  -- optional metadata
  duplicate_of UUID REFERENCES project_images(id) ON DELETE SET NULL
);
```

(Future plan if scalability becomes a problem) Partition by year:
```sql
CREATE TABLE project_images_partitioned (
  LIKE project_images INCLUDING ALL
) PARTITION BY RANGE (uploaded_at);

CREATE TABLE project_images_2025
  PARTITION OF project_images
  FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');
```
(for now performance should be fine with a good index, so don't do this)

Indexes
```sql
CREATE INDEX idx_project_images_project ON project_images(project_id);
CREATE INDEX idx_project_images_status ON project_images(status);
CREATE INDEX idx_project_images_hash ON project_images(hash_md5);
CREATE INDEX idx_project_images_task ON project_images(task_id);

-- Composite indexes for helpful speedup of common searches
CREATE INDEX IF NOT EXISTS idx_project_status_uploaded_at
  ON project_images(project_id, status, uploaded_at);

-- To prevent duplicates:
CREATE UNIQUE INDEX uniq_project_hash
  ON project_images(project_id, hash_md5);

-- Speed up classifier scans for staged images in `user-uploads` dir
CREATE INDEX idx_project_images_staged ON project_images(uploaded_at)
  WHERE status = 'staged';

-- Geospatial index
CREATE INDEX idx_project_images_location ON project_images USING GIST (location);
```

- [ ] Create a migration for each task to download related `images.json` for each, then insert into database **in batches**.

Something like this... (avoids dups)
```sql
INSERT INTO project_images (
  project_id, task_id, filename, s3_key, hash_md5,
  location, exif, uploaded_by, uploaded_at, status
)
VALUES
  (:project_id, :task_id, :filename, :s3_key, :hash_md5,
   ST_SetSRID(ST_MakePoint(:exif_lon, :exif_lat), 4326),
   :exif::jsonb, :uploaded_by, :uploaded_at, :status)
ON CONFLICT (project_id, hash_md5) DO NOTHING;
```

- [ ] Refactor backend code to use this table, and not the `images.json` file anymore.

https://github.com/hotosm/drone-tm/discussions/635 has additional details on the strategy going forward.
In summary, we should insert the EXIF for each uploaded image into our database, in batches during the sorting / classification process on the backend.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace `images.json` approach as it's not scalable, not transactional, and is fragile #636

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Replace images.json approach as it's not scalable, not transactional, and is fragile #636

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Replace `images.json` approach as it's not scalable, not transactional, and is fragile #636