Skip to content

Replace images.json approach as it's not scalable, not transactional, and is fragile #636

@spwoodcock

Description

@spwoodcock
  • Create a database table project_images:

Enums

-- 1. Create enum type for image status
CREATE TYPE image_status AS ENUM (
  'staged',         -- uploaded but not yet classified
  'classified',     -- successfully classified and moved
  'invalid_exif',   -- EXIF unreadable or missing
  'unmatched',      -- no task intersects this photo
  'duplicate'       -- hash collision within project
);

Create table, partitioned:

CREATE TABLE project_images (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),

  project_id UUID NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
  task_id UUID REFERENCES tasks(id) ON DELETE SET NULL,

  filename TEXT NOT NULL,
  s3_key TEXT NOT NULL,
  hash_md5 CHAR(32) NOT NULL,

  location GEOMETRY(POINT, 4326),  -- extracted from exif lat/lon
  exif JSONB,  -- The raw EXIF info if we need to process it again at some point

  uploaded_by UUID REFERENCES users(id) ON DELETE SET NULL,
  uploaded_at TIMESTAMP WITH TIME ZONE DEFAULT now(),
  classified_at TIMESTAMP WITH TIME ZONE,
  status image_status NOT NULL DEFAULT 'staged',

  -- optional metadata
  duplicate_of UUID REFERENCES project_images(id) ON DELETE SET NULL
);

(Future plan if scalability becomes a problem) Partition by year:

CREATE TABLE project_images_partitioned (
  LIKE project_images INCLUDING ALL
) PARTITION BY RANGE (uploaded_at);

CREATE TABLE project_images_2025
  PARTITION OF project_images
  FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');

(for now performance should be fine with a good index, so don't do this)

Indexes

CREATE INDEX idx_project_images_project ON project_images(project_id);
CREATE INDEX idx_project_images_status ON project_images(status);
CREATE INDEX idx_project_images_hash ON project_images(hash_md5);
CREATE INDEX idx_project_images_task ON project_images(task_id);

-- Composite indexes for helpful speedup of common searches
CREATE INDEX IF NOT EXISTS idx_project_status_uploaded_at
  ON project_images(project_id, status, uploaded_at);

-- To prevent duplicates:
CREATE UNIQUE INDEX uniq_project_hash
  ON project_images(project_id, hash_md5);

-- Speed up classifier scans for staged images in `user-uploads` dir
CREATE INDEX idx_project_images_staged ON project_images(uploaded_at)
  WHERE status = 'staged';

-- Geospatial index
CREATE INDEX idx_project_images_location ON project_images USING GIST (location);
  • Create a migration for each task to download related images.json for each, then insert into database in batches.

Something like this... (avoids dups)

INSERT INTO project_images (
  project_id, task_id, filename, s3_key, hash_md5,
  location, exif, uploaded_by, uploaded_at, status
)
VALUES
  (:project_id, :task_id, :filename, :s3_key, :hash_md5,
   ST_SetSRID(ST_MakePoint(:exif_lon, :exif_lat), 4326),
   :exif::jsonb, :uploaded_by, :uploaded_at, :status)
ON CONFLICT (project_id, hash_md5) DO NOTHING;
  • Refactor backend code to use this table, and not the images.json file anymore.

#635 has additional details on the strategy going forward.
In summary, we should insert the EXIF for each uploaded image into our database, in batches during the sorting / classification process on the backend.

Metadata

Metadata

Assignees

Labels

backendRelated to backend codebugSomething isn't workingeffort:lowLikely a few hourspriority:highShould be addressed as a priorityrepo:drone-tm

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions