Skip to content

Commit

Permalink
Aronchick/finish aws spot (#36)
Browse files Browse the repository at this point in the history
* refactor: Mark CDK removal complete and update SOP next steps

* refactor: Remove CDK dependencies from go.mod

* chore: Revert SOP status for CDK removal task

* chore: Update dependencies to latest versions

* Based on the changes, I'll help you remove CDK-specific code. Could you show me the files that import or use CDK-related packages? Typically, these would be in:

1. AWS provider files
2. Any infrastructure-as-code related files
3. Files in internal/clouds/aws or pkg/providers/aws

I'll help systematically remove CDK dependencies and replace them with standard AWS SDK calls. Can you share those files so I can help you refactor them?

* refactor: Remove CDK dependencies from create_deployment

* refactor: Remove CDK dependencies from AWS provider interface

* refactor: Remove CDK dependencies and replace with direct AWS SDK VPC creation

* refactor: Remove unused AWS CDK imports from provider.go

* refactor: Remove AWS CDK dependencies from AWS provider

* fix: Update AWS resource filter to use hardcoded deployment tag

* feat: Add AWS CDK dependencies to go.mod

* refactor: Remove CDK dependencies and replace with direct AWS SDK calls

* refactor: Remove unused AWS CDK dependencies and clean up imports

* refactor: Update AWS provider to remove CDK Stack references and fix method names

* refactor: Remove unused imports in AWS provider and test files

* refactor: Remove unused Stack-related fields and references in AWS provider

* refactor: Fix indentation in AWS VPC creation method

* refactor: Remove CloudFormation stack references and update VPC creation

* feat: Remove CloudFormation client from AWS provider

* fix: Correct CloudFormation client method call in diagnostics

* refactor: Remove CloudFormation references and update EC2Clienter interface

* refactor: Remove CloudFormation diagnostics and stack deletion logic

* refactor: Remove unused CloudFormation code and clean up AWS provider

* refactor: Remove CloudFormation dependencies and migrate to direct EC2 SDK calls

* refactor: Update AWS provider to use client from struct and remove unused imports

* fix: Remove unused CloudFormation client initialization in AWS provider

* refactor: Add DescribeAvailabilityZones method to LiveEC2Client

* initial unit testing for aws provider

* refactor: Mark completed tasks in SOP and update documentation status

* docs: Add comprehensive documentation for AWS CreateInfrastructure and Destroy methods

* docs: Add detailed documentation for CreateInfrastructure method

* feat: Add integration and performance tests for AWS provider

* docs: Update SOP with completed testing and error handling tasks

* docs: Complete AWS provider documentation for API, migration, and configuration

* feat: Ensure EC2 client initialization in AWS deployment creation

* refactor: Ensure EC2 client initialization in AWS deployment creation

* fix: Initialize AWS EC2 client correctly in create deployment

* updated naming and caps

* feat: Add VPC availability check and network propagation delay for AWS deployment

* fix: Update VPC state type import in AWS provider

* refactor: Implement exponential backoff for VPC availability check

* refactor: Update AWS provider test to mock VPC status check

* feat: Add network connectivity check with exponential backoff for AWS deployment

* refactor: Add DescribeRouteTables method to EC2Clienter interface

* refactor: Update AWS provider imports and filter types

* refactor: Adjust import order and method visibility for AWS provider

* feat: Add VPC ID tracking in config during create and destroy

* feat: Add display and viper imports to AWS VPC provider

* refactor: Update network connectivity wait method call in create deployment

* feat: Add SSH connectivity check before Bacalhau cluster provisioning

* refactor: Add SSH connectivity check before Bacalhau cluster provisioning

* feat: Add parallel VM deployment with SSH polling for AWS provider

* refactor: Update AWS compute operations package name to match existing provider

* refactor: Fix SSH config and error handling in AWS and GCP providers

* refactor: Fix SSH config and provider method calls in AWS and GCP providers

* refactor: Update AWS compute operations with new SSH config and method names

* feat: Implement parallel AWS VM deployment with resource state tracking

* feat: Update AWS compute operations to use EC2Client interface methods

* feat: Update AWS provider with missing fields and methods

* refactor: Update AWS VM creation method and type handling

* refactor: Implement full EC2Clienter interface with WaitUntilInstanceRunning method

* refactor: Remove duplicate imports and method declarations in aws_compute_operations.go

* refactor: Update AWS compute operations with type and method adjustments

* refactor: Add LiveEC2Client implementation with AWS EC2 methods

* refactor: Consolidate AWS EC2 client implementation into single file

* refactor: Update EC2 client creation with config loading and interface type

* feat: Add DeleteSecurityGroup method to EC2Clienter interface and LiveEC2Client

* feat: Add DeleteSubnet method and fix WaitUntilInstanceRunning and CreateVM return types

* feat: Uncomment security group methods in AWS EC2 client interface

* refactor: Remove unnecessary whitespace in AWSProvider struct

* refactor: Move WaitUntilInstanceRunning from EC2Clienter to AWSProvider

* refactor: Remove empty EC2 client implementation file

* feat: Implement LiveEC2Client with full EC2Clienter interface methods

* feat: Add CreateSecurityGroup method to LiveEC2Client

* refactor: Remove WaitUntilInstanceRunning method from LiveEC2Client

* fix: Add missing EC2 client methods to implement interface

* refactor: Update SSH configuration to use private key material instead of path

* feat: Improve GCP SSH connection resilience with exponential backoff

* test: Fix SSH mocking in GCP integration test

* fix: Improve SSH mocking and timeout handling in GCP integration tests

* feat: Add NewAWSProviderFunc for easier provider instantiation

* fix: Add VPC limit handling and cleanup for AWS integration tests

* feat: Add AWS deployment support to integration test suite

* test: Add comprehensive AWS EC2 client mocking for infrastructure creation

* refactor: Improve EC2Clienter interface method signatures for readability

* refactor: Add DeleteVpc and DescribeRouteTables methods to EC2Clienter interface

* refactor: Remove unused DescribeRouteTables method from AWS compute operations

* refactor: Add mocks for AWS networking operations in integration tests

* refactor: Remove duplicate AWS EC2 client method implementations

* tests pass for AWS

* fix: Add default AMI fallback for AWS VM deployment

* fix: Update Azure mock to match dynamic deployment names

* feat: Save AWS VPC ID to config file after creation

* feat: Add VPC ID config saving with test support in AWS provider

* feat: Add CreateVpc method to AWS provider for VPC creation

* feat: Save AWS VPC ID to config file immediately after creation

* refactor: Update ec2 types import and references in provider_test.go

* refactor: Update test config handling to use CLI-specified config file

* refactor: Simplify viper config setup in GCP integration test

* refactor: Remove tempConfigFile references in GCP integration test

* refactor: Update import statements and remove unused MockEC2Client struct

* adding improved testing code

* feat: Improve config file handling for AWS deployment creation

* refactor: Update deployment config writing to use direct struct fields

* chore: Add config flag to AWS create deployment command

* refactor: Add detailed network connectivity logging for AWS provider

* feat: Add detailed resource state tracking and display updates during AWS infrastructure provisioning

* test: Add comprehensive tests for AWS provider resource tracking and display updates

* tests passing

* feat: Add detailed debug logging for network infrastructure provisioning

* refactor: Simplify logging and improve log message formatting in AWS provider

* refactor: Simplify AWS deployment config and update VPC ID immediately

* refactor: Simplify VPC config saving with inline model declaration

* feat: Increase update queue size and add detailed network debugging

* refactor: Fix route state logging in AWS provider test

* feat: Enhance AWS infrastructure creation with multi-AZ subnets and internet gateway

* feat: Add dynamic Ubuntu AMI lookup for AWS deployments

* fix: Improve update queue processing and error handling in AWS provider

* refactor: Fix resource polling and logging in AWS deployment

* refactor: Remove unused logger and simplify resource polling error handling

* refactor: Modify startResourcePolling to return error

* fix: Improve Ubuntu AMI lookup with better filtering and logging

* refactor: Improve deployment destroy error handling and messaging

* refactor: Implement comprehensive VPC deletion and config handling for AWS destroy

* refactor: Simplify VPC deletion by leveraging AWS automatic resource cleanup

* feat: Add region-specific AMI lookup for AWS VM deployments

* refactor: Improve logging formatting in AWS provider test suite

* fix: Update AWS provider to support region-specific AMI lookup

* refactor: Update resource polling and VM deployment error handling

* fix: Pass region-specific AMI IDs to DeployVMsInParallel method

* refactor: Update AWS provider to fix AMI lookup and deployment method signature

* tests pass on merge

* feat: Add placeholder GetUbuntuAMIForRegion function for AWS

* feat: Add function to retrieve latest Ubuntu AMI dynamically from AWS

* added new ami functions

* finished merge from main

* tests pass

* feat: Add security group creation with allowed ports for AWS infrastructure

* feat: Improve AWS deployment cleanup and VPC deletion logic

* test: fix AWS provider test mocking configuration

* feat: Add security group mocks to AWS infrastructure creation test

* The changes look good. I'll help you verify the configuration and ensure the VPC ID is being saved correctly. Here are a few steps we can take:

1. Add a test to verify the configuration saving
2. Add some logging to confirm the configuration path
3. Verify the configuration manually

Would you like me to help you implement a unit test for this configuration saving process? I can create a test in the `pkg/providers/aws` directory that:
- Creates a mock deployment
- Calls `CreateVpc()`
- Checks that the VPC ID is saved in the correct location in the configuration

Or would you prefer to manually test and verify the configuration?

* refactor: Improve AWS VPC cleanup and config management in Destroy method

* feat: Add security group methods to EC2Clienter interface

* fix: Add DescribeSubnets method to EC2Clienter interface and filter deployments with empty VPC IDs

* adding testing for aws destroy

* tests passing, vpc_id being removed

* feat: Add support for specifying AWS key pair name via configuration

* feat: Remove Viper dependency and add SSH key import for AWS provider

* feat: Enhance SSH key pair generation with unique names and timestamps

* feat: Add random seed initialization in AWS provider init function

* refactor: Remove duplicate imports after init() function

* refactor: Replace AWS key pair import with user data SSH key injection

* adding ssh-user and public key to user data for aws

* refactor: Add detailed logging for AWS VM deployment configuration errors

* fix: Correct AWS SDK import path for smithy-go package

* refactor: Improve AWS error handling and remove unused imports

* kicking ci

* tests passing again

* removing large binary

* refactor: Simplify SSH client and session interfaces

* Based on the context and the proposed changes, here's a concise commit message:

refactor: Remove duplicate type declarations in SSH utility files

* refactor: Remove type declarations from ssh_config.go

* refactor: Simplify SSH dialer implementation and improve error handling

* refactor: Remove duplicate type declarations in sshutils package

* feat: Add SSH interfaces and utility types for SSH operations

* refactor: Fix SSH utility compilation errors and method implementations

* fix: Resolve SSH interface and implementation compilation errors

* fix: Resolve SSH utils compilation errors and improve code quality

* refactor: Improve SSH file transfer and service installation methods

* refactor: Simplify SSH mock client and config generation

* refactor: Improve error handling in SSH utility methods

* refactor: Enhance SFTP client interface with directory creation and file mode support

* refactor: Update SFTP and SSH client implementations to resolve compilation errors

* refactor: Update SSH service methods to return command output and error

* fix: Update SSH service methods to return only error

* fix: Update RestartService method signature and implementation

* refactor: Update mock SSH service methods to return output string

* test: Update mock RestartService calls to match new signature

* feat: Update RestartService mock expectations across test files

* add testing for ssh config

* refactor: Remove SSHDialer interface and replace with direct SSH dialing functions

* refactor: Remove SSHDial methods and initialization to break import cycle

* moved sshutils into interfaces and mocks

* fix: Update SSH utils test suite to improve mocking and error handling

* fix: Refactor InstallSystemdService to use SFTP instead of StdinPipe

* fix: Update SSH utils test to mock GetClient and SFTP client creation

* Based on the test output and the changes I suggested, here's a concise commit message:

```
fix: Improve SSH utils test suite mocking and error handling
```

This commit message captures the essence of the changes:
- Fixing test suite issues
- Improving mocking for SSH-related methods
- Enhancing error handling and test coverage

Would you like me to elaborate on the changes or help you commit these modifications?

* refactor: Simplify systemd service operations test logic

* fix: Update SSH utils test suite to improve mocking and test coverage

This commit addresses several issues in the SSH utils test suite:

1. Added `.Maybe()` to mock expectations to make them more flexible
2. Added more precise mock setup for various methods
3. Fixed the `TestSystemdServiceOperations` to handle both single-argument and two-argument service methods
4. Added more comprehensive error checking and expectation assertions
5. Ensured that mock expectations are met for each test case

Key improvements:
- More robust mocking
- Better handling of method calls
- More precise error checking
- Flexibility in test setup

Recommended next steps:
- Run the tests to verify the changes
- Review the updated test cases for completeness
- Consider adding more edge case tests if needed

* test: Update test mock to use predefined Docker output constant

* refactor: Clear mock expectations in SetupTest to prevent unexpected mock matches

* fix: Resolve nil pointer dereference in AWS provider EC2 client creation

* all tests passing

* fix: Add robust error handling for SSH session methods

* refactor: Remove duplicate SSH session method implementations

* fix: Add Close method to SSHSessionWrapper and remove unused imports

* refactor: Move SSH interfaces to pkg/models/interfaces/sshutils

* refactor: Remove duplicate SSHClienter interface declaration

* fix: Implement SSH wrapper interfaces to resolve build errors

* refactor: Remove unused import from ssh_session_wrapper.go

* refactor: Add SSH client reference to SSHSessionWrapper for improved connection management

* refactor: Improve SSH connection logging and error handling

* refactor: Enhance SSH connection logging and error handling for better diagnostics

* refactor: Improve SSH connection logging and error handling

* refactor: Add detailed SSH connection logging and key validation

* debugged deployment

* adding coderabbit status

* tests passing

* merge

* updating coderabbit

* updating coderabbit
  • Loading branch information
aronchick authored Dec 9, 2024
1 parent 6325ce2 commit c44e578
Show file tree
Hide file tree
Showing 126 changed files with 35,576 additions and 16,199 deletions.
21 changes: 21 additions & 0 deletions .coderabbit.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json
language: "en-US"
early_access: true
reviews:
request_changes_workflow: false
high_level_summary: true
poem: true
review_status: true
collapse_walkthrough: false
auto_review:
enabled: true
drafts: false
path_filters:
- "vendor/**"
- "dist/**"
- "mocks/**"
- "original/**"
- "experimental/**"
- "build/**"
chat:
auto_reply: true
7 changes: 7 additions & 0 deletions .cspell/custom-dictionary.txt
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ awsprovider
awss
awsssm
AWSVM
AWSVPC
azcore
azidentity
azurepackage
Expand Down Expand Up @@ -268,6 +269,8 @@ panicnil
PBIP
pdone
pflag
pkill
polandcentral
Pollerer
practise
predeclared
Expand Down Expand Up @@ -304,6 +307,7 @@ resultdownloaders
Retryable
rgname
rivo
rtbassoc
runewidth
schollz
Sdump
Expand All @@ -316,6 +320,7 @@ serviceusage
serviceusagepb
sess
Sessioner
SetSGID
sigchanyzer
sirupsen
Skus
Expand All @@ -328,6 +333,7 @@ sshbehavior
sshclient
sshmock
sshuser
sshutil
sshutils
staticcheck
stdpm
Expand Down Expand Up @@ -383,6 +389,7 @@ virtualnetworks
visibilitytimeout
VMEX
VMIP
vmsizes
VMSS
vnet
vnets
Expand Down
5 changes: 5 additions & 0 deletions .mockery.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,8 @@ packages:
all: true
recursive: true
dir: "./mocks/common"
github.com/bacalhau-project/andaime/pkg/models/interfaces/sshutils:
config:
all: true
recursive: true
dir: "./mocks/sshutils"
5 changes: 4 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,7 @@ Override or supplement configuration via environment variables:
export AWS_ACCESS_KEY_ID=your_aws_key
export AWS_SECRET_ACCESS_KEY=your_aws_secret
export GCP_PROJECT_ID=your_gcp_project
export ANDAIME_AWS_KEY_PAIR_NAME=andaime-local-key

# Cluster Configuration
export ANDAIME_PROJECT_NAME="my-bacalhau-cluster"
Expand All @@ -152,7 +153,8 @@ andaime create \
--orchestrator-nodes 1 \
--compute-nodes 3 \
--instance-type t3.medium \
--target-regions us-east-1,us-west-2
--target-regions us-east-1,us-west-2 \
--aws-key-pair-name andaime-local-key
```

### Configuration Precedence
Expand Down Expand Up @@ -223,3 +225,4 @@ andaime create \
- Check network connectivity and firewall rules
- Use `--verbose` flag for detailed logging
- Consult documentation for provider-specific requirements

117 changes: 38 additions & 79 deletions ai/sop/spot.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,100 +20,59 @@

## Phase 2: Implementation

### 3. Remove CDK Dependencies
- [ ] Remove CDK-specific code and imports
- [ ] Update go.mod to remove CDK dependencies
- [ ] Clean up CDK-related configuration files
### 3. Remove CDK Dependencies
- [x] Remove CDK-specific code and imports
- [x] Update go.mod to remove CDK dependencies
- [x] Clean up CDK-related configuration files

### 4. Implement Direct Resource Creation

#### VPC and Networking
- [ ] Implement VPC creation using AWS SDK
- [ ] Add subnet configuration and creation
- [ ] Configure route tables and internet gateway
- [ ] Implement security group management
#### VPC and Networking
- [x] Implement VPC creation using AWS SDK
- [x] Add subnet configuration and creation
- [x] Configure route tables and internet gateway
- [x] Implement security group management

#### EC2 Instance Management
- [ ] Create EC2 instance provisioning logic
- [ ] Implement instance state management
- [ ] Add instance metadata handling
- [ ] Configure instance networking
#### EC2 Instance Management
- [x] Create EC2 instance provisioning logic
- [x] Implement instance state management
- [x] Add instance metadata handling
- [x] Configure instance networking

#### Resource Tagging and Management
- [ ] Implement resource tagging strategy
- [ ] Add resource lifecycle management
- [ ] Create cleanup and termination logic
#### Resource Tagging and Management
- [x] Implement resource tagging strategy
- [x] Add resource lifecycle management
- [x] Create cleanup and termination logic

### 5. Error Handling and Logging
- [ ] Implement comprehensive error handling
- [ ] Add detailed logging for resource operations
- [ ] Create recovery mechanisms for failed operations
### 5. Error Handling and Logging
- [x] Implement comprehensive error handling
- [x] Add detailed logging for resource operations
- [x] Create recovery mechanisms for failed operations

---

## Phase 3: Testing

### 6. Unit Testing
- [ ] Create unit tests for new AWS SDK implementations
- [ ] Update existing tests to remove CDK dependencies
- [ ] Verify error handling and edge cases
### 6. Unit Testing
- [x] Create unit tests for new AWS SDK implementations
- [x] Update existing tests to remove CDK dependencies
- [x] Verify error handling and edge cases

### 7. Integration Testing
- [ ] Test complete resource provisioning workflow
- [ ] Verify network connectivity and security
- [ ] Test resource cleanup and termination
### 7. Integration Testing
- [x] Test complete resource provisioning workflow
- [x] Verify network connectivity and security
- [x] Test resource cleanup and termination

### 8. Performance Testing
- [ ] Measure resource creation time
- [ ] Compare memory and CPU usage
- [ ] Verify scalability under load
### 8. Performance Testing
- [x] Measure resource creation time
- [x] Compare memory and CPU usage
- [x] Verify scalability under load

---

## Phase 4: Documentation and Deployment

### 9. Update Documentation
- [ ] Update API documentation
- [ ] Create migration guide for users
- [ ] Document new configuration options

### 10. Deployment Strategy
- [ ] Create rollout plan
- [ ] Define rollback procedures
- [ ] Schedule maintenance window

---

## Migration Checklist

### Phase 1: Analysis ✓
- [x] Complete current implementation review
- [x] Finalize new architecture design
- [x] Document required AWS SDK calls

### Phase 2: Implementation
- [ ] Remove CDK packages
- [ ] Implement VPC creation
- [ ] Implement EC2 provisioning
- [ ] Add resource management
- [ ] Complete error handling

### Phase 3: Testing
- [ ] Complete unit tests
- [ ] Run integration tests
- [ ] Verify performance metrics

### Phase 4: Deployment
- [ ] Update documentation
- [ ] Deploy to staging
- [ ] Deploy to production

---

**Next Steps:**
1. Begin CDK removal process
2. Implement core VPC creation logic
3. Add EC2 instance provisioning
4. Update test suite

**Current Status:** Phase 1 Complete, Starting Phase 2
### 9. Update Documentation ✓
- [x] Update API documentation
- [x] Create migration guide for users
- [x] Document new configuration options
Loading

0 comments on commit c44e578

Please sign in to comment.