Skip to content

Reference Architecture to automate the process of transferring new or modified files from remote SFTP Servers to your local S3 environment.

License

MIT-0, Unknown licenses found

Licenses found

MIT-0
LICENSE
Unknown
LICENSE-SUMMARY
Notifications You must be signed in to change notification settings

aws-samples/file-transfer-sync-solution

File Transfer Synchronization solution

Introduction

This solution implements an automated strategy for synchronizing remote SFTP repositories with local S3 buckets. It orchestrates the process of listing remote directories, detecting changes, and transferring files. It can be run based on a scheduler or on-demand.

The solution leverages the following AWS services:

Key features:

  • Monitors remote SFTP servers using SFTP Connectors' List capabilities
  • Transfers missing or updated files using SFTP Connector Retrieve action
  • Supports recursive synchronization of entire folder structures
  • Fully serverless architecture for cost-effective and scalable operations

Architecture

A combination of Lambda, Step Functions and Transfer Family features facilitates data movement. Only the Transfer Family connectors move the data, Step Functions and Lambda determine what needs to be copied, and Event Bridge Scheduler acts as the trigger based on your needs. The solution is completely stateless, making use of file modification times to compare and detect new or changed files that need to be transferred.

High level architecture:

image

Step Function visualization:

image

Component Interactions

  1. Execution Phase

    a. On-Demand Execution

    • You can manually execute the Step Function by using the event structure stored in SSM Parameter Store.
    • While executing the Step Function, you can modify the FromTimestamp parameter in the event to specify the starting date and time for the file copy process.

    b. Event Bridge Scheduler

    • The Event Bridge Scheduler triggers the Step Function execution based on the configured schedule (e.g., daily, hourly, or a custom cron expression).
    • There are multiple schedules based on the Configuration files in this project and the Event passed to Step Functions includes the required parameters according to each schedule configuration.
  2. Step Function

    • The Step Function orchestrates the entire process and coordinates the interaction between different components.
    • For each event, it invokes the RemoteFoldersList Lambda function interacts with the Transfer Family SFTP Connector to asynchronously retrieve a list of files in the remote folders to be synchronized.
    • Then use the GetListStatus Lambda function, to check if the List process is finished and optionally get the list of child folder if Recursive is enabled to run a list again for those sub folders.
    • The SyncRemoteFolder Lambda function detects if new or modified files are available in the remote server, and then invokes the Transfer Family SFTP Connector to asynchronously transfer those files from the remote repository to the local S3 bucket.
    • If any errors occur during the synchronization process, the Step Function captures the error and sends a notification to the configured SNS topic.
  3. Transfer Family SFTP Connector

    • The Transfer Family SFTP connector is responsible for establishing a secure connection to the remote SFTP server.
    • It handles the listing of files in the remote repository and the transfer of files between the remote repository and the local S3 bucket.
    • The connector uses the configured security policy, trusted public keys and secrets stored in AWS Secrets Manager to ensure secure communication with the remote server.
  4. S3 Buckets

    • Only one bucket is created by this solution to store the results generated by the Transfer Family SFTP Connector when Listing the remote SFTP directories. This bucket is encrypted using a KMS Customer managed keys created by the solution.
    • The solution can use as many S3 Buckets as needed for a target for the Transfer Family SFTP Connector Sync process, when the files are copied from the remote SFTP to local the local S3 Bucket. These S3 Buckets are defined in the Configuration Files.
  5. AWS Secrets Manager

    • Securely stores access credentials based on Username, Password and/or Certificate.
  6. SNS Topic and CloudWatch

    • The SNS topic is used for sending notifications in case of errors or failures during the execution process.
    • CloudWatch publishes messages to the SNS topic when an error occurs executing Step Functions, allowing subscribers (e.g., email addresses) to be notified.
    • CloudWatch Dashboards also centralizes all the important Metrics and Logs generated by the solution.

Static Public IPs

Each of the Transfer Family Connectors is created with 3 Static Public IPs that you can use in case the 3rd party company require to define an allow list. This IPs doesn't change during the Connector Lifecycle.

Usage

The reference architecture is fully defined as a dynamic CDK Application which simplifies the process of defining the whole infrastructure in code and potentially automate the deployment through a simple pipeline.

To do so, you just need to push new configuration changes as json files to ./configuration/sftp/ and deploy the project. Each of the files defines the configuration for a specific remote SFTP Server and multiple configuration rules.

The recommended approach is to use the provided CLI script to automatically generate and modify the config files, more details in the Deployment Section

The configuration file structure and content needs the following data:

{
    "Description": <Connection Description>,
    "Name": <Identifying name for resources, no spaces allowed>,
    "Schedule": <Tag, AWS Cron Expression or "on-demand">,
    "Url": <Remote SFTP Server URL, FQDN and Port allowed>,
    "SecurityPolicyName": <TransferSFTPConnectorSecurityPolicy-2024-03 or TransferSFTPConnectorSecurityPolicy-2023-07>,
    "SyncSettings": [
        {
            "LocalRepository": {
                "BucketName": <Local Bucket Name>,
                "Prefix": <Local Prefix>,
                (OPTIONAL) "KmsKeyArn": <KMS Key ARN used for the Bucket default encryption configuration>
            },
            "RemoteFolders": {
                "Folder": <Remote Folder to Sync>,
                "Recursive": <true / false>
            },
            (OPTIONAL) "Schedule": <Tag, AWS Cron Expression or "on-demand">
        },
        { ... }
    ],
    "PublicKey": [
        "ssh-rsa AAAAB[...]5yQ==",
        "..."
    ]
}

You can check the example configuration file. Within AWS Account service limits, you can have as many configuration files as you need, and on the SyncSettings configuration list, you can define as many Remote to Local pairs as you wish and all will be run during the same schedule for the same Remote SFTP Server. The CDK Application will automatically resolve all the IAM Role permissions needed for the process to work and will create all the needed resources, including Event Bridge Scheduler, SFTP Connector and Secrets Manager Secret.

Schedule Configuration

The file synchronization process can be configured to run based on a schedule or on-demand. The solution supports both global schedules for entire configurations and individual schedules for specific sync settings.

Scheduling Strategies

  1. Cron / Tag Schedule:

    • Only considers files created in the remote repository between the current execution timestamp and the previous execution timestamp.
    • Useful for regular, periodic synchronization while avoiding duplicate transfers.
  2. On-Demand execution:

    • When the Schedule value is configured as on-demand, at the Step Function execution phase you can set up an additional optional parameter called FromTimestamp that allows you to define from when (UTC Timestamp) files are considered to be copied.
    • By default the value for FromTimestamp is set to 0, meaning that all files (newer than 1 January 1970 00:00:00) will be compared.
    • Copy any modify files from the timestamp specified, including files that may have been deleted from S3 between runs but still exist on the remote SFTP server.

Individual Sync Setting Schedules

From version 1.3.0, you can define specific schedules for each item in your SyncSettings configuration list. This allows for:

  • Different schedules for multiple folders within a single SFTP connection
  • More granular control over synchronization timing
  • Reduced resource consumption by eliminating the need for multiple configuration files

Individual item level schedules take precedence over the general Schedule setting.

Example scenario: you need to synchronize a remote SFTP Server with 10 folders, 5 of those are updates once a day at midnight, 2 are updated hourly, 1 is updated weekly and for the remaining 2 you get notified when there are new files to run an on-demand copy. Before this update, you would have need to create 4 Configuration files, each with its dedicated Transfer Family Connector, public IPs and Secrets. Today you can create a single Configuration file (and it's resources) with different Schedule parameters for each item in the SyncSettings array according to the business needs.

{
  "Schedule": "@daily",
  "SyncSettings": [
    {
      "LocalRepository": { ... },
      "RemoteFolders": { ... },
      "Schedule": "@hourly"
    },
    {
      "LocalRepository": { ... },
      "RemoteFolders": { ... }
    },
    {
      "LocalRepository": { ... },
      "RemoteFolders": { ... },
      "Schedule": "@weekly"
    },
    {
      "LocalRepository": { ... },
      "RemoteFolders": { ... },
      "Schedule": "on-demand"
    }
  ]
}

Available Schedule Options

  • Predefined TAGs: @monthly, @daily, @hourly, @minutely, @sunday, @monday, @tuesday, @wednesday, @thursday, @friday, @saturday, @every10min
  • Custom Cron Expressions, keep in mind that this needs to be an AWS Event Bridge Cron expression format
  • "on-demand" for manual execution

Best Practices

  • Choose schedules that align with your data update frequency
  • Use individual schedules for folders with different update patterns
  • Consider resource usage and costs when setting frequent schedules
  • Test your configuration to ensure it meets your synchronization needs

Target Bucket KMS Encryption

The solution supports target S3 Buckets that use server-side encryption with AWS KMS (SSE-KMS). If your target S3 Bucket is encrypted using KMS, you must specify the ARN of the KMS Key used for encryption in your configuration file under: SyncSettings > LocalRepository > KmsKeyArn

Note: The KmsKeyArn parameter is optional. Only include it if your target bucket uses KMS encryption.

Replaceable Tags in Remote Folder Paths

The solution supports the use of replaceable tags in the remote folder paths. This feature allows for dynamic folder selection based on the current date (in UTC). The following tags are available:

  • %year%: Replaced with the current four-digit year (e.g., 2024)
  • %month%: Replaced with the current two-digit month (e.g., 03 for March)
  • %day%: Replaced with the current two-digit day of the month (e.g., 15)

You can use these tags individually or in combination within the RemoteFolders > Folder path in your configuration. The solution allows you to define your own format. For example:

  • "/data/%year%": Lists only the folder for the current year
  • "/data/%year%/%month%": Lists the folder for the current year and month
  • "/data/%year%-%month%-%day%": Lists the folder for the specific date

This feature allows for more flexible and automated folder synchronization based on current dates, which is particularly useful for organizing providing data partitioned by time periods.

Trusted Public Key configuration

Transfer Family Connector service allows you to validate the identity of the remote server by configuring an expected trusted host key for the connection. You can optionally add more than one Key, and you need to do so in the PublicKey within the JSON configuration file for each remote server as a list of strings containing the trusted certificates. You can follow this guide on how to get the cert and the expected format.

Access Credentials

Important Note: After deploying the configuration for a new Remote Server, you need to login to AWS and manually define the content of the Secret Manager Secret. This is due to security and to avoid your secret credentials being stored in your git repository and Cloud Formation deployment. It also protects the infrastructure from unwanted changes to working credentials when new changes are deployed to the solution.

When you modify the Secret content the key/values and format will depend on the remote server authentication strategy, you can follow this guide to understand how to do it.

Monitoring

As part of the solution, we are deploying a CloudWatch Dashboard to centralizes all important metrics and logs filters for troubleshooting. An SNS topic is also created for email notifications if Step Functions fail to run for any reason, you just need to subscribe to it (Topic Name: TransferSyncServiceStack-NotificationTopic###). It's especially important to monitor the Transfer Family Connector logs, as those will expose connectivity, authentication, non-available remote folders, etc.

Lambda PowerTools was also implemented to improve logging capabilities. By default every lambda function will be logging the same type of information, but we now enabled a debug mode that allows you to get more data, including the request context and event. To do so, you just need to modify the Lambda function Environment Variable POWERTOOLS_LOG_LEVEL. You can also change this through a new deployment by modifying the CDK project configuration parameters. Debug mode is includes Lambda Events that can be very verbose, so use with caution. Additionally, all the Lambda executions withing a Step Function uses the SFN Execution ID as a Correlation ID so you can easily filter single runs in the logs.

Solution Cost

The solution is fully serverless, meaning that you pay for what you use. The main cost factor for the solution will depend on the number of files being monitored, the frequency at which the process runs and the amount of GBs being transferred. Click here to check be on the Transfer Family SFTP Connectors public pricing.

Example Scenario:

You are monitoring one remote SFTP Server. You are running an a recursive sync once per day and the server has a total of 10 folders, 5000 files evenly distributed and 20MB per file. The first time the schedule runs, everything will be sync and you'll have a on-time cost of:

Pricing Category Cost (N. Virginia) Unit Total
SFTP Connector Calls - List 0.001 / API Call 10 API calls $0.01
SFTP Connector Calls - Transfer 0.001 / API Call 500 API calls $0.50
SFTP Connector Calls - Data Transfer 0.4 / GB 98 GB $39.06
Total: $39.57

After the first replication, the solution will only copy new or modified files from the target server once per day. In this example 5% of the files gets modified each day:

Pricing Category Cost (N. Virginia) Unit Total
SFTP Connector Calls - List $0.001 / API Call 10 API calls $0.01
SFTP Connector Calls - Transfer $0.001 / API Call 25 API calls $0.03
SFTP Connector Calls - Data Transfer $0.4 / GB 5 GB $1.95
Total: $1.99 per day

Pre requirements

This project is built using Python3 and CDK, before you start, make sure to have all the pre requirements properly installed in your environment.

Deployment

Tagging resources

By default the solution applies some native CDK and Name tags. If your deployment requires additional tags (for eg. Cost Allocation, Environment, Team, etc), you can now update the solution parameters file by modifying the JSON for additional_tags with your specific Keys and Values.

Permission Boundaries

If you are enforcing the usage of IAM Permission Boundaries for IAM Roles created in the account, you can update the solution parameters file and add the managed policy ARN you are using to the permission_boundary_policy_arn parameter.

Note:

  • This step is optional. Only add this parameter if you're enforcing IAM Permission Boundaries in your account.
  • Ensure you have the correct ARN for your permission boundary policy.
  • If you're not using permission boundaries, you can omit this parameter or leave it as an empty string.

By setting this parameter, all IAM roles created by this solution will adhere to the specified permission boundary, enhancing your security posture and compliance with organizational policies.

Step-by-Step deployment

Deploying the solution is easy, you just need to

MacOS and Linux:

Step 1: To manually create a virtualenv on MacOS and Linux:

python3 -m venv .venv

Step 2: After the init process completes and the virtualenv is created, you can use the following command to activate your virtualenv

source .venv/bin/activate

Go to Step 4 for MacOS and Linux Users.

Windows:

Step 3: Windows users can activate the virtualenv with this command:

.venv\Scripts\activate.bat

Step 4: Once the virtualenv is activated, you can install the required dependencies.

pip install -r requirements.txt

Step 5: Create new configuration files running the provided CLI and following the instructions.

python3 cli.py

Step 6: If this is your first time using CDK with the target AWS Account you'll need to bootstrap the environment.

cdk bootstrap

Step 7: At this point you can now synthesize the CloudFormation template for this code and deploy it.

cdk synth
cdk deploy

Useful commands

  • cdk ls list all stacks in the app
  • cdk synth emits the synthesized CloudFormation template
  • cdk deploy deploy this stack to your default AWS account/region
  • cdk diff compare deployed stack with current state
  • cdk docs open CDK documentation

Troubleshooting and Support

For issues or questions, please open an issue in this repository.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

About

Reference Architecture to automate the process of transferring new or modified files from remote SFTP Servers to your local S3 environment.

Topics

Resources

License

MIT-0, Unknown licenses found

Licenses found

MIT-0
LICENSE
Unknown
LICENSE-SUMMARY

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published