Skip to content

Scraper for SoundCloud that extracts audio metadata and download URLs using Node.js and Puppeteer. Designed for collecting audio data to train AI models in music generation, audio enhancement, and speech processing.

License

Notifications You must be signed in to change notification settings

Decodo/soundcloud-scraper

Repository files navigation

SoundCloud Scraper for AI Training

Build Status Node Version License

A Node.js-based SoundCloud scraper designed for collecting audio data to train AI models. Includes three specialized scripts for music generation, audio enhancement, and speech processing use cases.

Features

  • Playlist scraping. Extract metadata from SoundCloud playlists including artist names, track titles, play counts, and URLs.
  • Search results scraping. Find Creative Commons licensed tracks with download availability filtering.
  • Profile scraping. Collect podcast episodes and long-form spoken content from user profiles.
  • Proxy support. Built-in residential proxy integration for reliable, undetected scraping.
  • Headless browser automation. Uses Puppeteer to handle JavaScript-heavy pages and dynamic content.
  • Download detection. Automatically identifies tracks with enabled download buttons.

Installation

Prerequisites

Setup

git clone https://github.com/Decodo/soundcloud-scraper.git
cd soundcloud-scraper
npm install

Configure proxies

Get your proxy credentials from the Decodo dashboard and update them in each script file:

await page.authenticate({
  username: 'YOUR_PROXY_USERNAME',  // Replace with your username
  password: 'YOUR_PROXY_PASSWORD'   // Replace with your password
});

Usage

1. Music generation AI training

Train models like Suno AI, AIVA, or Stable Audio using curated playlists that represent successful musical patterns across different genres.

File: music-generation.js

Scrape trending playlists to collect metadata for training music generation models.

node music-generation.js

What it does:

  • Targets SoundCloud playlist pages
  • Extracts artist names, track titles, play counts
  • Outputs structured data with rankings

Customize the target:

// Edit line 43 in music-generation.js
await page.goto('https://soundcloud.com/YOUR-PLAYLIST-URL', {

Output example:

Found 50 playlist tracks

1. Artist Name - "Track Title" (1.2M plays)
   https://soundcloud.com/artist/track

2. Audio enhancement AI training

Collect Creative Commons tracks to train models that clean degraded recordings, remove noise, and restore audio quality like Adobe Enhance Speech or Descript.

File: audio-enhancement.js

Find Creative Commons tracks with download availability for audio cleanup model training.

node audio-enhancement.js

What it does:

  • Searches SoundCloud with custom queries
  • Auto-scrolls to load more results
  • Filters tracks with download buttons enabled

Customize the search:

// Edit line 49 in audio-enhancement.js
await page.goto('https://soundcloud.com/search/sounds?q=YOUR-SEARCH-QUERY', {

Output example:

Total items found: 85
Found 42 downloadable tracks:

1. Artist Name - "Track Title"
   https://soundcloud.com/artist/track

3. Speech/voice AI training

Extract podcast episodes to train speech recognition, voice cloning, and natural language processing models like Whisper (OpenAI) or ElevenLabs.

File: voice-training.js

Extract podcast episodes and lectures for speech recognition and voice AI training.

node voice-training.js

What it does:

  • Scrapes user profile pages
  • Focuses on long-form spoken content
  • Identifies downloadable episodes

Customize the target:

// Edit line 43 in voice-training.js
await page.goto('https://soundcloud.com/YOUR-PROFILE/tracks', {

Configure limits:

// Edit lines 62-63 in speech-training.js
const maxTracksToScrape = 50;  // Maximum tracks to process
const maxScrollAttempts = 20;  // Scroll depth

Output example:

Found 68 total tracks
24 downloadable tracks:

1. Podcast Name - "Episode Title"
   https://soundcloud.com/podcast/episode

Configuration

Each script includes configurable parameters at the top of the file:

Proxy settings:

'--proxy-server=http://gate.decodo.com:7000'  // Proxy endpoint
username: 'YOUR_PROXY_USERNAME'               // Your credentials
password: 'YOUR_PROXY_PASSWORD'

Scraping behavior:

headless: true          // Run browser invisibly
timeout: 45000          // Page load timeout (ms)
maxScrollAttempts: 5    // Pagination depth
targetResults: 500      // Result limit

Resource blocking:

// Scripts automatically block images, media, fonts
// to improve performance and reduce bandwidth

Best practices

  • Start with small limits (10-20 items) to test your setup
  • Use residential proxies to avoid IP bans
  • Add delays between requests to respect rate limits
  • Monitor console output for errors and warnings
  • Store credentials securely, never commit them to Git

Proxy setup

Get residential proxies from Decodo:

  1. Sign up at dashboard.decodo.com
  2. Navigate to Residential Proxies → Proxy Setup
  3. Copy your Username and Password
  4. Update credentials in each script file

Troubleshooting

Browser won't launch?

  • Verify Node.js 14+ is installed
  • Install Puppeteer: npm install
  • Try setting headless: false to see browser

Getting blocked?

  • Check proxy credentials are correct
  • Increase delays between requests
  • Verify proxy quota on Decodo dashboard

No results found?

  • SoundCloud's HTML may have changed
  • Check target URL is accessible
  • Review browser console output

Documentation

Related projects

🗺️ Google Maps Scraper

🔍 Google Lens Scraper

📰 Google News Scraper

💬 Reddit Scraper

About

Scraper for SoundCloud that extracts audio metadata and download URLs using Node.js and Puppeteer. Designed for collecting audio data to train AI models in music generation, audio enhancement, and speech processing.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published