WordCloud from Excel Generator

📖 Overview

WordCloud Architect is a versatile, high-performance text analysis engine designed to transform local Excel databases into high-impact visual insights. While it works seamlessly with Western languages (English, Portuguese, etc.), it is uniquely equipped with advanced Japanese Natural Language Processing (NLP) capabilities.

The core challenge of Japanese text—the lack of spaces between words—is handled by an integrated Morphological Analysis module using the Janome library. This allows the engine to segment continuous text into meaningful units, ensuring professional-grade word clouds and statistical reports regardless of the language.

🖼️ System Preview

Japanese Example of a high-definition output for the "ガーデニング" keyword.

English Example of a high-definition output for the "Gardening" keyword.

🌟 Key Features

Universal WordCloud Generation: Effortlessly process any .xlsx file by pointing to the desired text and keyword columns.
Advanced Japanese Module: Uses Janome for deep morphological analysis, accurately extracting Nouns (名詞) and Adjectives (形容詞) from non-spaced text.
Local-First & Secure: Operates entirely on your local machine. No internet, cloud APIs, or Google Sheets credentials required.
Smart Stopword Filtering: Integrated support for an external stopwords.txt file (UTF-8) to eliminate noise across multiple languages simultaneously.
Automated Data Reporting: Automatically exports a comprehensive frequency report (.xlsx) containing the top 300 terms for each analyzed keyword.
Professional CJK Rendering: High-fidelity font management ensures Japanese characters are displayed perfectly (no "tofu" blocks).
Safe Filename Sanitization: Automated regex cleaning to ensure generated images are saved correctly, even if keywords contain illegal OS characters.

🔄 Workflow Logic

graph TD
    Data[Local Excel File] -->|Pandas| Loader[Data Loader]
    Loader -->|Text Stream| NLP[Multilingual / Japanese Engine]
    NLP -->|POS Tagging| Filter[Multi-Layer Filter]
    Filter -->|Stopwords Filter| Counter[Frequency Counter]
    Counter -->|Frequency Dict| Cloud[WordCloud Generator]
    Cloud -->|Matplotlib| Render[Visual Export .png]
    Counter -->|Counter List| Report[Excel Report .xlsx]

🛠️ Project Structure

wordcloud_from_excel/
├── input/                     # Source Data (Excel & Stopwords)
├── output/                    # Generated PNGs and Reports
├── venv/                      # Python Virtual Environment
├── generate_wordcloud.py      # Core Engine & Config
├── requirements.txt           # Project Dependencies
└── README.md                  # Documentation

🚀 Quick Start

Prerequisites

Python 3.9+
CJK Font: For Japanese support, ensure you have a compatible font (e.g., msgothic.ttc on Windows or NotoSans on Linux).

Installation & Deployment

Clone the Repository

Initialize Virtual Environment

python -m venv venv
# Activate (Windows)
.\venv\Scripts\activate
# Activate (Unix)
source venv/bin/activate

Install Dependencies
```
pip install -r requirements.txt
```
Execute Analysis
```
python generate_wordcloud.py
```

📘 NLP Strategy & Logic

1. The Tokenization Challenge

In Japanese, text is a continuous stream. The WordCloud Architect acts as a "Linguistic Chef", slicing the text stream based on morphological rules. It prioritizes:

名詞 (Nouns): To capture the core subjects.
形容詞 (Adjectives): To capture sentiments and qualities.

2. Multi-Layer Noise Reduction

The engine applies four distinct filters:

Grammatical Filter: Removes particles (助詞) and auxiliary verbs (助動詞).
Internal Dictionary Filter: Standard Japanese stopword sets.
External User Filter: Processes stopwords.txt to remove custom noise (e.g., "PDF", "Click", "Views").
Structural Filter: Removes single-character tokens and numeric strings.

⚖️ License (CC BY-NC 4.0)

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.

👨‍💻 Author

Rubens Braz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WordCloud from Excel Generator

📖 Overview

🖼️ System Preview

🌟 Key Features

🔄 Workflow Logic

🛠️ Project Structure

🚀 Quick Start

Prerequisites

Installation & Deployment

📘 NLP Strategy & Logic

1. The Tokenization Challenge

2. Multi-Layer Noise Reduction

⚖️ License (CC BY-NC 4.0)

👨‍💻 Author

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
input		input
output		output
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
generate_wordcloud.py		generate_wordcloud.py
requirements.txt		requirements.txt

rubensbraz/wordcloud_from_excel

Folders and files

Latest commit

History

Repository files navigation

WordCloud from Excel Generator

📖 Overview

🖼️ System Preview

🌟 Key Features

🔄 Workflow Logic

🛠️ Project Structure

🚀 Quick Start

Prerequisites

Installation & Deployment

📘 NLP Strategy & Logic

1. The Tokenization Challenge

2. Multi-Layer Noise Reduction

⚖️ License (CC BY-NC 4.0)

👨‍💻 Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages