This project focuses on big data analysis and application for tourism websites, specifically for scraping, cleaning, storing, analyzing, and visualizing hotel data from travel sites (such as Ctrip). The goal is to extract meaningful insights from a vast amount of unstructured data, allowing users to make data-driven decisions when planning their travels. The system implements data collection using web scraping, cleaning, storage with HBase, data analysis with MapReduce, and data visualization with ECharts.
This project includes the following main features:
-
Data Scraping:
- Uses Jsoup to scrape hotel data from tourism websites (such as Ctrip), including city information, hotel names, prices, ratings, and more.
-
Data Cleaning:
- Cleans the scraped HTML data using Jsoup, removing irrelevant elements and structuring the data for further analysis.
-
Data Storage:
- Utilizes HBase to store structured hotel data in a distributed, column-based database, taking advantage of HBase's scalability and performance.
-
Data Analysis:
- Implements data analysis using Hadoop's MapReduce, including tasks like calculating the average hotel price by city and performing word frequency analysis on hotel reviews.
-
Data Visualization:
- Uses ECharts to create interactive visualizations such as hotel price distribution, average price comparison, and room type statistics.
- Jsoup: A Java library for HTML parsing and web scraping.
- HBase: A distributed, column-oriented NoSQL database for storing structured data.
- Hadoop: A framework for distributed storage and processing of large data sets, used for MapReduce jobs.
- ECharts: A JavaScript library for creating interactive and customizable data visualizations.
- Java: The primary programming language used to implement the solution.
- Java 8 or higher
- HBase 2.0 or higher
- Hadoop 3.x
- Maven (for dependency management)
- ECharts (for visualization)
Contributions to this project are welcome! If you have suggestions, bug fixes, or improvements, feel free to fork this repository and submit a pull request.
- Fork this repository.
- Create a new branch (
git checkout -b feature-xyz
). - Commit your changes (
git commit -am 'Add new feature'
). - Push to the branch (
git push origin feature-xyz
). - Create a new pull request.
This project is licensed under the MIT License - see the LICENSE file for details.
- Special thanks to the creators of ECharts and Jsoup.
- Hadoop and HBase communities for providing the necessary big data tools.
- The tourism website data source for providing rich datasets for analysis.
- Project Overview: Briefly explains the objective of the project.
- Features: Describes the key functionalities that the project implements.
- Technologies Used: Lists the technologies used in the project.
- Project Structure: Provides a hierarchical view of the project folder structure.
- Installation and Usage: Explains the setup process and how to run the different parts of the system.
- Example Output: Describes the expected output of the system, such as the visualizations.
- Contributing: Provides guidelines for contributing to the project.
- License: Mentions the project's open-source license.
- Acknowledgments: Credits to the libraries or tools used in the project.
This should provide a solid foundation for your GitHub README. Feel free to modify or expand on this template to better suit your project specifics!