Our customer is a used car sales service that developes an application in order to attract new clients. This app helps to determine a market price of user's car.
Following criteria are essential for the customer:
- prediction quality
- time required for model trainig
- time required for prediction
Dataset objects are entries crawled from car profile forms:
DateCrawled
— date the form was downloaded on from a databaseVehicleType
— car body typeRegistrationYear
— year of car registrationGearbox
— gearbox typePower
— horsepower (hp)Model
— car modelKilometer
— car mileage (km)RegistrationMonth
— month of car registrationFuelType
— fuel typeBrand
— car brandRepaired
— was the car in repair or notDateCreated
— car profile creation dateNumberOfPictures
— number of car photosPostalCode
— postal code of a user who owns a car profileLastSeen
— last user activity date
Target feature:
Price
— price (euro)
- Dataset has a lot of outliers and missing values.
-
- I marked outliers as NaN values and removed only objects with 3 or more missing values.
-
- For objects left I changed NaNs to unknown values in categorical features.
- All models except dummies showed RMSE score that meets reqirements.
- Random Forest Regressor showed itself to be the slowest one.
- Linear Regression was the fastest to fit and to predict. It has the worst RMSE score but it still fits requirements.
- CatBoost fits much slower than LGBM but makes predictions faster.
- LGBM fit time is the second after Linear Regression
LightGBM model was chosen with RMSE score of 1429