|
1 | 1 | ## Ingest Data using Python scripts |
2 | 2 |
|
3 | | -If you want to ingest data into Elasticsearch starting with the raw CSV data files follow the instructions below: |
| 3 | +If you want to ingest data into Elasticsearch starting with the raw CSV data |
| 4 | +files follow the instructions below: |
4 | 5 |
|
5 | | -##### 1. Download the following files: <br> |
6 | | -- `ingestRestaurantData.py` - Python script to process and ingest. This script downloads the required dataset. |
| 6 | +#### 1. Download the following files: |
| 7 | + |
| 8 | +- `ingestRestaurantData.py` - Python script to process and ingest. Note that this script downloads the required dataset. |
7 | 9 | - `inspection_mapping.json` contains mapping for Elasticsearch index |
8 | 10 |
|
9 | 11 | #### 2. Install and Configure Python |
10 | 12 |
|
11 | 13 | Requires Python 3. |
12 | | -Install Dependencies using pip i.e. `pip install -r requirements.txt` |
| 14 | +Install Dependencies using pip i.e. |
| 15 | +```shell |
| 16 | +pip install -r requirements.txt |
| 17 | +``` |
| 18 | + |
| 19 | +Note that MacOS users may need to `brew install python3`, |
| 20 | +which would change the pip command to |
| 21 | +```shell |
| 22 | +pip3 install -r requirements.txt |
| 23 | +``` |
| 24 | +#### 3. Optionally, configure the Python script for SSL |
| 25 | + |
| 26 | +If your instance of Elasticsearch requires SSL, is not running locally, or both, |
| 27 | +you can tweak the script to enable it. |
| 28 | + |
| 29 | +Inside the script you will notice the connection string for Elasticsearch: |
| 30 | + |
| 31 | +```code |
| 32 | + es = elasticsearch.Elasticsearch( |
| 33 | + # ['host1'], |
| 34 | + # http_auth=('myuser', 'mypassword'), |
| 35 | + # port=443, |
| 36 | + # use_ssl=True |
| 37 | +) |
| 38 | +``` |
13 | 39 |
|
| 40 | +Replace the host entry with the name of your Elasticsearch endpoint (if more |
| 41 | +than one endpoint you can use a comma-separated list). For additional arguments |
| 42 | +see the Elasticsearch Python Client documentation |
| 43 | +(https://elasticsearch-py.readthedocs.io/en/master/api.html) |
| 44 | + |
| 45 | +#### 4. Run Python script to process, join data and index data |
| 46 | + |
| 47 | +Run `ingestRestaurantData.py` (requires Python 3). When the script is done |
| 48 | +running, you will have a `nyc_restaurants` index in your Elasticsearch instance |
14 | 49 |
|
15 | | -##### 2. Run Python script to process, join data and index data<br> |
16 | | -Run `ingestRestaurantData.py` (requires Python 3). When the script is done running, you will have a `nyc_restaurants` index in your Elasticsearch instance |
17 | 50 | ``` |
18 | | - python3 ingestRestaurantData.py |
| 51 | +python3 ingestRestaurantData.py |
19 | 52 | ``` |
| 53 | + |
20 | 54 | NOTE: |
21 | 55 | - The script makes a call to Google geocoding API to get the lat/lon information for restaurants addresses. (a) You might need to sign up for a API key to avoid hitting usage limits. (b) Depending on your internet connection and the size of the inspection dataset, this step might take a 30 minutes to a few hours to complete. |
22 | 56 | - We have also included a iPython Notebook version of the script `ingestRestaurantData.ipynb` in case you prefer running in a cell-by-cell mode. |
23 | 57 |
|
24 | | -##### 3. Check if data is available in Elasticsearch |
| 58 | +#### 5. Check if data is available in Elasticsearch |
| 59 | + |
25 | 60 | Check to see if all the data is available in Elasticsearch. If all goes well, you should get a `count` response of `473039` when you run the following command. |
26 | 61 |
|
27 | | - ```shell |
28 | | - curl -H "Content-Type: application/json" -XGET localhost:9200/nyc_restaurants/_count -d '{ |
29 | | - "query": { |
30 | | - "match_all": {} |
31 | | - } |
32 | | - }' |
33 | | - ``` |
| 62 | +```shell |
| 63 | +curl -H "Content-Type: application/json" -XGET localhost:9200/nyc_restaurants/_count -d '{ |
| 64 | +"query": { |
| 65 | + "match_all": {} |
| 66 | +} |
| 67 | +}' |
| 68 | +``` |
| 69 | + |
| 70 | +NOTE: |
| 71 | + |
| 72 | +If you are using https you will likely need to also use the |
| 73 | +`--user username:password` option with your curl command |
0 commit comments