-
Notifications
You must be signed in to change notification settings - Fork 8
/
NYC_open-data.py
79 lines (64 loc) · 2.13 KB
/
NYC_open-data.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
"""
NYC open data experiments
NYU Open Data
https://nycopendata.socrata.com/
https://dev.socrata.com/docs/endpoints.html
Directions: search for dataset, click on export, right click on csv for url,
then kill everything after the question mark.
Prepared for Data Bootcamp course at NYU
* https://github.com/NYUDataBootcamp/Materials
* https://github.com/NYUDataBootcamp/Materials/Code/Lab
Written by Dave Backus, February 2016
Created with Python 3.5
"""
import pandas as pd
import time
#%%
"""
restaurant inspections
takes 7-8 minutes with an ethernet connection, delivers 476k observations
"""
start = time.process_time()
print('\nLocal time at start:', time.localtime())
url = 'https://nycopendata.socrata.com/api/views/xx67-kt59/rows.csv'
insp = pd.read_csv(url, low_memory=False)
print('\nNYC restaurant inspections')
print('Dimensions:', insp.shape)
print('Variables and types:\n', insp.dtypes, sep='')
print('\nLocal time at finish:', time.localtime())
print('\nTiming for restaraunt input', time.process_time()-start)
#%%
"""
311 complaints
"""
url = 'https://data.cityofnewyork.us/api/views/erm2-nwe9/rows.csv'
df311 = pd.read_csv(url, nrows=10)
print('\n311 calls')
print('Dimensions:', df311.shape)
print('Variables:', list(df311))
#%%
"""
Itamar's 311 code
50k observations takes 7 minutes with ethernet connection
Paul's explanation:
The data server runs a program that tells it how to deliver the data.
The inputs to this program are passed to the server in the url after the ?.
The format is described in the SODA Query docs, which we access from the
wrench at the top of the NYC open data page:
https://nycopendata.socrata.com/
https://dev.socrata.com/docs/queries/
"""
print('\nLocal time at start:', time.localtime())
size=1000
start=0
base_url='https://data.cityofnewyork.us/resource/erm2-nwe9.json'
df_final=pd.DataFrame()
for i in range(50):
url = base_url+'?$limit=%s&$offset=%d'%(size,start)
df=pd.read_json(url)
df_final = df_final.append(df)
start+=size
print('\n311 calls')
print('Dimensions:', df_final.shape)
print('Variables and types:\n', df_final.dtypes, sep='')
print('\nLocal time at finish:', time.localtime())