-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
question on the bigdata.py #14
Comments
Lines 57 through 68 randomly sample the large dataset using NumPy code. Only 8 datapoints are loaded at each step of gradient descent. However, with each pass over the |
Thanks a lot, Jared. I might not have made myself clear. The "bigdata.py" is supposed to demonstrate that we can handle "large" volume of data with TensorFlow. Because we only sample 8 points at a time, this means TensorFlow still deals with a very small amount of data. I fail to see how TensorFlow scales up to "large" volume of data in "bigdata.py" script. Both "bigdata.py" and "tensor.py" run "_EPOCH" times, the only difference is that "tensor.py" use the same data for each loop, and "bigdata.py" samples different data from a large population. |
Tensor.py shows you how to process samples in parallel. If you have a GPU, you can increase _BATCH to something like 100 or 1,000 or even 10,000 and using tensors it will run in parellel. There are two reasons why you might want a bigger _BATCH. (1) You can get away with a larger step size using fewer _EPOCHS. (2) You can handle data with more "variance", which is to say if you are classifying between 100 outcomes, at a minimum you want a batch size on the order of 100 (otherwise, convergence with Stochastic Gradient Descent will be slow). But you're right, you will need to run through a "_EPOCH" many times |
Many thanks for your time and patience, Jared. I think I was stuck at the fact that the "bigdata.py" generates a big dataset, but the regression only sample 8 points for 10,000 times. Therefore, the scripts at most uses 80,000 data points of 8 million and leave 7.92M data unused. Perhaps, as you suggested, the script could use a large _BATCH, like 1000, to simulate getting data feed from a large dataset. |
It works because the extra 7.92M datapoints are very similar to the first 80,000 datapoints. In circumstances where this is not the case, you might consider running stochastic gradient descent much, much longer to cover all the datapoints (serial) or use a larger batch size (parallel) |
That's my point. Except generating extra 7.92M random point, the "bigdata.py" script is identical to the "tensorflow.py" script. Therefore, I am not sure I get the purpose of the "bigdata.py" script. :-) |
The point is to explain how to use placeholders. If you don't use placeholders, the amount of data that can be handled by TensorFlow is limited. |
Thanks for setting this up. I am wondering the "bigdata.py". It appears to me the code does not use all the data from the "big data" population and only samples 8 points a time. That is no different than the "tensor.py" which just use the same 8 points over and over. Can you elaborate? Thanks.
The text was updated successfully, but these errors were encountered: