This week seems to have been all about data. I populated my bundle with 7 popular (and open source) datasets for examples and experimentation within the bundle. These are very popular with the python neural network crowd and are included in many of the libraries. Thus, I thought it would be valuable to include them in an HPCC format for use with the bundle.
I also included the python (in the form of jupyter notebooks) code that I used to convert the datasets in their original format into a form that is easily sprayed onto an HPCC Cluster.
For example, the MNIST dataset is 28x28 sized images of hand written digits with labels for which digit each image is, 0-9. It is originally in a ubyte format (see the source for more details: http://yann.lecun.com/exdb/mnist/ ), and the output is a dataset, that when sprayed (Fixed, size=785) produces an HPCC Dataset with 60k and 10k rows (one for test and one for training) where each row has an integer for the label and the pixel data is stored as a DATA format.
The datasets are: CIFAR-10, CIFAR-100, MNIST, Fashion-MNIST, Reuters, IMDB, and Boston Housing
Comments