Introduction to NumPy¶
Downloading and extracting the data¶
You can start by downloading this week’s volcano data zip file, which we will be using for this lesson.
- The Firefox browser will likely download the file to the
Downloads
directory in the home directory. If so, you should move the file into the home directory using the file browser.
- The Firefox browser will likely download the file to the
Once you have moved the
volcano-data.zip
file into your home directory, you can unzip the file using theunzip
command in the Terminal window. In Linux or on a Mac, you can do:$ cd $HOME $ unzip volcano-data.zip $ ls volcano-data GVP-Volcano-Lat-Lon-Elev.csv GVP-Volcano-List.csv GVP_Volcano_List.xls
You should now have a directory titled
volcano-data
in your home directory. It contains three files from version 4.5.1 of the Smithsonian Institution’s Holocene volcano database (updated 23.9.2016). The three files are:GVP-Volcano-List.csv
: The complete Holocene volcano database file in.csv
format and with header lines. Note: Values are separated by semicolons (;
) in this version of the file rather than the commas (,
).GVP-Volcano-List.xls
: The Excel version of the complete Holocene volcano database.GVP-Volcano-Lat-Lon-Elev.csv
: The volcano ID, latitude, longitude and elevation for all volcanoes in the Holocene volcano database. There is no header in this file and values are separated by commas (,
).
Introducing NumPy¶
NumPy is a library for Python designed for efficient scientific (numerical) computing. It is an essential library in Python that is used under the hood in many other modules (including Pandas). Here, we willl get a sense of a few things NumPy can do.
To start using the NumPy library we will need to
import
it.In [1]: import numpy as np
Note that we’ve imported NumPy shortening
numpy
tonp
.Now we’ll import an example data file.
In [2]: data = np.loadtxt(fname='GVP-Volcano-Lat-Lon-Elev.csv', delimiter=',') In [3]: print(data) [[ 2.10010e+05 5.01700e+01 6.85000e+00 6.00000e+02] [ 2.10020e+05 4.57750e+01 2.97000e+00 1.46400e+03] [ 2.10030e+05 4.21700e+01 2.53000e+00 8.93000e+02] ... [ 3.90829e+05 -6.41500e+01 -5.77500e+01 1.63000e+03] [ 3.90847e+05 -6.20200e+01 -5.76700e+01 5.49000e+02] [ 6.00000e+05 0.00000e+00 0.00000e+00 0.00000e+00]]
Again, the data above is probably not very clear at this point, but is an example of data from the Smithsonian Institution’s Global Volcanism Program. Here, we have the ID number, latitude, longitude, and elevation of Holocene volcanoes in the database. Let’s see what we can do with this information.
First off, you may notice we’ve used NumPy to read in the data. Reading data in NumPy is quite similar to how things are done in Pandas, but the data are stored in a different type data structure than a Pandas DataFrame.
In [4]: type(data) Out[4]: numpy.ndarray
OK, so we have something new here. NumPy has its own data types that are part of the module. In this case, our data is stored in an NumPy n-dimensional array.
Like Pandas, we can also check to see how much data do we have in our
data
variable.In [5]: print(data.shape) (1520, 4)
1520 rows of data, 4 columns.
shape
is a member or attribute ofdata
, and is part of any NumPyndarray
. Printingdata.shape
tells us the size of the array.We can also check the data type of our data-columns by calling
data.dtype
, which is again similar to Pandas.In [6]: print(data.dtype) float64
OK, so it seems that all the data in our file is float data type, i.e., decimal numbers (stored with a precision of 64 bytes).
It is also possible to change the data type of the data which can be useful sometimes. Let’s take a copy of our data and convert our dataset into integer numbers.
# Take a copy of the data In [7]: copy = data.copy() # Convert to integer values In [8]: copy = copy.astype(int) In [9]: print(copy) [[210010 50 6 600] [210020 45 2 1464] [210030 42 2 893] ... [390829 -64 -57 1630] [390847 -62 -57 549] [600000 0 0 0]]
This is again quite similar to how things work in Pandas.
Within the array, we can find any value by using it’s index.
In [10]: data[0,0] Out[10]: 210010.0
This gives us the value stored in the first row and first column of
data
. Note that to refer to a location in an array you use the square brackets[ ]
just like for lists. Remember, index values start at zero, not one, and the first row and column refers to the top left value in the array. What will happen if we try to finddata[1520,0]
? Try it!
1520 volcanoes is quite a few to deal with at the same time. We can explore our data more easily by using index slicing to extract part of the array. Let’s start with just the latitude and longitude for the first five rows.
In [11]: data[0:5, 1:3] Out[11]: array([[50.17 , 6.85 ], [45.775, 2.97 ], [42.17 , 2.53 ], [38.87 , -4.02 ], [43.25 , 10.87 ]])
Nice! Note that in this case, the range of index values for the first 5 rows is 0-5. The data extracted will start at
0
and go up to, but not include5
. Be careful with this. We can also extract data for all columns without listing any index range at all.In [12]: data[0:2, :] Out[12]: array([[2.1001e+05, 5.0170e+01, 6.8500e+00, 6.0000e+02], [2.1002e+05, 4.5775e+01, 2.9700e+00, 1.4640e+03]])
Obviously, this can be useful.
We can also use index slicing to separate our data into different variables to make it easier to work with.
In [13]: Latitude = data[:,1] In [14]: print(Latitude) [ 50.17 45.775 42.17 ... -64.15 -62.02 0. ]
For many data files, this is a nice way to interact with only the data of your own interest.
Attention
Create a list called dataStr
where you append all of our data
array columns one by one in string (str
) format.
Use a for
loop for iterating over the columns.
Useful functions¶
It is common to need to create your own arrays not from a data file, but to make a variable that has a range from one value to another. If we wanted to calculate the
sin()
of a variablex
at 10 points from \(0\) to \(2\pi\), we could do the following.In [15]: x = np.linspace(0., 2 * np.pi, 10) In [16]: print(x) [0. 0.6981317 1.3962634 2.0943951 2.7925268 3.4906585 4.1887902 4.88692191 5.58505361 6.28318531] In [17]: y = np.sin(x) In [18]: print(y) [ 0.00000000e+00 6.42787610e-01 9.84807753e-01 8.66025404e-01 3.42020143e-01 -3.42020143e-01 -8.66025404e-01 -9.84807753e-01 -6.42787610e-01 -2.44929360e-16]
In this case,
x
starts at zero and goes to \(2\pi\) in 10 increments. Alternatively, if we wanted to specify the size of the increments for a new variablex2
, we could use thenp.arange()
function.In [19]: x2 = np.arange(0.0, 2 * np.pi, 0.5) In [20]: print(x2) [0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5 5. 5.5 6. ]
In this case,
x2
starts at zero and goes to the largest value that is smaller than \(2\pi\) by increments of 0.5. Both of these types of array options are useful in different situations.
Like normal variables, array variables can also be used for various mathematical operations.
In [21]: doublex = x * 2.0 In [22]: print(doublex) [ 0. 1.3962634 2.7925268 4.1887902 5.58505361 6.98131701 8.37758041 9.77384381 11.17010721 12.56637061]
In addition to the attributes we saw prevously for NumPy
ndarray
variables, there are also many methods that are part of thendarray
data type.In [23]: print(x.mean()) 3.141592653589793 In [24]: print(doublex.mean()) 6.283185307179586
No surprises here. If we think of variables as nouns, methods are verbs, actions for the variable values.
Note
When using methods, you always include the parentheses
()
to be clear we are referring to a method and not an attribute. There are many other usefulndarray
methods, such asx.min()
,x.max()
, andx.std()
(standard deviation).Methods can also act on part of an array.
In [25]: print(x[0:5].mean()) 1.3962634015954634