Warning: This document is for an old version of IntroQG.

Introduction to NumPy

Downloading and extracting the data

  1. You can start by downloading this week’s volcano data zip file, which we will be using for this lesson.

    • The Firefox browser will likely download the file to the Downloads directory in the home directory. If so, you should move the file into the home directory using the file browser.
  2. Once you have moved the volcano-data.zip file into your home directory, you can unzip the file using the unzip command in the Terminal window. In Linux or on a Mac, you can do:

    $ cd $HOME
    $ unzip volcano-data.zip
    $ ls volcano-data
    GVP-Volcano-Lat-Lon-Elev.csv GVP-Volcano-List.csv GVP_Volcano_List.xls
    

    You should now have a directory titled volcano-data in your home directory. It contains three files from version 4.5.1 of the Smithsonian Institution’s Holocene volcano database (updated 23.9.2016). The three files are:

    • GVP-Volcano-List.csv: The complete Holocene volcano database file in .csv format and with header lines. Note: Values are separated by semicolons (;) in this version of the file rather than the commas (,).
    • GVP-Volcano-List.xls: The Excel version of the complete Holocene volcano database.
    • GVP-Volcano-Lat-Lon-Elev.csv: The volcano ID, latitude, longitude and elevation for all volcanoes in the Holocene volcano database. There is no header in this file and values are separated by commas (,).

Introducing NumPy

NumPy is a library for Python designed for efficient scientific (numerical) computing. It is an essential library in Python that is used under the hood in many other modules (including Pandas). Here, we willl get a sense of a few things NumPy can do.

  1. To start using the NumPy library we will need to import it.

    In [1]: import numpy as np
    

    Note that we’ve imported NumPy shortening numpy to np.

  2. Now we’ll import an example data file.

    In [2]: data = np.loadtxt(fname='GVP-Volcano-Lat-Lon-Elev.csv', delimiter=',')
    
    In [3]: print(data)
    [[ 2.10010e+05  5.01700e+01  6.85000e+00  6.00000e+02]
     [ 2.10020e+05  4.57750e+01  2.97000e+00  1.46400e+03]
     [ 2.10030e+05  4.21700e+01  2.53000e+00  8.93000e+02]
     ...
     [ 3.90829e+05 -6.41500e+01 -5.77500e+01  1.63000e+03]
     [ 3.90847e+05 -6.20200e+01 -5.76700e+01  5.49000e+02]
     [ 6.00000e+05  0.00000e+00  0.00000e+00  0.00000e+00]]
    

    Again, the data above is probably not very clear at this point, but is an example of data from the Smithsonian Institution’s Global Volcanism Program. Here, we have the ID number, latitude, longitude, and elevation of Holocene volcanoes in the database. Let’s see what we can do with this information.

  3. First off, you may notice we’ve used NumPy to read in the data. Reading data in NumPy is quite similar to how things are done in Pandas, but the data are stored in a different type data structure than a Pandas DataFrame.

    In [4]: type(data)
    Out[4]: numpy.ndarray
    

    OK, so we have something new here. NumPy has its own data types that are part of the module. In this case, our data is stored in an NumPy n-dimensional array.

  4. Like Pandas, we can also check to see how much data do we have in our data variable.

    In [5]: print(data.shape)
    (1520, 4)
    

    1520 rows of data, 4 columns. shape is a member or attribute of data, and is part of any NumPy ndarray. Printing data.shape tells us the size of the array.

  5. We can also check the data type of our data-columns by calling data.dtype, which is again similar to Pandas.

    In [6]: print(data.dtype)
    float64
    

    OK, so it seems that all the data in our file is float data type, i.e., decimal numbers (stored with a precision of 64 bytes).

  6. It is also possible to change the data type of the data which can be useful sometimes. Let’s take a copy of our data and convert our dataset into integer numbers.

    # Take a copy of the data
    In [7]: copy = data.copy()
    
    # Convert to integer values
    In [8]: copy = copy.astype(int)
    
    In [9]: print(copy)
    [[210010     50      6    600]
     [210020     45      2   1464]
     [210030     42      2    893]
     ...
     [390829    -64    -57   1630]
     [390847    -62    -57    549]
     [600000      0      0      0]]
    

    This is again quite similar to how things work in Pandas.

  7. Within the array, we can find any value by using it’s index.

    In [10]: data[0,0]
    Out[10]: 210010.0
    

    This gives us the value stored in the first row and first column of data. Note that to refer to a location in an array you use the square brackets [ ] just like for lists. Remember, index values start at zero, not one, and the first row and column refers to the top left value in the array. What will happen if we try to find data[1520,0]? Try it!

  1. 1520 volcanoes is quite a few to deal with at the same time. We can explore our data more easily by using index slicing to extract part of the array. Let’s start with just the latitude and longitude for the first five rows.

    In [11]: data[0:5, 1:3]
    Out[11]: 
    array([[50.17 ,  6.85 ],
           [45.775,  2.97 ],
           [42.17 ,  2.53 ],
           [38.87 , -4.02 ],
           [43.25 , 10.87 ]])
    

    Nice! Note that in this case, the range of index values for the first 5 rows is 0-5. The data extracted will start at 0 and go up to, but not include 5. Be careful with this. We can also extract data for all columns without listing any index range at all.

    In [12]: data[0:2, :]
    Out[12]: 
    array([[2.1001e+05, 5.0170e+01, 6.8500e+00, 6.0000e+02],
           [2.1002e+05, 4.5775e+01, 2.9700e+00, 1.4640e+03]])
    

    Obviously, this can be useful.

  2. We can also use index slicing to separate our data into different variables to make it easier to work with.

    In [13]: Latitude = data[:,1]
    
    In [14]: print(Latitude)
    [ 50.17   45.775  42.17  ... -64.15  -62.02    0.   ]
    

    For many data files, this is a nice way to interact with only the data of your own interest.

Attention

Create a list called dataStr where you append all of our data array columns one by one in string (str) format. Use a for loop for iterating over the columns.

Useful functions

  1. It is common to need to create your own arrays not from a data file, but to make a variable that has a range from one value to another. If we wanted to calculate the sin() of a variable x at 10 points from \(0\) to \(2\pi\), we could do the following.

    In [15]: x = np.linspace(0., 2 * np.pi, 10)
    
    In [16]: print(x)
    [0.         0.6981317  1.3962634  2.0943951  2.7925268  3.4906585
     4.1887902  4.88692191 5.58505361 6.28318531]
    
    In [17]: y = np.sin(x)
    
    In [18]: print(y)
    [ 0.00000000e+00  6.42787610e-01  9.84807753e-01  8.66025404e-01
      3.42020143e-01 -3.42020143e-01 -8.66025404e-01 -9.84807753e-01
     -6.42787610e-01 -2.44929360e-16]
    

    In this case, x starts at zero and goes to \(2\pi\) in 10 increments. Alternatively, if we wanted to specify the size of the increments for a new variable x2, we could use the np.arange() function.

    In [19]: x2 = np.arange(0.0, 2 * np.pi, 0.5)
    
    In [20]: print(x2)
    [0.  0.5 1.  1.5 2.  2.5 3.  3.5 4.  4.5 5.  5.5 6. ]
    

    In this case, x2 starts at zero and goes to the largest value that is smaller than \(2\pi\) by increments of 0.5. Both of these types of array options are useful in different situations.

  1. Like normal variables, array variables can also be used for various mathematical operations.

    In [21]: doublex = x * 2.0
    
    In [22]: print(doublex)
    [ 0.          1.3962634   2.7925268   4.1887902   5.58505361  6.98131701
      8.37758041  9.77384381 11.17010721 12.56637061]
    
  2. In addition to the attributes we saw prevously for NumPy ndarray variables, there are also many methods that are part of the ndarray data type.

    In [23]: print(x.mean())
    3.141592653589793
    
    In [24]: print(doublex.mean())
    6.283185307179586
    

    No surprises here. If we think of variables as nouns, methods are verbs, actions for the variable values.

    Note

    When using methods, you always include the parentheses () to be clear we are referring to a method and not an attribute. There are many other useful ndarray methods, such as x.min(), x.max(), and x.std() (standard deviation).

  3. Methods can also act on part of an array.

    In [25]: print(x[0:5].mean())
    1.3962634015954634