NumPy#

NumPy (Numerical Python) is the core module for numerical computation in Python. NumPy contains a fast and memory-efficient implementation of a list-like array data structure and it contains useful linear algebra and random number functions. A large portion of NumPy is actually written in the C programming language.

A NumPy array is similar to Python’s list data structure. A Python list can contain any combination of element types: integers, floats, strings, functions, objects, etc. A NumPy array, on the other hand, must contain only one element type at a time. This way, NumPy arrays can be much faster and more memory efficient.

Both the Pandas module (for data analysis) and the Scikit-Learn module (for machine learning) are built upon the NumPy module. The Matplotlib module (for plotting) also plays nicely with NumPy. These four modules plus the base Python is practically all you need for basic to intermediate machine learning.

Two other fundamental Python modules closely related to machine learning are as follows - though we will not cover these in our tutorials:

  • SciPy: This module is for numerical computing including integration, differentiation, optimization, probability distributions, and parallel programming.

  • StatsModels: This module provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

Table of Contents#

Let’s import numpy with usual convention of np.

import numpy as np

Creating arrays with NumPy#

NumPy’s array class is called ndarray (the n-dimensional array). It is also known by the name array.

  • In a NumPy array, each dimension is called an axis and the number of axes is called the rank.

    • For example, a 3x4 matrix is an array of rank 2 (it is 2-dimensional).

    • The first axis has length 3, the second has length 4.

  • An array’s list of axis lengths is called the shape of the array.

    • For example, a 3x4 matrix’s shape is (3, 4).

    • The rank is equal to the shape’s length.

  • The size of an array is the total number of elements, which is the product of all axis lengths (eg. 3*4=12)

np.array#

The easiest way to create an array is to use the array function. This accepts any sequence-like object (including other arrays) and produces a new NumPy array containing the passed data.

arr1 = np.array([2, 10.2, 5.4, 80, 0])
arr1
array([ 2. , 10.2,  5.4, 80. ,  0. ])

Nested sequences, like a list of equal-length lists, will be converted into a multi-dimensional array:

data = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data)
arr2
array([[1, 2, 3, 4],
       [5, 6, 7, 8]])
arr2.shape
(2, 4)
arr2.ndim  # equal to len(a.shape)
2
arr2.size
8

Other functions to create arrays#

There are several other convenience NumPy functions to create arrays.

np.zeros#

Creates an array containing any number of zeros.

np.zeros(5)
array([0., 0., 0., 0., 0.])

It’s just as easy to create a 2-D array (i.e., a matrix) by providing a tuple with the desired number of rows and columns. For example, here’s a 3x4 matrix:

np.zeros((2, 3))  # notice the double parantheses
array([[0., 0., 0.],
       [0., 0., 0.]])

np.ones#

Produces an array of all ones.

np.ones((2, 3))
array([[1., 1., 1.],
       [1., 1., 1.]])

How to create an array with the same values:

(np.pi * np.ones((3,4))).round(2)
array([[3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14]])

np.arange#

This is similar to Python’s built-in range function, but much faster.

np.arange(5)
array([0, 1, 2, 3, 4])
np.arange(1, 5)
array([1, 2, 3, 4])

It also works with floats:

np.arange(1.0, 5.0)
array([1., 2., 3., 4.])

Of course, you can provide a step parameter:

np.arange(1, 5, step = 0.5)
array([1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])

np.linspace#

This is similar to seq() in R. Its inputs are (start, stop, number of elements) and it returns evenly-spaced numbers over a specified interval. By default, the stop value is included.

np.linspace(0, 10, 6)
array([ 0.,  2.,  4.,  6.,  8., 10.])

np.quantile#

Computes the q-th quantile of its input. It plays nicely with np.linspace.

a = np.arange(1, 21)
print('a =', a)
quartiles = np.linspace(0, 1, 5)
print('quartiles =', quartiles)
a = [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20]
quartiles = [0.   0.25 0.5  0.75 1.  ]
np.quantile(a, 0.5)  # how to compute the median
np.float64(10.5)
np.quantile(a, quartiles)
array([ 1.  ,  5.75, 10.5 , 15.25, 20.  ])

np.rand and np.randn#

A number of functions are available in NumPy’s random module to create arrays initialized with random values. For example, here is a matrix initialized with random floats between 0 and 1 (uniform distribution):

np.random.rand(2,3).round(3)
array([[0.867, 0.295, 0.074],
       [0.419, 0.619, 0.623]])

Here’s a matrix containing random floats sampled from a univariate normal distribution (Gaussian distribution) with mean 0 and variance 1:

np.random.randn(2,3).round(3)
array([[-0.102, -1.267,  0.367],
       [ 0.564,  1.524,  0.681]])

Data types for arrays#

Type

Description

int16

16-bit integer types

int32

32-bit integer types

int64

64-bit integer types

float16

Half-precision floating point

float32

Standard single-precision floating point

float64

Standard double-precision floating point

bool

Boolean (True or False)

string_

String

object

A value can be any Python object

np.array.dtype#

NumPy’s arrays are also efficient in part because all their elements must have the same type (usually numbers). You can check what the data type is by looking at the dtype attribute.

arr1 = np.array([1, 2, 3], dtype = np.float64)
print("Data type name:", arr1.dtype.name)
Data type name: float64
arr2 = np.array([1, 2, 3], dtype = np.int32)
print(arr2.dtype, arr2)
int32 [1 2 3]

np.array.astype #

You can explicitly convert or cast an array from one dtype to another using astype method.

arr2.dtype
dtype('int32')
arr2 = arr2.astype(np.float64)
arr2.dtype # integers are now cast to floating point
dtype('float64')

If you have an array of strings representing numbers, you can use list comprehension to convert them to numeric form.

arr3 = np.array(['1.25', '-9.6', '42'])

numeric_strings = np.array([float(x) for x in arr3])

numeric_strings
array([ 1.25, -9.6 , 42.  ])
numeric_strings.astype(float)  # this will not take effect unless you do set it to a new variable!
array([ 1.25, -9.6 , 42.  ])
numeric_strings.dtype
dtype('float64')

Arithmetic operations on arrays#

All the usual arithmetic operators (+, -, *, /, //, **, etc.) can be used with arrays. They apply element-wise.

a = np.array([14, 23, 32, 41])
b = np.array([5,  4,  3,  2])
print("a + b  =", a + b)
print("a - b  =", a - b)
print("a * b  =", a * b)
print("a / b  =", a / b)
print("a // b  =", a // b)
print("a % b  =", a % b)
print("a ** b =", a ** b)
a + b  = [19 27 35 43]
a - b  = [ 9 19 29 39]
a * b  = [70 92 96 82]
a / b  = [ 2.8         5.75       10.66666667 20.5       ]
a // b  = [ 2  5 10 20]
a % b  = [4 3 2 1]
a ** b = [537824 279841  32768   1681]

Note that the multiplication is not a matrix multiplication.

The arrays must have the same shape. If they do not, NumPy will apply the broadcasting rules, which is discussed further below.

Reshaping arrays#

In many cases, you can convert an array from one shape to another without copying any data.

np.array.shape#

Changing the shape of an array is as simple as setting its shape attribute. However, the array’s size must remain the same.

g = np.arange(12)
print(g)
print("Rank:", g.ndim)
[ 0  1  2  3  4  5  6  7  8  9 10 11]
Rank: 1
g.shape = (6, 2)
print(g)
print("Rank:", g.ndim)
[[ 0  1]
 [ 2  3]
 [ 4  5]
 [ 6  7]
 [ 8  9]
 [10 11]]
Rank: 2

np.array.reshape#

Another way to change an array’s shape is to use the reshape() method, which returns a new array object.

g2 = g.reshape(4,3)  # you need to set this to a new variable to take effect!
print(g2)
print("Rank:", g2.ndim)
[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]
Rank: 2

How about we get lazy and let NumPy figure out the details?

g2 = g.reshape(4, -1)  
print(g2)
[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]

How to convert a multi-dimensional array back to 1-dimensional (a.k.a array flattening): you can use the flatten method.

f = np.arange(6).reshape(3,2)
print(f)
f = f.flatten()  # you need to set this to a new variable to take effect!
print(f)
print(f.shape)
[[0 1]
 [2 3]
 [4 5]]
[0 1 2 3 4 5]
(6,)

Adding and removing elements#

np.append and np.insert#

a = np.arange(6)
print('original array:\n', a)

b = np.append(a, 111)
print('appending an element to the end:\n', b)

c = np.insert(a, 0, 111) 
print('inserting an element at a specific position:\n', c)

# watch out: these will NOT work: a.append(111), a.insert(0, 111)
original array:
 [0 1 2 3 4 5]
appending an element to the end:
 [  0   1   2   3   4   5 111]
inserting an element at a specific position:
 [111   0   1   2   3   4   5]

np.delete#

a = np.arange(6)
a
c = np.delete(a, [0,1])
print('deleting the first two elements:\n', c)

a.resize(2,3)
print('a after resize():\n', a)

e = np.delete(a, 0, axis=1) # you can delete an entire column by specifying axis=1
print('first column deleted:\n', e)

f = np.delete(a, 0, axis=0) # or you can delete an entire row by specifying axis=0
print('first row deleted:\n', f)
deleting the first two elements:
 [2 3 4 5]
a after resize():
 [[0 1 2]
 [3 4 5]]
first column deleted:
 [[1 2]
 [4 5]]
first row deleted:
 [[3 4 5]]

Copying arrays#

NumPy usually does not make copies for efficiency. Most assignments are just views, not copies. If you want a copy, you need to say so.

You can use either np.array.copy or np.copy.

b = a = np.arange(6)
a_copy = a.copy()
# alternatively,
a_copy = np.copy(a)
a
b
a_copy
print(a == a_copy)  # element-wise comparison
print(a is a_copy)  # this is False
print(a is b)  # this is True
a[0] = -111  # changing a has no effect on a_copy
a
a_copy
[ True  True  True  True  True  True]
False
True
array([0, 1, 2, 3, 4, 5])

Broadcasting#

Broadcasting describes how NumPy treats arrays with different shapes during arithmetic operations. Broadcasting can get complicated, so we recommend you avoid it all together if you can and do either one of the two things below:

  • Broadcast only a scalar with an array

  • Broadcast arrays of the same shape

A = np.arange(6).reshape(3,2)
B = np.arange(6, 12).reshape(3,2)
A
B
array([[ 6,  7],
       [ 8,  9],
       [10, 11]])
A + B
array([[ 6,  8],
       [10, 12],
       [14, 16]])
3 * A
array([[ 0,  3],
       [ 6,  9],
       [12, 15]])
(A / 3).round(2)  # float division
array([[0.  , 0.33],
       [0.67, 1.  ],
       [1.33, 1.67]])
A // 3  # integer division
array([[0, 0],
       [0, 1],
       [1, 1]])
11 + A
array([[11, 12],
       [13, 14],
       [15, 16]])

Element-wise matrix multiplication is done by *.

A * B
array([[ 0,  7],
       [16, 27],
       [40, 55]])

For usual matrix multiplication, you need to use np.dot.

B_new = B.reshape(2,-1)
B_new
np.dot(A, B_new)
array([[ 9, 10, 11],
       [39, 44, 49],
       [69, 78, 87]])

Conditional expressions with arrays#

x = np.array([10,20,30,40,50])
x >= 30
array([False, False,  True,  True,  True])
x[x >= 30]
array([30, 40, 50])

np.where#

Returns the indices of elements in an input array where the given condition is satisfied.

y = np.arange(10)
print(y)
np.where(y < 5)
[0 1 2 3 4 5 6 7 8 9]
(array([0, 1, 2, 3, 4]),)

Extremely useful: You can use where for vectorised if-else statements.

compared_to_5 = list(np.where(y < 5, 'smaller', 'bigger'))
print(compared_to_5)
[np.str_('smaller'), np.str_('smaller'), np.str_('smaller'), np.str_('smaller'), np.str_('smaller'), np.str_('bigger'), np.str_('bigger'), np.str_('bigger'), np.str_('bigger'), np.str_('bigger')]

Mathematical and statistical functions#

A set of mathematical functions that compute statistics about an entire array or about the data along an axis are accessible as methods of the array class.

a = np.array([[-2.5, 3.1, 7], [10, 11, 12]])
print(a)
[[-2.5  3.1  7. ]
 [10.  11.  12. ]]
np.max(a)
np.float64(12.0)
np.min(a)
np.float64(-2.5)
np.mean(a).round(3)
np.float64(6.767)
np.prod(a)
np.float64(-71610.0)
np.std(a).round(3)
np.float64(5.085)
np.var(a).round(3)
np.float64(25.856)
np.sum(a)
np.float64(40.6)

These functions accept an optional argument axis which lets you ask for the operation to be performed on elements along the given axis. For example:

b = np.arange(12).reshape(2,-1)
b
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11]])
b.sum(axis=0)  # sum across columns
array([ 6,  8, 10, 12, 14, 16])
b.sum(axis=1)  # sum across rows
array([15, 51])

Universal functions#

A universal function, or ufunc, is a function that performs element-wise operations on data in ndarrays. You can think of them as fast vectorized wrappers for simple functions that take one or more scalar values and produce one or more scalar results.

Many ufuncs are simple element-wise transformations, like sqrt or exp. These are referred to as unary ufuncs.

z = np.array([[-2.5, 3.1, 7], [10, 11, 12]])

np.square#

Element-wise square of the input.

np.square(z)
array([[  6.25,   9.61,  49.  ],
       [100.  , 121.  , 144.  ]])

np.exp#

Calculate the exponential of all elements in the input array.

np.exp(z)
array([[8.20849986e-02, 2.21979513e+01, 1.09663316e+03],
       [2.20264658e+04, 5.98741417e+04, 1.62754791e+05]])

Binary universal functions#

Others, such as add or maximum, take two arrays (thus, binary ufuncs) and return a single array as the result:

x = np.array([3, 6, 1])
y = np.array([4, 2, 9])
print(x)
print(y)
[3 6 1]
[4 2 9]

np.maximum#

Element-wise maximum of array elements - do not confuse with np.max which finds the max element in the array.

np.maximum(x,y)
array([4, 6, 9])

np.minimum#

Element-wise minimum of array elements - do not confuse with np.min which finds the min element in the array.

np.minimum(x,y)
array([3, 2, 1])

np.power#

First array elements raised to powers from second array, element-wise.

np.power(x,y)
array([81, 36,  1])

Array indexing and slicing#

One-dimensional arrays#

One-dimensional NumPy arrays can be accessed more or less like regular Python arrays:

a = np.array([1, 5, 3, 19, 13, 7, 3])
a[3]
np.int64(19)
a[2:5]
array([ 3, 19, 13])
a[2:-1]
array([ 3, 19, 13,  7])
a[:2]
array([1, 5])
a[2::2]
array([ 3, 13,  3])
a[::-1]
array([ 3,  7, 13, 19,  3,  5,  1])

Of course, you can modify elements:

a[3]=999
a
array([  1,   5,   3, 999,  13,   7,   3])

You can also modify an array slice:

a[2:5] = [997, 998, 999]
a
array([  1,   5, 997, 998, 999,   7,   3])

Multi-dimensional arrays#

Multi-dimensional arrays can be accessed in a similar way by providing an index or slice for each axis, separated by commas:

b = np.arange(12).reshape(4, 3)
b
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])
b[1, 1]  # row 2, col 2 (recall that Python slices starting at index 0)
np.int64(4)
b[0, :]  # row 1, all columns
array([0, 1, 2])
b[:, 0]  # all rows, column 1
array([0, 3, 6, 9])

Caution: Note the subtle difference between these two expressions:

c = b[0, :]
print(c)
print(c.shape)
[0 1 2]
(3,)
d = b[0:1, :]
print(d)
print(d.shape)
[[0 1 2]]
(1, 3)

The first expression returns row 1 as a 1D array of shape (3,), while the second returns that same row as a 2D array of shape (1, 3).

Transposing arrays#

An array’s transpose() method transposes the array.

a = np.arange(10).reshape(5,-1)
a
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])
a = a.transpose()  # notice the assignment for this method to work!
a
array([[0, 2, 4, 6, 8],
       [1, 3, 5, 7, 9]])

Combining arrays#

np.vstack: stack arrays vertically#

a = 1 + np.arange(3)
b = -1 * a
c = 10 + a
print(a)
print(b)
print(c)
d = np.vstack((a, b, c))  # notice the double parantheses
print('stack vertically:\n', d)
[1 2 3]
[-1 -2 -3]
[11 12 13]
stack vertically:
 [[ 1  2  3]
 [-1 -2 -3]
 [11 12 13]]

np.hstack: stack arrays horizontally#

d = np.hstack((a, b, c))  # notice the double parantheses
print('stack horizontally:\n', d)
stack horizontally:
 [ 1  2  3 -1 -2 -3 11 12 13]

Sorting arrays#

You can use an array’s sort method, but pay attention as sorting is done in-place!

a = np.array([3, 5, -1, 0, 11])
print(a)
sort_output = a.sort()
print('a has been sorted in place:\n', a)
print(sort_output) # tricky: this will print None!
[ 3  5 -1  0 11]
a has been sorted in place:
 [-1  0  3  5 11]
None

If you do not want to sort in place, you need to use np.sort.

a = np.array([3, 5, -1, 0, 11])
print(a)
b = np.sort(a)
print(b)
print('Notice a is not changed:\n', a)
[ 3  5 -1  0 11]
[-1  0  3  5 11]
Notice a is not changed:
 [ 3  5 -1  0 11]

If you want reverse sort, you need to do it indirectly as there is no direct option for it inside the sort methods.

a_reverse_sorted = np.sort(a)[::-1]
print(a_reverse_sorted)
[11  5  3  0 -1]

Exercises#

1- Initialize a 5 \(\times\) 3 2D array with all numbers divisible by 3 between 3 and 48. HINT: np.arange’s argument step. For example, you can create an array of 0, 2, 4, 6, 8 by calling np.arange(0, 10, step = 2). Then slice the last column of the array.

2- Create an array say a = np.random.uniform(1, 10, 10). Find the location or index of the maximum value in a. How about the location of the minimum value? HINT: use argmax and argmin methods

3- Create the following array and find the maximum values in each row. How about column-wise maximum values? HINT: use np.amax.

\[\begin{split}A = \begin{bmatrix} 1 & 3 & 4 \\ 2 & 7 & -1 \end{bmatrix}\end{split}\]

4- Missing values such as NA and nan are not uncommon in data science (technically, nan is not a missing value. It stands for not-a-number.) Create the following matrix which contains one nan using np.nan.

\[\begin{split}B = \begin{bmatrix} 1 & 3 & \text{nan} \\ 2 & 7 & -1 \end{bmatrix}\end{split}\]

5- Find the column-wise and the row-wise maximum values in B created in the previous question. Does np.amax return any value? HINT: Try np.nanmax method.

Possible solutions#

1- Initializing and slicing arrays

import numpy as np
# Create and reshape the array
myarray = np.arange(3, 48, step = 3)
myarray.shape = (5, 3)

# Slice the last column
myarray[:,2]

2- Indexing the maximum and minimum

import numpy as np
a = np.random.uniform(1, 10, 10)
a
a.argmax() # Find the maximum index
a.argmin() # Find the minimum index

3- Column-wise and row-wise maximum and minimum values.

import numpy as np
A = np.array([[1, 3, 4],[2, 7, -1]])
np.amax(A, axis = 0) # Column-wise
np.amax(A, axis = 1) # Row-wise

4- Creating nan with numpy.

import numpy as np
B = np.array([[1, 3, np.nan],[2, 7, -1]])

5- Column-wise and row-wise maximum and minimum values in the presence of nan values.

import numpy as np
B = np.array([[1, 3, np.nan],[2, 7, -1]])
np.nanmax(B, axis = 0) # Column-wise
np.nanmax(B, axis = 1) # Row-wise