NumPy#
NumPy (Numerical Python) is the core module for numerical computation in Python. NumPy contains a fast and memory-efficient implementation of a list-like array data structure and it contains useful linear algebra and random number functions. A large portion of NumPy is actually written in the C
programming language.
A NumPy array is similar to Python’s list
data structure. A Python list can contain any combination of element types: integers, floats, strings, functions, objects, etc. A NumPy array, on the other hand, must contain only one element type at a time. This way, NumPy arrays can be much faster and more memory efficient.
Both the Pandas
module (for data analysis) and the Scikit-Learn
module (for machine learning) are built upon the NumPy module. The Matplotlib
module (for plotting) also plays nicely with NumPy. These four modules plus the base Python is practically all you need for basic to intermediate machine learning.
Two other fundamental Python modules closely related to machine learning are as follows - though we will not cover these in our tutorials:
SciPy
: This module is for numerical computing including integration, differentiation, optimization, probability distributions, and parallel programming.StatsModels
: This module provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.
Table of Contents#
Let’s import numpy
with usual convention of np
.
import numpy as np
Creating arrays with NumPy#
NumPy’s array class is called ndarray (the n-dimensional array). It is also known by the name array.
In a NumPy array, each dimension is called an axis and the number of axes is called the rank.
For example, a 3x4 matrix is an array of rank 2 (it is 2-dimensional).
The first axis has length 3, the second has length 4.
An array’s list of axis lengths is called the shape of the array.
For example, a 3x4 matrix’s shape is
(3, 4)
.The rank is equal to the shape’s length.
The size of an array is the total number of elements, which is the product of all axis lengths (eg. 3*4=12)
np.array
#
The easiest way to create an array is to use the array
function. This accepts any sequence-like object (including other arrays) and produces a new NumPy array containing the passed data.
arr1 = np.array([2, 10.2, 5.4, 80, 0])
arr1
array([ 2. , 10.2, 5.4, 80. , 0. ])
Nested sequences, like a list of equal-length lists, will be converted into a multi-dimensional array:
data = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data)
arr2
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
arr2.shape
(2, 4)
arr2.ndim # equal to len(a.shape)
2
arr2.size
8
Other functions to create arrays#
There are several other convenience NumPy functions to create arrays.
np.zeros
#
Creates an array containing any number of zeros.
np.zeros(5)
array([0., 0., 0., 0., 0.])
It’s just as easy to create a 2-D array (i.e., a matrix) by providing a tuple with the desired number of rows and columns. For example, here’s a 3x4 matrix:
np.zeros((2, 3)) # notice the double parantheses
array([[0., 0., 0.],
[0., 0., 0.]])
np.ones
#
Produces an array of all ones.
np.ones((2, 3))
array([[1., 1., 1.],
[1., 1., 1.]])
How to create an array with the same values:
(np.pi * np.ones((3,4))).round(2)
array([[3.14, 3.14, 3.14, 3.14],
[3.14, 3.14, 3.14, 3.14],
[3.14, 3.14, 3.14, 3.14]])
np.arange
#
This is similar to Python’s built-in range
function, but much faster.
np.arange(5)
array([0, 1, 2, 3, 4])
np.arange(1, 5)
array([1, 2, 3, 4])
It also works with floats:
np.arange(1.0, 5.0)
array([1., 2., 3., 4.])
Of course, you can provide a step parameter:
np.arange(1, 5, step = 0.5)
array([1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])
np.linspace
#
This is similar to seq()
in R. Its inputs are (start, stop, number of elements) and it returns evenly-spaced numbers over a specified interval. By default, the stop value is included.
np.linspace(0, 10, 6)
array([ 0., 2., 4., 6., 8., 10.])
np.quantile
#
Computes the q-th quantile of its input. It plays nicely with np.linspace
.
a = np.arange(1, 21)
print('a =', a)
quartiles = np.linspace(0, 1, 5)
print('quartiles =', quartiles)
a = [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20]
quartiles = [0. 0.25 0.5 0.75 1. ]
np.quantile(a, 0.5) # how to compute the median
np.float64(10.5)
np.quantile(a, quartiles)
array([ 1. , 5.75, 10.5 , 15.25, 20. ])
np.rand
and np.randn
#
A number of functions are available in NumPy’s random
module to create arrays initialized with random values.
For example, here is a matrix initialized with random floats between 0 and 1 (uniform distribution):
np.random.rand(2,3).round(3)
array([[0.867, 0.295, 0.074],
[0.419, 0.619, 0.623]])
Here’s a matrix containing random floats sampled from a univariate normal distribution (Gaussian distribution) with mean 0 and variance 1:
np.random.randn(2,3).round(3)
array([[-0.102, -1.267, 0.367],
[ 0.564, 1.524, 0.681]])
Data types for arrays#
Type |
Description |
---|---|
int16 |
16-bit integer types |
int32 |
32-bit integer types |
int64 |
64-bit integer types |
float16 |
Half-precision floating point |
float32 |
Standard single-precision floating point |
float64 |
Standard double-precision floating point |
bool |
Boolean (True or False) |
string_ |
String |
object |
A value can be any Python object |
np.array.dtype
#
NumPy’s arrays are also efficient in part because all their elements must have the same type (usually numbers).
You can check what the data type is by looking at the dtype
attribute.
arr1 = np.array([1, 2, 3], dtype = np.float64)
print("Data type name:", arr1.dtype.name)
Data type name: float64
arr2 = np.array([1, 2, 3], dtype = np.int32)
print(arr2.dtype, arr2)
int32 [1 2 3]
np.array.astype
#
You can explicitly convert or cast an array from one dtype
to another using astype
method.
arr2.dtype
dtype('int32')
arr2 = arr2.astype(np.float64)
arr2.dtype # integers are now cast to floating point
dtype('float64')
If you have an array of strings representing numbers, you can use list comprehension to convert them to numeric form.
arr3 = np.array(['1.25', '-9.6', '42'])
numeric_strings = np.array([float(x) for x in arr3])
numeric_strings
array([ 1.25, -9.6 , 42. ])
numeric_strings.astype(float) # this will not take effect unless you do set it to a new variable!
array([ 1.25, -9.6 , 42. ])
numeric_strings.dtype
dtype('float64')
Arithmetic operations on arrays#
All the usual arithmetic operators (+
, -
, *
, /
, //
, **
, etc.) can be used with arrays. They apply element-wise.
a = np.array([14, 23, 32, 41])
b = np.array([5, 4, 3, 2])
print("a + b =", a + b)
print("a - b =", a - b)
print("a * b =", a * b)
print("a / b =", a / b)
print("a // b =", a // b)
print("a % b =", a % b)
print("a ** b =", a ** b)
a + b = [19 27 35 43]
a - b = [ 9 19 29 39]
a * b = [70 92 96 82]
a / b = [ 2.8 5.75 10.66666667 20.5 ]
a // b = [ 2 5 10 20]
a % b = [4 3 2 1]
a ** b = [537824 279841 32768 1681]
Note that the multiplication is not a matrix multiplication.
The arrays must have the same shape. If they do not, NumPy will apply the broadcasting rules, which is discussed further below.
Reshaping arrays#
In many cases, you can convert an array from one shape to another without copying any data.
np.array.shape
#
Changing the shape of an array is as simple as setting its shape
attribute. However, the array’s size must remain the same.
g = np.arange(12)
print(g)
print("Rank:", g.ndim)
[ 0 1 2 3 4 5 6 7 8 9 10 11]
Rank: 1
g.shape = (6, 2)
print(g)
print("Rank:", g.ndim)
[[ 0 1]
[ 2 3]
[ 4 5]
[ 6 7]
[ 8 9]
[10 11]]
Rank: 2
np.array.reshape
#
Another way to change an array’s shape is to use the reshape()
method, which returns a new array object.
g2 = g.reshape(4,3) # you need to set this to a new variable to take effect!
print(g2)
print("Rank:", g2.ndim)
[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]]
Rank: 2
How about we get lazy and let NumPy figure out the details?
g2 = g.reshape(4, -1)
print(g2)
[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]]
How to convert a multi-dimensional array back to 1-dimensional (a.k.a array flattening): you can use the flatten
method.
f = np.arange(6).reshape(3,2)
print(f)
f = f.flatten() # you need to set this to a new variable to take effect!
print(f)
print(f.shape)
[[0 1]
[2 3]
[4 5]]
[0 1 2 3 4 5]
(6,)
Adding and removing elements#
np.append
and np.insert
#
a = np.arange(6)
print('original array:\n', a)
b = np.append(a, 111)
print('appending an element to the end:\n', b)
c = np.insert(a, 0, 111)
print('inserting an element at a specific position:\n', c)
# watch out: these will NOT work: a.append(111), a.insert(0, 111)
original array:
[0 1 2 3 4 5]
appending an element to the end:
[ 0 1 2 3 4 5 111]
inserting an element at a specific position:
[111 0 1 2 3 4 5]
np.delete
#
a = np.arange(6)
a
c = np.delete(a, [0,1])
print('deleting the first two elements:\n', c)
a.resize(2,3)
print('a after resize():\n', a)
e = np.delete(a, 0, axis=1) # you can delete an entire column by specifying axis=1
print('first column deleted:\n', e)
f = np.delete(a, 0, axis=0) # or you can delete an entire row by specifying axis=0
print('first row deleted:\n', f)
deleting the first two elements:
[2 3 4 5]
a after resize():
[[0 1 2]
[3 4 5]]
first column deleted:
[[1 2]
[4 5]]
first row deleted:
[[3 4 5]]
Copying arrays#
NumPy usually does not make copies for efficiency. Most assignments are just views, not copies. If you want a copy, you need to say so.
You can use either np.array.copy
or np.copy
.
b = a = np.arange(6)
a_copy = a.copy()
# alternatively,
a_copy = np.copy(a)
a
b
a_copy
print(a == a_copy) # element-wise comparison
print(a is a_copy) # this is False
print(a is b) # this is True
a[0] = -111 # changing a has no effect on a_copy
a
a_copy
[ True True True True True True]
False
True
array([0, 1, 2, 3, 4, 5])
Broadcasting#
Broadcasting describes how NumPy treats arrays with different shapes during arithmetic operations. Broadcasting can get complicated, so we recommend you avoid it all together if you can and do either one of the two things below:
Broadcast only a scalar with an array
Broadcast arrays of the same shape
A = np.arange(6).reshape(3,2)
B = np.arange(6, 12).reshape(3,2)
A
B
array([[ 6, 7],
[ 8, 9],
[10, 11]])
A + B
array([[ 6, 8],
[10, 12],
[14, 16]])
3 * A
array([[ 0, 3],
[ 6, 9],
[12, 15]])
(A / 3).round(2) # float division
array([[0. , 0.33],
[0.67, 1. ],
[1.33, 1.67]])
A // 3 # integer division
array([[0, 0],
[0, 1],
[1, 1]])
11 + A
array([[11, 12],
[13, 14],
[15, 16]])
Element-wise matrix multiplication is done by *
.
A * B
array([[ 0, 7],
[16, 27],
[40, 55]])
For usual matrix multiplication, you need to use np.dot
.
B_new = B.reshape(2,-1)
B_new
np.dot(A, B_new)
array([[ 9, 10, 11],
[39, 44, 49],
[69, 78, 87]])
Conditional expressions with arrays#
x = np.array([10,20,30,40,50])
x >= 30
array([False, False, True, True, True])
x[x >= 30]
array([30, 40, 50])
np.where
#
Returns the indices of elements in an input array where the given condition is satisfied.
y = np.arange(10)
print(y)
np.where(y < 5)
[0 1 2 3 4 5 6 7 8 9]
(array([0, 1, 2, 3, 4]),)
Extremely useful: You can use where
for vectorised if-else statements.
compared_to_5 = list(np.where(y < 5, 'smaller', 'bigger'))
print(compared_to_5)
[np.str_('smaller'), np.str_('smaller'), np.str_('smaller'), np.str_('smaller'), np.str_('smaller'), np.str_('bigger'), np.str_('bigger'), np.str_('bigger'), np.str_('bigger'), np.str_('bigger')]
Mathematical and statistical functions#
A set of mathematical functions that compute statistics about an entire array or about the data along an axis are accessible as methods of the array class.
a = np.array([[-2.5, 3.1, 7], [10, 11, 12]])
print(a)
[[-2.5 3.1 7. ]
[10. 11. 12. ]]
np.max(a)
np.float64(12.0)
np.min(a)
np.float64(-2.5)
np.mean(a).round(3)
np.float64(6.767)
np.prod(a)
np.float64(-71610.0)
np.std(a).round(3)
np.float64(5.085)
np.var(a).round(3)
np.float64(25.856)
np.sum(a)
np.float64(40.6)
These functions accept an optional argument axis
which lets you ask for the operation to be performed on elements along the given axis. For example:
b = np.arange(12).reshape(2,-1)
b
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11]])
b.sum(axis=0) # sum across columns
array([ 6, 8, 10, 12, 14, 16])
b.sum(axis=1) # sum across rows
array([15, 51])
Universal functions#
A universal function, or ufunc, is a function that performs element-wise operations on data in ndarrays. You can think of them as fast vectorized wrappers for simple functions that take one or more scalar values and produce one or more scalar results.
Many ufuncs are simple element-wise transformations, like sqrt or exp. These are referred to as unary ufuncs.
z = np.array([[-2.5, 3.1, 7], [10, 11, 12]])
np.square
#
Element-wise square of the input.
np.square(z)
array([[ 6.25, 9.61, 49. ],
[100. , 121. , 144. ]])
np.exp
#
Calculate the exponential of all elements in the input array.
np.exp(z)
array([[8.20849986e-02, 2.21979513e+01, 1.09663316e+03],
[2.20264658e+04, 5.98741417e+04, 1.62754791e+05]])
Binary universal functions#
Others, such as add or maximum, take two arrays (thus, binary ufuncs) and return a single array as the result:
x = np.array([3, 6, 1])
y = np.array([4, 2, 9])
print(x)
print(y)
[3 6 1]
[4 2 9]
np.maximum
#
Element-wise maximum of array elements - do not confuse with np.max
which finds the max element in the array.
np.maximum(x,y)
array([4, 6, 9])
np.minimum
#
Element-wise minimum of array elements - do not confuse with np.min
which finds the min element in the array.
np.minimum(x,y)
array([3, 2, 1])
np.power
#
First array elements raised to powers from second array, element-wise.
np.power(x,y)
array([81, 36, 1])
Array indexing and slicing#
One-dimensional arrays#
One-dimensional NumPy arrays can be accessed more or less like regular Python arrays:
a = np.array([1, 5, 3, 19, 13, 7, 3])
a[3]
np.int64(19)
a[2:5]
array([ 3, 19, 13])
a[2:-1]
array([ 3, 19, 13, 7])
a[:2]
array([1, 5])
a[2::2]
array([ 3, 13, 3])
a[::-1]
array([ 3, 7, 13, 19, 3, 5, 1])
Of course, you can modify elements:
a[3]=999
a
array([ 1, 5, 3, 999, 13, 7, 3])
You can also modify an array slice:
a[2:5] = [997, 998, 999]
a
array([ 1, 5, 997, 998, 999, 7, 3])
Multi-dimensional arrays#
Multi-dimensional arrays can be accessed in a similar way by providing an index or slice for each axis, separated by commas:
b = np.arange(12).reshape(4, 3)
b
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
b[1, 1] # row 2, col 2 (recall that Python slices starting at index 0)
np.int64(4)
b[0, :] # row 1, all columns
array([0, 1, 2])
b[:, 0] # all rows, column 1
array([0, 3, 6, 9])
Caution: Note the subtle difference between these two expressions:
c = b[0, :]
print(c)
print(c.shape)
[0 1 2]
(3,)
d = b[0:1, :]
print(d)
print(d.shape)
[[0 1 2]]
(1, 3)
The first expression returns row 1 as a 1D array of shape (3,)
, while the second returns that same row as a 2D array of shape (1, 3)
.
Transposing arrays#
An array’s transpose()
method transposes the array.
a = np.arange(10).reshape(5,-1)
a
array([[0, 1],
[2, 3],
[4, 5],
[6, 7],
[8, 9]])
a = a.transpose() # notice the assignment for this method to work!
a
array([[0, 2, 4, 6, 8],
[1, 3, 5, 7, 9]])
Combining arrays#
np.vstack
: stack arrays vertically#
a = 1 + np.arange(3)
b = -1 * a
c = 10 + a
print(a)
print(b)
print(c)
d = np.vstack((a, b, c)) # notice the double parantheses
print('stack vertically:\n', d)
[1 2 3]
[-1 -2 -3]
[11 12 13]
stack vertically:
[[ 1 2 3]
[-1 -2 -3]
[11 12 13]]
np.hstack
: stack arrays horizontally#
d = np.hstack((a, b, c)) # notice the double parantheses
print('stack horizontally:\n', d)
stack horizontally:
[ 1 2 3 -1 -2 -3 11 12 13]
Sorting arrays#
You can use an array’s sort
method, but pay attention as sorting is done in-place!
a = np.array([3, 5, -1, 0, 11])
print(a)
sort_output = a.sort()
print('a has been sorted in place:\n', a)
print(sort_output) # tricky: this will print None!
[ 3 5 -1 0 11]
a has been sorted in place:
[-1 0 3 5 11]
None
If you do not want to sort in place, you need to use np.sort
.
a = np.array([3, 5, -1, 0, 11])
print(a)
b = np.sort(a)
print(b)
print('Notice a is not changed:\n', a)
[ 3 5 -1 0 11]
[-1 0 3 5 11]
Notice a is not changed:
[ 3 5 -1 0 11]
If you want reverse sort, you need to do it indirectly as there is no direct option for it inside the sort
methods.
a_reverse_sorted = np.sort(a)[::-1]
print(a_reverse_sorted)
[11 5 3 0 -1]
Exercises#
1- Initialize a 5 \(\times\) 3 2D array with all numbers divisible by 3 between 3 and 48. HINT: np.arange
’s argument step
. For example, you can create an array of 0, 2, 4, 6, 8 by calling np.arange(0, 10, step = 2)
. Then slice the last column of the array.
2- Create an array say a = np.random.uniform(1, 10, 10)
. Find the location or index of the maximum value in a
. How about the location of the minimum value? HINT: use argmax
and argmin
methods
3- Create the following array and find the maximum values in each row. How about column-wise maximum values? HINT: use np.amax
.
4- Missing values such as NA
and nan
are not uncommon in data science (technically, nan
is not a missing value. It stands for not-a-number.) Create the following matrix which contains one nan
using np.nan
.
5- Find the column-wise and the row-wise maximum values in B
created in the previous question. Does np.amax
return any value? HINT: Try np.nanmax
method.
Possible solutions#
1- Initializing and slicing arrays
import numpy as np
# Create and reshape the array
myarray = np.arange(3, 48, step = 3)
myarray.shape = (5, 3)
# Slice the last column
myarray[:,2]
2- Indexing the maximum and minimum
import numpy as np
a = np.random.uniform(1, 10, 10)
a
a.argmax() # Find the maximum index
a.argmin() # Find the minimum index
3- Column-wise and row-wise maximum and minimum values.
import numpy as np
A = np.array([[1, 3, 4],[2, 7, -1]])
np.amax(A, axis = 0) # Column-wise
np.amax(A, axis = 1) # Row-wise
4- Creating nan
with numpy
.
import numpy as np
B = np.array([[1, 3, np.nan],[2, 7, -1]])
5- Column-wise and row-wise maximum and minimum values in the presence of nan
values.
import numpy as np
B = np.array([[1, 3, np.nan],[2, 7, -1]])
np.nanmax(B, axis = 0) # Column-wise
np.nanmax(B, axis = 1) # Row-wise