pandas

Scientific Computing

Prof. Calvin

Why pandas?

What is pandas?

In its own words:

a fast, powerful, flexible and easy to use open source data analysis and manipulation tool

In my words:

“The basis of the modern ‘boom’ in data and data analysis”

pandas

Stands for “panel data”
Original built on NumPy, now on supercomputing technologies.
Core innovation: The DataFrame

A DataFrame is a data structure that organizes data into a 2-dimensional table of rows and columns, much like a spreadsheet.

“like a spreadsheet”

With pandas, we can do all the things we can do in a spreadsheet, but:
- Automatically
- Repeatedly
- Consistently
Basically, we never have to find a button again, but instead we have to learn functions from code documentation.

Spreadsheet Usage

I mostly (used to) use spreadsheets to make charts.
pandas works very well with Matplotlib (better than NumPy does, I think)
In the NumPy tutorial, pandas and Matplotlib are the two other Python packages mentioned, to read spreadsheets and plot them, respectively.

Why not pandas?

After years undisputed prominence as the best data library, pandas has recently received a challenger called “Polars” which is situationally faster and rapidly gaining popularity.
I expect Polars to take over in cloud computing but not in scientific computing.
Polars uses neither NumPy not Matplotlib, but plots with Altair, which I found insuitable for scientific applications.

Relevance

We have only calculated and plotted data we have typed in ourselves!
Yuck!

Install

pip again

Just like NumPy and Matplotlib, pandas is a Python package which we install via pip

python3 -m pip install pandas

That might take a moment, when it does we can check it worked!

Two First Steps

There are two great ways to get a pandas DataFrame.
We quickly show both.

import numpy as np
import pandas as pd

From NumPy

taxes = np.array([
    [9275, .1],
    [37650, .15],
    [91150, .25],
    [190150, .28],
    [413350, .33],
    [415051, .35]
])
df = pd.DataFrame(taxes) # use "df" for dataframes by convention
df # You'll notice this may look a lot nicer

	0	1
0	9275.0	0.10
1	37650.0	0.15
2	91150.0	0.25
3	190150.0	0.28
4	413350.0	0.33
5	415051.0	0.35

Get a File

I will use some nuclear data.
Usually data will come from your research group or experiments.
Often stored as a CSV.
- A “.csv” file, for “comma separated value”
We’ll use “Periodic Table of Elements.csv”
- Spaces in maes are annoying; we’ll manage.
It’s from here

From a URL

You can use files on your computer, or…
From a url.
But the url must be the address of the file
- Not a page that talks about the file.
- Not a page with the same data but presented in a pretty table.

“Raw” Files

On GitHub (common place to keep files) these are called “raw” files.
- https://gist.githubusercontent.com/GoodmanSciences/c2dd862cd38f21b0ad36b8f96b4bf1ee/raw/1d92663004489a5b6926e944c1b3d9ec5c40900e/Periodic%2520Table%2520of%2520Elements.csv
Note - that ends in .csv

From File

df = pd.read_csv("https://gist.githubusercontent.com/GoodmanSciences/c2dd862cd38f21b0ad36b8f96b4bf1ee/raw/1d92663004489a5b6926e944c1b3d9ec5c40900e/Periodic%2520Table%2520of%2520Elements.csv")
df

	AtomicNumber	Element	Symbol	AtomicMass	NumberofNeutrons	NumberofProtons	NumberofElectrons	Period	Group	Phase	...	FirstIonization	Density	MeltingPoint	BoilingPoint	NumberOfIsotopes	Discoverer	Year	SpecificHeat	NumberofShells	NumberofValence
0	1	Hydrogen	H	1.007	0	1	1	1	1.0	gas	...	13.5984	0.000090	14.175	20.28	3.0	Cavendish	1766.0	14.304	1	1.0
1	2	Helium	He	4.002	2	2	2	1	18.0	gas	...	24.5874	0.000179	NaN	4.22	5.0	Janssen	1868.0	5.193	1	NaN
2	3	Lithium	Li	6.941	4	3	3	2	1.0	solid	...	5.3917	0.534000	453.850	1615.00	5.0	Arfvedson	1817.0	3.582	2	1.0
3	4	Beryllium	Be	9.012	5	4	4	2	2.0	solid	...	9.3227	1.850000	1560.150	2742.00	6.0	Vaulquelin	1798.0	1.825	2	2.0
4	5	Boron	B	10.811	6	5	5	2	13.0	solid	...	8.2980	2.340000	2573.150	4200.00	6.0	Gay-Lussac	1808.0	1.026	2	3.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
113	114	Flerovium	Fl	289.000	175	114	114	7	14.0	artificial	...	NaN	NaN	NaN	NaN	NaN	NaN	1999.0	NaN	7	4.0
114	115	Moscovium	Mc	288.000	173	115	115	7	15.0	artificial	...	NaN	NaN	NaN	NaN	NaN	NaN	2010.0	NaN	7	5.0
115	116	Livermorium	Lv	292.000	176	116	116	7	16.0	artificial	...	NaN	NaN	NaN	NaN	NaN	NaN	2000.0	NaN	7	6.0
116	117	Tennessine	Ts	295.000	178	117	117	7	17.0	artificial	...	NaN	NaN	NaN	NaN	NaN	NaN	2010.0	NaN	7	7.0
117	118	Oganesson	Og	294.000	176	118	118	7	18.0	artificial	...	NaN	NaN	NaN	NaN	NaN	NaN	2006.0	NaN	7	8.0

118 rows × 28 columns

DataFrames

DataFrames and Arrays

DataFrames:
- Have row and column labels
- Can store multiple types of data (strings/words, integers, and floating point numbers)
Arrays:
- Don’t
- Can’t

Subsetting

One of the most common uses for pandas is to subset, to look at part of DataFrame but not all.
Usually by looking at a specific column or columns - by name.
- Vs. arrays and lists, DataFrames still have indices, but they are now usually strings.

df["Element"]

0         Hydrogen
1           Helium
2          Lithium
3        Beryllium
4            Boron
          ...     
113      Flerovium
114      Moscovium
115    Livermorium
116     Tennessine
117      Oganesson
Name: Element, Length: 118, dtype: object

Multiple Indices

We can also subset multiple indices using a list of indices (column names).

df[["Element","Symbol"]]

	Element	Symbol
0	Hydrogen	H
1	Helium	He
2	Lithium	Li
3	Beryllium	Be
4	Boron	B
...	...	...
113	Flerovium	Fl
114	Moscovium	Mc
115	Livermorium	Lv
116	Tennessine	Ts
117	Oganesson	Og

118 rows × 2 columns

Filtering

We can subset rows by filtering
Basically we write what would be an if statement, but use it as an index.
The “transuranic” elements are elements higher than number 92.

df[df["AtomicNumber"] > 92]

	AtomicNumber	Element	Symbol	AtomicMass	NumberofNeutrons	NumberofProtons	NumberofElectrons	Period	Group	Phase	...	FirstIonization	Density	MeltingPoint	BoilingPoint	NumberOfIsotopes	Discoverer	Year	SpecificHeat	NumberofShells	NumberofValence
92	93	Neptunium	Np	237.0	144	93	93	7	NaN	artificial	...	6.2657	20.5	913.15	4273.0	153.0	McMillan and Abelson	1940.0	NaN	7	NaN
93	94	Plutonium	Pu	244.0	150	94	94	7	NaN	artificial	...	6.0262	19.8	913.15	3501.0	163.0	Seaborg et al.	1940.0	NaN	7	NaN
94	95	Americium	Am	243.0	148	95	95	7	NaN	artificial	...	5.9738	13.7	1267.15	2880.0	133.0	Seaborg et al.	1944.0	NaN	7	NaN
95	96	Curium	Cm	247.0	151	96	96	7	NaN	artificial	...	5.9915	13.5	1340.15	3383.0	133.0	Seaborg et al.	1944.0	NaN	7	NaN
96	97	Berkelium	Bk	247.0	150	97	97	7	NaN	artificial	...	6.1979	14.8	1259.15	983.0	83.0	Seaborg et al.	1949.0	NaN	7	NaN
97	98	Californium	Cf	251.0	153	98	98	7	NaN	artificial	...	6.2817	15.1	1925.15	1173.0	123.0	Seaborg et al.	1950.0	NaN	7	NaN
98	99	Einsteinium	Es	252.0	153	99	99	7	NaN	artificial	...	6.4200	13.5	1133.15	NaN	123.0	Ghiorso et al.	1952.0	NaN	7	NaN
99	100	Fermium	Fm	257.0	157	100	100	7	NaN	artificial	...	6.5000	NaN	NaN	NaN	103.0	Ghiorso et al.	1953.0	NaN	7	NaN
100	101	Mendelevium	Md	258.0	157	101	101	7	NaN	artificial	...	6.5800	NaN	NaN	NaN	33.0	Ghiorso et al.	1955.0	NaN	7	NaN
101	102	Nobelium	No	259.0	157	102	102	7	NaN	artificial	...	6.6500	NaN	NaN	NaN	73.0	Ghiorso et al.	1958.0	NaN	7	NaN
102	103	Lawrencium	Lr	262.0	159	103	103	7	NaN	artificial	...	NaN	NaN	NaN	NaN	203.0	Ghiorso et al.	1961.0	NaN	7	NaN
103	104	Rutherfordium	Rf	261.0	157	104	104	7	4.0	artificial	...	NaN	18.1	NaN	NaN	NaN	Ghiorso et al.	1969.0	NaN	7	NaN
104	105	Dubnium	Db	262.0	157	105	105	7	5.0	artificial	...	NaN	39.0	NaN	NaN	NaN	Ghiorso et al.	1970.0	NaN	7	NaN
105	106	Seaborgium	Sg	266.0	160	106	106	7	6.0	artificial	...	NaN	35.0	NaN	NaN	NaN	Ghiorso et al.	1974.0	NaN	7	NaN
106	107	Bohrium	Bh	264.0	157	107	107	7	7.0	artificial	...	NaN	37.0	NaN	NaN	NaN	Armbruster and M�nzenberg	1981.0	NaN	7	NaN
107	108	Hassium	Hs	267.0	159	108	108	7	8.0	artificial	...	NaN	41.0	NaN	NaN	NaN	Armbruster and M�nzenberg	1983.0	NaN	7	NaN
108	109	Meitnerium	Mt	268.0	159	109	109	7	9.0	artificial	...	NaN	35.0	NaN	NaN	NaN	GSI, Darmstadt, West Germany	1982.0	NaN	7	NaN
109	110	Darmstadtium	Ds	271.0	161	110	110	7	10.0	artificial	...	NaN	NaN	NaN	NaN	NaN	NaN	1994.0	NaN	7	NaN
110	111	Roentgenium	Rg	272.0	161	111	111	7	11.0	artificial	...	NaN	NaN	NaN	NaN	NaN	NaN	1994.0	NaN	7	NaN
111	112	Copernicium	Cn	285.0	173	112	112	7	12.0	artificial	...	NaN	NaN	NaN	NaN	NaN	NaN	1996.0	NaN	7	NaN
112	113	Nihonium	Nh	284.0	171	113	113	7	13.0	artificial	...	NaN	NaN	NaN	NaN	NaN	NaN	2004.0	NaN	7	3.0
113	114	Flerovium	Fl	289.0	175	114	114	7	14.0	artificial	...	NaN	NaN	NaN	NaN	NaN	NaN	1999.0	NaN	7	4.0
114	115	Moscovium	Mc	288.0	173	115	115	7	15.0	artificial	...	NaN	NaN	NaN	NaN	NaN	NaN	2010.0	NaN	7	5.0
115	116	Livermorium	Lv	292.0	176	116	116	7	16.0	artificial	...	NaN	NaN	NaN	NaN	NaN	NaN	2000.0	NaN	7	6.0
116	117	Tennessine	Ts	295.0	178	117	117	7	17.0	artificial	...	NaN	NaN	NaN	NaN	NaN	NaN	2010.0	NaN	7	7.0
117	118	Oganesson	Og	294.0	176	118	118	7	18.0	artificial	...	NaN	NaN	NaN	NaN	NaN	NaN	2006.0	NaN	7	8.0

26 rows × 28 columns

Double `df`

We use df twice in that expression:
- Once to check the df if some value is greater than 92
- Once to look up all elements for which the value is greater.

df["AtomicNumber"] > 92

0      False
1      False
2      False
3      False
4      False
       ...  
113     True
114     True
115     True
116     True
117     True
Name: AtomicNumber, Length: 118, dtype: bool

.columns

By the way, what all do we know?

df.columns

Index(['AtomicNumber', 'Element', 'Symbol', 'AtomicMass', 'NumberofNeutrons',
       'NumberofProtons', 'NumberofElectrons', 'Period', 'Group', 'Phase',
       'Radioactive', 'Natural', 'Metal', 'Nonmetal', 'Metalloid', 'Type',
       'AtomicRadius', 'Electronegativity', 'FirstIonization', 'Density',
       'MeltingPoint', 'BoilingPoint', 'NumberOfIsotopes', 'Discoverer',
       'Year', 'SpecificHeat', 'NumberofShells', 'NumberofValence'],
      dtype='object')

Oh that’s a lot of information!

How big?

How big is df?

df.shape

(118, 28)

And by the way, np.array’s have shape too.

taxes.shape

(6, 2)

Aside: Multiple values within a () are asically lists but called tuples.

type(taxes.shape)

tuple

Locating Entries

Sometimes you want to use an integer index.

df.iloc[10] # "Integer Location"

AtomicNumber                   11
Element                    Sodium
Symbol                         Na
AtomicMass                  22.99
NumberofNeutrons               12
NumberofProtons                11
NumberofElectrons              11
Period                          3
Group                         1.0
Phase                       solid
Radioactive                   NaN
Natural                       yes
Metal                         yes
Nonmetal                      NaN
Metalloid                     NaN
Type                 Alkali Metal
AtomicRadius                  2.2
Electronegativity            0.93
FirstIonization            5.1391
Density                     0.971
MeltingPoint               371.15
BoilingPoint               1156.0
NumberOfIsotopes              7.0
Discoverer                   Davy
Year                       1807.0
SpecificHeat                1.228
NumberofShells                  3
NumberofValence               1.0
Name: 10, dtype: object

Zero/One index

DataFrames are 0-indexed

df["AtomicNumber"].iloc[0]

np.int64(1)

Elements are 1-indexed¹ with good reason.

df[["AtomicNumber","NumberofProtons"]].head(1)

	AtomicNumber	NumberofProtons
0	1	1

Data Manipulation

Check out these columns

df.iloc[::20][["NumberofNeutrons", "NumberofProtons"]]

	NumberofNeutrons	NumberofProtons
0	0	1
20	24	21
40	52	41
60	84	61
80	123	81
100	157	101

Pattern

It looks like atoms with fewer protons have the same number of protons and neutrons, and atoms with many protons have many more neutrons.
Let’s calculate the difference.

df["NumberofProtons"] - df["NumberofNeutrons"]

0       1
1       0
2      -1
3      -1
4      -1
       ..
113   -61
114   -58
115   -60
116   -61
117   -58
Length: 118, dtype: int64

Plots

Visualize

Easier to visualize.

# import Matplotlib if you don't have it yet!
import matplotlib.pyplot as plt

Scatterplot

plt.scatter(df["NumberofProtons"], df["NumberofNeutrons"])

Insights

A lot of science, as far as I can tell is finding novel insights.
I noticed that boiling/melting point seemed to go up and down at some intervals.
I recall that the number of outermost electrons did that too…
Let’s make num_es again, from last time.

`num_es`

es = np.array([2, 8, 8, 18, 18, 32, 32])
num_es = np.array([]) # The first zero elements
for e in es:
    num_es = np.append(num_es, np.arange(e))
num_es += 1
num_es

array([ 1.,  2.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  1.,  2.,  3.,
        4.,  5.,  6.,  7.,  8.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,
        9., 10., 11., 12., 13., 14., 15., 16., 17., 18.,  1.,  2.,  3.,
        4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13., 14., 15., 16.,
       17., 18.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11.,
       12., 13., 14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24.,
       25., 26., 27., 28., 29., 30., 31., 32.,  1.,  2.,  3.,  4.,  5.,
        6.,  7.,  8.,  9., 10., 11., 12., 13., 14., 15., 16., 17., 18.,
       19., 20., 21., 22., 23., 24., 25., 26., 27., 28., 29., 30., 31.,
       32.])

Expanding DataFrames

We can create a new column and add new data to it from a np.array
At least if the dimensions work out…

df.shape, num_es.shape

((118, 28), (118,))

We have 118 elements, so it is good we have 118 possible outermost electron counts!

Update list

Like updating the element of a list by index…

colors = ['red', 'orange', 'yellow', 'green', 'blue', 'indigo', 'violet']
colors[-1] = "purple"
colors

['red', 'orange', 'yellow', 'green', 'blue', 'indigo', 'purple']

Update DataFrame

We can update or add a column to a DataFrame by index:

df["OutermostElectrons"] = num_es
df.iloc[::20][["Element","OutermostElectrons"]]

	Element	OutermostElectrons
0	Hydrogen	1.0
20	Scandium	3.0
40	Niobium	5.0
60	Promethium	7.0
80	Thallium	27.0
100	Mendelevium	15.0

Patterns

For a scatter, that is two (2) DataFrames (or subsets), not one DataFrame with two columns

plt.scatter(df["OutermostElectrons"], df["MeltingPoint"])

3D

Third dimension via s= for size

plt.scatter(df["OutermostElectrons"], df["MeltingPoint"], s=df["AtomicNumber"])

Columns to Columns

Using vector operations, we can make novel columns from existing columns.

df["NeutronsLessProtons"] = df["NumberofNeutrons"] - df["NumberofProtons"]
plt.plot(df["NeutronsLessProtons"])

4D

plt.scatter(x=df["OutermostElectrons"], # we can optional specify x and y 
            y=df["MeltingPoint"], 
            s=df["AtomicNumber"], 
            c=df["NeutronsLessProtons"]) # color
plt.colorbar() # Like a legend for colors.

“kwargs”

When we call a function and use something like x= in the arguments, x is a keyword so we call these “kwargs” for “keyword arguments”.
You may see this when looking up functions.
Just follow examples.

def triple(x):
    return 3 * x

[triple(x=7), triple(7)]

[21, 21]

Dropping

Some columns may be useless or redundant.

df.iloc[::20][["AtomicNumber","NumberofProtons","NumberofElectrons"]]

	AtomicNumber	NumberofProtons	NumberofElectrons
0	1	1	1
20	21	21	21
40	41	41	41
60	61	61	61
80	81	81	81
100	101	101	101

Drop it

We can remove such columns.

df.shape

(118, 30)

# Have to specify columns (we can also drop rows)
df = df.drop(columns=["NumberofProtons","NumberofElectrons"])
df.shape

(118, 28)

Summaries

Summary Statistics

There’s a variety of ways to describe a DataFrame other than just seeing all of it’s data.

df.describe()

	AtomicNumber	AtomicMass	NumberofNeutrons	Period	Group	AtomicRadius	Electronegativity	FirstIonization	Density	MeltingPoint	BoilingPoint	NumberOfIsotopes	Year	SpecificHeat	NumberofShells	NumberofValence	OutermostElectrons	NeutronsLessProtons
count	118.000000	118.000000	118.000000	118.000000	90.000000	86.000000	96.000000	102.000000	105.000000	98.000000	98.000000	103.000000	107.000000	85.000000	118.000000	49.000000	118.000000	118.000000
mean	59.500000	145.988297	86.483051	5.254237	9.944444	1.825814	1.695000	7.988505	9.232161	1281.475184	2513.143163	28.116505	1865.280374	0.635976	5.254237	4.428571	12.483051	26.983051
std	34.207699	88.954899	54.785320	1.618200	5.597674	0.611058	0.621174	3.334571	8.630406	903.685175	1601.901036	35.864205	97.951740	1.653965	1.618200	2.345208	8.830535	20.782547
min	1.000000	1.007000	0.000000	1.000000	1.000000	0.490000	0.700000	3.893900	0.000090	14.175000	4.220000	3.000000	1250.000000	0.094000	1.000000	1.000000	1.000000	-1.000000
25%	30.250000	66.465750	36.000000	4.000000	5.000000	1.425000	1.237500	6.004850	2.700000	510.695000	1069.000000	11.000000	1803.500000	0.168000	4.000000	2.000000	5.000000	6.500000
50%	59.500000	142.575000	83.000000	6.000000	10.500000	1.800000	1.585000	6.960250	7.290000	1204.150000	2767.000000	19.000000	1878.000000	0.244000	6.000000	4.000000	11.000000	24.500000
75%	88.750000	226.750000	138.000000	7.000000	15.000000	2.200000	2.062500	8.964925	12.000000	1811.150000	3596.750000	24.000000	1940.000000	0.489000	7.000000	6.000000	18.000000	49.750000
max	118.000000	295.000000	178.000000	7.000000	18.000000	3.300000	3.980000	24.587400	41.000000	3948.150000	5869.000000	203.000000	2010.000000	14.304000	7.000000	8.000000	32.000000	61.000000

Info

Info is often helpful

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 118 entries, 0 to 117
Data columns (total 28 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   AtomicNumber         118 non-null    int64  
 1   Element              118 non-null    object 
 2   Symbol               118 non-null    object 
 3   AtomicMass           118 non-null    float64
 4   NumberofNeutrons     118 non-null    int64  
 5   Period               118 non-null    int64  
 6   Group                90 non-null     float64
 7   Phase                118 non-null    object 
 8   Radioactive          37 non-null     object 
 9   Natural              90 non-null     object 
 10  Metal                92 non-null     object 
 11  Nonmetal             19 non-null     object 
 12  Metalloid            7 non-null      object 
 13  Type                 115 non-null    object 
 14  AtomicRadius         86 non-null     float64
 15  Electronegativity    96 non-null     float64
 16  FirstIonization      102 non-null    float64
 17  Density              105 non-null    float64
 18  MeltingPoint         98 non-null     float64
 19  BoilingPoint         98 non-null     float64
 20  NumberOfIsotopes     103 non-null    float64
 21  Discoverer           109 non-null    object 
 22  Year                 107 non-null    float64
 23  SpecificHeat         85 non-null     float64
 24  NumberofShells       118 non-null    int64  
 25  NumberofValence      49 non-null     float64
 26  OutermostElectrons   118 non-null    float64
 27  NeutronsLessProtons  118 non-null    int64  
dtypes: float64(13), int64(5), object(10)
memory usage: 25.9+ KB

Columns

And remember we can see all the column names!
No ()!

df.columns

Index(['AtomicNumber', 'Element', 'Symbol', 'AtomicMass', 'NumberofNeutrons',
       'Period', 'Group', 'Phase', 'Radioactive', 'Natural', 'Metal',
       'Nonmetal', 'Metalloid', 'Type', 'AtomicRadius', 'Electronegativity',
       'FirstIonization', 'Density', 'MeltingPoint', 'BoilingPoint',
       'NumberOfIsotopes', 'Discoverer', 'Year', 'SpecificHeat',
       'NumberofShells', 'NumberofValence', 'OutermostElectrons',
       'NeutronsLessProtons'],
      dtype='object')

Shape

And of course shape!

df.shape

(118, 28)

Column-wise

We can do all kinds of operations over a single column.

mp = df["MeltingPoint"] # Copy column to save some typing.
[mp.min(), mp.max(), mp.mean(), mp.median(), mp.mode(), mp.std()]

[np.float64(14.175),
 np.float64(3948.15),
 np.float64(1281.4751836734692),
 np.float64(1204.15),
 0     913.15
 1    1204.15
 Name: MeltingPoint, dtype: float64,
 np.float64(903.6851752147392)]

Using `loc`

Metal or not

We helpfully can already see what metals are.

df.iloc[15:25:2][["Element","Metal"]]

	Element	Metal
15	Sulfur	NaN
17	Argon	NaN
19	Calcium	yes
21	Titanium	yes
23	Chromium	yes

NaN

NaN means “not a number” and is used by pandas (not generally by Python) to fill in missing data.

np.sqrt(-1)

C:\Users\cd-desk\AppData\Local\Temp\ipykernel_12296\3438155168.py:1: RuntimeWarning:

invalid value encountered in sqrt

np.float64(nan)

I find this idea a bit odd (and e.g. Polars does not do this) but we can work with it.

(Non)metal(loid)

Elements are “metal”, “nonmetal”, or “metalloid”.
I regard this as how metallic an element is.

df.iloc[12:16][["Element","Metal","Nonmetal","Metalloid"]]

	Element	Metal	Nonmetal	Metalloid
12	Aluminum	yes	NaN	NaN
13	Silicon	NaN	NaN	yes
14	Phosphorus	NaN	yes	NaN
15	Sulfur	NaN	yes	NaN

Metallic

We introduce and use np.where, which takes
- A condition
- A value if condition
- A value else

df["Metallic"] = np.where(df["Metal"] == "yes", "Metal", "Nonmetal")
df.iloc[12:16][["Metal","Nonmetal","Metalloid","Metallic"]]

	Metal	Nonmetal	Metalloid	Metallic
12	yes	NaN	NaN	Metal
13	NaN	NaN	yes	Nonmetal
14	NaN	yes	NaN	Nonmetal
15	NaN	yes	NaN	Nonmetal

Metalloids

Additionally deal with metalloids.

df[df["Metalloid"] == "yes"]["Metallic"] = "Metalloid"
df.iloc[12:16][["Metal","Nonmetal","Metalloid","Metallic"]]

C:\Users\cd-desk\AppData\Local\Temp\ipykernel_12296\1792942652.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

	Metal	Nonmetal	Metalloid	Metallic
12	yes	NaN	NaN	Metal
13	NaN	NaN	yes	Nonmetal
14	NaN	yes	NaN	Nonmetal
15	NaN	yes	NaN	Nonmetal

Whoops!

`loc`

df[df["Metalloid"] == "yes"] is not the same dataframe as df
Changes to it will not change df.
We use .loc, the cousin of .iloc to update df.
We must also very carefully put all indices, comma separated, in single brackets.
- Row and column, together, within []

Recall: NumPy

I recommend using the comma notation.
Otherwise I get unexpected behavior.

# reverse in both dimensions
arr[::-1][::-1]

array([[9.27500e+03, 1.00000e-01],
       [3.76500e+04, 1.50000e-01],
       [9.11500e+04, 2.50000e-01],
       [1.90150e+05, 2.80000e-01],
       [4.13350e+05, 3.30000e-01],
       [4.15051e+05, 3.50000e-01]])

# reverse in both dimensions
arr[::-1,::-1]

array([[3.50000e-01, 4.15051e+05],
       [3.30000e-01, 4.13350e+05],
       [2.80000e-01, 1.90150e+05],
       [2.50000e-01, 9.11500e+04],
       [1.50000e-01, 3.76500e+04],
       [1.00000e-01, 9.27500e+03]])

Takeaway: Always use [x,y] instead of [x][y]
This is why we use ~~NumPy~~pandas!

Update

df.loc[df["Metalloid"] == "yes", "Metallic"] = "Metalloid"
df.iloc[12:16][["Metal","Nonmetal","Metalloid","Metallic"]]

	Metal	Nonmetal	Metalloid	Metallic
12	yes	NaN	NaN	Metal
13	NaN	NaN	yes	Metalloid
14	NaN	yes	NaN	Nonmetal
15	NaN	yes	NaN	Nonmetal

Much nicer.

Another drop

Let’s just keep “Metallic” now that we have it.

# Have to specify columns (we can also drop rows)
df = df.drop(columns=["Metal","Nonmetal","Metalloid"])
df.columns

Index(['AtomicNumber', 'Element', 'Symbol', 'AtomicMass', 'NumberofNeutrons',
       'Period', 'Group', 'Phase', 'Radioactive', 'Natural', 'Type',
       'AtomicRadius', 'Electronegativity', 'FirstIonization', 'Density',
       'MeltingPoint', 'BoilingPoint', 'NumberOfIsotopes', 'Discoverer',
       'Year', 'SpecificHeat', 'NumberofShells', 'NumberofValence',
       'OutermostElectrons', 'NeutronsLessProtons', 'Metallic'],
      dtype='object')

Groups

On groups

In chemistry:

a group (also known as a family) is a column of elements in the periodic table of the chemical elements.

In computing

a set together with a binary operation satisfying certain algebraic conditions.

On groupby

In pandas

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

Element Groups

df.iloc[:60:10][["Element","Group"]]

	Element	Group
0	Hydrogen	1.0
10	Sodium	1.0
20	Scandium	3.0
30	Gallium	13.0
40	Niobium	5.0
50	Antimony	15.0

Counting

Let’s see how many elements are in each group!

df.groupby("Group").count()

	AtomicNumber	Element	Symbol	AtomicMass	NumberofNeutrons	Period	Phase	Radioactive	Natural	Type	...	BoilingPoint	NumberOfIsotopes	Discoverer	Year	SpecificHeat	NumberofShells	NumberofValence	OutermostElectrons	NeutronsLessProtons	Metallic
Group
1.0	7	7	7	7	7	7	7	1	7	7	...	7	7	7	7	6	7	7	7	7	7
2.0	6	6	6	6	6	6	6	1	6	6	...	6	6	6	6	5	6	6	6	6	6
3.0	4	4	4	4	4	4	4	1	4	4	...	4	4	4	4	4	4	0	4	4	4
4.0	4	4	4	4	4	4	4	1	3	4	...	3	3	4	4	3	4	0	4	4	4
5.0	4	4	4	4	4	4	4	1	3	4	...	3	3	4	4	3	4	0	4	4	4
6.0	4	4	4	4	4	4	4	1	3	4	...	3	3	4	4	3	4	0	4	4	4
7.0	4	4	4	4	4	4	4	2	2	4	...	3	3	4	4	2	4	0	4	4	4
8.0	4	4	4	4	4	4	4	1	3	4	...	3	3	4	3	3	4	0	4	4	4
9.0	4	4	4	4	4	4	4	1	3	4	...	3	3	4	4	3	4	0	4	4	4
10.0	4	4	4	4	4	4	4	1	3	4	...	3	3	3	4	3	4	0	4	4	4
11.0	4	4	4	4	4	4	4	1	3	4	...	3	3	3	1	3	4	0	4	4	4
12.0	4	4	4	4	4	4	4	1	3	4	...	3	3	3	2	3	4	0	4	4	4
13.0	6	6	6	6	6	6	6	1	5	5	...	5	5	5	6	5	6	6	6	6	6
14.0	6	6	6	6	6	6	6	1	5	6	...	5	5	5	3	5	6	6	6	6	6
15.0	6	6	6	6	6	6	6	1	5	5	...	5	5	5	5	5	6	6	6	6	6
16.0	6	6	6	6	6	6	6	2	5	6	...	5	5	5	5	4	6	6	6	6	6
17.0	6	6	6	6	6	6	6	2	5	5	...	5	5	5	6	4	6	6	6	6	6
18.0	7	7	7	7	7	7	7	2	6	7	...	6	6	6	7	6	7	6	7	7	7

18 rows × 25 columns

Goofy

Oh it took the count of all columns.
But… that isn’t useless?
Let’s perhaps look a melting point by group.

df.groupby("Group")["MeltingPoint"].mean()

Group
1.0      298.616429
2.0     1102.150000
3.0     1531.900000
4.0     2186.150000
5.0     2728.483333
6.0     2900.150000
7.0     2481.816667
8.0     2543.816667
9.0     2241.150000
10.0    1865.483333
11.0    1309.876667
12.0     507.213333
13.0     963.304000
14.0    1589.742000
15.0     583.882000
16.0     436.622000
17.0     290.758000
18.0     117.638600
Name: MeltingPoint, dtype: float64

Plot it

That was too hard to see.

df.groupby("Group")["MeltingPoint"].mean().plot()

`groupby`

Groupby is a cool and powerful tool.
I tend to use it with summary statistics, and you can usually work around it with filters, but with a bit of practice it’s a real joy to work with.

Merge

Names

An astute learning will note a few oddities with names and symbols in the table.
Some symbols and elements start with different letters.

Strings

Like lists, arrays, and DataFrames, we can use indices on strings of characters.

ele = df.iloc[1]
symbol = ele["Symbol"]
element = ele["Element"]
symbol[0] == element[0], symbol, element

(True, 'He', 'Helium')

Differing

Sometimes the first letter differs - but not often.

ele = df.iloc[18]
symbol = ele["Symbol"]
element = ele["Element"]
symbol[0] == element[0], symbol, element

(False, 'K', 'Potassium')

Write a function

We make a function that checks if two things start with the same letter.

def differ(x,y):
    # != means "does not equal"
    return x[0] != y[0]

Vectorize

To apply to DataFrames, we must vectorize it.
- x[0] of a DataFrame is a row
- x[0] of a string is a letter
- We wanted letters!

diff_vector = np.vectorize(differ)

The Elements

df[diff_vector(df["Symbol"], df["Element"])]["Element"]

10       Sodium
18    Potassium
25         Iron
46       Silver
49          Tin
50     Antimony
78         Gold
79      Mercury
81         Lead
Name: Element, dtype: object

Basically, these elements have archaic Latin names that were in use when symbols were set before developing their modern names, or something. Read more

Adding Names

Let’s make another DataFrame
- We’ll call if ad for “artus datus”, which might (I have no idea) be Latin for data frame.

names = [
    ["Sodium", "Natrium"],
    ["Potassium", "Kalium"],
    ["Iron", "Ferrum"],
    ["Silver", "Argentum"],
    ["Tin", "Stannum"],
    ["Antimony", "Stibium"],
    ["Tungsten", "Wolfram"],
    ["Gold", "Aurum"],
    ["Mercury", "Hydrargyrum"],
    ["Lead", "Plumbum"]
]

To DataFrame

ad for “artus datus”, which may be Latin for data frame.

ad = pd.DataFrame(names)
ad

	0	1
0	Sodium	Natrium
1	Potassium	Kalium
2	Iron	Ferrum
3	Silver	Argentum
4	Tin	Stannum
5	Antimony	Stibium
6	Tungsten	Wolfram
7	Gold	Aurum
8	Mercury	Hydrargyrum
9	Lead	Plumbum

Column Names

Remember .columns?

df.columns

Index(['AtomicNumber', 'Element', 'Symbol', 'AtomicMass', 'NumberofNeutrons',
       'Period', 'Group', 'Phase', 'Radioactive', 'Natural', 'Type',
       'AtomicRadius', 'Electronegativity', 'FirstIonization', 'Density',
       'MeltingPoint', 'BoilingPoint', 'NumberOfIsotopes', 'Discoverer',
       'Year', 'SpecificHeat', 'NumberofShells', 'NumberofValence',
       'OutermostElectrons', 'NeutronsLessProtons', 'Metallic'],
      dtype='object')

We can actually use = with .columns

ad.columns = ["Element", "Latin"] 
ad.iloc[0]

Element     Sodium
Latin      Natrium
Name: 0, dtype: object

Merge

df.merge(ad)

	AtomicNumber	Element	Symbol	AtomicMass	NumberofNeutrons	Period	Group	Phase	Radioactive	Natural	...	NumberOfIsotopes	Discoverer	Year	SpecificHeat	NumberofShells	NumberofValence	OutermostElectrons	NeutronsLessProtons	Metallic	Latin
0	11	Sodium	Na	22.990	12	3	1.0	solid	NaN	yes	...	7.0	Davy	1807.0	1.228	3	1.0	1.0	1	Metal	Natrium
1	19	Potassium	K	39.098	20	4	1.0	solid	NaN	yes	...	10.0	Davy	1807.0	0.757	4	1.0	1.0	1	Metal	Kalium
2	26	Iron	Fe	55.845	30	4	8.0	solid	NaN	yes	...	10.0	Prehistoric	NaN	0.449	4	NaN	8.0	4	Metal	Ferrum
3	47	Silver	Ag	107.868	61	5	11.0	solid	NaN	yes	...	27.0	Prehistoric	NaN	0.235	5	NaN	11.0	14	Metal	Argentum
4	50	Tin	Sn	118.710	69	5	14.0	solid	NaN	yes	...	28.0	Prehistoric	NaN	0.228	5	4.0	14.0	19	Metal	Stannum
5	51	Antimony	Sb	121.760	71	5	15.0	solid	NaN	yes	...	29.0	Early historic times	NaN	0.207	5	5.0	15.0	20	Metalloid	Stibium
6	79	Gold	Au	196.967	118	6	11.0	solid	NaN	yes	...	21.0	Prehistoric	NaN	0.129	6	NaN	25.0	39	Metal	Aurum
7	80	Mercury	Hg	200.590	121	6	12.0	liq	NaN	yes	...	26.0	Prehistoric	NaN	0.140	6	NaN	26.0	41	Metal	Hydrargyrum
8	82	Lead	Pb	207.200	125	6	14.0	solid	NaN	yes	...	29.0	Prehistoric	NaN	0.129	6	4.0	28.0	43	Metal	Plumbum

9 rows × 27 columns

Oh no!

So we successfully added Latin names, but…
We lost all elements without a Latin name.
The merge only kept elements in both df and ad

Join

Don’t merge

In my experience, merges never work, but…
They explain how joins, the good thing, work.

Background

To make learning joins easier, we’ll work with two DataFrames.

metalloids = df[df["Metallic"] == "Metalloid"]
latin_name = df.merge(ad)

Index

Versus merge, which just figures it out, join likes to look at indices.
So we have to make sure we have the same index on both data frames.
We’ll use "Symbol"

metalloids.index = metalloids["Symbol"]
latin_name.index = latin_name["Symbol"]

Simplify

To make things easier on us, let’s just only look at a few columns.
We will avoid having two columns of the same name.
- This is manageable, but let’s learn joins first.

metalloids = metalloids[["AtomicNumber","Metallic"]]
latin_name = latin_name[["Element","Latin"]]

The df’s

metalloids

	AtomicNumber	Metallic
Symbol
B	5	Metalloid
Si	14	Metalloid
Ge	32	Metalloid
As	33	Metalloid
Sb	51	Metalloid
Te	52	Metalloid
Po	84	Metalloid

latin_name

	Element	Latin
Symbol
Na	Sodium	Natrium
K	Potassium	Kalium
Fe	Iron	Ferrum
Ag	Silver	Argentum
Sn	Tin	Stannum
Sb	Antimony	Stibium
Au	Gold	Aurum
Hg	Mercury	Hydrargyrum
Pb	Lead	Plumbum

Visuals

I will also visualize this using venn diagrams.
To my knowledge, there is no graceful way to visualize venn diagrams, but matplotlib-venn is okay.
I’m spoiler-marking code that shows how I make the venn diagrams.

Code

from matplotlib_venn import venn2, venn2_circles

def show_join(how, left, right, mid):
    v = venn2((2, 2, 1), ('metalloids', 'latin_name'))
    v.get_patch_by_id('100').set_color('darkblue' if left else 'white')
    v.get_patch_by_id('010').set_color('darkblue' if right else 'white')
    v.get_patch_by_id('110').set_color('darkblue' if mid else 'white')
    v.get_patch_by_id('100').set_alpha(1.0)
    v.get_patch_by_id('010').set_alpha(1.0)
    v.get_patch_by_id('110').set_alpha(1.0)
    venn2_circles((2, 2, 1))
    for idx, subset in enumerate(v.subset_labels):
        v.subset_labels[idx].set_visible(False)
        plt.title("metalloids.join(latin_name, how=\"" + how + "\")")
    plt.show()

Quoth pandas

how: {‘left’,‘right’,‘outer’,‘inner’,‘cross’}

How to handle operation of the two objects.
- 'left': use calling frame’s index
- 'right': use other’s index.
- 'outer': form union of calling frame’s index with other’s index.
- 'inner': form intersection of calling frame’s index with other’s index.
- 'cross': creates the cartesian product from both frames.

Join

Join uses all columns, filling with “NaN”.

metalloids.join(latin_name)

	AtomicNumber	Metallic	Element	Latin
Symbol
B	5	Metalloid	NaN	NaN
Si	14	Metalloid	NaN	NaN
Ge	32	Metalloid	NaN	NaN
As	33	Metalloid	NaN	NaN
Sb	51	Metalloid	Antimony	Stibium
Te	52	Metalloid	NaN	NaN
Po	84	Metalloid	NaN	NaN

Left

metalloids.join(latin_name, how='left')["Element"]

Symbol
B          NaN
Si         NaN
Ge         NaN
As         NaN
Sb    Antimony
Te         NaN
Po         NaN
Name: Element, dtype: object

show_join('left',1,0,1)

Right

metalloids.join(latin_name, how='right')["Element"]

Symbol
Na       Sodium
K     Potassium
Fe         Iron
Ag       Silver
Sn          Tin
Sb     Antimony
Au         Gold
Hg      Mercury
Pb         Lead
Name: Element, dtype: object

show_join('right',0,1,1)

Outer

metalloids.join(latin_name, how='outer')["Element"]

Symbol
Ag       Silver
As          NaN
Au         Gold
B           NaN
Fe         Iron
Ge          NaN
Hg      Mercury
K     Potassium
Na       Sodium
Pb         Lead
Po          NaN
Sb     Antimony
Si          NaN
Sn          Tin
Te          NaN
Name: Element, dtype: object

show_join('outer',1,1,1)

Inner

metalloids.join(latin_name, how='inner')["Element"]

Symbol
Sb    Antimony
Name: Element, dtype: object

show_join('inner',0,0,1)

Cross

Cross makes pairs.
I don’t see it used often.

metalloids.join(latin_name, how='cross')

	AtomicNumber	Metallic	Element	Latin
0	5	Metalloid	Sodium	Natrium
1	5	Metalloid	Potassium	Kalium
2	5	Metalloid	Iron	Ferrum
3	5	Metalloid	Silver	Argentum
4	5	Metalloid	Tin	Stannum
...	...	...	...	...
58	84	Metalloid	Tin	Stannum
59	84	Metalloid	Antimony	Stibium
60	84	Metalloid	Gold	Aurum
61	84	Metalloid	Mercury	Hydrargyrum
62	84	Metalloid	Lead	Plumbum

63 rows × 4 columns

Exercise

The Table

The periodic table was developed by laying out elements by “Groups” and “Periods”
Basically, number of outer electrons (that can bond) and number of layers of inner “shells” of electrons that can’t bond.
- Or something. I’m not a real scientist.
We can plot these against each other.

Electronegativity

Electronegativity is a measure of the tendency of an atom to attract a bonding pair of electrons. The Pauling scale is the most commonly used. Fluorine (the most electronegative element) is assigned a value of 4.0, and values range down to cesium and francium which are the least electronegative at 0.7.

Exercise

Plot the table, by plotting “Groups” vs “Periods”
Plot electronegativity using color.
Confirm the claims from the chemistry text about electronegativity trends.
Annotate the location of fluorine (F).

Aside: Inversion

There’s a few ways to invert an axis that you may want to use.

taxes = pd.DataFrame(taxes)
plt.scatter(taxes[0],taxes[1])

Supply a `-`

You can negate, but it impacts labels.

# Not there's a negative on taxes[1]
plt.scatter(taxes[0],-taxes[1])

Get axes

You can use .gca() (get current axes) and set one negative.

plt.scatter(taxes[0],taxes[1])
plt.gca().invert_yaxis()

Solution

Code

# Plot
plt.scatter(
    x=df["Group"],
    y=df["Period"],
    c=df["Electronegativity"]
)
plt.gca().invert_yaxis()
# Label
plt.title("Electronegativity")
plt.xlabel("Group")
plt.ylabel("Period")
plt.colorbar()
# Annotate
_ = plt.annotate("F",(17,2))

pandas

Why pandas?

What is pandas?

pandas

“like a spreadsheet”

Spreadsheet Usage

Why not pandas?

Relevance

Install

pip again

Two First Steps

From NumPy

Get a File

From a URL

“Raw” Files

From File

DataFrames

DataFrames and Arrays

Subsetting

Multiple Indices

Filtering

Double df

.columns

How big?

Locating Entries

Zero/One index

Data Manipulation

Pattern

Plots

Visualize

Scatterplot

Insights

num_es

Expanding DataFrames

Update list

Update DataFrame

Patterns

3D

Columns to Columns

4D

“kwargs”

Dropping

Drop it

Summaries

Summary Statistics

Info

Columns

Shape

Column-wise

Using loc

Metal or not

NaN

(Non)metal(loid)

Metallic

Metalloids

loc

Recall: NumPy

Update

Another drop

Groups

On groups

On groupby

Element Groups

Counting

Goofy

Plot it

groupby

Merge

Names

Strings

Differing

Write a function

Vectorize

The Elements

Adding Names

To DataFrame

Column Names

Merge

Oh no!

Join

Double `df`

`num_es`

Using `loc`

`loc`

`groupby`

Supply a `-`