SD321 (Fall 2024)

Data Types in R and Python

A good command of the R language and statistical principles and methods is a must for any data scientist. Learning R is not the purpose of this course, but we will continue to review a few R basics in this lecture, before putting the basics into practice. Part of the material for these lecture notes is based on Methods in Biostatistics with R and R tutorial.

R Discussion continued from the previous lecture

Remember variables in R are not declared: they are created the first time they are assigned a value, just like in python. To assign a value to a variable, either = or <- can be used. The type of a variable is given by the type of the data assigned to it, and can change during the execution of the script. One can use the class() or typeof() methods to find the type of a variable. paste() cn be used to join multiple elements. print() is used to print some value to the output.

test = 5
print(type(test), "value: ", test)
test = "some string"
print(type(test), "value: ", test)

<class 'int'> value:  5
<class 'str'> value:  some string

test = 5
print(paste(class(test), "value: ", test))
test = "some string"
print(paste(class(test), "value: ", test))

[1] "numeric value:  5"
[1] "character value:  some string"

Types and type conversions: The basic data types in R are

numeric used for doubles and floats, (e.g. 5L),
integer used for integers
complex, used for complex numbers with an imaginary part e.g. 5 + 4i,
character, used for strings, which can be specified in single or double quotes,
logical used for booleans TRUE and FALSE.

Values can be "coerced" between types, using the functions as.numeric(), as.character, etc.

Lets continue our comparison and type discussion

For today's notes, we will continue to show the comparisons between Python and R. Python examples and output will remain on the left with the R versions on the right. Please try these examples by cutting and pasting into either the python3 or R interpreters.

Vectors are basically lists with all elements of the same type. There are many ways of creating a vector in R:

x = [1,4,6]
print(x)
print(x[0]) # Access the first element
len(x)      # Find the length of x

y = list(range(1,6)) # [1, 2, 3, 4, 5]
import numpy # we need numpy for arange (for floats)
z = list(numpy.arange(1,7.5,0.5))
print(z)

[1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0]

x = c(1,4,6)
print(x)
print(x[1]) # Access the first element
length(x)

y = seq(1:5) # [1] 1 2 3 4 5

z = seq(1, 7, by = 0.5)
print(x)

[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0

Data frames in R, created with the function data.frame() are similar with the Python pandas data frames. The data in a data frame is organized in rows and columns, usually with the columns having names. The different columns can have data of different types, but all values in a column have the same type.

You can imagine these data frames are rows and columns that you would see in a spreadsheet, and later in this class database tables.

import pandas as pd
df = pd.DataFrame({
  'age':    [25, 30, 32, 42],
  'handed': ["left", "right", "ambidextrous", "left"]
})
print(df)

   age        handed
0   25          left
1   30         right
2   32  ambidextrous
3   42          left


df = data.frame(
  age    = c(25, 30, 32, 42),
  handed = c("left", "right", "ambidextrous", "left")
)
print(df)

  age       handed
1  25         left
2  30        right
3  32 ambidextrous
4  42         left

Remember that the index begins with 0 in Python but with 1 in R.

Other data structures R also has lists, created with the function list, which can contain elements of different data types, matrices, created using the function matrix. which is a 2-dimensional set of data, arrays, created with the function array, which can have more than 2 dimensions, and factors, created with the function factor which are used to store categorical values.

import pandas as pd
data = [1, 2, 3, 4, 5, 6, 7, 8, 9]
my_matrix = pd.DataFrame({
    0: data[0:3],
    1: data[3:6],
    2: data[6:9]
})
print(my_matrix)


data = c(1, 2, 3, 4, 5, 6, 7, 8, 9)
my_matrix = matrix(data, nrow = 3, ncol = 3, byrow = TRUE)




print(my_matrix)

[,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9

Lets look at an example of the array function in R, hopefully at this point you can start to see why R is a nice language for working with data.

import pandas as pd
import numpy as np
data = np.arange(1, 13)  # Equivalent to R's 1:12
my_array = data.reshape(3, 2, 2)
for i in range(my_array.shape[2]):
    df = pd.DataFrame(my_array[:, :, i])
    print(f"Layer {i+1}:\n{df}\n")



data = c(1:12)
my_array <- array(data, dim = c(3, 2, 2)) # a 3x2x2 array


print(my_array)

[,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

, , 2

[,1] [,2]
[1,]    7   10
[2,]    8   11
[3,]    9   12

Subsetting - Extracting small relevant parts of large data.

Subsetting or selecting only particular rows and columns from an R data structure, can be done using square brackets and specifying some criteria inside (either a particular row or column(s) or conditions on which rows and columns to return), or, for data frames, using the $ to specify one of the columns. Here are some examples, with the output also shown:

import pandas as pd
# Create a pandas DataFrame
df = pd.DataFrame({
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "Salary": [50000, 55000, 60000]
})

# Subset by column name
subset1 = df["Age"]  # Extracts the 'Age' column
print(subset1)

# Subset by row and column position
subset2 = df.iloc[0, 2]  # Extracts the value from the 1st row, 3rd column
print(subset2)

# Subset rows based on a condition
subset3 = df[df["Age"] > 28]  # Extracts rows where Age is greater than 28
print(subset3)

0    25
1    30
2    35
Name: Age, dtype: int64
50000
      Name  Age  Salary
1      Bob   30   55000
2  Charlie   35   60000


# Create a data frame
df = data.frame(
  Name = c("Alice", "Bob", "Charlie"),
  Age = c(25, 30, 35),
  Salary = c(50000, 55000, 60000)
)

# Subset by column name
subset1 = df$Age  # Extracts the 'Age' column
print(subset1)

# Subset by row and column position
subset2 = df[1, 3]  # Extracts the value from the 1st row, 3rd column
print(subset2)

# Subset rows based on a condition
subset3 = df[df$Age > 28, ]  # Extracts rows where Age is greater than 28
print(subset3)

[1] 25 30 35



[1] 50000
     Name Age Salary
2     Bob  30  55000
3 Charlie  35  60000

Our next example will be looking at vectors

import pandas as pd
numbers = pd.Series([10, 20, 30, 40, 50])
subset1 = numbers.iloc[1]
print(subset1)           
subset2 = numbers.iloc[[0, 2, 4]]
print(subset2)
subset3 = numbers[numbers > 25]
print(subset3)


numbers = c(10, 20, 30, 40, 50) # Create a vector
subset1 = numbers[2]
print(subset1)
subset2 = numbers[c(1, 3, 5)]
print(subset2)
subset3 = numbers[numbers > 25]
print(subset3)

[1] 20

[1] 10 30 50



[1] 30 40 50

Now lets look at matrices.

import numpy as np
matrix_data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(matrix_data)
subset1 = matrix_data[1, 2]      # Extracts the element at the 2nd row, 3rd column
print(subset1)                   
subset2 = matrix_data[0:2, 1:3]  # Extracts a submatrix
print(subset2)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


6

[[2 3]
 [5 6]]


matrix_data = matrix(1:9, nrow = 3, byrow = TRUE)
print(matrix_data)
subset1 = matrix_data[2, 3]      # Extracts the element at the 2nd row, 3rd column
print(subset1)
subset2 = matrix_data[1:2, 2:3]  # Extracts a submatrix
print(subset2)

[,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9

[1] 6

[,1] [,2]
[1,]    2    3
[2,]    5    6

Our last example will focus on lists in R and dictionaries in Python.

my_dict = {"Name": "Alice", "Age": 25, "Scores": [90, 95, 85]}
subset1 = my_dict["Scores"]  # Extracts the 'Scores' element
print(subset1)
subset2 = my_dict["Age"]  # Extracts the 'Age' element
print(subset2)

[90, 95, 85]
25

my_list = list(Name = "Alice", Age = 25, Scores = c(90, 95, 85))
subset1 = my_list$Scores # Extracts the 'Scores' element
print(subset1)
subset2 = my_list[[2]]   # Extracts the 2nd element (Age)
print(subset2)

[1] 90 95 85
[1] 25

Subsetting - Step by Step

In the previous section we looked at the various data types and the basics of carving out specific data. For the following we will focus on dataframes, and step through various methods in a logical manner.

Subsetting Rows

# Using Rows index:
df.iloc[0, :]         # First row
df.iloc[0:3, :]       # First three rows
df.iloc[-3:, :]       # Last three rows

# Using Logical Conditions:
df[df['column_name'] > 5]        # Rows where 'column_name' is greater than 5
df[df['column_name'] == 'value'] # Rows where 'column_name' equals "value"
df[df['column_name'].isin(['value1', 'value2'])]  # Rows where 'column_name' is in a list of values

# Multiple Conditions:
df[(df['column_name'] == 'value') & (df['column2'] == 'value')]

# Using query() Function:
df.query('column_name > 5')           # Rows where 'column_name' is greater than 5
df.query('column_name == "value"')    # Rows where 'column_name' equals "value"

# Using Row index:
df[1, ]         # First row
df[1:3, ]       # First three rows
df[-c(1, 2), ]  # Exclude first and second rows

# Using Logical Conditions:
df[df$column_name > 5, ]           # Rows where 'column_name' is greater than 5
df[df$column_name == "value", ]    # Rows where 'column_name' equals "value"


# Multiple Conditions:
df[df$column_name == "value" & df$column2 == "value", ]    # Rows where 'column_name' equals "value"

# Using subset() Function:
subset(df, column_name > 5)        # Rows where 'column_name' is greater than 5
subset(df, column_name == "value") # Rows where 'column_name' equals "value"

Subsetting Columns

# Using Column Index:
df.iloc[:, 0]          # First column
df.iloc[:, 0:3]        # First three columns
df.iloc[:, -3:]        # Last three columns

# Using Column Names:
df['column_name']            # Single column by name
df[['col1', 'col2']]         # Multiple columns by name

# Using .loc[] for Label-Based Indexing:
df.loc[:, 'column_name']     # Single column by name
df.loc[:, ['col1', 'col2']]  # Multiple columns by name

# Using Column Index:
df[, 1]      # First column
df[, 1:3]    # First three columns
df[, -c(1, 2)]  # Exclude first and second columns

# Using Column Names:
df[, "column_name"]           # Single column by name
df[, c("col1", "col2")]       # Multiple columns by name

# Using $ for Single Columns
df$column_name                # Select single column (returns as a vector)

Subsetting Rows and Columns Together

# Using Indexing with .iloc[]:
df.iloc[0:3, 1:4]   # First three rows, second to fourth columns
df.iloc[:, [0, 2]]  # All rows, first and third columns

# Using Indexing with .loc[]:
df.loc[df['column_name'] > 5, ['col1', 'col2']]  # Rows where condition is true, and specific columns

# Using Indexing:
df[1:3, 2:4]                 # First three rows, second to fourth columns
df[df$column_name > 5, c("col1", "col2")]  # Rows where condition is true, and specific columns

# Using subset()
subset(df, column_name > 5, select = c("col1", "col2"))

Removing Missing Values (NA)

# Using .dropna()
df_clean = df.dropna()  # Remove rows with any NA values
df_clean = df.dropna(subset=['col1', 'col2'])  # Remove rows with NA in specific columns

# Using .notnull() or .isnull()
df_clean = df[df['column_name'].notnull()]  # Keep rows where 'column_name' is not NA
df_clean = df[df['column_name'].isnull()]   # Keep rows where 'column_name' is NA

# Using na.omit()
df_clean <- na.omit(df)        # Remove rows with any NA values


# Using complete.cases()
df_clean <- df[complete.cases(df), ]  # Keep rows without NA values

Subsetting with specific criteria

# First or Last N Rows:
df.head(5)          # First 5 rows
df.tail(5)          # Last 5 rows

# Random Sampling
df.sample(10)       # Randomly select 10 rows
df.sample(frac=0.1) # Randomly select 10% of rows

# First or Last N Rows:
head(df, n = 5)          # First 5 rows
tail(df, n = 5)          # Last 5 rows

# Random Sampling (dplyr package)
sample_n(df, 10)         # Randomly select 10 rows
sample_frac(df, 0.1)     # Randomly select 10% of rows

Filtering with pandas (Python) or dplyr (R)

These examples may be a bit more complicated, just keep the ideas in the back of your mind for now.

# Using .loc[] and Conditional Filtering:

df.loc[df['column_name'] > 5]              # Rows where 'column_name' is greater than 5
df.loc[(df['col1'] > 5) & (df['col2'] < 10)]  # Rows matching multiple conditions

# Selecting Columns with .loc[]:
df.loc[:, ['col1', 'col2']]   # Select specific columns
df.loc[:, df.columns != 'col1']  # Exclude a specific column

# Combining Row and Column Subsetting:
df.loc[df['column_name'] > 5, ['col1', 'col2']]  # Filter rows and select specific columns

# Using .groupby() and .filter():
df.groupby('column_name').filter(lambda x: x['col2'].mean() > 50)  # Groups where the mean of 'col2' > 50

# Using .groupby() and .get_group():
df.groupby('column_name').get_group('value')  # Get group where 'column_name' equals "value"

# Using filter()
library(dplyr)
df %>% filter(column_name > 5)
df %>% filter(column_name == "value")

# Selecting Columns with select()
df %>% select(col1, col2)
df %>% select(-col1)    # Exclude a column

# Combining filter() and select()
df %>% filter(column_name > 5) %>% select(col1, col2)

Subsetting - Another set of examples focusing on R

Now lets focus on R and walk through an example, R output is on the right.

# Create a Vector
x = c(1, 3, 4, 5)

# Retrieve the 4th Element (remember indexing starts at 1 in R)
x[4]

# Use a boolean expression to subset the data
x[ x > 2 ]

# Create a Matrix, no ncol is specified so it will be computed
mat = matrix(1:6, nrow = 2)



mat[1:2, 1]

mat[1, 1:2]

mat[1, c(FALSE, TRUE, FALSE)]

# If nothing specified for one dimension,
# all values will be in the result
mat[1:2, ]


# Create a data frame
df = data.frame(
  age = c(25, 30, 32, 42),
  handed = c("left", "right", "ambidextrous", "left")
)

# Show the first row, all columns
df[1,]
  age handed
1  25   left

# All rows, column age
df[, "age"]


# Using $ to select a column
df$age

# Select all columns for rows with age > 27
df[df$age > 27, ]



# We can "attach" a dataframe which results in all columns become their own variables
attach(df)
age

handed





[1] 5


[1] 3 4 5


     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

[1] 1 2

[1] 1 3

[1] 3



    [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

 age       handed
1  25         left
2  30        right
3  32 ambidextrous
4  42         left

  age handed
1  25   left



[1] 25 30 32 42



[1] 25 30 32 42


age       handed
2  30        right
3  32 ambidextrous
4  42         left


[1] 25 30 32 42

[1] "left"         "right"        "ambidextrous" "left"

A preview of a possible future Statistics Functions

Below is a quick table of basic statistical functions in Python (pandas) and R, that may be useful as simple reference.

Statistical Method	Python (pandas)	R
Mean	`df['column'].mean()` `df['column'].mean(skipna=True)`	`mean(x)` `mean(x, na.rm = TRUE)`
Median	`df['column'].median()`	`median(x)`
Mode	`df['column'].mode()`	`mode_function <- function(x) { unique_x <- unique(x); unique_x[which.max(tabulate(match(x, unique_x)))] }`
Standard Deviation	`df['column'].std()` `df['column'].std(skipna=True)`	`sd(x)` `sd(x, na.rm = TRUE)`
Variance	`df['column'].var()` `df['column'].var(skipna=True)`	`var(x)` `var(x, na.rm = TRUE)`
Range (Minimum and Maximum)	`df['column'].min()` `df['column'].max()`	`min(x)` `max(x)`
Quantiles	`df['column'].quantile([0.25, 0.5, 0.75])` `df['column'].quantile(0.1)`	`quantile(x)` `quantile(x, probs = c(0.1, 0.9))`
Summary Statistics	`df['column'].describe()` `df.describe()`	`summary(x)` `summary(df)`
Correlation (Pearson by default)	`df['column1'].corr(df['column2'])` `df.corr()` `df.corr(method='spearman')`	`cor(x, y)` `cor(df)` `cor(x, y, method = "spearman")`
Covariance	`df['column1'].cov(df['column2'])` `df.cov()`	`cov(x, y)` `cov(df)`
Frequency Table	`df['column'].value_counts()`	`table(x)`
Proportions	`df['column'].value_counts(normalize=True)`	`prop.table(table(x))`
Cross Tabulation	`pd.crosstab(df['col1'], df['col2'])`	`table(df$col1, df$col2)`
Summary of Data Frame	`df.info()`	`str(df)`
Count Missing Values	`df['column'].isna().sum()` `df.isna().sum()`	`sum(is.na(x))` `colSums(is.na(df))`
Random Sample	`df.sample(n=10)` `df.sample(frac=0.1)`	`sample(x, size = 10)` `sample(x, size = 10, replace = TRUE)`

Practice Problems

Consider the vector: num = c("1.5", "2", "8.4")
- Convert num into a numeric vector
- Convert num into a factor using factor, calling it num_fac.
- Convert num_fac into a numeric vector using as.numeric. Call it n2. What is the result?
- Convert num_fac into a string vector. Call it n3.
- Convert num_fac into a numeric vector using as.numeric(as.character()). Call it n4

Practice Problems

At the bottom of most classes are practice problems that you should try, see if you can complete the following:

In both Python and R, pull out the column MyColumn from the dataframe MyDataframe.
In both Python and R, retrieve the $5^{th}$ row of the dataframe MyDataframe.
In both Python and R, retrieve all rows of dataframe MyDataframe where column MyColumn has value MyValue.