Data Types and Basic Operations:
1. Objects:
In every computer language variables provide a means of accessing the data stored in memory. R does not provide direct access to the computer’s memory but rather provides a number of specialized data structures we will refer to as objects.These objects are referred to through symbols or variables.
R has five basic or atomic classes of objects:
2.1 Vectors:
1. Objects:
In every computer language variables provide a means of accessing the data stored in memory. R does not provide direct access to the computer’s memory but rather provides a number of specialized data structures we will refer to as objects.These objects are referred to through symbols or variables.
R has five basic or atomic classes of objects:
- characters
- numeric(real numbers)
- integer
- complex
- logical(True or False)
2.1 Vectors:
The most basic object in R is vector. A vector can only contain objects of the same class. But the exception is a 'list', which is represent as a vector but can contain objects of different classes. Empty vectors can be created with the vector() function.
2.2 Lists:
Lists are another kind of data storage. Lists have elements, each of which can contain any type of R object, i.e. the elements of a list do not have to be of the same type.
2.3 Numbers:
Numbers in R generally treated as numeric objects(i.e. double precision real numbers). If you explicitly want an integer you need to specify the L suffix. There is also special number Inf which represent infinity; e.g:1 / 0; Inf can be used in ordinary calculations; e.g. 1 / Inf is 0 .The value NaN (not a number)represent undefined value;e.g 0 / 0; NaN can be also be thought of as a missing value.
2.3 Attributes
All objects except NULL can have one or more attributes attached to them. Attributes are stored as a pairlist where all elements are named, but should be thought of as a set of name=value pairs. Attributes are used to implement the class structure used in R.
R objects can have attributes
Entering inputs in R: At the R prompt we type expressions. The <- symbol is the assignment operator.If you assign a value to x and for printing the value of x in prompt, just type the below commands in your R prompt :
2.2 Lists:
Lists are another kind of data storage. Lists have elements, each of which can contain any type of R object, i.e. the elements of a list do not have to be of the same type.
2.3 Numbers:
Numbers in R generally treated as numeric objects(i.e. double precision real numbers). If you explicitly want an integer you need to specify the L suffix. There is also special number Inf which represent infinity; e.g:1 / 0; Inf can be used in ordinary calculations; e.g. 1 / Inf is 0 .The value NaN (not a number)represent undefined value;e.g 0 / 0; NaN can be also be thought of as a missing value.
2.3 Attributes
All objects except NULL can have one or more attributes attached to them. Attributes are stored as a pairlist where all elements are named, but should be thought of as a set of name=value pairs. Attributes are used to implement the class structure used in R.
R objects can have attributes
- names, dimnames
- dimensions (e.g. matrices, arrays)
- class
- length
- other user-defined attributes/metadata
Attributes of an object can be accessed using the attributes() function.
Do It Your Self:
Entering inputs in R: At the R prompt we type expressions. The <- symbol is the assignment operator.If you assign a value to x and for printing the value of x in prompt, just type the below commands in your R prompt :
> x <- 1
> print(x)
[1] 1
> x
[1] 1
> print(x)
[1] 1
> x
[1] 1
The # character indicates a comment. Anything to the right of the # (including the # itself) is ignored. For example.
> msg <- "hello world" ## hello world
> msg
[1] "hello world"
> msg
[1] "hello world"
When a complete expression is entered at the prompt, it is evaluated and the result of the evaluated expression is returned. The result may be auto-printed.
> x <- 5 ## nothing printed
> x ## auto printing occurs
[1] 5
> print(x) ## print explicitly
[1] 5
> x ## auto printing occurs
[1] 5
> print(x) ## print explicitly
[1] 5
The [1] indicates that x is a vector and 5 is the first element.
The : operator is used to create integer sequences. If we do 1:100 in prompt, then the output is the sequence of 1 to 100 integers .check it out:
> 1:100 ## numbers 1 to 100
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
[37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
[55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
[73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
[91] 91 92 93 94 95 96 97 98 99 100
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
[37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
[55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
[73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
[91] 91 92 93 94 95 96 97 98 99 100
The [1] to [91] indicate the position of first element in each row.
Creating vectors:
The c()function can be used to create vectors of objects. Where c is short form of concatenate.
> x <- c(0.5, 0.6) ## numeric
> x <- c(TRUE, FALSE) ## logical
> x <- c(T, F) ## logical
> x <- c("a", "b", "c") ## character
> x <- 9:29 ## integer
> x <- c(1+0i, 2+4i) ## complex
> x <- c(TRUE, FALSE) ## logical
> x <- c(T, F) ## logical
> x <- c("a", "b", "c") ## character
> x <- 9:29 ## integer
> x <- c(1+0i, 2+4i) ## complex
When different objects are mixed in a vector, coercion occurs so that every element in the vector is of the same class. e.g:
> y <- c(1.7, "a") ## character
> y <- c(TRUE, 2) ## numeric
> y <- c("a", TRUE) ## character
> y <- c(TRUE, 2) ## numeric
> y <- c("a", TRUE) ## character
Using the vector function we can also create vectors of certain type and length. for example:
> x <- vector("numeric", length = 10)
> x
[1] 0 0 0 0 0 0 0 0 0 0
> x
[1] 0 0 0 0 0 0 0 0 0 0
object can explicitly coerced from one cass to another using the as.* functions, if available.for example the normal class of x is integer but we can explicitly coerced into numeric using as.numeric(x) function. The examples are below:
> x <- 0: 6
> class(x)
[1] "integer"
> as.numeric(x)
[1] 0 1 2 3 4 5 6
> as.logical(x)
[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE
> as.character(x)
[1] "0" "1" "2" "3" "4" "5" "6"
> as.complex(x)
[1] 0+0i 1+0i 2+0i 3+0i 4+0i 5+0i 6+0i
But nonsensical coercion results in NAs.> class(x)
[1] "integer"
> as.numeric(x)
[1] 0 1 2 3 4 5 6
> as.logical(x)
[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE
> as.character(x)
[1] "0" "1" "2" "3" "4" "5" "6"
> as.complex(x)
[1] 0+0i 1+0i 2+0i 3+0i 4+0i 5+0i 6+0i
> as.numeric(x)
[1] NA NA NA
Warning message:
NAs introduced by coercion
> as.logical(x)
[1] NA NA NA
[1] NA NA NA
Warning message:
NAs introduced by coercion
> as.logical(x)
[1] NA NA NA
Matrix:
Matrices are special type of vectors with a dimension attribute. The dimension attribute is itself an integer vector of length 2 (nrow, ncol). Matrices are constructed column-wise, so entries can be thought of starting in the “upper left” corner and running down the columns.
> m <- matrix(1:6, nrow = 2, ncol =3)
> m
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
> dim(m)
[1] 2 3
> attributes(m)
$dim
[1] 2 3
> m
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
> dim(m)
[1] 2 3
> attributes(m)
$dim
[1] 2 3
> m <- 1:6
> m
[1] 1 2 3 4 5 6
> dim(m) <- c(2,3)
> m
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
Matrices can be created by column-binding or row-binding with cbind() and rbind().> m
[1] 1 2 3 4 5 6
> dim(m) <- c(2,3)
> m
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
> x <- 1:3
> y <- 10:12
> cbind(x, y) ## column-binding
x y
[1,] 1 10
[2,] 2 11
[3,] 3 12
> rbind(x, y)
[,1] [,2] [,3]
x 1 2 3
y 10 11 12
> y <- 10:12
> cbind(x, y) ## column-binding
x y
[1,] 1 10
[2,] 2 11
[3,] 3 12
> rbind(x, y)
[,1] [,2] [,3]
x 1 2 3
y 10 11 12
Lists:
Lists are very important data type in R and it contain elements of different classes.
> x <- list(1, "a", FALSE, 1 + 2i)##elements of diff classes
> x ## auto printing
[[1]]
[1] 1
[[2]]
[1] "a"
[[3]]
[1] FALSE
[[4]]
[1] 1+2i
> x ## auto printing
[[1]]
[1] 1
[[2]]
[1] "a"
[[3]]
[1] FALSE
[[4]]
[1] 1+2i
Factors:
Factors are special type of vectors used to represent categorical data. Factors can be unordered or ordered. One can think of a factor as an integer vector where each integer has a label.Factors are used to describe items that can have a finite number of values (gender, social class, etc.).
- Factors are treated specially by modelling functions like lm() and glm()
- Using factors with labels is better than using integers because factors are self-describing; having a variable that has values “Male” and “Female” is more descriptive than a variable that has values 1 and 2.
> x <- factor(c("yes", "yes", "no", "yes", "no")) ## char vectors
> x
[1] yes yes no yes no
Levels: no yes
> table(x) ## shows number of No and Yes
x
no yes
2 3
> unclass(x) ## define value 2 to Yes and 1 to No
[1] 2 2 1 2 1
attr(,"levels")
> x
[1] yes yes no yes no
Levels: no yes
> table(x) ## shows number of No and Yes
x
no yes
2 3
> unclass(x) ## define value 2 to Yes and 1 to No
[1] 2 2 1 2 1
attr(,"levels")
[1] "no" "yes"
The order of the levels can be set using the levels argument to factor(). This can be important in linear modelling because the first level is used as the baseline level.
> x <- factor(c("yes", "yes", "no", "yes", "no"), levels =c("yes", "no"))
> x
[1] yes yes no yes no
Levels: yes no
in the above example yes is 1st level no is 2nd level.> x
[1] yes yes no yes no
Levels: yes no
NA and NaN:
Missing values are denoted by NA or NaN for undefined mathematical operations.
- is.na() is used to test objects if they are NA
- is.nan() is used to test for NaN
NA values have a class also, so there are integer NA, character NA, etc.A NaN value is also NA but the converse is not true
For Example:
> x <- c( 1, NA, NA, 4, 5)
> is.na(x)
[1] FALSE TRUE TRUE FALSE FALSE
> is.nan(x)
[1] FALSE FALSE FALSE FALSE FALSE
> x <- c( 1, NA, NaN, 4, 5)
> is.na(x)
[1] FALSE TRUE TRUE FALSE FALSE
> is.nan(x)
[1] FALSE FALSE TRUE FALSE FALSE
> is.na(x)
[1] FALSE TRUE TRUE FALSE FALSE
> is.nan(x)
[1] FALSE FALSE FALSE FALSE FALSE
> x <- c( 1, NA, NaN, 4, 5)
> is.na(x)
[1] FALSE TRUE TRUE FALSE FALSE
> is.nan(x)
[1] FALSE FALSE TRUE FALSE FALSE
Removing NA Values
A common task is to remove missing values (NAs).
> x <- c(1, 2, NA, 4, NA, 5)
> bad <- is.na(x)
> x[!bad]
[1] 1 2 4 5
> bad <- is.na(x)
> x[!bad]
[1] 1 2 4 5
What if there are multiple things and you want to take the subset with no missing values?
> x <- c(1, 2, NA, 4, NA, 5)
> y <- c("a", "b", NA, "d", NA, "f")
> good <- complete.cases(x, y)
> good
[1] TRUE TRUE FALSE TRUE FALSE TRUE
> x[good]
[1] 1 2 4 5
> y[good]
[1] "a" "b" "d" "f"/div
> y <- c("a", "b", NA, "d", NA, "f")
> good <- complete.cases(x, y)
> good
[1] TRUE TRUE FALSE TRUE FALSE TRUE
> x[good]
[1] 1 2 4 5
> y[good]
[1] "a" "b" "d" "f"/div
General Example for removing NA values:
> airquality[1:6, ]
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
> good <- complete.cases(airquality)
> airquality[good, ][1:6, ]
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
7 23 299 8.6 65 5 7
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
> good <- complete.cases(airquality)
> airquality[good, ][1:6, ]
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
7 23 299 8.6 65 5 7
Data frames:
Data frames are used to store tabular data. They are represented as a special type of list where every element of the list has
to have the same length. Each element of the list can be thought of as a column and the length of each element of the list is the number of rows. Unlike matrices, data frames can store different classes of objects in each column (just like lists). Data frames also have a special attribute called row.names. Data frames are usually created by calling read.table() or read.csv(). Can be converted to a matrix by calling data.matrix()
There are a number of operators that can be used to extract subsets of R objects.to have the same length. Each element of the list can be thought of as a column and the length of each element of the list is the number of rows. Unlike matrices, data frames can store different classes of objects in each column (just like lists). Data frames also have a special attribute called row.names. Data frames are usually created by calling read.table() or read.csv(). Can be converted to a matrix by calling data.matrix()
> x <- data.frame(foo = 1:4, bar = c(T, T, F, T))
> x
foo bar
1 1 TRUE
2 2 TRUE
3 3 FALSE
4 4 TRUE
> nrow(x)
[1] 4
> ncol(x)
[1] 2
> x
foo bar
1 1 TRUE
2 2 TRUE
3 3 FALSE
4 4 TRUE
> nrow(x)
[1] 4
> ncol(x)
[1] 2
Name:
R objects can have names, which is very useful for writing readable code and self describing objects.
> x <- 1:3
> names(x)
NULL
> names(x) <- c("first", "second", "third")
> x
first second third
1 2 3
> names(x)
[1] "first" "second" "third"
> names(x)
NULL
> names(x) <- c("first", "second", "third")
> x
first second third
1 2 3
> names(x)
[1] "first" "second" "third"
Lists can also have names.
> x <- list(a = 1, b = 2, c = 3) ## a,b,c are the names
> x
$a
[1] 1
$b
[1] 2
$c
[1] 3
And matrices.> x
$a
[1] 1
$b
[1] 2
$c
[1] 3
> m <- matrix(1:4, nrow = 2, ncol = 2)
> dimnames(m) <- list(c("a", "b"), c("c", "d"))
> m
c d
a 1 3
b 2 4
> dimnames(m) <- list(c("a", "b"), c("c", "d"))
> m
c d
a 1 3
b 2 4
In the above example the sequence 1 to 4 forms a row based ordered matrix.the function dimnames() gives names for rows and columns, where a, b are row names and c, d are column names.
Subsetting:
- [ always ret urns an object of the same class as the original; can be used to select more than one element (there is one exception).
- [[ is used to extract elements of a list or a data frame; it can only be used to extract a single element and the class of the returned object will not necessarily be a list or data frame
- $ is used to extract elements of a list or data frame by name; semantics are similar to hat of [[.
> x[1]
[1] "a"
> x[2]
[1] "b"
> x[1:4]
[1] "a" "b" "c" "c"
> x[x > "a"]
[1] "b" "c" "c" "d"
> u <- x > "a"
> u
[1] FALSE TRUE TRUE TRUE TRUE FALSE
> x[u]
[1] "b" "c" "c" "d"
[1] "a"
> x[2]
[1] "b"
> x[1:4]
[1] "a" "b" "c" "c"
> x[x > "a"]
[1] "b" "c" "c" "d"
> u <- x > "a"
> u
[1] FALSE TRUE TRUE TRUE TRUE FALSE
> x[u]
[1] "b" "c" "c" "d"
Subsetting a Matrix:
Matrices can be subsetted in the usual way with (i, j) type indices.
> x <- matrix(1:6, 2, 3)
> x[1, 2]
[1] 3
> x[2, 1]
[1] 2
> x[1, 2]
[1] 3
> x[2, 1]
[1] 2
Indices can also be missing.
> x[1, ]
[1] 1 3 5
> x[, 2]
[1] 3 4
By default, when a single element of a matrix is retrieved, it is returned as a vector of length 1 rather than a 1 × 1 matrix. This behavior can be turned off by setting drop = FALSE.[1] 1 3 5
> x[, 2]
[1] 3 4
> x <- matrix(1:6, 2, 3)
> x[1, 2]
[1] 3
> x[1, 2, drop = FALSE]
[,1]
[1,] 3
Similarly, subsetting a single column or a single row will give you a vector, not a matrix (by default).> x[1, 2]
[1] 3
> x[1, 2, drop = FALSE]
[,1]
[1,] 3
> x <- matrix(1:6, 2, 3)
> x[1, ]
[1] 1 3 5
> x[1, , drop = FALSE]
[,1] [,2] [,3]
[1,] 1 3 5
> x[1, ]
[1] 1 3 5
> x[1, , drop = FALSE]
[,1] [,2] [,3]
[1,] 1 3 5
List subsetting:
> x <- list(jas = 1:4, kp = 0.6)
> x[1]
$jas
[1] 1 2 3 4
> x[[1]]
[1] 1 2 3 4
> x$kp
[1] 0.6
> x[["kp"]]
[1] 0.6
> x["kp"]
$kp
[1] 0.6
> x[1]
$jas
[1] 1 2 3 4
> x[[1]]
[1] 1 2 3 4
> x$kp
[1] 0.6
> x[["kp"]]
[1] 0.6
> x["kp"]
$kp
[1] 0.6
Extracting multiple elements of a list.
> x <- list(foo = 1:4, bar = 0.6, baz = "hello")
> x[c(1, 3)]
$foo
[1] 1 2 3 4
$baz
[1] "hello"
The [[ operator can be used with computed indices; $ can only be used with literal names.> x[c(1, 3)]
$foo
[1] 1 2 3 4
$baz
[1] "hello"
> x <- list(foo = 1:4, bar = 0.6, baz = "hello")
> name <- "foo"
> x[[name]] ## computed index for ‘foo’
[1] 1 2 3 4
>x$name ## element ‘name’ doesn’t exist!
NULL
> x$foo
[1] 1 2 3 4 ## element ‘foo’ does exist
> name <- "foo"
> x[[name]] ## computed index for ‘foo’
[1] 1 2 3 4
>x$name ## element ‘name’ doesn’t exist!
NULL
> x$foo
[1] 1 2 3 4 ## element ‘foo’ does exist
The [[ can take an integer sequence.
> x <- list(a = list(10, 12, 14), b = c(3.14, 2.81))
> x[[c(1, 3)]]
[1] 14
> x[[1]][[3]]
[1] 14
> x[[c(2, 1)]]
[1] 3.14
> x[[c(1, 3)]]
[1] 14
> x[[1]][[3]]
[1] 14
> x[[c(2, 1)]]
[1] 3.14
Partial Matching of Name:
Partial matching of names is allowed with [[ and $.
> x <- list(aardvark = 1:5)
> x$a
[1] 1 2 3 4 5
> x[["a"]]
NULL
> x[["a", exact = FALSE]]
[1] 1 2 3 4 5
> x$a
[1] 1 2 3 4 5
> x[["a"]]
NULL
> x[["a", exact = FALSE]]
[1] 1 2 3 4 5
Vectorized operations: |
Here we first define vectors "x" and "y" and will look at how to add,subtract, multiply,division and how to apply logics to the numbers in the vectors.
> x<- 1:4; y <- 6:9
> x + y
[1] 7 9 11 13
> x-y
[1] -5 -5 -5 -5
> x > 2
[1] FALSE FALSE TRUE TRUE
> x >= 2
[1] FALSE TRUE TRUE TRUE
> y == 8
[1] FALSE FALSE TRUE FALSE
> x * y
[1] 6 14 24 36
> x / y
[1] 0.1666667 0.2857143 0.3750000 0.4444444
> x + y
[1] 7 9 11 13
> x-y
[1] -5 -5 -5 -5
> x > 2
[1] FALSE FALSE TRUE TRUE
> x >= 2
[1] FALSE TRUE TRUE TRUE
> y == 8
[1] FALSE FALSE TRUE FALSE
> x * y
[1] 6 14 24 36
> x / y
[1] 0.1666667 0.2857143 0.3750000 0.4444444
Vectorized Matrix Operations
> x <- matrix(1:4, 2, 2); y <- matrix(rep(10, 4), 2, 2)
> x * y ## element-wise multiplication
[,1] [,2]
[1,] 10 30
[2,] 20 40
> x / y
[,1] [,2]
[1,] 0.1 0.3
[2,] 0.2 0.4
> x %*% y ## true matrix multiplication
[,1] [,2]
[1,] 40 40
[2,] 60 60
> x * y ## element-wise multiplication
[,1] [,2]
[1,] 10 30
[2,] 20 40
> x / y
[,1] [,2]
[1,] 0.1 0.3
[2,] 0.2 0.4
> x %*% y ## true matrix multiplication
[,1] [,2]
[1,] 40 40
[2,] 60 60
No comments:
Post a Comment