Sunday, September 30, 2012

Part2: R-Language Basics

 Data Types and Basic Operations:

1. Objects:
   In every computer language variables provide a means of accessing the data stored in memory. R does not provide direct access to the computer’s memory but rather provides a number of specialized data structures we will refer to as objects.These objects are referred to through symbols or variables.

R has five basic or atomic classes of objects:


  1. characters
  2. numeric(real numbers)
  3. integer
  4. complex
  5. logical(True or False)
2. Basic types:

2.1 Vectors:
The most basic object in R is vector. A vector can only contain objects of the same class. But the exception is a 'list', which is represent  as a vector but can contain objects of different classes. Empty vectors can be created with the vector() function.

 2.2 Lists:
Lists are another kind of data storage. Lists have elements, each of which can contain any type of R object, i.e. the elements of a list do not have to be of the same type.


 2.3 Numbers:
Numbers in R generally treated as numeric objects(i.e. double precision real numbers). If you explicitly want an integer you need to specify the L suffix. There is also special number Inf which represent infinity; e.g:1 / 0; Inf can be used in ordinary calculations; e.g. 1 / Inf is 0 .The value NaN (not a number)represent undefined value;e.g 0 / 0; NaN can be also be thought of as a missing value.

2.3 Attributes
All objects except NULL can have one or more attributes attached to them. Attributes are stored as a pairlist where all elements are named, but should be thought of as a set of name=value pairs. Attributes are used to implement the class structure used in R.
R objects can have attributes
  • names, dimnames
  • dimensions (e.g. matrices, arrays)
  • class
  • length
  • other user-defined attributes/metadata
Attributes of an object can be accessed using the attributes() function.

Do It Your Self:


Entering inputs in R: At the R prompt we type expressions. The <- symbol is the assignment operator.If you assign a value to x and for printing the value of x in prompt, just type the below commands in your R prompt :

> x <- 1
> print(x)
[1] 1
> x
[1] 1



The # character indicates a comment. Anything to the right of the # (including the # itself) is ignored. For example.

> msg <- "hello world" ## hello world
> msg
[1] "hello world"


When a complete expression is entered at the prompt, it is evaluated and the result of the evaluated expression is returned. The result may be auto-printed.

> x <- 5  ## nothing printed
> x       ## auto printing occurs
[1] 5
> print(x)   ## print explicitly
[1] 5

The [1] indicates that x is a vector and 5 is the first element.

The : operator is used to create integer sequences. If we do 1:100 in prompt, then the output is the sequence of 1 to 100 integers .check it out:

> 1:100   ## numbers 1 to 100
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
 [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
 [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
 [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
 [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
 [91]  91  92  93  94  95  96  97  98  99 100

The [1] to [91] indicate the position of first element in each row.

Creating vectors:
The c()function can be used to create vectors of objects. Where c is short form of concatenate.  



> x <- c(0.5, 0.6)      ## numeric
> x <- c(TRUE, FALSE)   ## logical
> x <- c(T, F)          ## logical
> x <- c("a", "b", "c") ## character
> x <- 9:29             ## integer
> x <- c(1+0i, 2+4i)    ## complex


When different objects are mixed in a vector, coercion occurs so that every element in the vector is of the same class. e.g:  

> y <- c(1.7, "a")   ## character
> y <- c(TRUE, 2)    ## numeric
> y <- c("a", TRUE)  ## character



Using the vector function we can also create vectors of certain type and length. for example:


> x <- vector("numeric", length = 10)
> x
 [1] 0 0 0 0 0 0 0 0 0 0

object can explicitly coerced from one cass to another using the as.* functions, if available.for example the normal class of x is integer but we can explicitly coerced into numeric using as.numeric(x) function. The examples are below:


> x <- 0: 6
> class(x)
[1] "integer"
> as.numeric(x)
[1] 0 1 2 3 4 5 6
> as.logical(x)
[1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
> as.character(x)
[1] "0" "1" "2" "3" "4" "5" "6"
> as.complex(x)
[1] 0+0i 1+0i 2+0i 3+0i 4+0i 5+0i 6+0i
But nonsensical coercion results in NAs.

> as.numeric(x)
[1] NA NA NA
Warning message:
NAs introduced by coercion
> as.logical(x)
[1] NA NA NA

 Matrix:

Matrices are special type of vectors with a dimension attribute. The dimension attribute is itself an integer vector of length 2 (nrow, ncol). Matrices are constructed column-wise, so entries can be thought of starting in the “upper left” corner and running down the columns.



> m <- matrix(1:6, nrow = 2, ncol =3)
> m
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
> dim(m)
[1] 2 3
> attributes(m)
$dim
[1] 2 3
Matrices can also be created directly from vectors by adding a dimension attribute.


> m <- 1:6
> m
[1] 1 2 3 4 5 6
> dim(m) <- c(2,3)
> m
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
Matrices can be created by column-binding or row-binding with cbind() and rbind().

> x <- 1:3
> y <- 10:12
> cbind(x, y) ## column-binding
     x  y
[1,] 1 10
[2,] 2 11
[3,] 3 12
> rbind(x, y)
  [,1] [,2] [,3]
x    1    2    3
y   10   11   12

 Lists:

Lists are very important data type in R and it contain elements of different classes.


> x <- list(1, "a", FALSE, 1 + 2i)##elements of diff classes
> x ## auto printing
[[1]]
[1] 1

[[2]]
[1] "a"

[[3]]
[1] FALSE

[[4]]
[1] 1+2i

Factors:

Factors are special type of vectors used to represent categorical data. Factors can be unordered or ordered. One can think of a factor as an integer vector where each integer has a label.Factors are used to describe items that can have a finite number of values (gender, social class, etc.).
  • Factors are treated specially by modelling functions like lm() and glm()
  • Using factors with labels is better than using integers because factors are self-describing; having a variable that has values “Male” and “Female” is more descriptive  than a variable that has values 1 and 2.
So factors can be created using factor() function:


> x <- factor(c("yes", "yes", "no", "yes", "no")) ## char vectors
> x
[1] yes yes no  yes no
Levels: no yes
> table(x)    ## shows number of No and Yes
x
 no yes
  2   3
> unclass(x) ## define value 2 to Yes and 1 to No
[1] 2 2 1 2 1
attr(,"levels")
[1] "no"  "yes"
The order of the levels can be set using the levels argument to factor(). This can be important in linear modelling because the first level is used as the baseline level.

> x <- factor(c("yes", "yes", "no", "yes", "no"), levels =c("yes", "no"))
> x
[1] yes yes no  yes no
Levels: yes no
in the above example yes is 1st level no is 2nd level.

NA and NaN: 

Missing values are denoted by NA or NaN for undefined mathematical operations.
  • is.na() is used to test objects if they are NA
  • is.nan() is used to test for NaN
NA values have a class also, so there are integer NA, character NA, etc.A NaN value is also NA but the converse is not true
For Example:

> x <- c( 1, NA, NA, 4, 5)
> is.na(x)
[1] FALSE  TRUE  TRUE FALSE FALSE
> is.nan(x)
[1] FALSE FALSE FALSE FALSE FALSE
> x <- c( 1, NA, NaN, 4, 5)
> is.na(x)
[1] FALSE  TRUE  TRUE FALSE FALSE
> is.nan(x)
[1] FALSE FALSE  TRUE FALSE FALSE

  Removing NA Values

A common task is to remove missing values (NAs).

> x <- c(1, 2, NA, 4, NA, 5)
> bad <- is.na(x)
> x[!bad]
[1] 1 2 4 5
 What if there are multiple things and you want to take the subset with no missing values?

> x <- c(1, 2, NA, 4, NA, 5)
> y <- c("a", "b", NA, "d", NA, "f")
> good <- complete.cases(x, y)
> good
[1] TRUE TRUE FALSE TRUE FALSE TRUE
> x[good]
[1] 1 2 4 5
> y[good]
[1] "a" "b" "d" "f"/div
  General Example for removing NA values:

> airquality[1:6, ]
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6
> good <- complete.cases(airquality)
> airquality[good, ][1:6, ]
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
7    23     299  8.6   65     5   7

 

Data frames:

Data frames are used to store tabular data. They are represented as a special type of list where every element of the list has
to have the same length. Each element of the list can be thought of as a column and the length of each element of the list is the number of rows. Unlike matrices, data frames can store different classes of objects in each column (just like lists). Data frames also have a special attribute called row.names. Data frames are usually created by calling read.table() or read.csv(). Can be converted to a matrix by calling data.matrix()

> x <- data.frame(foo = 1:4, bar = c(T, T, F, T))
> x
  foo   bar
1   1  TRUE
2   2  TRUE
3   3 FALSE
4   4  TRUE
> nrow(x)
[1] 4
> ncol(x)
[1] 2


Name:

R objects can have names, which is very useful for writing readable code and self describing objects.

> x <- 1:3
> names(x)
NULL
> names(x) <- c("first", "second", "third")
> x
 first second  third
     1      2      3
> names(x)
[1] "first"  "second" "third"
  Lists can also have names.

> x <- list(a = 1, b = 2, c = 3) ## a,b,c are the names
> x
$a
[1] 1
$b
[1] 2
$c
[1] 3
And matrices.

> m <- matrix(1:4, nrow = 2, ncol = 2)
> dimnames(m) <- list(c("a", "b"), c("c", "d"))
> m
  c d
a 1 3
b 2 4
In the above example the sequence 1 to 4 forms a row based ordered matrix.the function dimnames() gives names for rows and columns, where a, b are row names and c, d are column names.


Subsetting:

There are a number of operators that can be used to extract subsets of R objects.
  1. [ always ret urns an object of the same class as the original; can be used to select more than one element (there is one exception).
  2. [[ is used to extract elements of a list or a data frame; it can only be used to extract a single element and the class of the returned object will not necessarily be a list or data frame 
  3. $ is used to extract elements of a list or data frame by name; semantics are similar to hat of [[.
Examples for subsetting:

> x[1]
[1] "a"
> x[2]
[1] "b"
> x[1:4]
[1] "a" "b" "c" "c"
> x[x > "a"]
[1] "b" "c" "c" "d"
> u <- x > "a"
> u
[1] FALSE TRUE TRUE TRUE TRUE FALSE
> x[u]
[1] "b" "c" "c" "d"

Subsetting a Matrix:

Matrices can be subsetted in the usual way with (i, j) type indices.
> x <- matrix(1:6, 2, 3)
> x[1, 2]
[1] 3
> x[2, 1]
[1] 2

Indices can also be missing.

> x[1, ]
[1] 1 3 5
> x[, 2]
[1] 3 4
By default, when a single element of a matrix is retrieved, it is returned as a vector of length 1 rather than a 1 × 1 matrix. This behavior can be turned off by setting drop = FALSE.

> x <- matrix(1:6, 2, 3)
> x[1, 2]
[1] 3
> x[1, 2, drop = FALSE]
    [,1]
[1,] 3
Similarly, subsetting a single column or a single row will give you a vector, not a matrix (by default).

> x <- matrix(1:6, 2, 3)
> x[1, ]
[1] 1 3 5
> x[1, , drop = FALSE]
    [,1] [,2] [,3]
[1,] 1    3    5


List subsetting:


> x <- list(jas = 1:4, kp = 0.6)
> x[1]
$jas
[1] 1 2 3 4
> x[[1]]
[1] 1 2 3 4
> x$kp
[1] 0.6
> x[["kp"]]
[1] 0.6
> x["kp"]
$kp
[1] 0.6

Extracting multiple elements of a list.

> x <- list(foo = 1:4, bar = 0.6, baz = "hello")
> x[c(1, 3)]
$foo
[1] 1 2 3 4
$baz
[1] "hello"
The [[ operator can be used with computed indices; $ can only be used with literal names.

> x <- list(foo = 1:4, bar = 0.6, baz = "hello")
> name <- "foo"

> x[[name]] ## computed index for ‘foo’
[1] 1 2 3 4
 
>x$name  ## element ‘name’ doesn’t exist!
NULL
> x$foo
[1] 1 2 3 4 ##
element ‘foo’ does exist

The [[ can take an integer sequence.

> x <- list(a = list(10, 12, 14), b = c(3.14, 2.81))
> x[[c(1, 3)]]
[1] 14
> x[[1]][[3]]
[1] 14
> x[[c(2, 1)]]
[1] 3.14


Partial Matching of Name:

Partial matching of names is allowed with [[ and $.
> x <- list(aardvark = 1:5)
> x$a
[1] 1 2 3 4 5
> x[["a"]]
NULL
> x[["a", exact = FALSE]]
[1] 1 2 3 4 5


Vectorized operations:

 Many operations in R are vectorized making code more efficient, concise, and easier to read. Once you have a vector (or a list of numbers) in memory most basic operations are available. Most of the basic operations will act on a whole vector and can be used to quickly perform a large number of calculations with a single command. 
 Here we first define vectors "x" and "y" and will look at how to add,subtract, multiply,division and how to apply logics to the numbers in the vectors. 


> x<- 1:4; y <- 6:9
> x + y 
[1] 7 9 11 13 
> x-y
[1] -5 -5 -5 -5
> x > 2 
[1] FALSE FALSE TRUE TRUE 
> x >= 2 
[1] FALSE TRUE TRUE TRUE 
> y == 8 
[1] FALSE FALSE TRUE FALSE 
> x * y 
[1] 6 14 24 36 
> x / y 
[1] 0.1666667 0.2857143 0.3750000 0.4444444

Vectorized Matrix Operations


> x <- matrix(1:4, 2, 2); y <- matrix(rep(10, 4), 2, 2)
> x * y ## element-wise multiplication
     [,1] [,2]
[1,]  10 30
[2,]  20 40
> x / y
    [,1] [,2]
[1,] 0.1 0.3
[2,] 0.2 0.4
> x %*% y ## true matrix multiplication
    [,1] [,2]
[1,]  40   40
[2,]  60   60



 
 

No comments:

Post a Comment