R Data Structures for Machine Learning

screw-cap-1931743_1280

Every programming language has specific data structure. R language also has some predefined data structure that each of them can be useful for specific purposes. For doing machine learning in R, we normally use data structure such as Vector, List, Data Frame, factors, arrays and matrix. In this post I will explain some of them briefly.

Vector – C()

Vector stores the order set of values. All values have same data type. Each vector can have types like Integer (numbers without decimals), Double (numbers with decimals), Character (text data), and Logical (TRUE or FALSE values).

vector

We use Function C () to define a vector to store people name.

codee1

Subject_name is a Vector that contains Character value (People name).

We can use the Typeof () to determine the type of Vector.

code2

The output will be:

code3

Now we are going to have another vector that stores the people age.

code4

The Age vector stores Integer value. We create another vector to store a Boolean information about whether people married or single:

code5

Using the Typeof () Function to see the Vector type:

code6

We can select specific elements of the each vector, for example to extract the second name in Subject_Name vector, we write below code:

code9

which the output will be:

code8

Moreover, there is a possibility to get the range of value in a Vector. For example, we want to fetch the age second and third person we stored in Age vector, the code should be look like below:

code10

The out put will be like:

code11

Factor – Factor()

Factor is specific type of Vector that stores the categorical or ordinal variables, for instance, instead of storing the female and male in a vector computer stores 1,2 that takes less space, for defining a Factor for storing gender we first should have a vector of gender as below

C(“Female”, “Male”)

then we  use commend Factor() as below

code12

as you can see in above output, when I called the “gender” , it shows the gender of people that we stored in Vector plus a value called “Level”, Level show the possible value in gender vector.

for instance, currently we just have BA and Master students . However, in future there is a possibility that we have PhD or Diploma students. So we create a factor as below that can support future types as well:

code13

we should specify the “Levels” like this :levels = c(“BA”,”Master”, “PhD”,”Diploma”)

Lists-list()

List is so similar to vector. List able to have combination of data types whilst in Vector we just can have one data type.

list

For instance for storing the student’s information we can use list as below:

code14

the out put of calling students list will be look like:

code15

List helps us to have combination of the data type.

Data frames- data.frame()

Data Frames are most important data structure in machine learning process. It similar to Table as it has both columns and rows.

dataframe

To define a Frame we use data.frame syntax as below:

dataframe1

studentData is a data frame that contains some vectors like subject_name, Age, Gender and Student_Level.

R automatically convert every character vector to a factor, hence to avoid that we normally use StringAsfactor as parameter that specify character data type should not consider as factor.

the output of calling Studentdata will be look like:

dfout

As data frame is like a table we can access the cells, rows and columns separately

for instance, to fetch a specific column like age we use below code:

agecol

only the Age column as a Vector has been shown.

Moreover, we just want to see age and gender of students so we employ below code:

2colmdf

we can extract all the rows of the first column:

studentname

or extract all columns data of specific students using below code

studentdata

in next post I will show how we can get data from different resources and how to visualize the data inside R.

Reference :L. Brents. Machine Learning with R, Pack Publishing, 2015

Save

Leila Etaati on LinkedinLeila Etaati on TwitterLeila Etaati on Youtube
Leila Etaati
Trainer, Consultant, Mentor
Leila is the first Microsoft AI MVP in New Zealand and Australia, She has Ph.D. in Information System from the University Of Auckland. She is the Co-director and data scientist in RADACAD Company with more than 100 clients in around the world. She is the co-organizer of Microsoft Business Intelligence and Power BI Use group (meetup) in Auckland with more than 1200 members, She is the co-organizer of three main conferences in Auckland: SQL Saturday Auckland (2015 till now) with more than 400 registrations, Difinity (2017 till now) with more than 200 registrations and Global AI Bootcamp 2018. She is a Data Scientist, BI Consultant, Trainer, and Speaker. She is a well-known International Speakers to many conferences such as Microsoft ignite, SQL pass, Data Platform Summit, SQL Saturday, Power BI world Tour and so forth in Europe, USA, Asia, Australia, and New Zealand. She has over ten years’ experience working with databases and software systems. She was involved in many large-scale projects for big-sized companies. She also AI and Data Platform Microsoft MVP. Leila is an active Technical Microsoft AI blogger for RADACAD.

Leave a Reply