Data wrangling is the process of converting original data into a form that is ready for analysis. In this post we will see how to download and wrangle the Austin and Sacramento animal shelter impoundment data into standardized forms that can be manipulated easily with software.
We learned in the previous post that Austin and Sacramento have data portals hosting their municipal data sets. Each data portal runs a web application that supports browsing, searching, lightweight analysis, and visualization. You can learn how to use these web applications by reading the respective wikis:
- Austin Open Data Portal documentation, maintained by Open Austin.
- Sacramento Open Data Portal generic documentation, maintained by Junar.
However, in this series of posts we will sidestep the web applications and, instead, look at writing software to perform more complex analysis of the animal shelter data. We will program in the high-level language R, which is an interpreted language for data analysis and
These posts will be technical how-to’s, appealing primarily to those interested in software development and statistics. But at the same time, they will be narrative, telling the data-driven story of the Austin and Sacramento shelter animals.
Fetching Austin Data
In order to work with a data set in R, we must first load the data over the network. Since the Austin and Sacramento data portals are hosted by different companies, the method of fetching the data varies slightly.
The Austin data portal is hosted by Socrata, which provides a network API for accessing
Socrata data sets are identified by a URL. Since anonymous access is subject to throttling, applications typically identify themselves when accessing data. Client applications are identified by an application token. To acquire an application token, you’ll first have to open a Socrata developer account, and then create a new application.
Each data set URL can specify the way in which the data is provided, including the interchange format and whether the data is broken into pages. For our purposes, we will download a complete data set in CSV format. The parameterized URL to do that is the following:
https://data.austintexas.gov/resource/{data set id}.csv/?$$app_token={app token}&limit=50000
File download from a URL is easily done in R:
download.file(fromUrl, localFilePath, method = "curl") atxRawData <- read.csv(localFilePath, header = TRUE)
The downloaded CSV file is spooled to a local file, then read into an R data frame. Our toolkit hides these details, and implements downloading of the raw Austin intake or outcome data sets in a single call:
atxRawIntakeData <- atxLoadRawIntake() atxRawOutcomeData <- atxLoadRawOutcome()
As a starting point for data wrangling, the toolkit loads raw open data into R data sets in which each column is a vector of strings.
Fetching Sacramento Data
The Sacramento data portal is hosted by Junar, which provides a network API for accessing
The parameterized URL for a whole Sacramento data set in CSV format is the following:
http://api.data.cityofsacramento.org/api/v2/datastreams/{data set id}/data.csv/?auth_key={app token}&limit=50000
The Sacramento impoundment data comes in a single data set containing both intake and outcome events. The {data set id}
is "ANIMA-INTAK-AND-OUTCO
".
The basic code for downloading a Sacramento CSV file and loading it into an initial data frame is the same as for Austin.
Wrangling Austin Data
The raw Austin impoundment data is broken into two data sets, one for animal intake events and another for animal outcome events. The toolkit provides functions to fetch and wrangle each data set individually, and to create a merged data set of impoundment events (i.e., paired intake and outcome events):
atxWrangledIntakeData <- atxLoadIntake() atxWrangledOutcomeData <- atxLoadOutcome() atxWrangledImpoundData <- atxLoadOpenData()
The sections that follow describe the wrangling done for us by the toolkit, in terms of the raw data input and the transformed data output.
Austin Intake Data
Here are the first few items of the raw data frame for Austin animal intake events:
> atxRawIntakeData <- atxLoadRawIntake() > glimpse(atxRawIntakeData) Observations: 49,653 Variables: 12 $ age_upon_intake (chr) "8 years", "11 months", "4 weeks", "4 years", "2 years", "5 months", "2 years", "14 years",... $ animal_id (chr) "A706918", "A724273", "A665644", "A682524", "A708452", "A707375", "A696408", "A657188", "A7... $ animal_type (chr) "Dog", "Dog", "Cat", "Dog", "Dog", "Dog", "Dog", "Cat", "Cat", "Dog", "Dog", "Cat", "Dog", ... $ breed (chr) "English Springer Spaniel", "Basenji Mix", "Domestic Shorthair Mix", "Doberman Pinsch/Austr... $ color (chr) "White/Liver", "Sable/White", "Calico", "Tan/Gray", "Black/White", "Brown/White", "Tricolor... $ datetime (chr) "2015-07-05T12:59:00.000", "2016-04-14T18:43:00.000", "2013-10-21T07:59:00.000", "2014-06-2... $ datetime2 (chr) "2015-07-05T12:59:00.000", "2016-04-14T18:43:00.000", "2013-10-21T07:59:00.000", "2014-06-2... $ found_location (chr) "9409 Bluegrass Dr in Austin (TX)", "2818 Palomino Trail in Austin (TX)", "Austin (TX)", "8... $ intake_condition (chr) "Normal", "Normal", "Sick", "Normal", "Normal", "Normal", "Normal", "Normal", "Normal", "No... $ intake_type (chr) "Stray", "Stray", "Stray", "Stray", "Public Assist", "Stray", "Stray", "Owner Surrender", "... $ name (chr) "Belle", "Runster", "", "Rio", "Mumble", "*Candy Cane", "*Pearl", "Tommy", "*Mint", "", "St... $ sex_upon_intake (chr) "Spayed Female", "Intact Male", "Intact Female", "Neutered Male", "Intact Male", "Intact Fe...
The toolkit transforms each raw column into one or more new columns, and then to binds all the new columns into a final data frame.
Age
The raw age_upon_intake
data is a string that either comprises count and units of measure (e.g., "1 year" or "14 days") or is the literal value "NULL". The raw data is transformed into three new columns:
intake_age_count
is an integer number.intake_age_units
is the factor (dy, wk, mo, yr), which is the unit of measurement of intake age count.intake_age
is a lubridate::duration that represents the age in common units(seconds).[6] This representation enables sorting and binning of animals by age.
The "NULL" raw value maps to NA
in all three new columns.
Sex
The raw sex_upon_intake
data is a string that comprises spay/neuter status and gender (e.g., "Spayed Female" or "Neutered Male" or "Intact Male") or is the literal value "Unknown". The raw data is transformed into two new columns:
gender
is the factor (Male, Female).intake_spay_neuter
is the factor (Intact, Altered).
The "Unknown" raw value maps to NA
in both new columns.
Breed
The raw breed
data is a string that comprises one or two breed categories (e.g., "Labrador Retriever" or "Labrador Retriever/German Shepherd"), or describes an uncategorized mix of two or more breeds (e.g., "Labrador Retriever Mix"), or describes a species (e.g., "Squirrel"). The raw data is transformed into two new columns:
breed_1
is a factor describing the primary breed or the species.breed_2
is a factor describing the secondary breed, or takes the value "Mix" when the animal is a mixed-breed and the secondary breed is not assigned. The value isNA
for animals that are categorized as single-breed.
Color
The raw color
data is a string that comprises one or two color/coat categories. (e.g., "Sable/White" or "Orange Tabby/White" or "Seal Point"). The raw data is transformed into two new columns:
color_1
is a factor describing the primary color/coat.color_2
is a factor describing the secondary color/coat. The value isNA
for animals that are categorized by a single color/coat.
Intake Date
The raw datetime
data is a string that represents the intake date and time in ISO 8601 format (e.g., "2015-07-05T12:59:00.000"). The time is local Austin time, not UTC. The raw data is transformed into one new column:
intake_date
is a POSIXct timestamp, which represents time as the signed number of seconds relative to a well-knownorigin.[7]
The raw datetime2
data is ignored, because it merely duplicates the raw datetime
data.
Clean Up
The remaining raw data is transformed into the following new columns:
animal_id
is the animal impound ID (animal_id
) as a factor.kind
is the animal type (animal_type
) as a factor.intake_type
is the animal intake category (intake_type
) as a factor.intake_condition
is the animal intake condition category (intake_condition
) as a factor.intake_location
is the animal pick-up location (found_location
) as a string. The raw data values are suitable for geo-coding.name
is the animal name (name
) as a factor. Some raw names are flagged with an asterisk, no doubt signifying a special (but unknown to us) status. The asterisks are removed from the names.
For all conversions to factor, the empty-string raw data value is mapped to NA
.
The wrangled data frame of Austin animal intake events looks like this:
> atxIntakeData <- atxLoadIntake() > glimpse(atxIntakeData) Observations: 49,653 Variables: 16 $ animal_id (fctr) A706918, A724273, A665644, A682524, A708452, A707375, A696408, A657188, A707658, A697950,... $ kind (fctr) Dog, Dog, Cat, Dog, Dog, Dog, Dog, Cat, Cat, Dog, Dog, Cat, Dog, Other, Cat, Dog, Cat, Ca... $ name (fctr) Belle, Runster, NA, Rio, Mumble, Candy Cane, Pearl, Tommy, Mint, NA, Stumpy, Cleopatra, N... $ gender (fctr) Female, Male, Female, Male, Male, Female, Female, Male, Female, Female, Female, Female, F... $ color_1 (fctr) White, Sable, Calico, Tan, Black, Brown, Tricolor, Brown Tabby, Black, Tan, Black, Blue, ... $ color_2 (fctr) Liver, White, NA, Gray, White, White, NA, White, White, White, White, NA, NA, NA, NA, Whi... $ breed_1 (fctr) English Springer Spaniel, Basenji, Domestic Shorthair, Doberman Pinsch, Labrador Retrieve... $ breed_2 (fctr) NA, Mix, Mix, Australian Cattle Dog, Mix, NA, NA, Mix, Mix, Labrador Retriever, Mix, Mix,... $ intake_date (time) 2015-07-05, 2016-04-14, 2013-10-21, 2014-06-29, 2015-07-30, 2015-07-11, 2015-02-04, 2013-... $ intake_type (fctr) Stray, Stray, Stray, Stray, Public Assist, Stray, Stray, Owner Surrender, Stray, Stray, S... $ intake_condition (fctr) Normal, Normal, Sick, Normal, Normal, Normal, Normal, Normal, Normal, Normal, Normal, Nor... $ intake_location (chr) "9409 Bluegrass Dr in Austin (TX)", "2818 Palomino Trail in Austin (TX)", "Austin (TX)", "... $ intake_age_count (int) 21, 4, 17, 17, 13, 18, 13, 7, 17, 13, 11, 13, 18, 2, 13, 2, 17, 13, 16, 17, 13, 13, 2, 13,... $ intake_age_units (fctr) yr, mo, wk, yr, yr, mo, yr, yr, wk, mo, yr, yr, mo, yr, yr, yr, dy, wk, yr, yr, yr, yr, y... $ intake_age (dbl) 662256000, 10518984, 10281600, 536112000, 409968000, 47335428, 409968000, 220752000, 10281... $ intake_spay_neuter (fctr) Altered, Intact, Intact, Altered, Intact, Intact, Intact, Altered, Intact, Intact, Altere...
Austin Outcome Data
Here are the first few items of the raw data frame for Austin animal outcome events:
> atxRawOutcomeData <- atxLoadRawOutcome() > glimpse(atxRawOutcomeData) Observations: 49,255 Variables: 12 $ age_upon_outcome (chr) "4 months", "2 weeks", "5 months", "6 years", "3 months", "2 months", "1 month", "1 year", ... $ animal_id (chr) "A720371", "A725795", "A710054", "A722378", "A707743", "A705205", "A702919", "A709257", "A6... $ animal_type (chr) "Dog", "Other", "Dog", "Dog", "Cat", "Cat", "Cat", "Other", "Cat", "Dog", "Dog", "Cat", "Ca... $ breed (chr) "Anatol Shepherd/Labrador Retriever", "Opossum", "Miniature Schnauzer Mix", "German Shepher... $ color (chr) "Buff", "Gray", "Black/Tan", "Brown/Black", "Black/White", "White/Blue Tabby", "Blue", "Bla... $ date_of_birth (chr) "2015-10-08T00:00:00.000", "2016-04-15T00:00:00.000", "2015-03-16T00:00:00.000", "2010-03-1... $ datetime (chr) "2016-02-13T17:59:00.000", "2016-05-03T12:39:00.000", "2015-08-18T17:46:00.000", "2016-03-3... $ monthyear (chr) "2016-02-13T17:59:00.000", "2016-05-03T12:39:00.000", "2015-08-18T17:46:00.000", "2016-03-3... $ name (chr) "Moose", "", "", "Canine", "", "*Dell", "", "", "Patches", "Copper", "", "Boris", "*Marigol... $ outcome_subtype (chr) "", "Underage", "", "", "SCRP", "Offsite", "Partner", "Rabies Risk", "", "Foster", "Foster"... $ outcome_type (chr) "Adoption", "Euthanasia", "Adoption", "Return to Owner", "Transfer", "Adoption", "Transfer"... $ sex_upon_outcome (chr) "Neutered Male", "Unknown", "Spayed Female", "Neutered Male", "Intact Male", "Neutered Male...
The raw age_upon_outcome
, date_of_birth
, and monthyear
columns are dropped, since the same information is carried in the age
column of the wrangled intake data set. The remaining raw columns are transformed in the same manner as their counterparts in the intake data set.
The wrangled data frame of Austin animal outcome events looks like this:
> atxOutcomeData <- atxLoadOutcome() > glimpse(atxOutcomeData) Observations: 49,255 Variables: 12 $ animal_id (fctr) A720371, A725795, A710054, A722378, A707743, A705205, A702919, A709257, A643424, A67909... $ kind (fctr) Dog, Other, Dog, Dog, Cat, Cat, Cat, Other, Cat, Dog, Dog, Cat, Cat, Dog, Dog, Cat, Dog... $ name (fctr) Moose, NA, NA, Canine, NA, Dell, NA, NA, Patches, Copper, NA, Boris, Marigold, Goowa, N... $ gender (fctr) Male, NA, Female, Male, Male, Male, Male, NA, Male, Male, Female, Male, Female, Male, M... $ color_1 (fctr) Buff, Gray, Black, Brown, Black, White, Blue, Black, Black, Red, Red, Orange Tabby, Bro... $ color_2 (fctr) NA, NA, Tan, Black, White, Blue Tabby, NA, NA, White, NA, White, White, White, Brown, W... $ breed_1 (fctr) Anatol Shepherd, Opossum, Miniature Schnauzer, German Shepherd, Domestic Shorthair, Dom... $ breed_2 (fctr) Labrador Retriever, NA, Mix, Mix, Mix, Mix, Mix, Mix, Mix, Mix, Mix, Mix, Mix, Mix, Mix... $ outcome_date (time) 2016-02-13, 2016-05-03, 2015-08-18, 2016-03-31, 2015-07-18, 2015-06-20, 2015-05-21, 201... $ outcome_type (fctr) Adoption, Euthanasia, Adoption, Return to Owner, Transfer, Adoption, Transfer, Euthanas... $ outcome_subtype (fctr) NA, Underage, NA, NA, SCRP, Offsite, Partner, Rabies Risk, NA, Foster, Foster, NA, NA, ... $ outcome_spay_neuter (fctr) Altered, NA, Altered, Altered, Intact, Altered, Intact, NA, Altered, Altered, Altered, ...
Austin Merged Data
The toolkit also supports fetching and wrangling the Austin raw intake and outcome data into a merged data set of impoundment events, which looks like this:
> atxOpenData <- atxLoadOpenData() > glimpse(atxOpenData) Observations: 50,576 Variables: 20 $ animal_id (fctr) A006100, A006100, A047759, A134067, A141142, A163459, A165752, A178569, A189592, A19135... $ kind (fctr) Dog, Dog, Dog, Dog, Dog, Dog, Dog, Dog, Dog, Cat, Cat, Dog, Dog, Dog, Cat, Dog, Dog, Do... $ name (fctr) Scamp, Scamp, Oreo, Bandit, Bettie, Sasha, Pep, Boti, Ophelia, Bri-Bri, Sassafrass, Car... $ gender (fctr) Male, Male, Male, Male, Female, Female, Male, Male, Female, Female, Female, Male, Male,... $ color_1 (fctr) Yellow, Yellow, Tricolor, Brown, Black, Black, Brown, White, Brown, Black, Brown Tabby,... $ color_2 (fctr) White, White, NA, White, White, Gray, White, Black, White, White, White, Tan, White, Bl... $ breed_1 (fctr) Spinone Italiano, Spinone Italiano, Dachshund, Shetland Sheepdog, Labrador Retriever, M... $ breed_2 (fctr) Mix, Mix, NA, NA, Pit Bull, Mix, Mix, Mix, Mix, Mix, Mix, Mix, Mix, Labrador Retriever,... $ intake_date (time) 2014-03-07, 2014-12-19, 2014-04-02, 2013-11-16, 2013-11-16, 2014-11-14, 2014-09-15, 201... $ intake_type (fctr) Public Assist, Public Assist, Owner Surrender, Public Assist, Stray, Stray, Stray, Publ... $ intake_condition (fctr) Normal, Normal, Normal, Injured, Aged, Normal, Normal, Normal, Normal, Normal, Normal, ... $ intake_location (fctr) 8700 Research in Austin (TX), 8700 Research Blvd in Austin (TX), Austin (TX), 12034 Res... $ intake_age_count (int) 19, 20, 3, 9, 8, 8, 8, 8, 11, 9, 7, 9, 7, 6, 12, 6, 8, 8, 9, 5, 8, 5, 6, 6, 5, 5, 8, 6, ... $ intake_age_units (fctr) yr, yr, yr, yr, yr, yr, yr, yr, yr, yr, yr, yr, yr, yr, yr, yr, yr, yr, yr, yr, yr, yr,... $ intake_age (int) 599184000, 630720000, 94608000, 283824000, 252288000, 252288000, 252288000, 252288000, 3... $ intake_spay_neuter (fctr) Altered, Altered, Altered, Altered, Altered, Intact, Altered, Altered, Altered, Intact,... $ outcome_date (time) 2014-03-08, 2014-12-20, 2014-04-07, 2013-11-16, 2013-11-17, 2014-11-14, 2014-09-15, 201... $ outcome_type (fctr) Return to Owner, Return to Owner, Transfer, Return to Owner, Return to Owner, Return to... $ outcome_subtype (fctr) NA, NA, Partner, NA, NA, NA, NA, NA, NA, NA, Partner, Foster, Suffering, NA, Suffering,... $ outcome_spay_neuter (fctr) Altered, Altered, Altered, Altered, Altered, Intact, Altered, Altered, Altered, Intact,...
Wrangling Sacramento Data
The Sacramento impoundment data is much less rich than the Austin data, containing only the barest minimum of information about each animal taken in by the shelter. Here are the first few items of the raw data frame for Sacramento animal impoundment events:
> sacRawOpenData <- sacLoadRawOpenData() > glimpse(sacRawOpenData) Observations: 31,157 Variables: 9 $ Animal_Id (chr) "A038219", "A083800", "A089599", "A106453", "A107900", "A113137", "A126087", "A128417", "... $ Animal_Name (chr) "BLANCA", "KAYLA", "CHUNKY", "CHEYANNE", "BETOVEN", "BEAR", "", "LILLY", "JEZABELL", "LUC... $ Animal_Type (chr) "DOG", "DOG", "DOG", "DOG", "DOG", "DOG", "CAT", "DOG", "DOG", "DOG", "DOG", "DOG", "CAT"... $ Activity_Number (chr) "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "... $ Intake_Type (chr) "EUTH REQ", "EUTH REQ", "EUTH REQ", "EUTH REQ", "EUTH REQ", "EUTH REQ", "EUTH REQ", "EUTH... $ Picked_up_Location (chr) "FAIRFIELD AND ARCADE", "ADA", "ADA/EUTH REQ", "EITH REQ/ SIGNED ADA", "", "", "OWNED", "... $ Intake_Date (chr) "2013-06-12 00:00:00", "2014-12-07 00:00:00", "2013-08-11 00:00:00", "2013-06-19 00:00:00... $ Outcome_Type (chr) "EUTH", "EUTH", "EUTH", "EUTH", "EUTH", "EUTH", "EUTH", "EUTH", "EUTH", "RTO", "EUTH", "E... $ Outcome_Date (chr) "2013-06-12 00:00:00", "2014-12-07 00:00:00", "2013-08-11 00:00:00", "2013-06-19 00:00:00...
The raw data columns are transformed one-for-one to the following new columns:
animal_id
is the animal impound ID (Animal_Id
) as a factor.kind
is the animal type (Animal_Type
) as a factor.name
is the animal name (Animal_Name
) as a factor.intake_type
is the animal intake category (Intake_Type
) as a factor.intake_date
is the animal intake date (Intake_Date
) as a POSIXct timestamp. The time part of the raw date-time string (e.g., "2014-12-07 00:00:00") is always zero in the Sacramento open data.intake_location
is the animal pick-up location (Picked_up_Location
) as a string.outcome_type
is the animal outcome category (Outcome_Type
) as a factor.outcome_date
is the animal outcome date (Outcome_Date
) as a POSIXct timestamp.
The raw Activity_Number
column, which is the identifying number for an Animal Control event for which we have no other data, is dropped.
The wrangled data frame of Sacramento animal impoundment events looks like this:
> sacOpenData <- sacLoadOpenData() > glimpse(sacOpenData) Observations: 31,157 Variables: 8 $ animal_id (fctr) A038219, A083800, A089599, A106453, A107900, A113137, A126087, A128417, A132304, A146689, A... $ kind (fctr) DOG, DOG, DOG, DOG, DOG, DOG, CAT, DOG, DOG, DOG, DOG, DOG, CAT, DOG, DOG, DOG, DOG, DOG, D... $ name (fctr) BLANCA, KAYLA, CHUNKY, CHEYANNE, BETOVEN, BEAR, , LILLY, JEZABELL, LUCY, SMOKEY, ELVIS, WYT... $ intake_date (time) 2013-06-12, 2014-12-07, 2013-08-11, 2013-06-19, 2014-10-01, 2013-09-10, 2015-08-21, 2014-05... $ intake_type (fctr) EUTH REQ, EUTH REQ, EUTH REQ, EUTH REQ, EUTH REQ, EUTH REQ, EUTH REQ, EUTH REQ, EUTH REQ, S... $ intake_location (fctr) FAIRFIELD AND ARCADE, ADA, ADA/EUTH REQ, EITH REQ/ SIGNED ADA, NA, NA, OWNED, ADA BACKGROUN... $ outcome_date (time) 2013-06-12, 2014-12-07, 2013-08-11, 2013-06-19, 2014-10-01, 2013-09-10, 2015-08-21, 2014-05... $ outcome_type (fctr) EUTH, EUTH, EUTH, EUTH, EUTH, EUTH, EUTH, EUTH, EUTH, RTO, EUTH, EUTH, EUTH, RTO, EUTH, RTO...
To Be Continued
These data sets are a good starting point for further work, but you can be sure more wrangling will be needed later, for example, to create derived values or remove outliers. In the upcoming posts in this series we will explore our initial wrangled data sets, investigate what can be learned from animal shelter open data, and ask how the quality of the currently available open data might be improved.
References
-
Crawley, Michael J. 2013. The R Book. John Wiley & Sons. (Amazon)
-
Clark, Michael. 2015. Introduction to R.
-
Socrata. 2016. Getting Started with the SODA Consumer API.
-
City of Chicago. 2014. RSocrata. Chicago Digital. GitHub project.
-
Junar. 2013. Sacramento Open Data API.
-
van Dunné, Frans. 2016. Access Open Data in R through the Junar API.
-
Grolemund, Garrett and Wickham, Hadley. 2011. Dates and Times Made Easy with lubridate. Journal of Statistical Software. vol. 40 no. 3.
-
inside-R. 2013. R Date-Time Classes.
- Crawley, Michael J. 2013. The R Book. John Wiley & Sons. (Amazon)
- Clark, Michael. 2015. Introduction to R.
- Socrata. 2016. Getting Started with the SODA Consumer API.
- City of Chicago. 2014. RSocrata. Chicago Digital. GitHub project.
- Junar. 2013. Sacramento Open Data API.
- van Dunné, Frans. 2016. Access Open Data in R through the Junar API.
- Grolemund, Garrett and Wickham, Hadley. 2011. Dates and Times Made Easy with lubridate. Journal of Statistical Software. vol. 40 no. 3.
- inside-R. 2013. R Date-Time Classes.