Animal Shelter Data Analysis Software Technical

Austin & Sacramento Open Data #2: Data Wrangling

Data wrangling is the process of converting original data into a form that is ready for analysis. In this post we will see how to download and wrangle the Austin and Sacramento animal shelter impoundment data into standardized forms that can be manipulated easily with software.

We learned in the previous post that Austin and Sacramento have data portals hosting their municipal data sets. Each data portal runs a web application that supports browsing, searching, lightweight analysis, and visualization. You can learn how to use these web applications by reading the respective wikis:

However, in this series of posts we will sidestep the web applications and, instead, look at writing software to perform more complex analysis of the animal shelter data. We will program in the high-level language R, which is an interpreted language for data analysis and visualization.[1,2] R is open-source, free to download, and can be used alone or together with the R Studio integrated development environment. Our analysis code will be built on top of a companion toolkit of R functions, which takes care of the low-level details of fetching and cleaning up the raw open data. The toolkit is available on GitHub to view or download.

These posts will be technical how-to’s, appealing primarily to those interested in software development and statistics. But at the same time, they will be narrative, telling the data-driven story of the Austin and Sacramento shelter animals.

Fetching Austin Data

In order to work with a data set in R, we must first load the data over the network. Since the Austin and Sacramento data portals are hosted by different companies, the method of fetching the data varies slightly.

The Austin data portal is hosted by Socrata, which provides a network API for accessing data.[3,4] The API supports downloading, querying (filtering), and paging. We will use the API only to download data sets.

Socrata data sets are identified by a URL. Since anonymous access is subject to throttling, applications typically identify themselves when accessing data. Client applications are identified by an application token. To acquire an application token, you’ll first have to open a Socrata developer account, and then create a new application.

Each data set URL can specify the way in which the data is provided, including the interchange format and whether the data is broken into pages. For our purposes, we will download a complete data set in CSV format. The parameterized URL to do that is the following:

https://data.austintexas.gov/resource/{data set id}.csv/?$$app_token={app token}&limit=50000

File download from a URL is easily done in R:

download.file(fromUrl, localFilePath, method = "curl")
atxRawData <- read.csv(localFilePath, header = TRUE)

The downloaded CSV file is spooled to a local file, then read into an R data frame. Our toolkit hides these details, and implements downloading of the raw Austin intake or outcome data sets in a single call:

atxRawIntakeData <- atxLoadRawIntake()
atxRawOutcomeData <- atxLoadRawOutcome()

As a starting point for data wrangling, the toolkit loads raw open data into R data sets in which each column is a vector of strings.

Fetching Sacramento Data

The Sacramento data portal is hosted by Junar, which provides a network API for accessing data.[5,6] The Junar API has features similar to the Socrata API, including using an application token. You can create an application token by going to the Sacramento Open Data developer page.

The parameterized URL for a whole Sacramento data set in CSV format is the following:

http://api.data.cityofsacramento.org/api/v2/datastreams/{data set id}/data.csv/?auth_key={app token}&limit=50000

The Sacramento impoundment data comes in a single data set containing both intake and outcome events. The {data set id} is "ANIMA-INTAK-AND-OUTCO".

The basic code for downloading a Sacramento CSV file and loading it into an initial data frame is the same as for Austin.

Wrangling Austin Data

The raw Austin impoundment data is broken into two data sets, one for animal intake events and another for animal outcome events. The toolkit provides functions to fetch and wrangle each data set individually, and to create a merged data set of impoundment events (i.e., paired intake and outcome events):

atxWrangledIntakeData <- atxLoadIntake()
atxWrangledOutcomeData <- atxLoadOutcome()
atxWrangledImpoundData <- atxLoadOpenData()

The sections that follow describe the wrangling done for us by the toolkit, in terms of the raw data input and the transformed data output.

Austin Intake Data

Here are the first few items of the raw data frame for Austin animal intake events:

> atxRawIntakeData <- atxLoadRawIntake()
> glimpse(atxRawIntakeData)
Observations: 49,653
Variables: 12
$ age_upon_intake  (chr) "8 years", "11 months", "4 weeks", "4 years", "2 years", "5 months", "2 years", "14 years",...
$ animal_id        (chr) "A706918", "A724273", "A665644", "A682524", "A708452", "A707375", "A696408", "A657188", "A7...
$ animal_type      (chr) "Dog", "Dog", "Cat", "Dog", "Dog", "Dog", "Dog", "Cat", "Cat", "Dog", "Dog", "Cat", "Dog", ...
$ breed            (chr) "English Springer Spaniel", "Basenji Mix", "Domestic Shorthair Mix", "Doberman Pinsch/Austr...
$ color            (chr) "White/Liver", "Sable/White", "Calico", "Tan/Gray", "Black/White", "Brown/White", "Tricolor...
$ datetime         (chr) "2015-07-05T12:59:00.000", "2016-04-14T18:43:00.000", "2013-10-21T07:59:00.000", "2014-06-2...
$ datetime2        (chr) "2015-07-05T12:59:00.000", "2016-04-14T18:43:00.000", "2013-10-21T07:59:00.000", "2014-06-2...
$ found_location   (chr) "9409 Bluegrass Dr in Austin (TX)", "2818 Palomino Trail in Austin (TX)", "Austin (TX)", "8...
$ intake_condition (chr) "Normal", "Normal", "Sick", "Normal", "Normal", "Normal", "Normal", "Normal", "Normal", "No...
$ intake_type      (chr) "Stray", "Stray", "Stray", "Stray", "Public Assist", "Stray", "Stray", "Owner Surrender", "...
$ name             (chr) "Belle", "Runster", "", "Rio", "Mumble", "*Candy Cane", "*Pearl", "Tommy", "*Mint", "", "St...
$ sex_upon_intake  (chr) "Spayed Female", "Intact Male", "Intact Female", "Neutered Male", "Intact Male", "Intact Fe...

The toolkit transforms each raw column into one or more new columns, and then to binds all the new columns into a final data frame.

Age

The raw age_upon_intake data is a string that either comprises count and units of measure (e.g., "1 year" or "14 days") or is the literal value "NULL". The raw data is transformed into three new columns:

  • intake_age_count is an integer number.
  • intake_age_units is the factor (dy, wk, mo, yr), which is the unit of measurement of intake age count.
  • intake_age is a lubridate::duration that represents the age in common units (seconds).[6] This representation enables sorting and binning of animals by age.

The "NULL" raw value maps to NA in all three new columns.

Sex

The raw sex_upon_intake data is a string that comprises spay/neuter status and gender (e.g., "Spayed Female" or "Neutered Male" or "Intact Male") or is the literal value "Unknown". The raw data is transformed into two new columns:

  • gender is the factor (Male, Female).
  • intake_spay_neuter is the factor (Intact, Altered).

The "Unknown" raw value maps to NA in both new columns.

Breed

The raw breed data is a string that comprises one or two breed categories (e.g., "Labrador Retriever" or "Labrador Retriever/German Shepherd"), or describes an uncategorized mix of two or more breeds (e.g., "Labrador Retriever Mix"), or describes a species (e.g., "Squirrel"). The raw data is transformed into two new columns:

  • breed_1 is a factor describing the primary breed or the species.
  • breed_2 is a factor describing the secondary breed, or takes the value "Mix" when the animal is a mixed-breed and the secondary breed is not assigned. The value is NA for animals that are categorized as single-breed.

Color

The raw color data is a string that comprises one or two color/coat categories. (e.g., "Sable/White" or "Orange Tabby/White" or "Seal Point"). The raw data is transformed into two new columns:

  • color_1 is a factor describing the primary color/coat.
  • color_2 is a factor describing the secondary color/coat. The value is NA for animals that are categorized by a single color/coat.

Intake Date

The raw datetime data is a string that represents the intake date and time in ISO 8601 format (e.g., "2015-07-05T12:59:00.000"). The time is local Austin time, not UTC. The raw data is transformed into one new column:

  • intake_date is a POSIXct timestamp, which represents time as the signed number of seconds relative to a well-known origin.[7]

The raw datetime2 data is ignored, because it merely duplicates the raw datetime data.

Clean Up

The remaining raw data is transformed into the following new columns:

  • animal_id is the animal impound ID (animal_id) as a factor.
  • kind is the animal type (animal_type) as a factor.
  • intake_type is the animal intake category (intake_type) as a factor.
  • intake_condition is the animal intake condition category (intake_condition) as a factor.
  • intake_location is the animal pick-up location (found_location) as a string. The raw data values are suitable for geo-coding.
  • name is the animal name (name) as a factor. Some raw names are flagged with an asterisk, no doubt signifying a special (but unknown to us) status. The asterisks are removed from the names.

For all conversions to factor, the empty-string raw data value is mapped to NA.

The wrangled data frame of Austin animal intake events looks like this:

> atxIntakeData <- atxLoadIntake()
> glimpse(atxIntakeData)
Observations: 49,653
Variables: 16
$ animal_id          (fctr) A706918, A724273, A665644, A682524, A708452, A707375, A696408, A657188, A707658, A697950,...
$ kind               (fctr) Dog, Dog, Cat, Dog, Dog, Dog, Dog, Cat, Cat, Dog, Dog, Cat, Dog, Other, Cat, Dog, Cat, Ca...
$ name               (fctr) Belle, Runster, NA, Rio, Mumble, Candy Cane, Pearl, Tommy, Mint, NA, Stumpy, Cleopatra, N...
$ gender             (fctr) Female, Male, Female, Male, Male, Female, Female, Male, Female, Female, Female, Female, F...
$ color_1            (fctr) White, Sable, Calico, Tan, Black, Brown, Tricolor, Brown Tabby, Black, Tan, Black, Blue, ...
$ color_2            (fctr) Liver, White, NA, Gray, White, White, NA, White, White, White, White, NA, NA, NA, NA, Whi...
$ breed_1            (fctr) English Springer Spaniel, Basenji, Domestic Shorthair, Doberman Pinsch, Labrador Retrieve...
$ breed_2            (fctr) NA, Mix, Mix, Australian Cattle Dog, Mix, NA, NA, Mix, Mix, Labrador Retriever, Mix, Mix,...
$ intake_date        (time) 2015-07-05, 2016-04-14, 2013-10-21, 2014-06-29, 2015-07-30, 2015-07-11, 2015-02-04, 2013-...
$ intake_type        (fctr) Stray, Stray, Stray, Stray, Public Assist, Stray, Stray, Owner Surrender, Stray, Stray, S...
$ intake_condition   (fctr) Normal, Normal, Sick, Normal, Normal, Normal, Normal, Normal, Normal, Normal, Normal, Nor...
$ intake_location    (chr) "9409 Bluegrass Dr in Austin (TX)", "2818 Palomino Trail in Austin (TX)", "Austin (TX)", "...
$ intake_age_count   (int) 21, 4, 17, 17, 13, 18, 13, 7, 17, 13, 11, 13, 18, 2, 13, 2, 17, 13, 16, 17, 13, 13, 2, 13,...
$ intake_age_units   (fctr) yr, mo, wk, yr, yr, mo, yr, yr, wk, mo, yr, yr, mo, yr, yr, yr, dy, wk, yr, yr, yr, yr, y...
$ intake_age         (dbl) 662256000, 10518984, 10281600, 536112000, 409968000, 47335428, 409968000, 220752000, 10281...
$ intake_spay_neuter (fctr) Altered, Intact, Intact, Altered, Intact, Intact, Intact, Altered, Intact, Intact, Altere...

Austin Outcome Data

Here are the first few items of the raw data frame for Austin animal outcome events:

> atxRawOutcomeData <- atxLoadRawOutcome()
> glimpse(atxRawOutcomeData)
Observations: 49,255
Variables: 12
$ age_upon_outcome (chr) "4 months", "2 weeks", "5 months", "6 years", "3 months", "2 months", "1 month", "1 year", ...
$ animal_id        (chr) "A720371", "A725795", "A710054", "A722378", "A707743", "A705205", "A702919", "A709257", "A6...
$ animal_type      (chr) "Dog", "Other", "Dog", "Dog", "Cat", "Cat", "Cat", "Other", "Cat", "Dog", "Dog", "Cat", "Ca...
$ breed            (chr) "Anatol Shepherd/Labrador Retriever", "Opossum", "Miniature Schnauzer Mix", "German Shepher...
$ color            (chr) "Buff", "Gray", "Black/Tan", "Brown/Black", "Black/White", "White/Blue Tabby", "Blue", "Bla...
$ date_of_birth    (chr) "2015-10-08T00:00:00.000", "2016-04-15T00:00:00.000", "2015-03-16T00:00:00.000", "2010-03-1...
$ datetime         (chr) "2016-02-13T17:59:00.000", "2016-05-03T12:39:00.000", "2015-08-18T17:46:00.000", "2016-03-3...
$ monthyear        (chr) "2016-02-13T17:59:00.000", "2016-05-03T12:39:00.000", "2015-08-18T17:46:00.000", "2016-03-3...
$ name             (chr) "Moose", "", "", "Canine", "", "*Dell", "", "", "Patches", "Copper", "", "Boris", "*Marigol...
$ outcome_subtype  (chr) "", "Underage", "", "", "SCRP", "Offsite", "Partner", "Rabies Risk", "", "Foster", "Foster"...
$ outcome_type     (chr) "Adoption", "Euthanasia", "Adoption", "Return to Owner", "Transfer", "Adoption", "Transfer"...
$ sex_upon_outcome (chr) "Neutered Male", "Unknown", "Spayed Female", "Neutered Male", "Intact Male", "Neutered Male...

The raw age_upon_outcome, date_of_birth, and monthyear columns are dropped, since the same information is carried in the age column of the wrangled intake data set. The remaining raw columns are transformed in the same manner as their counterparts in the intake data set.

The wrangled data frame of Austin animal outcome events looks like this:

> atxOutcomeData <- atxLoadOutcome()
> glimpse(atxOutcomeData)
Observations: 49,255
Variables: 12
$ animal_id           (fctr) A720371, A725795, A710054, A722378, A707743, A705205, A702919, A709257, A643424, A67909...
$ kind                (fctr) Dog, Other, Dog, Dog, Cat, Cat, Cat, Other, Cat, Dog, Dog, Cat, Cat, Dog, Dog, Cat, Dog...
$ name                (fctr) Moose, NA, NA, Canine, NA, Dell, NA, NA, Patches, Copper, NA, Boris, Marigold, Goowa, N...
$ gender              (fctr) Male, NA, Female, Male, Male, Male, Male, NA, Male, Male, Female, Male, Female, Male, M...
$ color_1             (fctr) Buff, Gray, Black, Brown, Black, White, Blue, Black, Black, Red, Red, Orange Tabby, Bro...
$ color_2             (fctr) NA, NA, Tan, Black, White, Blue Tabby, NA, NA, White, NA, White, White, White, Brown, W...
$ breed_1             (fctr) Anatol Shepherd, Opossum, Miniature Schnauzer, German Shepherd, Domestic Shorthair, Dom...
$ breed_2             (fctr) Labrador Retriever, NA, Mix, Mix, Mix, Mix, Mix, Mix, Mix, Mix, Mix, Mix, Mix, Mix, Mix...
$ outcome_date        (time) 2016-02-13, 2016-05-03, 2015-08-18, 2016-03-31, 2015-07-18, 2015-06-20, 2015-05-21, 201...
$ outcome_type        (fctr) Adoption, Euthanasia, Adoption, Return to Owner, Transfer, Adoption, Transfer, Euthanas...
$ outcome_subtype     (fctr) NA, Underage, NA, NA, SCRP, Offsite, Partner, Rabies Risk, NA, Foster, Foster, NA, NA, ...
$ outcome_spay_neuter (fctr) Altered, NA, Altered, Altered, Intact, Altered, Intact, NA, Altered, Altered, Altered, ...

Austin Merged Data

The toolkit also supports fetching and wrangling the Austin raw intake and outcome data into a merged data set of impoundment events, which looks like this:

> atxOpenData <- atxLoadOpenData()
> glimpse(atxOpenData)
Observations: 50,576
Variables: 20
$ animal_id           (fctr) A006100, A006100, A047759, A134067, A141142, A163459, A165752, A178569, A189592, A19135...
$ kind                (fctr) Dog, Dog, Dog, Dog, Dog, Dog, Dog, Dog, Dog, Cat, Cat, Dog, Dog, Dog, Cat, Dog, Dog, Do...
$ name                (fctr) Scamp, Scamp, Oreo, Bandit, Bettie, Sasha, Pep, Boti, Ophelia, Bri-Bri, Sassafrass, Car...
$ gender              (fctr) Male, Male, Male, Male, Female, Female, Male, Male, Female, Female, Female, Male, Male,...
$ color_1             (fctr) Yellow, Yellow, Tricolor, Brown, Black, Black, Brown, White, Brown, Black, Brown Tabby,...
$ color_2             (fctr) White, White, NA, White, White, Gray, White, Black, White, White, White, Tan, White, Bl...
$ breed_1             (fctr) Spinone Italiano, Spinone Italiano, Dachshund, Shetland Sheepdog, Labrador Retriever, M...
$ breed_2             (fctr) Mix, Mix, NA, NA, Pit Bull, Mix, Mix, Mix, Mix, Mix, Mix, Mix, Mix, Labrador Retriever,...
$ intake_date         (time) 2014-03-07, 2014-12-19, 2014-04-02, 2013-11-16, 2013-11-16, 2014-11-14, 2014-09-15, 201...
$ intake_type         (fctr) Public Assist, Public Assist, Owner Surrender, Public Assist, Stray, Stray, Stray, Publ...
$ intake_condition    (fctr) Normal, Normal, Normal, Injured, Aged, Normal, Normal, Normal, Normal, Normal, Normal, ...
$ intake_location     (fctr) 8700 Research in Austin (TX), 8700 Research Blvd in Austin (TX), Austin (TX), 12034 Res...
$ intake_age_count    (int) 19, 20, 3, 9, 8, 8, 8, 8, 11, 9, 7, 9, 7, 6, 12, 6, 8, 8, 9, 5, 8, 5, 6, 6, 5, 5, 8, 6, ...
$ intake_age_units    (fctr) yr, yr, yr, yr, yr, yr, yr, yr, yr, yr, yr, yr, yr, yr, yr, yr, yr, yr, yr, yr, yr, yr,...
$ intake_age          (int) 599184000, 630720000, 94608000, 283824000, 252288000, 252288000, 252288000, 252288000, 3...
$ intake_spay_neuter  (fctr) Altered, Altered, Altered, Altered, Altered, Intact, Altered, Altered, Altered, Intact,...
$ outcome_date        (time) 2014-03-08, 2014-12-20, 2014-04-07, 2013-11-16, 2013-11-17, 2014-11-14, 2014-09-15, 201...
$ outcome_type        (fctr) Return to Owner, Return to Owner, Transfer, Return to Owner, Return to Owner, Return to...
$ outcome_subtype     (fctr) NA, NA, Partner, NA, NA, NA, NA, NA, NA, NA, Partner, Foster, Suffering, NA, Suffering,...
$ outcome_spay_neuter (fctr) Altered, Altered, Altered, Altered, Altered, Intact, Altered, Altered, Altered, Intact,...

Wrangling Sacramento Data

The Sacramento impoundment data is much less rich than the Austin data, containing only the barest minimum of information about each animal taken in by the shelter. Here are the first few items of the raw data frame for Sacramento animal impoundment events:

> sacRawOpenData <- sacLoadRawOpenData()
> glimpse(sacRawOpenData)
Observations: 31,157
Variables: 9
$ Animal_Id          (chr) "A038219", "A083800", "A089599", "A106453", "A107900", "A113137", "A126087", "A128417", "...
$ Animal_Name        (chr) "BLANCA", "KAYLA", "CHUNKY", "CHEYANNE", "BETOVEN", "BEAR", "", "LILLY", "JEZABELL", "LUC...
$ Animal_Type        (chr) "DOG", "DOG", "DOG", "DOG", "DOG", "DOG", "CAT", "DOG", "DOG", "DOG", "DOG", "DOG", "CAT"...
$ Activity_Number    (chr) "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "...
$ Intake_Type        (chr) "EUTH REQ", "EUTH REQ", "EUTH REQ", "EUTH REQ", "EUTH REQ", "EUTH REQ", "EUTH REQ", "EUTH...
$ Picked_up_Location (chr) "FAIRFIELD AND ARCADE", "ADA", "ADA/EUTH REQ", "EITH REQ/ SIGNED ADA", "", "", "OWNED", "...
$ Intake_Date        (chr) "2013-06-12 00:00:00", "2014-12-07 00:00:00", "2013-08-11 00:00:00", "2013-06-19 00:00:00...
$ Outcome_Type       (chr) "EUTH", "EUTH", "EUTH", "EUTH", "EUTH", "EUTH", "EUTH", "EUTH", "EUTH", "RTO", "EUTH", "E...
$ Outcome_Date       (chr) "2013-06-12 00:00:00", "2014-12-07 00:00:00", "2013-08-11 00:00:00", "2013-06-19 00:00:00...

The raw data columns are transformed one-for-one to the following new columns:

  • animal_id is the animal impound ID (Animal_Id) as a factor.
  • kind is the animal type (Animal_Type) as a factor.
  • name is the animal name (Animal_Name) as a factor.
  • intake_type is the animal intake category (Intake_Type) as a factor.
  • intake_date is the animal intake date (Intake_Date) as a POSIXct timestamp. The time part of the raw date-time string (e.g., "2014-12-07 00:00:00") is always zero in the Sacramento open data.
  • intake_location is the animal pick-up location (Picked_up_Location) as a string.
  • outcome_type is the animal outcome category (Outcome_Type) as a factor.
  • outcome_date is the animal outcome date (Outcome_Date) as a POSIXct timestamp.

The raw Activity_Number column, which is the identifying number for an Animal Control event for which we have no other data, is dropped.

The wrangled data frame of Sacramento animal impoundment events looks like this:

> sacOpenData <- sacLoadOpenData()
> glimpse(sacOpenData)
Observations: 31,157
Variables: 8
$ animal_id       (fctr) A038219, A083800, A089599, A106453, A107900, A113137, A126087, A128417, A132304, A146689, A...
$ kind            (fctr) DOG, DOG, DOG, DOG, DOG, DOG, CAT, DOG, DOG, DOG, DOG, DOG, CAT, DOG, DOG, DOG, DOG, DOG, D...
$ name            (fctr) BLANCA, KAYLA, CHUNKY, CHEYANNE, BETOVEN, BEAR, , LILLY, JEZABELL, LUCY, SMOKEY, ELVIS, WYT...
$ intake_date     (time) 2013-06-12, 2014-12-07, 2013-08-11, 2013-06-19, 2014-10-01, 2013-09-10, 2015-08-21, 2014-05...
$ intake_type     (fctr) EUTH REQ, EUTH REQ, EUTH REQ, EUTH REQ, EUTH REQ, EUTH REQ, EUTH REQ, EUTH REQ, EUTH REQ, S...
$ intake_location (fctr) FAIRFIELD AND ARCADE, ADA, ADA/EUTH REQ, EITH REQ/ SIGNED ADA, NA, NA, OWNED, ADA BACKGROUN...
$ outcome_date    (time) 2013-06-12, 2014-12-07, 2013-08-11, 2013-06-19, 2014-10-01, 2013-09-10, 2015-08-21, 2014-05...
$ outcome_type    (fctr) EUTH, EUTH, EUTH, EUTH, EUTH, EUTH, EUTH, EUTH, EUTH, RTO, EUTH, EUTH, EUTH, RTO, EUTH, RTO...

To Be Continued

These data sets are a good starting point for further work, but you can be sure more wrangling will be needed later, for example, to create derived values or remove outliers. In the upcoming posts in this series we will explore our initial wrangled data sets, investigate what can be learned from animal shelter open data, and ask how the quality of the currently available open data might be improved.

Photo "Still adoptable!" by Eileen McFall is licensed under CC BY-NC 2.0, cropped from original.

References

  1. Crawley, Michael J. 2013. The R Book. John Wiley & Sons. (Amazon)
  2. Clark, Michael. 2015. Introduction to R.
  3. Socrata. 2016. Getting Started with the SODA Consumer API.
  4. City of Chicago. 2014. RSocrata. Chicago Digital. GitHub project.
  5. Junar. 2013. Sacramento Open Data API.
  6. van Dunné, Frans. 2016. Access Open Data in R through the Junar API.
  7. Grolemund, Garrett and Wickham, Hadley. 2011. Dates and Times Made Easy with lubridate. Journal of Statistical Software. vol. 40 no. 3.
  8. inside-R. 2013. R Date-Time Classes.

Leave a Reply

Your email address will not be published. Required fields are marked *