Restaurant Inspection

Point your browser at the New York City datasets page. Click on the magnifying glass with the plus along the left edge. In the New York City Restaurant Inspection Results window that pops up, click here to view the dataset. In the upper right corner, click on the blue Export button and select Download as CSV (comma-separated values). Move the resulting file DOHMH_New_York_City_Restaurant_Inspection_Results.csv from your Downloads folder to your Desktop.

Two restaurants might have the same name (DBA, “Doing Business As”), but each restaurant has its own CAMIS number (“Commercial Activities Management Information System”). Each restaurant has one record for each of its inspections. The building column is text because some values have characters that are not digits.

The .import command is confused by the commas in the cuisine_description and violation_description fields, even though these fields have "double quotes" around them when they contain commas. To avoid the problem, perl changes each of these commas into a |, and sqlite3 changes them back to commas.

The || operator is concatenation.

I ran the following commands in the Terminal application on my Mac OS X 10.10.4. The downloaded CSV file (minus lowercase L for “long”):

ls -l DOHMH_New_York_City_Restaurant_Inspection_Results.csv
-rw-r--r--  1 myname  mygroup  188445195 Aug 15 21:27 DOHMH_New_York_City_Restaurant_Inspection_Results.csv

Number of lines in the file (minus lowercase L for “lines”):

wc -l DOHMH_New_York_City_Restaurant_Inspection_Results.csv
  493373 DOHMH_New_York_City_Restaurant_Inspection_Results.csv

First three lines in the file:

head -3 DOHMH_New_York_City_Restaurant_Inspection_Results.csv
CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINE DESCRIPTION,INSPECTION DATE,ACTION,VIOLATION CODE,VIOLATION DESCRIPTION,CRITICAL FLAG,SCORE,GRADE,GRADE DATE,RECORD DATE,INSPECTION TYPE
30075445,MORRIS PARK BAKE SHOP,BRONX,1007,MORRIS PARK AVE,10462,7188924968,Bakery,02/09/2015,Violations were cited in the following area(s).,06C,"Food not protected from potential source of contamination during storage, preparation, transportation, display or service.",Critical,6,A,02/09/2015,06/23/2015,Cycle Inspection / Initial Inspection
30075445,MORRIS PARK BAKE SHOP,BRONX,1007,MORRIS PARK AVE,10462,7188924968,Bakery,03/03/2014,Violations were cited in the following area(s).,10F,"Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.",Not Critical,2,A,03/03/2014,06/23/2015,Cycle Inspection / Initial Inspection

The file temp.csv that will be created by perl has no title line, has an extra column on the left, and has a | in place of each comma that is not a field delimiter.

ls -l temp.csv
-rw-r--r--  1 myname  mygroup  192100190 Jun 24 14:56 temp.csv

wc -l temp.csv
  492560 temp.csv

head -3 temp.csv
1,30075445,MORRIS PARK BAKE SHOP,BRONX,1007,MORRIS PARK AVE,10462,7188924968,Bakery,02/09/2015,Violations were cited in the following area(s).,06C,"Food not protected from potential source of contamination during storage| preparation| transportation| display or service.",Critical,6,A,02/09/2015,06/23/2015,Cycle Inspection / Initial Inspection
2,30075445,MORRIS PARK BAKE SHOP,BRONX,1007,MORRIS PARK AVE,10462,7188924968,Bakery,03/03/2014,Violations were cited in the following area(s).,10F,"Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed| raised| spaced or movable to allow accessibility for cleaning on all sides| above and underneath the unit.",Not Critical,2,A,03/03/2014,06/23/2015,Cycle Inspection / Initial Inspection
3,30075445,MORRIS PARK BAKE SHOP,BRONX,1007,MORRIS PARK AVE,10462,7188924968,Bakery,10/10/2013,No violations were recorded at the time of this inspection.,,,Not Applicable,,,,06/23/2015,Trans Fat / Second Compliance Inspection

We installed a version of sqlite3 when we installed Android Studio.

which sqlite3
/Users/myname/Library/Android/sdk/platform-tools/sqlite3

sqlite3 -version
3.8.6 2014-08-15 11:46:33 9491ba7d738528f168657adb43a198238abde19e

whereis sqlite3
/usr/bin/sqlite3

I ran the following bash (Bourne Again) shellscript, named makedb, to create a .db file containing a table named latest_table that contains only the most recent inspection for each restaurant. It creates the files temp.csv and restaurants.db. Then it removes temp.csv.

#!/bin/bash
cd /Users/myname/restaurant

curl \
'https://data.cityofnewyork.us/api/views/xx67-kt59/rows.csv?accessType=DOWNLOAD' \
> DOHMH_New_York_City_Restaurant_Inspection_Results.csv

#List the file.  -l stands for "long".
ls -l DOHMH_New_York_City_Restaurant_Inspection_Results.csv
#Print the number of lines in the file. -l stands for "lines".
wc -l DOHMH_New_York_City_Restaurant_Inspection_Results.csv

#Create the file temp.csv, which is a copy of
#DOHMH_New_York_City_Restaurant_Inspection_Results.csv
#with three changes:
# 1. Remove the first line.
# 2. Add a new column of numbers on the left: 1, 2, 3, etc.
# 3. In any field surrounded by "double quotes", change each comma to |
#    because the .import statement would be confused by the commas.

perl -ne '
        next if $. == 1;
        print $. - 1, ",";
        chomp;
        while ($_ ne "") {
		#If the line begins with a double-quoted string
		#that contains no double quotes, chop it off and store it in $&.
		if (s/^"[^"]*"//) {
			$field = $&;
                        $field =~ tr/,/|/;
                } else {
                        s/^[^,]*//;
			$field = $&;
                }
		print $field;

		#If the line begins with a comma,
		#chop it off and store it in $&.
                if (s/^,//) {
                        print $&;
                }
        }
        print "\n";
' DOHMH_New_York_City_Restaurant_Inspection_Results.csv > temp.csv

ls -l temp.csv
wc -l temp.csv

#Remove the old restaurants.db file (if it exists) and create a new one.
rm -f restaurants.db
sqlite3 restaurants.db << EOF

create table android_metadata (
        "locale" text default "en_US"
);

insert into android_metadata (locale) values ("en_US");

create table restaurant_table (
        _id integer primary key autoincrement,
        camis integer,
        dba text,
        boro text,
        building text,
        street text,
        zipcode integer,
        phone text,
        cuisine_description text,
        inspection_date text,
        action text,
        violation_code text,
        violation_description text,
        critical_flag text,
        score integer,
        grade text,
        grade_date text,
        record_date text,
        inspection_type text
);

.mode csv
.import temp.csv restaurant_table

select "Table restaurant_table contains " || count(*) || " records."
	from restaurant_table;

--Change the vertical bars back to commas now that the .import has been done.

update restaurant_table
	set violation_description = replace(violation_description, "|", ",");

--Change the format of each date from 12/31/2015 to 2015-12-31.
--Now an alphabetical comparison of two dates will tell us which one is earlier.

update restaurant_table set inspection_date =
        substr(inspection_date, 7, 4) || "-" ||
        substr(inspection_date, 1, 2) || "-" ||
        substr(inspection_date, 4, 2) where length(inspection_date) = 10;

update restaurant_table set grade_date =
        substr(grade_date, 7, 4) || "-" ||
        substr(grade_date, 1, 2) || "-" ||
        substr(grade_date, 4, 2) where length(grade_date) = 10;

update restaurant_table set record_date =
        substr(record_date, 7, 4) || "-" ||
        substr(record_date, 1, 2) || "-" ||
        substr(record_date, 4, 2) where length(record_date) = 10;

--Create a new table containing only the most recent inspection record
--for each restaurant.

create table latest_table (
        _id integer primary key autoincrement,
        camis integer,
        dba text,
        boro text,
        building text,
        street text,
        zipcode integer,
        phone text,
        cuisine_description text,
        latest_inspection_date text,
        action text,
        violation_code text,
        violation_description text,
        critical_flag text,
        score integer,
        grade text,
        grade_date text,
        record_date text,
        inspection_type text
);

insert into latest_table (
        _id,
        camis,
        dba,
        boro,
        building,
        street,
        zipcode,
        phone,
        cuisine_description,
        latest_inspection_date,
        action,
        violation_code,
        violation_description,
        critical_flag,
        score,
        grade,
        grade_date,
        record_date,
        inspection_type
) select * from (select
	_id,
	camis,
	dba,
	boro,
	building,
	street,
	zipcode,
	phone,
	cuisine_description,
	max(inspection_date) as latest_inspection_date,
	action,
	violation_code,
	violation_description,
	critical_flag,
	score,
	grade,
	grade_date,
	record_date,
	inspection_type
from restaurant_table group by camis);

select "Table latest_table contains " || count(*) || " records."
	from latest_table;

drop table restaurant_table;
vacuum;

--The following select statement prints too many fields to fit on one line,
--so we'll print each field on a separate line.
.mode line

--dba = "WO HOP 17", 17 Mott Street, Chinatown
select * from latest_table where camis = 41320866;

.quit
EOF

rm temp.csv
ls -l restaurants.db
exit 0

The output of the shellscript is

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  179M  100  179M    0     0   603k      0  0:05:04  0:05:04 --:--:-- 2068k
-rw-r--r--  1 myname  mygroup  188218168 Aug 14 12:58 DOHMH_New_York_City_Restaurant_Inspection_Results.csv
  492715 DOHMH_New_York_City_Restaurant_Inspection_Results.csv
-rw-r--r--  1 myname  mygroup  191555871 Aug 14 12:58 temp.csv
  492714 temp.csv
"Table restaurant_table contains 492714 records."
"Table latest_table contains 25356 records."
                   _id = 192839
                 camis = 41320866
                   dba = WO HOP 17
                  boro = MANHATTAN
              building = 17
                street = MOTT STREET
               zipcode = 10013
                 phone = 2122672536
   cuisine_description = Chinese
latest_inspection_date = 2014-09-04
                action = Violations were cited in the following area(s).
        violation_code = 04L
 violation_description = Evidence of mice or live mice present in facility's food and/or non-food areas.
         critical_flag = Critical
                 score = 10
                 grade = A
            grade_date = 2014-09-04
           record_date = 2015-08-13
       inspection_type = Cycle Inspection / Initial Inspection
-rw-------  1 myname  mygroup  11620352 Aug 14 12:59 restaurants.db