Building a Crime Database

4 minute read

The objective of this project is to create a database using postgreSQL.

  • Create a database and table
  • observe and optimize values based on observations
  • create users, groups, and privileges so it can be decided who can access the dataframe

My objective is to create a SQL database and table from the file ‘boston.csv’. The file contains information on crimes that occurred in Boston including the type of crime and the date that it occurred. The end goal is to have a functional database where I am capable of managing and assigning user accessibility.

I begin by creating the database. I initiate a connection with an existing database ‘dq’ and name my new database ‘crime_db’. New databases can’t be apart of a transaction box so I set my code to auto commit right after I execute the command. I then disconnect from database dq, connect to my newly created database, and create a schema called crimes.

import psycopg2
dqconn = psycopg2.connect(dbname = 'dq', user = 'dq')
dqcur = dqconn.cursor()
dqconn.autocommit = True
dqcur.execute('CREATE DATABASE crime_db')
dqconn.close()
conn = psycopg2.connect(dbname = 'crime_db', user = 'dq')
cur = conn.cursor()
cur.execute('CREATE SCHEMA crimes;')
conn.commit()
# Opening and reading the csv file
import csv
with open('boston.csv', 'r') as file:
    reader = csv.reader(file)
    rows = [row for row in reader]
    col_headers = rows[0]
    first_row = rows[1]
    print(rows[0], rows[1])
['incident_number', 'offense_code', 'description', 'date', 'day_of_the_week', 'lat', 'long'] ['1', '619', 'LARCENY ALL OTHERS', '2018-09-02', 'Sunday', '42.35779134', '-71.13937053']
# Identifying the types of values that exist in each column
import numpy as np
def get_col_set(csv_file,col_index):
    with open(csv_file, 'r') as file:
        next(file)
        reader = csv.reader(file)
        column_values = set([row[col_index] for row in reader])
    return column_values
for number in range(0,7):
    statement="Unique Values in Column {} are {}".format(number, len(get_col_set('boston.csv',number)))
    print(statement)
Unique Values in Column 0 are 298329
Unique Values in Column 1 are 219
Unique Values in Column 2 are 239
Unique Values in Column 3 are 1177
Unique Values in Column 4 are 7
Unique Values in Column 5 are 18177
Unique Values in Column 6 are 18177
# Used this to find the total characters in columns 5 and 6
lengths = [len(obj) for obj in get_col_set('boston.csv', 5)]
np.max(lengths)
11
  • Column 0 has a max of 6 integers smallint
  • Column 1 has a max of 4 integers smallint
  • Column 2 has a max of 58 Characters varchar(100)
  • Column 3 is a date with 10 characters date
  • Column 4 contains the 7 days of the week weekday_enum
  • column 5 and 6 contain latitudes and longitudes respectively

Having identified how each column is formatted I can create a table and format each column based on the information that I have. By formatting each column based on the type of data it will contain, the system will use less memory and be able to process the data quicker.

# Create a data type specifically for column 4
cur.execute('''CREATE TYPE weekday_enum AS
ENUM ('Friday', 'Monday', 'Saturday', 'Sunday', 'Thursday', 'Tuesday', 'Wednesday')''')
# created table and assigned data types
cur.execute('''CREATE TABLE crimes.boston_crimes (incident_number integer PRIMARY KEY, offense_code smallint, description VARCHAR(100),
occurence_date date, day_of_the_week weekday_enum, lat DECIMAL(10, 8), long DECIMAL(10, 8));
''')
cur.execute('ALTER TABLE crimes.boston_crimes ALTER COLUMN incident_number TYPE integer;')
# added the rows from the csv to the table
with open("boston.csv") as f:
    cur.copy_expert("COPY crimes.boston_crimes FROM STDIN WITH CSV HEADER;", f)

Lastly I create two groups and assign them different privileges. This way I can assign different users to each group to prevent unauthorized users from tampering with the database. Lastly I create two users and assign them to their respective privileges.

# Revoke all priviliges from public users
cur.execute('REVOKE ALL ON crimes.boston_crimes FROM public')
cur.execute('REVOKE ALL ON DATABASE crime_db FROM public')
# Create readonly and readwrite groups and assign corresponding priviliges.
cur.execute('CREATE GROUP readonly WITH NOLOGIN')
cur.execute('CREATE GROUP readwrite WITH NOLOGIN')
cur.execute('GRANT CONNECT ON DATABASE crime_db TO readonly, readwrite')
cur.execute('GRANT USAGE ON SCHEMA crimes TO  readonly, readwrite')
cur.execute('GRANT SELECT ON ALL TABLES IN SCHEMA crimes TO readonly')
cur.execute('GRANT SELECT, INSERT, DELETE, UPDATE ON ALL TABLES IN SCHEMA crimes TO readwrite')
# Create users and grant them group privileges.
cur.execute("CREATE USER data_analyst WITH PASSWORD 'secret1'")
cur.execute("CREATE USER data_scientist WITH PASSWORD 'secret2'")
cur.execute("GRANT readonly TO data_analyst")
cur.execute("GRANT readwrite TO data_scientist")
conn.commit()
# Checking priviliges on readwrite group
cur.execute('''SELECT grantee, privilege_type
    FROM information_schema.table_privileges
    WHERE grantee = 'readwrite';
    ''')
cur.fetchall()
[('readwrite', 'INSERT'),
 ('readwrite', 'SELECT'),
 ('readwrite', 'UPDATE'),
 ('readwrite', 'DELETE')]
# Print the first 10 columns to ensure the table is behaving properly
cur.execute('SELECT * FROM crimes.boston_crimes LIMIT 10')
first_10 = cur.fetchall()
first_10
[(1,
  619,
  'LARCENY ALL OTHERS',
  datetime.date(2018, 9, 2),
  'Sunday',
  Decimal('42.35779134'),
  Decimal('-71.13937053')),
 (2,
  1402,
  'VANDALISM',
  datetime.date(2018, 8, 21),
  'Tuesday',
  Decimal('42.30682138'),
  Decimal('-71.06030035')),
 (3,
  3410,
  'TOWED MOTOR VEHICLE',
  datetime.date(2018, 9, 3),
  'Monday',
  Decimal('42.34658879'),
  Decimal('-71.07242943')),
 (4,
  3114,
  'INVESTIGATE PROPERTY',
  datetime.date(2018, 9, 3),
  'Monday',
  Decimal('42.33418175'),
  Decimal('-71.07866441')),
 (5,
  3114,
  'INVESTIGATE PROPERTY',
  datetime.date(2018, 9, 3),
  'Monday',
  Decimal('42.27536542'),
  Decimal('-71.09036101')),
 (6,
  3820,
  'M/V ACCIDENT INVOLVING PEDESTRIAN - INJURY',
  datetime.date(2018, 9, 3),
  'Monday',
  Decimal('42.29019621'),
  Decimal('-71.07159012')),
 (7,
  724,
  'AUTO THEFT',
  datetime.date(2018, 9, 3),
  'Monday',
  Decimal('42.30607218'),
  Decimal('-71.08273260')),
 (8,
  3301,
  'VERBAL DISPUTE',
  datetime.date(2018, 9, 3),
  'Monday',
  Decimal('42.32701648'),
  Decimal('-71.10555088')),
 (9,
  301,
  'ROBBERY - STREET',
  datetime.date(2018, 9, 3),
  'Monday',
  Decimal('42.33152148'),
  Decimal('-71.07085307')),
 (10,
  3301,
  'VERBAL DISPUTE',
  datetime.date(2018, 9, 3),
  'Monday',
  Decimal('42.29514664'),
  Decimal('-71.05860832'))]
# Closing the connection
conn.close()
# This is here incase I made a mistake
cur.execute('rollback;')