Calculating Zip Code Distances Table


Calculating Zip Code Distances Table== Overview == I use MySQL to store my zipcode information in. The raw data is taken from a commercial package sold at http://www.zip-codes.com/, or you can usually find a very old and outdated list of US Zip Codes on the web from the 1990's. At the bottom of this page, there is a quick "How To" for loading zip-codes.com Business package, but the main documentation assumes you have a table in MySQL already that contains the fields zipcode, latitude and longitude. My database structure contains the following three field definitions, which are all that are needed for this to work:

zipcode    char(5),
latitude   decimal(12, 8),
longitude  decimal(12, 8)

Note that latitude and longitude are the decimal degrees. If your information is in degrees, minutes and seconds, you will have to convert it with the formula

latitude = degrees + (minutes + (seconds/60)) / 60

Limitations: These calculations are not high precision. They calculate the distance between two zip codes from the latitude and longitude of the geographical center of the zip code, not the border. Thus, it is theoretically possible that a single city block could separate addresses on the border of two zip codes, but these calculations would show them to be many miles apart.

Calculating the Distances

The Script

drop table if exists zip_codes_temp;
create table zip_codes_temp (
zip_code          char(5) not null unique,
latitude          double precision,
longitude         double precision,
sinlat            double precision,
coslat            double precision,
primary key       (zip_code)
) comment="List of all zip codes in US. Lat & Long are in radians";
ALTER TABLE zip_codes_temp DISABLE KEYS;
insert into zip_codes_temp
select zipcode, latitude/57.29577951, longitude/57.29577951, null, null from zip_codes
where primaryrecord = 'P'
update zip_codes_temp set sinlat = sin( latitude );
update zip_codes_temp set coslat = cos( latitude );
ALTER TABLE zip_codes_temp ENABLE KEYS;
drop table if exists zip_code_distances;
create table zip_code_distances (
zip_code_1        char(5) not null,
zip_code_2        char(5) not null,
distance          int not null,
key               ( zip_code_1 ),
key               ( zip_code_2 )
) comment="contains distance between various zip codes";
ALTER TABLE zip_code_distances DISABLE KEYS;
insert into zip_code_distances
select
a.zip_code, b.zip_code,3963.1*acos( a.sinlat*b.sinlat+a.coslat*b.coslat*cos(a.longitude-b.longitude) )
from
zip_codes_temp a,
zip_codes_temp b
where
a.zip_code < b.zip_code and
3963.1*acos( a.sinlat*b.sinlat+a.coslat*b.coslat*cos(a.longitude-b.longitude) ) <= 50;
ALTER TABLE zip_code_distances ENABLE KEYS;
drop table zip_codes_temp;

Discussion of Script

This script create a table named zip_code_distances that stores the distances between any two zip codes in one row. The fields zip_code_1 and zip_code_2 are arbitrary; the target and source may be in either, but no pair will be duplicated. Thus, for the distance between US Zip Codes 75214 and 75201, either 75214 will show up in zip_code_1 or zip_code_2, and 75201 will be in the other one. I made the decision to do this to save space at the minor expense of complexity and processing time when running queries. Instead of being able to give the simple query:

select distance from zip_code_distances where zip_code_1 = '75214' and zip_code_2 = '75201' Wrong: Will not work

You must use the query:

select distance from zip_code_distances where zip_code_1 in ('75214','75201) and zip_code_2 in ('75214','75201') Correct

The reason behind this is that we would require twice the disk space to store every possible permutation of zip codes. By only storing every combination, we are halving our storage size (which is still significant). Note: I could be wrong about this, as I find the index file associated with this table is half again as large as the actual data file

Additional Note: I used the disable keys/enable keys but saw no significant speed difference when I ran a test without them. Not sure how much that helps, though the MySQL manual states it is supposed to help a lot, so I left it in.

Now, about the script. The table zip_codes_temp is created simply to hold some intermediary values. Since we are using spherical geometry, the formulae are fairly resource intensive, and we will be calculating information over and over using the same values. We will use new values of latitude and longitude, storing the values as radians (dividing by 57.29577951 in the insert statement right after the table creation), then calculating the sine of both latitude and longitude prior to doing the distance calculation (the update statements).

The next step is to create a table to hold the actual distances. For this, I have two zip code fields (zip_code_1 and zip_code_2) and the distance between them (distance). Since we are searching on both zip_code_1 and zip_code_2, In index both (it would be faster to add the indexes after the table was populated, I'm sure, but I'm lazy the the computer isn't doing anything right now anyway).

Now, the calculation. The distance, in miles, between two points each having a latitude and longitude measured in radians is:

3963.1*acos( a.sinlat*b.sinlat+a.coslat*b.coslat*cos(a.longitude-b.longitude) ) miles

if sinlat and coslat are sine(latitude) and cos(latitude) respectively. To calculate in kilometers, simply multiply 3963.1 by 1.60934 (the number of kilometers in a mile) and change your formula to be:

6377.991*acos( a.sinlat*b.sinlat+a.coslat*b.coslat*cos(a.longitude-b.longitude) )  kilometers

The above script is designed to only store zip codes that are 50 miles or less distance from each other (the constant 50 at the end of the statement), though it still has to calculate each point in the database and the time difference in increasing this number is rather trivial, consisting only the time taken to store the result and its index.

By the way, I'm not going into the actual formulae for calculating this: it is all over the web (see http://en.wikipedia.org/wiki/Great-circle_distance for an interesting discussion).

Useful SQL

Calculate distance between to zip codes
select distance from zip_code_distances where zip_code_1 in ('75214','75201) and zip_code_2 in ('75214','75201')
Find all zipcodes within a 25 mile radius of the US zip code 75214
select zip_code_1 from zip_code_distances where zip_code_2 = '75214' and distance <= 25
union
select zip_code_2 from zip_code_distances where zip_code_1 = '75214' and distance <= 25

Run Time for Script

As far as time to run the script, I originally created this script back in the late 90's using a file I found on the 'net, apparently one of the last published for free by the USPS. I vaguely remember it taking five or six hours. One of my "cohorts in crime" thought it took longer than that. We were using what I seem to remember was a 500Mhz AMD processor with probably around 256M of RAM, but it could have been faster or slower, with more or less memory.

When I ran the run that led to this article in May of 2008, it was on a Debian Etch virtual running under Xen on an AMD Athlon 64 3800+. I allocated a gig of ram to the virtual, and the other virtuals were quiescent during the run. I expected it to be faster, but it only took 5 minutes. Called the cohort mentioned above and his first question was "Are you sure it worked?" I had the same question and had already run a few tests, but ran more with him on the phone and, so far, it looks like it worked.

So, I ran some tests at various radii, and recorded the results in the following table.

{| class="wikitable" border="1" |- ! Radius !! Calc Time !! Resulting Rows !! Data Size !! Index Size |- | align="right" | 50 mi | align="right" | 4:33 | align="right" | 1,367,581 | align="right" | 20,217,795 | align="right" | 30,493,696 |- | align="right" | 500 mi | align="right" | 29:56 | align="right" | 52,174,805 | align="right" | 781,507,050 | align="right" | 1,178,516,480 |- | align="right" | Unbounded | align="right" | 19:23:21 | align="right" | 224,222,076 | align="right" | 3,363,331,140 | align="right" | 6,605,434,880 |}

Note: The index size above is only the keys on zipcode_1 and zipcode_2. To get any reasonable response time, you should create an index on distance:

alter table zip_code_distances add index (distance);

In tests run, on the unbounded data set, the SQL statement select count(*) from zip_code_distances where distance <= 500; returned in 90.52 seconds without the additional index. With the index, the query took only xxxx seconds, or 10.1% of the time, which is critical in real time applications. However, the index file size increased to xxxxxxxxxx bytes (x gigabyte size increase).

Loading CSV Zip Code Data

Deluxe Business CSV from http://www.zip-codes.com/

This is way overkill if all you want to do is calculate distances. They have a much cheaper package that includes the zip, latitude and longitude. I got this as my clients like to look at the stats on each area. So, the following script is will need to be cut down quite a bit.

  • First, open the file containing the Zip Code data; zip-codes-database-DELUXE-BUSINESS.csv on my download. Delete the header line and save it somewhere with a different (easier) name, like zips.csv.
  • Open the file containing the multi county data, if you want it loaded also. Again, on my download it was zip-codes-database-MULTI-COUNTY.csv. Again, strip the top line (which contains the field names) and save it at the same location as zips.csv, only I named mine counties.csv.
  • Create a database. I called mine zipcode, since I am very imaginative and want people to have trouble figuring out what databases are for.
  • Create a file named create_zip_code.sql and paste the following script into it. Edit the bottom two lines of the script to reflect the fully qualified path and file name to the two files above; the one I called zips.csv and counties.csv.
  drop table if exists zip_codes;
create table zip_codes (
zipcode                     char(5) not null,
primaryrecord               char(1),
population                  integer,
householdsperzipcode        integer,
whitepopulation             integer,
blackpopulation             integer,
hispanicpopulation          integer,
asianpopulation             integer,
hawaiianpopulation          integer,
indianpopulation            integer,
otherpopulation             integer,
malepopulation             integer,
femalepopulation            integer,
personsperhousehold         decimal(4, 2),
averagehousevalue           integer,
incomeperhousehold          integer,
medianage                   decimal(3, 1),
medianagemale               decimal(3, 1),
medianagefemale             decimal(3, 1),
latitude                    decimal(12, 8),
longitude                   decimal(12, 8),
elevation                   integer,
state                       char(2),
statefullname               varchar(35),
citytype                    char(1),
cityaliasabbreviation       varchar(13),
areacode                    varchar(55),
city                        varchar(35),
cityaliasname               varchar(35),
countyname                  varchar(45),
countyfips                  char(5),
statefips                   char(2),
timezone                    char(2),
daylightsaving              char(1),
msa                         varchar(35),
pmsa                        char(4),
csa                         char(3),
cbsa                        char(5),
cbsa_div                    char(5),
cbsa_type                   char(5),
cbsa_name                   varchar(150),
msa_name                    varchar(150),
pmsa_name                   varchar(150),
region                      varchar(10),
division                    varchar(20),
mailingname                 char(1),
establishments2003          integer,
employment2003              integer,
firstquarterpayroll2003     integer,
annualpayroll2003           integer,
emplymentflag2003           char(1),
growthrank                  integer,
growthhousingunits2003      integer,
growthhousingunits2004      integer,
growthincreasenumber        integer,
growthincreasepercentage    decimal(3, 1),
cbsapop2003                 integer,
cbsadivpop2003              integer,
congressionaldistrict       varchar(150),
congressionallandarea       varchar(150),
deliveryresidential         integer,
deliverybusiness            integer,
deliverytotal               integer,
preferredlastlinekey        varchar(10),
classificationcode          char(1),
multicounty                 char(1),
csaname                     varchar(255),
cbsa_div_name               varchar(255),
citystatekey                char(6),
populationestimate          integer,
landarea                    decimal(12,6),
waterarea                   decimal(12,6)
);
drop table if exists zip_code_multi_county;
create table zip_code_multi_county (
zipcode                     char(5),
statefips                   char(2),
state                       char(2),
countyfips                  char(3),
county                      varchar(255)
);
load data infile 'FullPathTo/zips.csv' into table zip_codes FIELDS TERMINATED BY ',' ENCLOSED BY '"'
load data infile 'FullPathTo/county.csv' into table zip_code_multi_county FIELDS TERMINATED BY ',' ENCLOSED BY '"'
  • From the directory containing create_zip_code.sql, execute the following command
mysql zipcode < create_zip_code.sql

You now have the entire zip code table, with all the demographic and multi-county information loaded. Note: This is not even close to being normalized. At the very least, the state and county names could be removed and accessed via the FIPS. See the documentation for this product for greater information on what the various columns mean.

Using the old USPS Data File

I won't go into much detail here. This file was available on the 'net around the turn of the millennium. It is also available from http://www.census.gov/geo/www/tiger/zip1999.html last time I checked. It contains an old dBase file, and a Word document that describes it. I no longer have the scrips I used to convert it and load it into mySQL, but if I can find an old mySQL dump, I'll post it here also. I make no guarantees of anything on this as I no longer maintain it, and I did not clean anything up in the dump; simply did a mysqldump. However, it is free.

Last update:
2012-01-01 17:06
Author:
Rod
Revision:
1.2
Average rating:0 (0 Votes)

You cannot comment on this entry

Chuck Norris has counted to infinity. Twice.