Calculating Zip Code Distances Table
I use MySQL to store my zipcode information in. The raw data is taken from a commercial package sold at http://www.zip-codes.com/, or you can usually find a very old and outdated list of US Zip Codes on the web from the 1990's. At the bottom of this page, there is a quick "How To" for loading zip-codes.com Business package, but the main documentation assumes you have a table in MySQL already that contains the fields zipcode, latitude and longitude. My database structure contains the following three field definitions, which are all that are needed for this to work:
zipcode char(5), latitude decimal(12, 8), longitude decimal(12, 8)
Note that latitude and longitude are the decimal degrees. If your information is in degrees, minutes and seconds, you will have to convert it with the formula
latitude = degrees + (minutes + (seconds/60)) / 60
Limitations: These calculations are not high precision. They calculate the distance between two zip codes from the latitude and longitude of the geographical center of the zip code, not the border. Thus, it is theoretically possible that a single city block could separate addresses on the border of two zip codes, but these calculations would show them to be many miles apart.
Calculating the Distances
drop table if exists zip_codes_temp; create table zip_codes_temp ( zip_code char(5) not null unique, latitude double precision, longitude double precision, sinlat double precision, coslat double precision, primary key (zip_code) ) comment="List of all zip codes in US. Lat & Long are in radians"; ALTER TABLE zip_codes_temp DISABLE KEYS; insert into zip_codes_temp select zipcode, latitude/57.29577951, longitude/57.29577951, null, null from zip_codes where primaryrecord = 'P' update zip_codes_temp set sinlat = sin( latitude ); update zip_codes_temp set coslat = cos( latitude ); ALTER TABLE zip_codes_temp ENABLE KEYS; drop table if exists zip_code_distances; create table zip_code_distances ( zip_code_1 char(5) not null, zip_code_2 char(5) not null, distance int not null, key ( zip_code_1 ), key ( zip_code_2 ) ) comment="contains distance between various zip codes"; ALTER TABLE zip_code_distances DISABLE KEYS; insert into zip_code_distances select a.zip_code, b.zip_code,3963.1*acos( a.sinlat*b.sinlat+a.coslat*b.coslat*cos(a.longitude-b.longitude) ) from zip_codes_temp a, zip_codes_temp b where a.zip_code < b.zip_code and 3963.1*acos( a.sinlat*b.sinlat+a.coslat*b.coslat*cos(a.longitude-b.longitude) ) <= 50; ALTER TABLE zip_code_distances ENABLE KEYS; drop table zip_codes_temp;
Discussion of Script
This script create a table named zip_code_distances that stores the distances between any two zip codes in one row. The fields zip_code_1 and zip_code_2 are arbitrary; the target and source may be in either, but no pair will be duplicated. Thus, for the distance between US Zip Codes 75214 and 75201, either 75214 will show up in zip_code_1 or zip_code_2, and 75201 will be in the other one. I made the decision to do this to save space at the minor expense of complexity and processing time when running queries. Instead of being able to give the simple query:
select distance from zip_code_distances where zip_code_1 = '75214' and zip_code_2 = '75201' Wrong: Will not work
You must use the query:
select distance from zip_code_distances where zip_code_1 in ('75214','75201) and zip_code_2 in ('75214','75201') Correct
The reason behind this is that we would require twice the disk space to store every possible permutation of zip codes. By only storing every combination, we are halving our storage size (which is still significant). Note: I could be wrong about this, as I find the index file associated with this table is half again as large as the actual data file
Additional Note: I used the disable keys/enable keys but saw no significant speed difference when I ran a test without them. Not sure how much that helps, though the MySQL manual states it is supposed to help a lot, so I left it in.
Now, about the script. The table zip_codes_temp is created simply to hold some intermediary values. Since we are using spherical geometry, the formulae are fairly resource intensive, and we will be calculating information over and over using the same values. We will use new values of latitude and longitude, storing the values as radians (dividing by 57.29577951 in the insert statement right after the table creation), then calculating the sine of both latitude and longitude prior to doing the distance calculation (the update statements).
The next step is to create a table to hold the actual distances. For this, I have two zip code fields (zip_code_1 and zip_code_2) and the distance between them (distance). Since we are searching on both zip_code_1 and zip_code_2, In index both (it would be faster to add the indexes after the table was populated, I'm sure, but I'm lazy and the the computer isn't doing anything right now anyway).
Now, the calculation. The distance, in miles, between two points each having a latitude and longitude measured in radians is:
3963.1*acos( a.sinlat*b.sinlat+a.coslat*b.coslat*cos(a.longitude-b.longitude) ) miles
if sinlat and coslat are sine(latitude) and cos(latitude) respectively. To calculate in kilometers, simply multiply 3963.1 by 1.60934 (the number of kilometers in a mile) and change your formula to be:
6377.991*acos( a.sinlat*b.sinlat+a.coslat*b.coslat*cos(a.longitude-b.longitude) ) kilometers
The above script is designed to only store zip codes that are 50 miles or less distance from each other (the constant 50 at the end of the statement), though it still has to calculate each point in the database and the time difference in increasing this number is rather trivial, consisting only the time taken to store the result and its index.
By the way, I'm not going into the actual formulae for calculating this: it is all over the web (see http://en.wikipedia.org/wiki/Great-circle_distance for an interesting discussion).
Calculate distance between to zip codes select distance from zip_code_distances where zip_code_1 in ('75214','75201) and zip_code_2 in ('75214','75201')
Find all zipcodes within a 25 mile radius of the US zip code 75214 select zip_code_1 from zip_code_distances where zip_code_2 = '75214' and distance <= 25 union select zip_code_2 from zip_code_distances where zip_code_1 = '75214' and distance <= 25
Run Time for Script
As far as time to run the script, I originally created this script back in the late 90's using a file I found on the 'net, apparently one of the last published for free by the USPS. I vaguely remember it taking five or six hours. One of my "cohorts in crime" thought it took longer than that. We were using what I seem to remember was a 500Mhz AMD processor with probably around 256M of RAM, but it could have been faster or slower, with more or less memory.
When I ran the run that led to this article in May of 2008, it was on a Debian Etch virtual running under Xen on an AMD Athlon 64 3800+. I allocated a gig of ram to the virtual, and the other virtuals were quiescent during the run. I expected it to be faster, but it only took 5 minutes. Called the cohort mentioned above and his first question was "Are you sure it worked?" I had the same question and had already run a few tests, but ran more with him on the phone and, so far, it looks like it worked.
So, I ran some tests at various radii, and recorded the results in the following table.
|Radius||Calc Time||Resulting Rows||Data Size||Index Size|
Note: The index size above is only the keys on zipcode_1 and zipcode_2. To get any reasonable response time, you should create an index on distance:
alter table zip_code_distances add index (distance);
In tests run, on the unbounded data set, the SQL statement select count(*) from zip_code_distances where distance <= 500; returned in 90.52 seconds without the additional index. With the index, the query took only 9 seconds, or 10.1% of the time, which is critical in real time applications. However, the index file size increased to xxxxxxxxxx bytes (x gigabyte size increase).
Loading CSV Zip Code Data
Deluxe Business CSV from http://www.zip-codes.com/
This is way overkill if all you want to do is calculate distances. They have a much cheaper package that includes the zip, latitude and longitude. I got this as my clients like to look at the stats on each area. So, the following script is will need to be cut down quite a bit.
- First, open the file containing the Zip Code data; zip-codes-database-DELUXE-BUSINESS.csv on my download. Delete the header line and save it somewhere with a different (easier) name, like zips.csv.
- Open the file containing the multi county data, if you want it loaded also. Again, on my download it was zip-codes-database-MULTI-COUNTY.csv. Again, strip the top line (which contains the field names) and save it at the same location as zips.csv, only I named mine counties.csv.
- Create a database. I called mine zipcode, since I am very imaginative and want people to have trouble figuring out what databases are for.
- Create a file named create_zip_code.sql and paste the following script into it. Edit the bottom two lines of the script to reflect the fully qualified path and file name to the two files above; the one I called zips.csv and counties.csv.
drop table if exists zip_codes; create table zip_codes ( zipcode char(5) not null, primaryrecord char(1), population integer, householdsperzipcode integer, whitepopulation integer, blackpopulation integer, hispanicpopulation integer, asianpopulation integer, hawaiianpopulation integer, indianpopulation integer, otherpopulation integer, malepopulation integer, femalepopulation integer, personsperhousehold decimal(4, 2), averagehousevalue integer, incomeperhousehold integer, medianage decimal(3, 1), medianagemale decimal(3, 1), medianagefemale decimal(3, 1), latitude decimal(12, 8), longitude decimal(12, 8), elevation integer, state char(2), statefullname varchar(35), citytype char(1), cityaliasabbreviation varchar(13), areacode varchar(55), city varchar(35), cityaliasname varchar(35), countyname varchar(45), countyfips char(5), statefips char(2), timezone char(2), daylightsaving char(1), msa varchar(35), pmsa char(4), csa char(3), cbsa char(5), cbsa_div char(5), cbsa_type char(5), cbsa_name varchar(150), msa_name varchar(150), pmsa_name varchar(150), region varchar(10), division varchar(20), mailingname char(1), establishments2003 integer, employment2003 integer, firstquarterpayroll2003 integer, annualpayroll2003 integer, emplymentflag2003 char(1), growthrank integer, growthhousingunits2003 integer, growthhousingunits2004 integer, growthincreasenumber integer, growthincreasepercentage decimal(3, 1), cbsapop2003 integer, cbsadivpop2003 integer, congressionaldistrict varchar(150), congressionallandarea varchar(150), deliveryresidential integer, deliverybusiness integer, deliverytotal integer, preferredlastlinekey varchar(10), classificationcode char(1), multicounty char(1), csaname varchar(255), cbsa_div_name varchar(255), citystatekey char(6), populationestimate integer, landarea decimal(12,6), waterarea decimal(12,6) ); drop table if exists zip_code_multi_county; create table zip_code_multi_county ( zipcode char(5), statefips char(2), state char(2), countyfips char(3), county varchar(255) ); load data infile 'FullPathTo/zips.csv' into table zip_codes FIELDS TERMINATED BY ',' ENCLOSED BY '"' load data infile 'FullPathTo/county.csv' into table zip_code_multi_county FIELDS TERMINATED BY ',' ENCLOSED BY '"'
- From the directory containing create_zip_code.sql, execute the following command
mysql zipcode < create_zip_code.sql
You now have the entire zip code table, with all the demographic and multi-county information loaded. Note: This is not even close to being normalized. At the very least, the state and county names could be removed and accessed via the FIPS. See the documentation for this product for greater information on what the various columns mean.
Using the old USPS Data File
I won't go into much detail here. This file was available on the 'net around the turn of the millennium. It is also available from http://www.census.gov/geo/www/tiger/zip1999.html last time I checked. It contains an old dBase file, and a Word document that describes it. I no longer have the scripts I used to convert it and load it into mySQL, but if I can find an old mySQL dump, I'll post it here also. I make no guarantees of anything on this as I no longer maintain it, and I did not clean anything up in the dump; simply did a mysqldump. However, it is free.