Folium python là gì

Datasets with geographical data such as latitudes, longitudes, and FIPS codes lend themselves really well to visualization through mapping packages like Folium. While state codes and FIPS county codes are widely used in mapping packages, I wanted to map out ZIP code level data while working with GeoJSON.

We look at the LA county restaurant and market inspection dataset for this purpose. There are two separate csv files available: one for inspection records and one for violation records.

What’s this process like from a high level?

  1. Clean the data
  2. Transform the violation records and merge with inspection records
  3. Find appropriate GeoJSON
  4. Visualize some data

1 Clean the data

Lets start with loading the csv files into data frames and seeing what variables we have.

Data frame for inspections

Data frame for violations

Datetime object

The ‘activity_date’ column is a string and we’ll just run an apply with a function to convert them to datetime objects in both tables.

ZIP codes

A look at unique zip codes found many with additional 4 digits appended to the usual 5 digits. These digits are mainly for USPS mail sorting. For the purposes of this analyses, we only keep the first 5 digits.

Outliers in Violations

A look at the violation codes yield codes that mostly begin with a ‘F’. There seem to be a few starting with ‘W’ that only appear once or twice. When matched with the violation description, they were the only descriptions that did not have a violation number in front of them. Furthermore, some didn’t even result in point deductions. As they only make up 17 entries out of the 272,801 violations, we can safely drop them.

Creating new features with regex

Nope, I still can’t

When looking at the column ‘pe_description’, records looked something like this: ‘RESTAURANT (0–30) SEATS MODERATE RISK’.

It seems to describe 3 different things: what type of establishment it is, how many people it can host, and the risk level.

To better represent the data, we write three helper functions with regex and string split to create new feature variables.

A quick description of the two regex statements used here:

For extracting the type of establishment, we want to get everything before the first opening parentheses. The regex was thus in the form .+(?= ()

Let’s break this down:
.+ →This matches, returns any character, and keeps going. The use of ‘+’ means it has to match at least once.
(?= ()→This is a lookahead which indicates that the string ends with ‘ (‘ and that the open parentheses will not be returned.

For extracting the size of the establishment, I used the regex (?<=().+(?=))

Let’s break this down too:
(?<=() →This is a lookbehind which indicates that the string starts with an open parentheses which will not be returned.
.+ →Like above, returns any character and continues.
(?=)) →Like above, a lookahead that indicates the string ends with a close parentheses which will not be returned.

The end result

2 Transform the violation records and merge with inspection records

Individual violations do not seem to tell us much about a particular location. Let’s create a new dataframe from the violation dataframe that represents all the different violations and the total number of each violation each facility has.

We begin this by grouping the violation dataframe by the facility id and the violation code. We then aggregate by the count to find the total times each facility has violated a particular rule.

We take this new data frame and unstack it. We then transpose the resultant data frame so that the violation codes are now the columns. The index is reset so facility_id is a separate column too.

We then merge this new matrix with the inspection data frame so we now have a record of total instances across all violations for each facility id.

3 Find appropriate GeoJSON

To map out the data by ZIP code in Folium, we’ll need a GeoJSON to represent the boundaries of each ZIP code. Luckily, there was one at LA Times.

A look at the ZIP codes represented by this GeoJSON shows over 800 ZIP codes; most of which will not useful and will only clutter the resulting map. We thus seek to filter out irrelevant ZIP codes to make the map clearer.

With this updated JSON, we can now look at some geographical distributions.

4 Visualize some data

We create two helper functions here to help us create data frames. The first count_distribution function returns the total counts of a subgroup in each location (ZIP code).

Lots of restaurants in 90004 and 90005

The subgroup_distribution function returns the percentage representation of each subgroup in each location (ZIP code).

Almost 70% of facilities in 90004 are high risk

We then create the helper function to create folium maps.

With the helper functions complete, let’s start visualizing the data.

Average score of facilities in each ZIP code

Central LA doesn’t do as well on these inspections.

Total facilities in each ZIP code

Total facilities for 2000+ occupancy and representation of 2000+ occupancy in each ZIP code

It seems that larger facilities such as stadiums and function halls are situated quite far out from central LA. As seen from the concentration map, such large facilities are the only facilities in that general area.

Average violations with plumbing violations and food contact surface cleanliness violations

Let’s take a look at two of the more common violations: plumbing and food contact surface cleanliness. The average number of violations for a particular violation code by a facility in each zip code was calculated.

It seems from the map on the right that the area around Marina Del Rey has the highest incidence per facility of plumbing violations.

Meanwhile, the ZIP code around West Los Angeles College seems to have a higher incidence of both types of violations.

Improvements and next steps

  1. I definitely would have liked to create a “cohort analysis” of these inspections to see which areas improved or deteriorated over time.
  2. It would have been interesting to have more information about each individual facility as the nature of the facilities could be strongly correlated to particular violations. For example, a small family run street food stall could have different violations as the Staples Center. The Foursquare API could have been used for this.

As always, the code can be found on my Github. There are some correlational data I looked at as well that were not in the scope of this particular EDA. And as always feel free to connect via LinkedIn!