Folium python là gì
Datasets with geographical data such as latitudes, longitudes, and FIPS codes lend themselves really well to visualization through mapping packages like Folium. While state codes and FIPS county codes are widely used in mapping packages, I wanted to map out ZIP code level data while working with GeoJSON. Show We look at the LA county restaurant and market inspection dataset for this purpose. There are two separate csv files available: one for inspection records and one for violation records. What’s this process like from a high level?
1 Clean the dataLets start with loading the csv files into data frames and seeing what variables we have. Data frame for inspectionsData frame for violationsDatetime object The ‘activity_date’ column is a string and we’ll just run an apply with a function to convert them to datetime objects in both tables. ZIP codes A look at unique zip codes found many with additional 4 digits appended to the usual 5 digits. These digits are mainly for USPS mail sorting. For the purposes of this analyses, we only keep the first 5 digits. Outliers in Violations A look at the violation codes yield codes that mostly begin with a ‘F’. There seem to be a few starting with ‘W’ that only appear once or twice. When matched with the violation description, they were the only descriptions that did not have a violation number in front of them. Furthermore, some didn’t even result in point deductions. As they only make up 17 entries out of the 272,801 violations, we can safely drop them. Creating new features with regex Nope, I still can’tWhen looking at the column ‘pe_description’, records looked something like this: ‘RESTAURANT (0–30) SEATS MODERATE RISK’. It seems to describe 3 different things: what type of establishment it is, how many people it can host, and the risk level. To better represent the data, we write three helper functions with regex and string split to create new feature variables. A quick description of the two regex statements used here: For extracting the type of establishment, we want to get everything before the first opening parentheses. The regex was thus in the form .+(?= () Let’s break this down: For extracting the size of the establishment, I used the regex (?<=().+(?=)) Let’s break this down too: 2 Transform the violation records and merge with inspection recordsIndividual violations do not seem to tell us much about a particular location. Let’s create a new dataframe from the violation dataframe that represents all the different violations and the total number of each violation each facility has. We begin this by grouping the violation dataframe by the facility id and the violation code. We then aggregate by the count to find the total times each facility has violated a particular rule. We take this new data frame and unstack it. We then transpose the resultant data frame so that the violation codes are now the columns. The index is reset so facility_id is a separate column too. We then merge this new matrix with the inspection data frame so we now have a record of total instances across all violations for each facility id. 3 Find appropriate GeoJSONTo map out the data by ZIP code in Folium, we’ll need a GeoJSON to represent the boundaries of each ZIP code. Luckily, there was one at LA Times. A look at the ZIP codes represented by this GeoJSON shows over 800 ZIP codes; most of which will not useful and will only clutter the resulting map. We thus seek to filter out irrelevant ZIP codes to make the map clearer. With this updated JSON, we can now look at some geographical distributions. 4 Visualize some dataWe create two helper functions here to help us create data frames. The first count_distribution function returns the total counts of a subgroup in each location (ZIP code). Lots of restaurants in 90004 and 90005The subgroup_distribution function returns the percentage representation of each subgroup in each location (ZIP code). Almost 70% of facilities in 90004 are high riskWe then create the helper function to create folium maps. With the helper functions complete, let’s start visualizing the data. Average score of facilities in each ZIP codeCentral LA doesn’t do as well on these inspections. Total facilities in each ZIP codeTotal facilities for 2000+ occupancy and representation of 2000+ occupancy in each ZIP codeIt seems that larger facilities such as stadiums and function halls are situated quite far out from central LA. As seen from the concentration map, such large facilities are the only facilities in that general area. Average violations with plumbing violations and food contact surface cleanliness violationsLet’s take a look at two of the more common violations: plumbing and food contact surface cleanliness. The average number of violations for a particular violation code by a facility in each zip code was calculated. It seems from the map on the right that the area around Marina Del Rey has the highest incidence per facility of plumbing violations. Meanwhile, the ZIP code around West Los Angeles College seems to have a higher incidence of both types of violations. Improvements and next steps
As always, the code can be found on my Github. There are some correlational data I looked at as well that were not in the scope of this particular EDA. And as always feel free to connect via LinkedIn! |