Introduction

This tutorial is Yu Wang's final project of CMSC320, UMD, the purpose is to show users some basic processes of data science with a detailed application in analyzing the earthquakes in US from 1970 to 2014. In this tutorial, I am going to show you how to format the data into a dataframe, manipulate the data with dataframe operations, visualize the data into plots and maps, implement regressions on the data, and some methods of Machine Learning. Visualizing data may give us a direct feeling of the data, but when we analyze a group of data, only the "view" of data is not sufficient, we need more accurate estimators to evaluate.

Tutorial content

In this tutorial, we need folium, pandas, numpy, scipy, matplotlib and sklearn packages to achieve the purposes I mentioned in the induction. And I will show how to use them to reach the goals.

We will use the data collected by BuzzFeed News on github, https://raw.githubusercontent.com/BuzzFeedNews/2015-03-earthquake-maps/master/data/earthquake_states.csv. It is interesting that such a media company collects quite a large amount of earthquake records.

I will cover the following topics in this tutorial:

Installing the libraries

Before we start, we need to make sure that all needed libraries have been installed properly on the computer. As usual, we can install the libraries with "pip" or "conda".

$ pip install (the name of a library)

or

$ conda install (the name of a library)

For more information about these libraries, go to Summary and references to visit the official websites of the libraries.

In [70]:
import pandas as pd
import numpy as np
from scipy import stats
import folium
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.neighbors import KNeighborsClassifier
from sklearn import cross_validation
from sklearn import tree

Acquiring and formatting data

After we installed and imported these libraries, we are ready to start our analysis. First of all, we need to load the data. Pandas has its function called read_csv to generate a dataframe from reading a csv file (local or online), with this function, we can get a well-formatted dataframe and use the dataframe instead.

To drop the rows with NaN values, use dropna().

To drop other rows, columns, use drop() with specific requirement.

To drop duplicated rows, use drop_duplicates().

To sort the rows, use sort_values() with the sorting column(s), order, and other options.

To check the first 5 rows, use head(). (5 is a default value, you can set your own number instead of 5.)

In [71]:
Earthquake = pd.read_csv("https://raw.githubusercontent.com/BuzzFeedNews/2015-03-earthquake-maps/master/data/earthquake_states.csv")
Earthquake = Earthquake.dropna()
#drop rows with NaN values
Earthquake=Earthquake.drop(['id','updated','type','geom'],axis=1)
#drop unnecessary columns
Earthquake.sort_values(['mag'],axis = 0, ascending=True,inplace=True)
#sort the dataframe in the ascending order of magnitude
Earthquake.drop_duplicates(inplace=True)
#drop duplicates
print(len(Earthquake.index))
#output the length of the dataframe
Earthquake.head()
#have a glance
34707
Out[71]:
time latitude longitude mag place state
105552 1978-06-14 19:44:33.4-04 33.500 -116.782 3.0 Southern California California
61125 1998-01-26 06:00:53.63-05 37.037 -121.489 3.0 Northern California California
61127 1998-01-26 04:32:31.87-05 62.220 -150.274 3.0 Central Alaska Alaska
96756 1984-10-03 19:17:23.08-04 62.742 -143.714 3.0 Central Alaska Alaska
23664 2008-11-08 11:31:54-05 33.498 -116.513 3.0 Southern California California

We can see that after we removed some unnecessary rows and columns, there are 34707 valid records for us to use. Earthquake is the name of the dataframe that we have generated for storing the data from the csv file.

Grouping and plotting data

In the previous output, we can see there are six columns in this dataframe Earthquake, grouping is to gather the data with certain features together or measure some features of certain groups. When we need to split the data according to certain features, we can select the feature we are interested and store the rows with the feature in a new dataframe or the current one. I will explain more precisely when we have an example. To do some measurements of certain groups, we can use groupby() with some optional functions like size() and mean(). When the result is in one column, the type of it is Series, and when there are more than one column, the type is DataFrame. So remember to use to_frame() to convert it back to a dataframe (put the column name within the bracelets).

In [72]:
#count the number of earthquakes in each state, sort it in a descending order
Statecount = Earthquake.groupby(['state']).size().to_frame('count').reset_index()
Statecount.sort_values(['count'],axis=0,ascending=False,inplace=True)
#calculate the average magnitude in each state, name the columns, sort it in a descending order
stateaver = Earthquake[['state','mag']].groupby(['state']).mean().reset_index()
stateaver.columns=['state','avermag']
stateaver.sort_values(['avermag'],ascending=False,inplace=True)
#combine them together
state=pd.merge(Statecount, stateaver, on='state', how='inner')
state.sort_values(['count'],ascending=False, inplace=True)

Now we have calculated the average earthquake magnitudes and the numbers of earthquakes in the states with earthquakes. We are ready to plot them out. The easiest way to plot a graph with matplotlib is using plot(). You need to pass in three parameters, X series, Y series and the marker. X and Y series need to be numeric, but we can use x/yticks when one variable is not numeric, just like state. The following code can show you how to use x/yticks(). Remember to use show() when you want to output a plot.

In [73]:
plt.title('State earthquake count')
plt.barh(range(0,len(state.index)), state['count'], align='center')
#Generate an array with the same length of state. Use it as Y series temporarily.
plt.yticks(range(0,len(state.index)),state['state'], size=5)
#Use yticks to build a relationship between the array and state, set the size of the characters.
plt.ylabel('State')
plt.xlabel('Earthquake count')
plt.xscale('log')
#change the scale to log
plt.show()
state.sort_values(['avermag'],ascending=False, inplace=True)
plt.title('State average magnitude')
plt.barh(range(0,len(state.index)), state['avermag'], align='center')
plt.yticks(range(0,len(state.index)),state['state'], size=5)
plt.ylabel('State')
plt.xlabel('Earthquake average magnitude')
plt.xlim([3, 4])
plt.show()
#plot the data of state average magnitude and state earthquake count with '+'
plt.plot(state['avermag'],state['count'],'+')
plt.yscale('log')
plt.title('State average magnitude VS. count')
plt.xlabel('State average magnitude')
plt.ylabel('State earthquake count')
plt.show()

With these plots, we can say that there are no obvious relationships between a state's average magnitude and its earthquake count. California has the most earthquakes in US, and Hawaii has the highest state average magnitude in US. It is not very practical to compare these two factors and figure out which one is more important for people to watch out, so I invent a new variable called total magnitude, which is calculated by summing up all the magnitudes of earthquakes together, and it can also be calculated by State average magnitude times state earthquake count.

In [74]:
#build a new column for the new variable, sort the dataframe with the new variable
state['totalmag']=state['avermag']*state['count']
state.sort_values(['totalmag'],ascending=False, inplace=True)
plt.title('State total magnitude')
plt.barh( range(0,len(state)), state['totalmag'], align='center')
plt.yticks(range(0,len(state)),state['state'], size=5)
plt.ylabel('State')
plt.xlabel('Earthquake total magnitude')
plt.xscale('log')
plt.show()

We can see that the result of this new variable is pretty close to the result of state earthquake count. And this is reasonable because the average magnitudes of all these states are all in the range of 3 to 4, so the influence of state average magnitude on this total magnitude is really limited.

Expressing data on a map

This part is pretty simple. The most important thing is to find the coordinates, then you can show the information you want at corresponding locations on the map. You can design the shape, the size, and the color of your markers on the map. The following code is to mark those high-magnitude earthquakes on the map.

In [75]:
#generate a map of USA
map = folium.Map(location=[34, -130], zoom_start=3)
#select big earthquakes
BigEarthquake = Earthquake[Earthquake['mag']>=5]
#mark them on the map
for i, row in BigEarthquake.iterrows():
    if row['mag'] < 6:
        folium.CircleMarker([row['latitude'], row['longitude']],radius=row['mag']/2,popup='Magnitude: '+str(row['mag'])+' Place: '+row['place']+' Time: '+str(row['time']),color='yellow',fill_color='yellow').add_to(map)
    elif row['mag'] < 7:
        folium.CircleMarker([row['latitude'], row['longitude']],radius=row['mag']/2,popup='Magnitude: '+str(row['mag'])+' Place: '+row['place']+' Time: '+str(row['time']),color='red',fill_color='red').add_to(map)
    elif row['mag'] < 8:
        folium.CircleMarker([row['latitude'], row['longitude']],radius=row['mag']/2,popup='Magnitude: '+str(row['mag'])+' Place: '+row['place']+' Time: '+str(row['time']),color='black',fill_color='black').add_to(map)
map
Out[75]:

Implementing and evaluating regressions

There are many different ways to implement regression in Python, especially when you are trying to do simple linear regressions. Among different functions in different packages, I would like to recommend stats.linregress, it is a function that generates the regression model and testify the model at the same time. After we provide the data series, it returns slope, intercept, r_value, p_value and std_err, so we don't need to calculate those statistical values. With r_value and p_value, we can clearly know whether the regression model is reasonable.

In [93]:
#use apply to call a function on each of the elements
Earthquake['cate']=Earthquake['mag'].apply(int)
catecount=Earthquake[['cate']].groupby(['cate']).size().to_frame('count').reset_index()
print(catecount)
plt.title('The count of earthquakes in different categories')
plt.plot(catecount['cate'],catecount['count'],'.')
plt.xlabel('Earthquake categories')
plt.ylabel('Count')
plt.yscale('log')
plt.show()
   cate  count
0     3  29298
1     4   4841
2     5    517
3     6     43
4     7      8

It seems that there is a linear relation between earthquake category and the denary logarithm of its count. So let's see how well it works.

In [87]:
#use stats.linregress to do linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(catecount['cate'], catecount['count'].apply(np.log10))
plt.plot(catecount['cate'],catecount['count'],'.')
plt.plot(catecount['cate'],np.power(10,slope*catecount['cate']+intercept),'-')
plt.title('The count of earthquakes in different categories')
plt.xlabel('Earthquake categories')
plt.ylabel('Count')
plt.yscale('log')
plt.show()
print('p-value: '+str(p_value))
print('r-value: '+str(r_value))
p-value: 9.479462258e-05
r-value: -0.998159030729

Since the p-value is much smaller than 0.001, and the absolute value of r-value is very close to 1, we can figure out that there is a clear linear relation between the earthquake categories and the denary logarithm of their counts.

2-D plotting

In the following codes, I plot several plots to show the data directly. I also do some data manipulations here to reveal the real pattern behind those numbers. To get more accurate data, sometimes we need partial data from a column. In this case, we need to split a column to several columns. I would show this with an example that I split year, month, day and dtime from the column time.

In [88]:
#split one column into two
Earthquake[['date','dtime']] = Earthquake['time'].str.split(' ',1,expand=True)
Earthquake[['year','month','day']] = Earthquake['date'].str.split('-',2,expand=True)
#calculate year average and month average
yearaver = Earthquake[['year','mag']].groupby(['year']).mean().reset_index()
yearaver.sort_values(['mag'],ascending=True,inplace=True)
monthaver = Earthquake[['month','mag']].groupby(['month']).mean().reset_index()
monthaver.sort_values(['mag'],ascending=True,inplace=True)
Earthquake.sort_values(['mag'],axis = 0, ascending=False,inplace=True)
#plot them
plt.title('Annual average earthquake magnitude')
plt.plot(yearaver['year'].apply(int),yearaver['mag'],'.')
plt.xlabel('year')
plt.ylabel('average magnitude')
plt.show()
plt.title('Monthly average earthquake magnitude')
plt.plot(monthaver['month'],monthaver['mag'],'.')
plt.xlabel('month')
plt.ylabel('average magnitude')
plt.show()

We can see that there are no relationships between magnitudes and months, so we can say that earthquake is not seasonal. And there is a slight decreasing trend of the annual average magnitude in this period with some fluctuations, which may suggest the active seismic period of Circum-Pacific seismic zone has passed.

In [79]:
#just try to find the highest one year record in a state
stateyearcount = Earthquake[['year','state']].groupby(['year','state']).size().to_frame('count').reset_index()
stateyearcount.sort_values(['count'],axis = 0, ascending=False,inplace=True)
moststateyear=stateyearcount.iloc[0]
print('The highest count of Earthquake of one state in a year: ')
print(str(moststateyear['count'])+' times, in '+str(moststateyear['state'])+', in '+str(moststateyear['year']))
The highest count of Earthquake of one state in a year: 
1213 times, in California, in 1992

A record is found.

In [80]:
#count the earthquakes of different magnitudes, and plot them
magcount = Earthquake[['mag']].groupby(['mag']).size().to_frame('count').reset_index()
plt.title('Earthquake magnitude count')
plt.plot(magcount['mag'],magcount['count'],'.')
plt.plot(magcount['mag'],magcount['count'],'-')
plt.xlabel('Earthquake magnitude')
plt.ylabel('count')
plt.show()
#count the earthquakes of different categories, and plot them
catecount = Earthquake[['cate']].groupby(['cate']).size().to_frame('count').reset_index()
plt.title('Earthquake categorized magnitude count')
plt.plot(catecount['cate'],catecount['count'],'.')
plt.plot(catecount['cate'],catecount['count'],'-')
plt.xlabel('Earthquake categorized magnitude')
plt.ylabel('count')
plt.show()

Looking at the raw data, we can easily figure out the raw data is really abnormal. When the magnitude is less than 5, the occurrence of earthquakes with a magnitude of a multiple of 0.1 is significantly higher than the occurrence of earthquakes with other magnitudes. So, the data cannot show a real distribution of earthquake magnitudes in US. Then what should I do to reveal the true distribution? There are two possible ideas, one is to split some of the sharp peaks to their neighboring ranges, so that we may lower those peaks and get a more smooth line. I can split those peaks with normal distribution into its neighboring range with a length of 1, or I can split it evenly in that range. But this may bring more uncertainties into the dataset, and uncertainties are not good for data analysis. Another way is to round the magnitudes to 0.1 level, so that we can ignore those special cases, and look at the big picture of the data. Compared with the previous idea, this one is much easier, all I need to do is round up the magnitudes.

In [81]:
#round up the magnitude to smooth the data, plot it in normal and log coordinate system
magcount['roundmag']=round(magcount['mag'],1)
roundcount = magcount[['roundmag','count']].groupby(['roundmag']).sum().reset_index()
plt.title('Earthquake rounded magnitude count')
plt.plot(roundcount['roundmag'],roundcount['count'],'.')
plt.xlabel('Earthquake rounded magnitude')
plt.ylabel('count')
plt.show()
plt.title('Earthquake rounded magnitude count')
plt.plot(roundcount['roundmag'],roundcount['count'],'.')
plt.xlabel('Earthquake rounded magnitude')
plt.ylabel('count')
plt.yscale('log')
plt.show()

We can see that the Earthquake rounded magnitude count plot shows the same idea that the logarithm of count may have a linear relation with Earthquake magnitude (rounded), just as the relation between count and earthquake categories. So we may get a linear relation on this one as well.

In [91]:
#do the linear regression again
slope, intercept, r_value, p_value, std_err = stats.linregress(roundcount['roundmag'], roundcount['count'].apply(np.log10))
plt.title('The count of earthquakes VS. rounded magnitudes')
plt.plot(roundcount['roundmag'],roundcount['count'],'.')
plt.plot(roundcount['roundmag'],np.power(10,slope*roundcount['roundmag']+intercept),'-')
plt.xlabel('Earthquake rounded magnitude')
plt.ylabel('Count')
plt.yscale('log')
plt.show()
print('p-value: '+str(p_value))
print('r-value: '+str(r_value))
p-value: 4.10911352961e-36
r-value: -0.987461323212

This linear model also shows pretty good fitness from its p-value and r-value, so we can say that this is a good fit. Because the count of earthquake cannot be negative, so we should check whether this model is still valid when it reaches to some very very rare earthquake catastrophe. Because this model is on the logarithm of count, when the rounded magnitude keeps increasing, the regression line can only get closer and closer to count=0, but never reach it. So this linear model is a qualified model. And we can also see the real shape of this regression model with a uniform coordinate system.

In [92]:
#plot the real shape of the regression line in normal coordinate system
plt.title('The count of earthquakes VS. rounded magnitudes')
plt.plot(roundcount['roundmag'],roundcount['count'],'.')
plt.plot(roundcount['roundmag'],np.power(10,slope*roundcount['roundmag']+intercept),'-')
plt.xlabel('Earthquake rounded magnitude')
plt.ylabel('Count')
plt.show()

For earthquake researchers, there is another interesting topic that they desire to figure out, which is the period of earthquakes. A more professional way to call it is "the seismic cycle", which includes three parts, inter-seismic slip, co-seismic slip, and post-seismic slip. In the first part, the strain accumulates steadily between the earthquakes; then, the second part refers to the slip that occurs at the instant of the earthquake; after that, the third part refers to slip that occurs after an earthquake. With enough data, we can group the data into certain areas and show the change of the strain.

In [84]:
#build a dictionary for earthquake hotzones' data
Hotzone={}
Areacount=pd.DataFrame()
#count the number of earthquakes per place
Areacount = Earthquake[['place']].groupby(['place']).size().to_frame('count').reset_index()
Areacount.sort_values(['count'], ascending=False, inplace=True)
#get the top five's data
Toparea=Areacount.head()
for place in Toparea['place']:
    Hotzone[place]=Earthquake[Earthquake['place']==place]
    temp=Hotzone[place]
    plt.title('Earthquake magnitude VS. time plot at '+place)
    time = temp['year'].apply(int)+(temp['month'].apply(int)-1)/12
    plt.plot(time,temp['mag'],'.')
    plt.xlabel('time')
    plt.ylabel('Earthquake magnitude')
    plt.show()

With so much data here, the pattern of earthquakes is not very obvious. And the spatial features of different earthquakes cannot be presented in these 2d plots, so let's make 3d plots instead.

3-D plotting

To plot a graph in 3-D, we need to import Axes3D from matplotlib(mpl_toolkits.mplot3d). The following example is 3-D scatter plots. The code is pretty close to the one for generating 2-D plots, but we need to claim a figure first, and the plot is a subplot of this figure.

In [85]:
#build a relationship between colors and categories
color={3:'y',4:'g',5:'c',6:'b',7:'k'}
#a function to convert number to color
def getcolor (x):
    return color[x]
#plot them per category per place
for place in Toparea['place']:
    OneHotzone=Hotzone[place]
    for i in range(3,8):
        temp=OneHotzone[OneHotzone['cate']==i]
        if temp.empty==False:
            #initialize the plot
            fig = plt.figure()
            ax = fig.add_subplot(111, projection='3d')
            plt.title('Cate '+str(i)+' Earthquake distribution at '+place)
            #three variables
            xs = temp['latitude']
            ys = temp['longitude']
            zs = temp['year'].apply(int)+(temp['month'].apply(int)-1)/12
            ax.set_xlabel('latitude')
            ax.set_ylabel('longitude')
            ax.set_zlim3d(bottom=1970,top=2015)
            #data, color, marker
            ax.scatter(xs, ys, zs, c=getcolor(i), marker='o')
            plt.show()

We can see that the distribution of Category 3 earthquakes are pretty continuous and dense around the same locations, and the distribution of low-magnitude earthquakes doesn't quite show the possible locations for high-magnitude earthquakes. This may give us a chance to use KNN(K nearest neighbors) and decision trees to distinguish different categories of earthquake from its location and time. To pick up a proper set, I choose the data of California, Alaska and Hawaii. And we can compare the accuracy of this two ML methods.

Machine learning

Finally, we get to the last step: machine learning. Machine learning is not to let the machine "learn" something new, but using enough data to train the machine to find out certain patterns behind the data. Here I am going to use KNN and decision trees on the data of California, Alaska and Hawaii. To split the train data and test data, I would build two subsets of data, one is for training, and the other one is for testing. The ratio of training and testing is 4:1. I will do this for ten times, and we can compare the performance of KNN and decision trees by comparing SSE (sum of squared errors), error count and correctness percentage.

In [86]:
#get the data of California, Alaska and Hawaii
data=Earthquake[Earthquake['state'] == 'California']
data.append(Earthquake[Earthquake['state'] == 'Alaska'])
data.append(Earthquake[Earthquake['state'] == 'Hawaii'])
data.reset_index(drop=True, inplace=True)
#store classifiers
knns={}
trees={}
#10-fold cross-validation
for i in range(10):
    #a way to generate random training and testing sets
    mask=np.random.rand(len(data.index)) < 0.8
    traindata = data[mask]
    testdata = data[~mask]
    #drop unnecessary columns
    trainX = traindata.drop(['time','date','cate','dtime','state','place'],axis=1)
    trainY = traindata['cate']
    testX = testdata.drop(['time','date','cate','dtime','state','place'],axis=1)
    testY = testdata['cate']
    #initialize clf
    knnclf = KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=10, p=2,
           weights='uniform')
    treeclf = tree.DecisionTreeClassifier(max_depth=5)
    knns[i]=knnclf
    trees[i]=treeclf
    #training
    knnclf.fit(trainX, trainY)
    treeclf.fit(trainX, trainY)
    #predict
    s1 = pd.Series(knnclf.predict(testX)).to_frame('knn_result')
    s2 = pd.Series(treeclf.predict(testX)).to_frame('tree_result')
    #put them together
    predictY=pd.concat([s1, s2],axis=1,join_axes=[s1.index])
    testY.reset_index(drop=True, inplace=True)
    compare = pd.concat([predictY, testY], axis=1)
    #compute SSE, error count and Correctness percentage
    compare['knn_diff']=compare['knn_result']-compare['cate']
    compare['tree_diff']=compare['tree_result']-compare['cate']
    knn_diff = compare['knn_diff']
    knn_diff = knn_diff[knn_diff!=0]
    tree_diff = compare['tree_diff']
    tree_diff = tree_diff[tree_diff!=0]
    knn_sse=np.sum(knn_diff.apply(np.square))
    tree_sse=np.sum(tree_diff.apply(np.square))
    print('SSE of KNN: '+str(knn_sse))
    print('Error count of KNN:'+str(knn_diff.size))
    print('Correctness percentage of KNN:'+str((1-knn_diff.size/len(compare.index))*100)+'%')
    print('SSE of decision tree: '+str(tree_sse))
    print('Error count of decision tree:'+str(tree_diff.size))
    print('Correctness percentage of decision tree:'+str((1-tree_diff.size/len(compare.index))*100)+'%')
SSE of KNN: 415
Error count of KNN:323
Correctness percentage of KNN:89.24050632911393%
SSE of decision tree: 0
Error count of decision tree:0
Correctness percentage of decision tree:100.0%
SSE of KNN: 412
Error count of KNN:324
Correctness percentage of KNN:89.37356510331256%
SSE of decision tree: 0
Error count of decision tree:0
Correctness percentage of decision tree:100.0%
SSE of KNN: 384
Error count of KNN:295
Correctness percentage of KNN:90.03041568097329%
SSE of decision tree: 0
Error count of decision tree:0
Correctness percentage of decision tree:100.0%
SSE of KNN: 432
Error count of KNN:332
Correctness percentage of KNN:88.97009966777408%
SSE of decision tree: 0
Error count of decision tree:0
Correctness percentage of decision tree:100.0%
SSE of KNN: 347
Error count of KNN:271
Correctness percentage of KNN:90.97268487674883%
SSE of decision tree: 1
Error count of decision tree:1
Correctness percentage of decision tree:99.96668887408394%
SSE of KNN: 396
Error count of KNN:307
Correctness percentage of KNN:89.75300400534046%
SSE of decision tree: 0
Error count of decision tree:0
Correctness percentage of decision tree:100.0%
SSE of KNN: 411
Error count of KNN:306
Correctness percentage of KNN:89.97050147492625%
SSE of decision tree: 0
Error count of decision tree:0
Correctness percentage of decision tree:100.0%
SSE of KNN: 456
Error count of KNN:331
Correctness percentage of KNN:88.99966766367564%
SSE of decision tree: 0
Error count of decision tree:0
Correctness percentage of decision tree:100.0%
SSE of KNN: 415
Error count of KNN:310
Correctness percentage of KNN:89.49508641138597%
SSE of decision tree: 0
Error count of decision tree:0
Correctness percentage of decision tree:100.0%
SSE of KNN: 425
Error count of KNN:312
Correctness percentage of KNN:89.4878706199461%
SSE of decision tree: 0
Error count of decision tree:0
Correctness percentage of decision tree:100.0%

We can see that for both of SSE and error count, decision trees have a much better performance compared with KNN. With a limitation of max_depth=5, the decision tree can already have a superb accuracy that almost no error. For this 10 times of training and testing, I generate random training and testing data, this idea is called K-fold cross-validation, here, K=10, which is the most common one. In knns and trees, I save all the classifiers, so that we can use them to handle similar data in the future by combining the results from all 10 classifiers. For example, we can pick the majority of results as the final result.

Summary and references

This tutorial highlighted just a few elements of what is the common way to analyze data in Python with Jupyter notebook. More information about the libraries and descriptions of earthquakes are available from the following links:

Packages:

  1. Folium: https://folium.readthedocs.io/en/latest/
  2. Pandas: http://pandas.pydata.org/
  3. NumPy: http://www.numpy.org/
  4. Scipy: https://www.scipy.org/
  5. Matplotlib: https://matplotlib.org/
  6. Scikit-learn: http://scikit-learn.org/stable/

Earthquake:

  1. US Geological Survey: https://earthquake.usgs.gov/
  2. Foreshocks, Mainshocks, and Aftershocks, Caltech: http://scedc.caltech.edu/Module/shockmod.html
  3. Earthquakes and the Seismic Cycle, University of Wisconsin-Madison: http://geoscience.wisc.edu/~chuck/Classes/Mtn_and_Plates/eq_cycle.html
  4. Preparations for earthquakes: https://www.ready.gov/earthquakes