Scatter plots are used to display values for typically two variables for a set of data.
It is probably one of the best way to show you visually the strength of the relationship between the variables, the direction of the relationship between the variables (instead of comparison shown by histograms) and whether outliers exist.
Let see an example and for it we will look at some wine data.
As a matter of facts, there are large differences in price and quality between production years although the wine is from the same area and produced in a similar way and traditionally expert tasters predict the quality by tasting it when it is on the market.
In march 1990 Orley Ashenfelter, an economic professor claimed he can predict wine quality without tasting the wine and in advance, before is on the market or even produced.
The data he used are the prices of a dozen Bordeaux wine bottles in US$ (the value in the table is its logarithm) for the years 1952-1978, together with that year average temperature, rain, age of wine and France population.
A copy of the data is available on my repository.
Here are the first table rows:
import pandas as pd wine = pd.read_csv('wine.csv') head(wine) Year, Price, WinterRain, AGST, HarvestRain, Age, FrancePop 1952, 7.495, 600, 17.1167, 160, 31, 43183.569 1953, 8.0393, 690, 16.7333, 80, 30, 43495.03 1955, 7.6858, 502, 17.15, 130, 28, 44217.857
The matplotlib function to draw a scatter plot is – surprise – called “scatter” and requires at least the x and y variables:
plt.scatter(wine['AGST'], wine['Price']) plt.title("Price vs. Average Growing Season Temp of Bordeaux wine bottles") plt.xlabel("AGST (Celsius)") plt.ylabel("Log of Price") plt.grid(True) plt.show()
The X variable is the “AVGT” column (the average temperature) and the Y variable is the Price.
The chart shows clearly that there is a relation between these two variables: when the average temperature is high, the price is also tendential higher. Such relation is called direct relation while both variables are in the same direction. If the price would be lower then it’d be called an inverse relation.
We can also combine in the same chart three variables by plotting two of them on the x axis and see how they relate with the y axis.
We do this by plotting two times with different colours:
plt.scatter(wine['WinterRain'], wine['Price'], color='black', label='Winter') plt.scatter(wine['HarvestRain'], wine['Price'], color='red', label='Harvest') plt.title("Price vs. Rain of Bordeaux wine bottles") plt.xlabel("Rain (mm)") plt.ylabel("Log of Price") plt.legend(loc='upper center') # add a legend for clarity plt.grid(True) plt.show()
The “label” parameter in the scatter() function is used for the legend.
Well, it seems that there is not a strong relation between wine price and rain …
You can plot two X variables on the same chart when they share the same unit and scale (in this case are both millimeters of rain).
If is not the case, you need to plot two different charts but matplotlib allows you to place them in a grid by using the function subplot().
For example, you can put two charts one below the other in a grid with two rows and one column:
plt.subplot(2, 1, 1, axisbg="yellow") # (nrows, ncols, plot_number) plt.scatter(wine['Age'], wine['Price']) plt.title("Price vs. Age of Bordeaux wine bottles") plt.xlabel("Age in years") plt.ylabel("Log of Price") plt.show() plt.subplot(212, axisbg="cyan") # equivalent to: plt.subplot(2,1,2) plt.scatter(wine['FrancePop'], wine['Price']) plt.title("Price of Bordeaux wine bottles vs. Population") plt.xlabel("France population (Thousands)") plt.ylabel("Log of Price") plt.show()
Not a surprise but there is again a direct relation between Price and Age (not a surprise) but not really between Price and France population…