Data formats#
Data come in many forms and formats, depending on source and function.
“Flat” text files#
.txt, .dat, .csv, .fasta, …
Often with a fixed format
Comma separated columns
Tabulator separated columns
Fixed width columns
Headers, subheaders
Sections with keywords
import pandas as pd
planets_DF = pd.read_csv('../../data/planets.csv')
planets_DF.head()
planet | distance | diameter | |
---|---|---|---|
0 | Mercury | 0.387 AU | 4878 km |
1 | Venus | 0.723 AU | 12104 km |
2 | Earth | 1.000 AU | 12756 km |
3 | Mars | 1.524 AU | 6787 km |
4 | Jupiter | 5.203 AU | 142796 km |
# Using planets_DF as input, remove " AU" from the "distance" column and convert it to a float
planets_DF['distance'] = planets_DF['distance'].str.replace(' AU', '').astype(float)
print(planets_DF.dtypes)
planets_DF.head()
planet object
distance float64
diameter object
dtype: object
planet | distance | diameter | |
---|---|---|---|
0 | Mercury | 0.387 | 4878 km |
1 | Venus | 0.723 | 12104 km |
2 | Earth | 1.000 | 12756 km |
3 | Mars | 1.524 | 6787 km |
4 | Jupiter | 5.203 | 142796 km |
Plotting directly with Pandas#
Instead of converting data to NumPy arrays and plotting, it is possible to plot directly from Pandas with special plot commands and .plot.xxx().
# Bar plot
# pd.options.plotting.backend = "matplotlib"
import matplotlib.pyplot as plt
planets_DF.plot.bar(x='planet', y='distance', rot=0)
plt.show()
# Try: color, rot, add more with plt.xxx()
Exercise#
Plot the planets as circles instead of bars.
Use the diameter of the planet to adjust the circle size.
# Plot concentric ellipses for each planet where the size of the ellipse is proportional to
# the planet's distance. Add planet names to the plot. Let the xlim and ylim extend from -31 to 31.
import matplotlib.pyplot as plt
import numpy as np
fig, ax = plt.subplots(figsize=(10,10))
ax.set_xlim(-31, 31)
ax.set_ylim(-31, 31)
ax.set_aspect('equal')
ax.set_facecolor('black')
for i in range(len(planets_DF)):
ax.add_artist(plt.Circle((0,0), planets_DF['distance'][i], color='white', fill=False))
ax.text(planets_DF['distance'][i], 0, planets_DF['planet'][i], color='white', ha='center', va='center')
plt.show()
Excel files#
Look at the Budget.xlsx file in the data folder to justify the choice of import below.
Should this be adjusted?
Budget = pd.read_excel('../../data/Budget.xlsx', sheet_name='2019', skiprows=2, index_col=0)
Budget
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | Unnamed: 13 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SalespersonID | |||||||||||||
102 | NaN | NaN | 54.728491 | 8.053198e+02 | 1010.005526 | NaN | 1572.864176 | 196.077202 | 1896.777586 | 1430.463445 | 347.092894 | 723.918675 | 8.037248e+03 |
125 | 9016.031000 | 12635.710568 | 3975.959267 | 3.541869e+04 | 67254.905958 | 108270.212920 | 129885.185350 | 51801.187615 | 80833.313215 | 14351.065997 | 99034.421805 | 5946.076427 | 6.184228e+05 |
144 | 9870.923721 | 8537.674024 | 34467.719310 | 1.314835e+05 | 63478.807310 | 80332.088219 | 8950.845553 | 51091.397573 | 28463.939431 | 136104.694978 | 29076.741297 | 35033.416700 | 6.168917e+05 |
183 | 1394.216707 | 6598.171435 | 15397.847623 | 2.877323e+04 | 6530.927321 | 158828.572729 | 66426.139653 | 29403.035538 | 139983.518128 | 231592.365267 | 20132.676098 | 17248.364886 | 7.223091e+05 |
194 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
196 | 2270.020511 | 12633.225198 | 7889.580041 | 2.465794e+03 | 199521.467124 | 6375.290655 | 24857.201539 | 193086.944649 | 66161.225155 | 225389.414595 | 53963.213693 | 37316.866234 | 8.319302e+05 |
215 | 5347.775816 | 98045.092946 | 14818.451817 | 5.493729e+05 | 323788.746583 | 122600.920552 | 79347.415036 | 353588.777730 | 37104.061797 | 41614.097652 | 233643.059453 | 583835.349542 | 2.443107e+06 |
265 | 1995.803045 | 2903.113376 | 16911.287170 | 2.858913e+05 | 143236.464661 | 14227.073545 | 403624.400847 | 76606.898096 | 11670.773263 | 153295.074262 | 202455.468961 | 41727.065582 | 1.354545e+06 |
285 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
326 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
660 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
669 | NaN | NaN | NaN | NaN | NaN | 4962.708683 | 4752.446948 | 11980.303433 | 36185.827317 | 28779.246362 | 52581.712055 | 29698.142511 | 1.689404e+05 |
Grand Total | 29894.770800 | 141352.987547 | 93515.573718 | 1.034211e+06 | 804821.324483 | 495596.867303 | 719416.499101 | 767754.621836 | 402299.435893 | 832556.422558 | 691234.386255 | 751529.200557 | 6.764183e+06 |
# Remove redundant row and column (sums)
Budget = Budget.iloc[:12, :12]
# Exchange NaN values with 0
Budget.replace(np.NAN, 0, inplace=True)
Budget
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
SalespersonID | ||||||||||||
102 | 0.000000 | 0.000000 | 54.728491 | 805.319842 | 1010.005526 | 0.000000 | 1572.864176 | 196.077202 | 1896.777586 | 1430.463445 | 347.092894 | 723.918675 |
125 | 9016.031000 | 12635.710568 | 3975.959267 | 35418.687607 | 67254.905958 | 108270.212920 | 129885.185350 | 51801.187615 | 80833.313215 | 14351.065997 | 99034.421805 | 5946.076427 |
144 | 9870.923721 | 8537.674024 | 34467.719310 | 131483.460122 | 63478.807310 | 80332.088219 | 8950.845553 | 51091.397573 | 28463.939431 | 136104.694978 | 29076.741297 | 35033.416700 |
183 | 1394.216707 | 6598.171435 | 15397.847623 | 28773.228192 | 6530.927321 | 158828.572729 | 66426.139653 | 29403.035538 | 139983.518128 | 231592.365267 | 20132.676098 | 17248.364886 |
194 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
196 | 2270.020511 | 12633.225198 | 7889.580041 | 2465.793773 | 199521.467124 | 6375.290655 | 24857.201539 | 193086.944649 | 66161.225155 | 225389.414595 | 53963.213693 | 37316.866234 |
215 | 5347.775816 | 98045.092946 | 14818.451817 | 549372.866029 | 323788.746583 | 122600.920552 | 79347.415036 | 353588.777730 | 37104.061797 | 41614.097652 | 233643.059453 | 583835.349542 |
265 | 1995.803045 | 2903.113376 | 16911.287170 | 285891.263738 | 143236.464661 | 14227.073545 | 403624.400847 | 76606.898096 | 11670.773263 | 153295.074262 | 202455.468961 | 41727.065582 |
285 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
326 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
660 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
669 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 4962.708683 | 4752.446948 | 11980.303433 | 36185.827317 | 28779.246362 | 52581.712055 | 29698.142511 |
# Plot monthly budget as boxplot
Budget.boxplot()
plt.show()
Pandas plotting with different backend#
It is possible to exchange the matplotlib backend in Pandas with something else, e.g., the more interactive Plotly. This may not be a full replacement.
pd.options.plotting.backend = "plotly"
fig = Budget.boxplot()
fig.show() # Exchanged plt.show() with fig.show() to avoid mixing matplotlib and plotly