Data formats#

Data come in many forms and formats, depending on source and function.

“Flat” text files#

  • .txt, .dat, .csv, .fasta, …

  • Often with a fixed format

    • Comma separated columns

    • Tabulator separated columns

    • Fixed width columns

    • Headers, subheaders

    • Sections with keywords

import pandas as pd
planets_DF = pd.read_csv('../../data/planets.csv')
planets_DF.head()
planet distance diameter
0 Mercury 0.387 AU 4878 km
1 Venus 0.723 AU 12104 km
2 Earth 1.000 AU 12756 km
3 Mars 1.524 AU 6787 km
4 Jupiter 5.203 AU 142796 km
# Using planets_DF as input, remove " AU" from the "distance" column and convert it to a float
planets_DF['distance'] = planets_DF['distance'].str.replace(' AU', '').astype(float)
print(planets_DF.dtypes)
planets_DF.head()
planet       object
distance    float64
diameter     object
dtype: object
planet distance diameter
0 Mercury 0.387 4878 km
1 Venus 0.723 12104 km
2 Earth 1.000 12756 km
3 Mars 1.524 6787 km
4 Jupiter 5.203 142796 km

Plotting directly with Pandas#

Instead of converting data to NumPy arrays and plotting, it is possible to plot directly from Pandas with special plot commands and .plot.xxx().

# Bar plot
# pd.options.plotting.backend = "matplotlib"
import matplotlib.pyplot as plt
planets_DF.plot.bar(x='planet', y='distance', rot=0)
plt.show()
# Try: color, rot, add more with plt.xxx()
../../_images/67146935ac365e8d046fd997a6a664e525549598fcd440e1794b59bf15548ac7.png

Exercise#

  • Plot the planets as circles instead of bars.

  • Use the diameter of the planet to adjust the circle size.

# Plot concentric ellipses for each planet where the size of the ellipse is proportional to 
# the planet's distance. Add planet names to the plot. Let the xlim and ylim extend from -31 to 31.
import matplotlib.pyplot as plt
import numpy as np
fig, ax = plt.subplots(figsize=(10,10))
ax.set_xlim(-31, 31)
ax.set_ylim(-31, 31)
ax.set_aspect('equal')
ax.set_facecolor('black')
for i in range(len(planets_DF)):
    ax.add_artist(plt.Circle((0,0), planets_DF['distance'][i], color='white', fill=False))
    ax.text(planets_DF['distance'][i], 0, planets_DF['planet'][i], color='white', ha='center', va='center')
plt.show()
../../_images/5c34c0563bcd35c44898e8fda19b7a13f5e1314e2a484cf85127e585a30929f5.png

Excel files#

Look at the Budget.xlsx file in the data folder to justify the choice of import below.
Should this be adjusted?

pd.read_excel()

Budget = pd.read_excel('../../data/Budget.xlsx', sheet_name='2019', skiprows=2, index_col=0)
Budget
1 2 3 4 5 6 7 8 9 10 11 12 Unnamed: 13
SalespersonID
102 NaN NaN 54.728491 8.053198e+02 1010.005526 NaN 1572.864176 196.077202 1896.777586 1430.463445 347.092894 723.918675 8.037248e+03
125 9016.031000 12635.710568 3975.959267 3.541869e+04 67254.905958 108270.212920 129885.185350 51801.187615 80833.313215 14351.065997 99034.421805 5946.076427 6.184228e+05
144 9870.923721 8537.674024 34467.719310 1.314835e+05 63478.807310 80332.088219 8950.845553 51091.397573 28463.939431 136104.694978 29076.741297 35033.416700 6.168917e+05
183 1394.216707 6598.171435 15397.847623 2.877323e+04 6530.927321 158828.572729 66426.139653 29403.035538 139983.518128 231592.365267 20132.676098 17248.364886 7.223091e+05
194 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
196 2270.020511 12633.225198 7889.580041 2.465794e+03 199521.467124 6375.290655 24857.201539 193086.944649 66161.225155 225389.414595 53963.213693 37316.866234 8.319302e+05
215 5347.775816 98045.092946 14818.451817 5.493729e+05 323788.746583 122600.920552 79347.415036 353588.777730 37104.061797 41614.097652 233643.059453 583835.349542 2.443107e+06
265 1995.803045 2903.113376 16911.287170 2.858913e+05 143236.464661 14227.073545 403624.400847 76606.898096 11670.773263 153295.074262 202455.468961 41727.065582 1.354545e+06
285 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
326 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
660 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
669 NaN NaN NaN NaN NaN 4962.708683 4752.446948 11980.303433 36185.827317 28779.246362 52581.712055 29698.142511 1.689404e+05
Grand Total 29894.770800 141352.987547 93515.573718 1.034211e+06 804821.324483 495596.867303 719416.499101 767754.621836 402299.435893 832556.422558 691234.386255 751529.200557 6.764183e+06
# Remove redundant row and column (sums)
Budget = Budget.iloc[:12, :12]
# Exchange NaN values with 0
Budget.replace(np.NAN, 0, inplace=True)
Budget
1 2 3 4 5 6 7 8 9 10 11 12
SalespersonID
102 0.000000 0.000000 54.728491 805.319842 1010.005526 0.000000 1572.864176 196.077202 1896.777586 1430.463445 347.092894 723.918675
125 9016.031000 12635.710568 3975.959267 35418.687607 67254.905958 108270.212920 129885.185350 51801.187615 80833.313215 14351.065997 99034.421805 5946.076427
144 9870.923721 8537.674024 34467.719310 131483.460122 63478.807310 80332.088219 8950.845553 51091.397573 28463.939431 136104.694978 29076.741297 35033.416700
183 1394.216707 6598.171435 15397.847623 28773.228192 6530.927321 158828.572729 66426.139653 29403.035538 139983.518128 231592.365267 20132.676098 17248.364886
194 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
196 2270.020511 12633.225198 7889.580041 2465.793773 199521.467124 6375.290655 24857.201539 193086.944649 66161.225155 225389.414595 53963.213693 37316.866234
215 5347.775816 98045.092946 14818.451817 549372.866029 323788.746583 122600.920552 79347.415036 353588.777730 37104.061797 41614.097652 233643.059453 583835.349542
265 1995.803045 2903.113376 16911.287170 285891.263738 143236.464661 14227.073545 403624.400847 76606.898096 11670.773263 153295.074262 202455.468961 41727.065582
285 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
326 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
660 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
669 0.000000 0.000000 0.000000 0.000000 0.000000 4962.708683 4752.446948 11980.303433 36185.827317 28779.246362 52581.712055 29698.142511
# Plot monthly budget as boxplot
Budget.boxplot()
plt.show()
../../_images/e19ea2998de51fb773e0d9720a8f3212a254dea375760a1f24bcbe0067560981.png

Pandas plotting with different backend#

It is possible to exchange the matplotlib backend in Pandas with something else, e.g., the more interactive Plotly. This may not be a full replacement.

pd.options.plotting.backend = "plotly"
fig = Budget.boxplot()
fig.show() # Exchanged plt.show() with fig.show() to avoid mixing matplotlib and plotly