Code
# Import libraries used below
import requests
import urllib.request
import urllib.parse
import time
import io
from bs4 import BeautifulSoup
import pandas as pd
import datetime
import os
Jeffrey Post
April 15, 2020
As the summary explains, this blog post will very quickly explain how to automatically download French government data on hospitalization and testing pertaining to COVID⁻19.
The various datasets concerning hospitalization data are found here.
If you follow the link you will find 4 csv datasets concerning hospitalization data along with 5 other csv files with metadata and documentation.
The various datasets concerning testing data are found here.
If you follow the link you will find 2 csv datasets concerning testing data along with 2 other csv files with metadata and documentation.
In both cases we want to download the first of the links since they contain the pertinent daily updated data (do have a look manually at the metadata and documentation files to make sure this is what you want).
Let’s first have a look ath the main landing page that I provided above.
The response here should be 200 (see life of codes here).
We see that the petrtinent file in each cases (testing or hospitalization data) are the first links in their page. So we save only this one as donw below:
We now have the URL for the CSV files we want so we’ll do similar steps as above to download these files.
Now that you have the data, what to do with it?
It depends on your purpose I guess: * First write the data to a CSV file which you then read * Directly read the data
# You can then read that csv file to use in your data analysis:
tests = pd.read_csv('tests.csv', sep=';', dtype={'dep': str, 'jour': str, 'clage_covid': str, 'nb_test': int, 'nb_pos': int, 'nb_test_h': int, 'nb_pos_h': int, 'nb_test_f': int, 'nb_pos_f': int}, parse_dates = ['jour'])
cases = pd.read_csv('cases.csv', sep=';', dtype={'dep': str, 'jour': str, 'hosp': int, 'rea': int, 'rad': int, 'dc': int}, parse_dates = ['jour'])
Note in the code above I had previously looked through the raw csv data to underdstand how to parse it.
cases = pd.read_csv(io.StringIO(requests.get(casescsvurl).content.decode('utf-8')), sep=';', dtype={'dep': str, 'jour': str, 'hosp': int, 'rea': int, 'rad': int, 'dc': int}, parse_dates = ['jour'])
tests = pd.read_csv(io.StringIO(requests.get(testscsvurl).content.decode('utf-8')), sep=';', dtype={'dep': str, 'jour': str, 'hosp': int, 'rea': int, 'rad': int, 'dc': int}, parse_dates = ['jour'])
It sometimes happends that links are provided in URI (URL symbols encoded into % symbols…)
You generally need to convert those back to correct URLs, example below:
Let’s quickly see how, from scratch, we can use code above to scrape testing data and plot it quickly.
Note the data only includes city testing centers and does not include hospital testing.
Collecting plotly==4.6.0
Downloading https://files.pythonhosted.org/packages/15/90/918bccb0ca60dc6d126d921e2c67126d75949f5da777e6b18c51fb12603d/plotly-4.6.0-py2.py3-none-any.whl (7.1MB)
|████████████████████████████████| 7.2MB 2.4MB/s
Requirement already satisfied: retrying>=1.3.3 in /usr/local/lib/python3.6/dist-packages (from plotly==4.6.0) (1.3.3)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from plotly==4.6.0) (1.12.0)
Installing collected packages: plotly
Found existing installation: plotly 4.4.1
Uninstalling plotly-4.4.1:
Successfully uninstalled plotly-4.4.1
Successfully installed plotly-4.6.0
#collapse_hide
import plotly.express as px
import plotly.graph_objects as go
# We want overall testing for France, se we groupby Day and sum: (filtering for clage_covid = 0 means not differentiated between age groups)
df = tests[tests.clage_covid=='0'].groupby(['jour']).sum()
fig = go.Figure(data=[
go.Bar(name='Positive tests', x=df.index, y=df.nb_pos, marker_color='red'),
go.Bar(name='Total tests', x=df.index, y=df.nb_test, marker_color='blue')
])
fig.update_layout(
title= 'Daily positive and total testing data in France',
xaxis_title = 'Date',
yaxis_title = 'Number of tests (total and positive)',
barmode='group'
)
fig.show()
Very easy to incorporate this into a python script to automate.
This is only the very basic of scraping, a lot more could be done, maybe in another blog post.