Conversation
| ### PyCharm ### | ||
| # Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio and WebStorm | ||
| # Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839 |
There was a problem hiding this comment.
I'm totally fine with you updating the .gitignore with required files, but adding your personal IDE files is considered a bad practice, you can see how to move this contents into a global .gitignore file for your local installation here.
There was a problem hiding this comment.
.gitignore has been updated.
also, audit.md and datapackage.json will be added too.
also, didnt notice the PULL_REQUEST_TEMPLATE.md, so ill take a look into it! thanks.
There was a problem hiding this comment.
This is awesome, I didn't know you can set a global gitignore thank you @ManuelAlvarezC !
There was a problem hiding this comment.
@ManuelAlvarezC Could you elaborate on the audit.md and the datapackage.json? just this two left.
There was a problem hiding this comment.
Although is not as complete as it should be, thedocumentation will help you get a good grab on it.
There was a problem hiding this comment.
@ManuelAlvarezC Thank you, ill take a look at it, finalize the edit and make the push.
| @@ -0,0 +1,3 @@ | |||
| from task_geo.data_sources.covid.south_korea.south_korea_patients import south_korea_patients | |||
|
|
|||
| __all__ = ['south_korea_patients'] | |||
There was a problem hiding this comment.
Rename your data source to south_korea, or even better, the ISO code, something like kr_covid
| 'infected_by', 'contact_number' | ||
| ] | ||
| df = df.reindex(columns=cols_ordered) | ||
| df['confirmed_date'] = pd.to_datetime(df.confirmed_date) |
There was a problem hiding this comment.
This cast can be done in two lines with:
date_columns = [...]
df[date_columns] = df[date_columns].apply(pd.to_datetime)
# This was written originally as pd.to_datetime(df[columns]) which crashes.There was a problem hiding this comment.
@ManuelAlvarezC Could you elaborate on this?
There was a problem hiding this comment.
Sure thing.
On the lines between 37-41(I only selected the line 37 to comment, a blunder of mine) you are casting columns to datetime and reassigning them multiple times. This approach has two drawbacks:
- More lines of code to read and write, making it easier to miss details and introduce errors.
- It's in fact much faster making the casting and assigning of all the columns at once. This is what is called vectorization and pandas ( and numpy too) are designed to have vectorized operations run much faster than regular iteration in python.
There was a problem hiding this comment.
Also, passing the date format to to_datetime will further improve its performance.
| import requests | ||
|
|
||
|
|
||
| def south_korea_patients_connector(*args, **kwargs): |
There was a problem hiding this comment.
I'm not sure why the signature of the functions is with *args, and **kwargs if you are only expecting one argument.
Also, this argument is a url, does this data source work with any url? I think this connector should take no arguments. Can you please update the signature for both connector and data source?
There was a problem hiding this comment.
Done, takes no argument now and the url has been set
|
Update: audit.md is done, some changes are made to comply with the lint for flake8 (checks during commit). Just have the datapackage.json left to do. once this is done, will be making another commit |
Extraction and formatting of the dataset of confirmed cases, deaths and recovered per patient in South Korea #20