Skip to content

Latest commit

 

History

History
44 lines (32 loc) · 5.93 KB

File metadata and controls

44 lines (32 loc) · 5.93 KB
Data Science 2016
3/25/2016

PROJECT PROPOSAL

Who is on the team?

Mackenzie and Brenna

In a couple of paragraphs, describe the key ideas of your proposed project? What is your MVP? What are your stretch goals?

The key ideas of this project are to work with an external collaborator to understand a data set. We are also looking to work with a time course data series in this project; we need to find the best way to examine data in a temporal dimension and not misrepresent with values that gather multi-week spanning data into one misleading value (non meaningful means for example). We’ve both also done a fair bit of work with data visualization and are looking to enhance the experience by finding effective techniques for representing data across time. We’re hoping we can look at existing journal articles and other peer-reviewed documents for inspiration on appropriate visualization techniques. We haven’t used much of what we have discussed in Thinkstats2 in projects yet, so we are now looking to apply these statistics techniques for comparing time course datasets (we do not yet know what these conventions are) to our data-heavy project. The MVP for this project would be well documented code with visualizations to make data relationships more clear in order to further data analysis. This may be in the form of a jupyter notebook but may also be .py file (that is well-commented to tell the story of the data) with accompanying .png files. Our stretch goal would look something like a more formal document about our findings throughout the project. This could take the form of a scientific journal article, including many compelling visualizations and insights.

To the best of your current knowledge, what datasets will you use for your project? Are there any obstacles you foresee in terms of getting access to the data?

We would like to investigate the data set gathered by Scott Hersey which investigates pollutant concentrations, particulate matter concentrations, and indoor and outdoor temperatures in the township of Kwadella. There are no obstacles in obtaining these four data sets as Scott has them on his computer. We are also planning to meet with him to create a “codebook” for the set of data. (UPDATE: This meeting happened and was very helpful. We now know which columns in the dataset we will use.)

What are the most important new skills / techniques you will have to learn to be successful in this project? If you think some of these skills would be useful for us to cover in class, please indicate which ones.
  • Data visualization skills and statistical analysis. It would be useful for us to spend some class time going over how to draw conclusions from data - how/when to articulate causes from observed correlations. Outline a rough timeline for the major milestones of your project. This will mainly be useful to refer back to as we move through the project. -Talk to Scott Hersey and make our own complete “codebook” as well as look through the analysis he has already conducted -Completely clean the data and have it in a usable format
  • create visualization of relationships seen in the data (will hopefully discuss correlation and causation)
  • Use statistical methods outlined in ch 7 of Thinkstats2 to look at how temperature and pollutant/particulate matter concentration correlates and look to find possible causation
  • Analyze visualizations and how they connect to statistical analysis.
What do you view as the biggest risks to you being successful on this project?
  • Biggest risk: Getting lost in statistical methods, not knowing whcih ones are useful.
  • Other potential risks: overscoping, trying to focus on too many interesting aspects of the dataset
Given each of your YOGAs, in what ways is this project well-aligned with these goals, and in what ways is it misaligned? If there are ways in which it is not well-aligned, please provide a potential strategy for bringing the project and your learning goals into better alignment.
Mackenzie - Fit between YOGA and Project Topic

In my first YOGA, I focused on learning to work with more complete data sets, finding and working with a specific research question for my project, and drawing more complete conclusions instead of just finding and showing trends without further analysis. This project is very well aligned with my goals because it inherently uses a complete research data set Scott Hersey collected, we (Brenna and I) are requiring ourselves to identify a research question before we start working with the data, and we are using statistical methods to investigate data relationships to draw concrete conclusions from these relationships. The major risk to our project, getting lost in the data, is also the biggest risk of misalignment with my learning goals.

Brenna - Fit between YOGA and Project Topic

The goals I chose to write about originally for my YOGA assignment were 1.) Creating clear, compelling visualizations that clearly communicate the story of the data to the viewer. This goal fits very well with this project topic. I will have the opportunity to experiment with different sorts of visualizations, go beyond jupyter notebooks, and get experience creating visualizations that are clean, polished, easy to interpret, and convey an important message. 2.) Knowing when and how to apply machine learning to a data science project This was my goal that fits the least well with this project, however, a large part of this goal was understanding when machine learning is the right tool to use. I believe for this project it mostly will not be. I will update my own personal learning goals to reflect this. 3.) Gaining better intuition about patterns and correlations within a dataset. (Enhanced data exploration skills of sorts) This goal fits well with the project. We will have the opportunity to explore an interesting dataset and I’ll have the opportunity to gain experience exploring and identifying relationships in a dataset.