Applied Philosophy of Science and Data Ethics
2021 - 2024: Last compiled: 2024-12-09
Preface
This class book gathers the contents addressed in the course Applied Philosophy of Science and Data Ethics from the Master of Data Science at the University of Luxembourg (UL). This course will introduce basic philosophical and scientific concepts supported by examples and discussion. The course expects pro-active participation from the students in the form of presentations and essays as well as open debates.
Course presentation
As much as data science involves automating tasks, we should avoid falling for the automatization of the mental processes we undergo to solve new problems. We data scientists are used to dealing with data as much as we do with the methods used to process such data. Still, rarely, if ever, we stop to question the methodologies that justify employing one method or another. Surprisingly for many, data science is as much about science as about data. We have refined the data acquisition tools and technology stacks used for accomplishing all tasks covered under the data science umbrella. However, below the layers of tools and frameworks that facilitate our daily work, there lies a foundational ground of assumptions, principles, and constraints that shape the way we do data science.
During the last few years, I have taught the students of the Master of Data Science to question such foundations and the reasons behind many tools’ weaknesses and strengths. In this course, we delve into philosophical issues, such as Hume’s problem of induction, and link them with the recent challenges of artificial intelligence (AI) in the domains of my working experience, like healthcare, systems biology, and computer networks.
We tackle scientific explanation, causation, and association, learning the strengths of Hempel’s covering-law model and how it fails to generalize once causality enters into play. We discuss the issues of empiricism with causality and showcase the power of causal inference in epidemiology. This leads us to the role of biases, confounders, and surrogate attributes in AI performance. The core of the course is the scientific method, we jump from Popper’s attempt to make science purely deductive to the risks of confirmation, and how diversity plays a crucial role in shaping the quality of datasets. The students get acquainted with the power of randomized controlled trials and learn to relate the strengths of randomization and stratification to current AI techniques such as cross-validation. This is accompanied by historical cases from scientists and doctors like Semmelweis, Fisher, Yersin, et al. They showed us how to properly set scientific hypotheses and how experimental control can play both in favor and against us. This leads to the final part of the course, data ethics, where we learn how cases like the Tuskegee tragedy shaped current regulations for informed consent, and how we can turn ethical values into actions to prevent our solutions from discriminating or wrongly modeling the phenomena at issue.
I believe this course should be taught in every computer science faculty. It requires constant updates to follow the trending topics. One year is COVID-19 and the misleading statistics of vaccine effectiveness put by anti-vax supporters; the next year is the issues of social media applications among teenagers; and today the consequences of generative AI. However changing the present may be, past theories provide a strong base material for this course that students learn to appreciate. The topics learned in this course will remain useful independently of technological advancements. They are essential knowledge for future data scientists to face the upcoming challenges and build the solutions to overcome them.
Course at UL
The course related to this class book aims to provide the students with guidelines and methodologies to identify epistemic and ethical issues present in data science. We expect the students to develop a critical eye that helps them mitigate such problems in their daily work as data scientists.
During this course, students will learn by example different layers of the scientific method and how they relate to data science and data ethics. In particular, they will learn how the mechanisms behind the data affect the data analysis, and how the different types of scientific inference condition method choice and affect the conclusions drawn from the analysis. In this sense, examples of statistical abuse, misconduct and bad visualization will be shown together with their, sometimes catastrophic, collateral consequences.
Disclaimer
Although the impact and extension of the topics addressed in this class book (and the course) are broad and diverse, its length is limited. Hence the scope and depth of the contents are restricted. Consequently, several topics on Philosophy of Science are tackled superficially while some others are completely ignored. Such philosophical questions are handled from a practical data science point of view. Similarly, Data Ethics is a relatively new matter in continuous evolution. Therefore we will try to cope with the main issues in the most practical way.
Learning outcomes
In line with the European Qualitity Framework, Bachelor degrees require a critical understanding of theories and principles, while Master degrees involve higher specialised knowledge and critical awareness of knowledge issues in a field. In this case, the field at issue is data science and the contents will tackle philosophical and ethical issues concerning data science. Therefore, the aim is to provide students with a better understanding of method justification, to increase their knowledge about such methods, their scope, purpose and relation to other practices.
- Get familiar with the scientific goals and methods.
- Learn the most common data science misconduct problems.
- Critically evaluate ethical issues and method choice.
About this course
Some sections from the first chapters of this course are inspired by the book from Prof. Dr Lars-Göran Johansson (Johansson et al. 2016) and the educational works of Prof. Dr Till Grüne-Yanoff (Grüne-Yanoff 2014), such as his great course at EDX on “Philosophy of Science for Engineers and Scientists”. Regarding the second part of the course, which covers data ethics, I would like to thank the Univerisity of Michigan, for their online courses (from which I was already a fan during my PhD) and especially Prof. Dr H. V. Jagadish for his course on Data Science Ethics which inspired me on the contents and examples of this course. Moreover, this last part of the course would not have been possible without many relevant books on the topics tackled in this course (see References). Including The Book of Why (Pearl and Mackenzie 2018); Ethics and Data Science (Loukides, Mason, and Patil 2018); Philosophy of Natural Science (Hempel 1966); How charts lie (Cairo 2019); Automating inequality (Eubanks 2018) and many others. I hope any resemblance or imitation is seen as an act of flattery.
About this class book
This class book was made thanks to the great tutorial available on the book “Open tools for writing open interactive textbooks (and more)”.
License
Licensed under CC BY-NC-ND 4.0
The book is released under CC BY-NC-ND 4.0 license. This means that you are free to:
- Share: copy and redistribute the material in any medium or format.
Under the following terms:
- Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
- Non-Commercial: You may not use the material for commercial purposes.
- No Derivatives: If you remix, transform, or build upon the material, you may not distribute the modified material.
- No additional restrictions: You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.