What Predicts Pulmonary Disease?

DS4200 — Information Presentation and Data Visualization

Introduction

Topic Statement

Our project examines how health indicators, lifestyle-related factors, and symptoms relate to pulmonary disease risk. Using a synthetic lung cancer risk dataset from Kaggle, we aim to visualize which factors stand out most clearly and how they help describe patterns of risk.

Why It Matters

This topic is important because lung cancer and related pulmonary conditions are high-impact health concerns, and understanding risk-related patterns can help make a complicated and sensitive topic more approachable. Visualization is especially useful here because it allows us to compare multiple possible indicators at once and show which ones appear most meaningful within the dataset.

About the Dataset

The dataset is well-suited for this project because it contains 5,000 records and 18 features related to lung cancer risk and prediction, providing enough observations and enough variable variety to build a meaningful narrative. Since the dataset includes demographic information and risk-related health indicators, it supports our goal of comparing which factors appear most associated with pulmonary disease outcomes.

Central Question

Our main question is which indicators in this dataset appear most strongly related to pulmonary disease, and how they compare when viewed together rather than one at a time. More specifically, we want to identify which factors are most important overall, whether certain symptoms or lifestyle indicators stand out relative to demographic factors, and what a higher-risk profile looks like across multiple variables.

Visual Narrative

The webpage is designed to move from broad comparisons to more detailed views of how risk-related variables behave together. We begin by highlighting which indicators appear most important overall, then use additional visualizations to examine relationships, compare outcomes, and show what higher-risk cases look like in an interactive way.

Limitations

Because this is a prediction-oriented dataset rather than direct clinical evidence, our findings should be understood as patterns within the data rather than medical proof. The goal of the project is to communicate these patterns clearly and thoughtfully, while recognizing the limits of what this type of dataset can support.

Data Overview

Source and Scale

We used the "Lung Cancer Prediction Dataset" from Kaggle, which contains 5,000 patient-level records and 18 features. For our project workflow, the dataset was cleaned and prepared for analysis, and the final version contains 5,000 usable observations with no missing values or duplicate rows.

Types of Variables

The dataset contains a mix of continuous and binary variables, which makes it well suited for visualization. Continuous variables include age, energy level, and oxygen saturation, while binary features include smoking, exposure to pollution, breathing issues, chest tightness, family history, and other symptom or lifestyle-related indicators.

Outcome / Target

The main outcome in this dataset is pulmonary disease, which is recorded as a yes/no variable and was converted into a binary numeric field for analysis. This allows us to compare which risk-related features appear most associated with disease outcomes and visualize those patterns across the dataset.

Summary

The dataset fits our project well because it includes enough observations to support meaningful comparisons and enough variables to study multiple dimensions of risk at once. It allows us to compare demographic, lifestyle, and symptom-related indicators within a single narrative. This is central to our goal of identifying what factors stand out most in relation to pulmonary disease risk.

References

Markaki, M., Tsamardinos, I., Langhammer, A., Lagani, V., Hveem, K., & Røe, O. D. (2018). A Validated Clinical Risk Prediction Model for Lung Cancer in Smokers of All Ages and Exposure Types: A HUNT Study. EBioMedicine, 31, 36–46. https://doi.org/10.1016/j.ebiom.2018.03.027
Cao, W., You, Z., Wang, Z., Liang, Z., Li, H., Chang, Z., Chen, Y., Dong, G., Cheng, Z. J., & Sun, B. (2025). Risk factors behind the global lung cancer burden: a pan-database exploration. Translational Lung Cancer Research, 14(7), 2452–2469. https://doi.org/10.21037/tlcr-2025-131

These two sources help ground our project in broader research on lung cancer prediction. The HUNT study focuses on clinical risk prediction in smokers and supports the idea that multiple individual-level factors can be combined to better understand disease risk, which connects directly to our project’s goal of comparing symptoms and lifestyle indicators. The pan-database exploration paper provides a broader public-health perspective by examining major contributors to the global lung cancer burden, which helps frame the importance of our analysis and why it focuses on identifying the risk-related features that stand out most clearly in the dataset. Together, these sources connect our visualizations to broader patient-level risk modeling and the context of lung cancer risk factors.