1. The Problem
My autocleanse.py
script ran instantly on the small Titanic dataset. This gave me a false sense of security. To understand its true performance and find real bottlenecks, I needed to test it against a large, challenging dataset.
2. The Solution
Since I didn’t have a large, messy dataset available, I created one. The strategy was to generate a synthetic CSV file with millions of rows to simulate a real-world workload. This would force the script to work hard, making performance issues visible.
3. The Lab Notebook
Part A: The Data Generator
I wrote a Python script (generate_data.py
) using pandas
and numpy
to create a challenging dataset.
Key Steps:
- Size: Generated 5 million rows.
- Columns: Created four columns: two numerical, one categorical, and one with dates.
- Artificial Messiness: Intentionally introduced a high percentage of
NaN
(missing) values into one of the numerical columns to test myhandle_missing_data
function. - Output: Saved the final DataFrame to
large_dataset.csv
.
Part B: The Experiment & The Result
I then used the time
command in my Linux terminal to measure the execution of my cleaning script on this new large file.
Command:
time python autocleanse.py data/large_dataset.csv
Result:
10.51s user 3.32s system 83% cpu 16.564 total
Part C: Initial Analysis
The total wait time was 16.56 seconds. The total time the CPU was actively working was 10.51s (user) + 3.32s (sys) = 13.83s
. This immediately tells me that my script is primarily CPU-bound—most of the time is spent on the actual data calculations within pandas. However, the ~2.7-second difference between real
and Total CPU
time shows that the initial disk read (I/O) is still a noticeable, non-trivial part of the process.
4. Academic Bridge
- OS Unit 2 (Process Management): This test created a long-running process, which is the fundamental unit of execution managed by the OS.
- OS Unit 3 (CPU Scheduling): The
user
vs.sys
time is a direct result of how the Linux scheduler allocated CPU cycles to my program’s code versus kernel tasks (like file access).
5. Why It Matters
A script that works on 1,000 rows might crash or take hours on 10 million. This process of stress-testing proves that I think about scalability, which is essential for any professional freelance project. It’s the difference between a toy project and an engineering proof-of-concept.