PyTexas 2013

Filtering and Deduplicating Data in IPython Notebook

Walker Hale

This talk provides a practical demonstration of how I used IPython Notebook to filter and then de-duplicate a set of 2877 records from a tab separated value (text) file. Using a file of 117 cases and a fairy complex set of business criteria that combined separate concepts of "most recent" and "most modern", I reduced the set of records down from 20 columns x 2877 records to 13 columns x 615 records. The end result was the output of this computation and a script that could be used for other data sets. (There were 9 more data sets.)

Status: Accepted