Thursday, April 14, 2016

Break Down Data Into a Tree of Categories - Easy Category Tree Breakdown

When you are looking at data it can be helpful to place it into categories based using what I call a category tree breakdown.  This can give us additional information that may not be apparent without looking at the breakdown.

In this simple example, say we have gathered a simple dataset on various political candidates, including: First Name, Last Name, Party Affiliation, Gender, State Affiliation, and Job.  Maybe we would want to see which genders were associated with which jobs or what the similarities are across party or state affiliations. 

Each of these could be done individually in SQL, but that would be inefficient and wouldn't expose the power of being able to perform a breakdown on the fly.

The Dataset in PHP

$rows = array(
    array('fname' => 'Ted', 'lname' => 'Cruz', 'party' => 'Republican', 'gender' => 'male', 'state' => 'Texas', 'job' => 'Senator'),
    array('fname' => 'John', 'lname' => 'Kasich', 'party' => 'Republican', 'gender' => 'male', 'state' => 'Ohio', 'job' => 'Governor'),
    array('fname' => 'Donald', 'lname' => 'Trump', 'party' => 'Republican', 'gender' => 'male', 'state' => 'New York', 'job' => 'Chairman'),
    array('fname' => 'Hillary', 'lname' => 'Clinton', 'party' => 'Democrat', 'gender' => 'female', 'state' => 'New York', 'job' => 'Secretary of State'),
    array('fname' => 'Bernie', 'lname' => 'Sanders', 'party' => 'Democrat', 'gender' => 'male', 'state' => 'Vermont', 'job' => 'Senator'),

);
All I need to do to perform a breakdown is to supply the function with the full dataset and specify which keys to look at to start building categories.  The order of the breakdown is important and determines the structure of the tree that is generated as a result.

For example, let's say I want to look at party affiliation.  Then for each party I want to know what are the positions held by those candidates.  Then for each of those jobs I want to see which states have an affiliation.
That can be done by specifying the breakdown array as follows:

$breakdown_arr = array('party', 'job', 'state');
show_breakdown($breakdown_arr, $rows);
I like to use named keys in my array because that makes it easier to read than integer indexes.
When this breakdown is applied to the dataset, the following result displays:

        Republican
            Senator
                Texas
            Governor
                Ohio
            Chairman
                New York
        Democrat
            Secretary of State
                New York
            Senator
                Vermont


Specify Any Type of Breakdown


The breakdown can be specified in any order we want.  Say we want to look again at the party affiliation, but this time by gender and first name:

$breakdown_arr = array('party', 'gender', 'fname');
show_breakdown($breakdown_arr, $rows);
This shows me a completely different take on the data.  We see here that while the Republican party is dominated by males, the Democratic party is split evenly by gender.

        Republican
            male
                Ted
                John
                Donald
        Democrat
            female
                Hillary
            male
                Bernie



The Algorithm For Performing the Breakdown


function get_array_breakdown($breakdown_indexes, $rows, &$data = null) {
    if (count($breakdown_indexes) < 1) {
        return array();
    }
    if (!$data) {
        $data = new datum();
    }
    $metric_index = array_shift($breakdown_indexes);
    $to_add = array();
    foreach ($rows as $row) {
        $val = $row[$metric_index];
        $to_add[$val][] = $row;
    }
    //Now loop through the children to continue the breakdown
    foreach ($to_add as $key => $to_add_val) {
        $datum = new datum();
        $datum->name = $key;
        $datum->data = $to_add_val;
        $data->addChild($datum);
        get_arrays_for_breakdown($breakdown_indexes, $to_add[$key], $datum);
    }
    return $data;
}

The way this code works is that it continually slices off a category index to look for in the dataset.  Then it loops through the passed rows and puts them into groups based on their values for the index.  These newly categorized values (which are stored in the $to_add array) need to be again gone through for the next breakdown (if there is one) and so are passed along to the next iteration so the next breakdown can be checked.  The remaining breakdown list, the unchecked values that were just added, and the $data object containing the result are all passed back into the method as a recursive call.

When there are no more rows to be added and subsequently checked, the recursive call stops being made and the resulting $data object is returned to the original calling function.

Unfortunately this call can be expensive when many rows and categories are being parsed.  If you have 'n' rows and 'm' columns to review so the big O(n) notation and where k is the number of distinct entries for each category is something like O(n) = n+k^m.

Try this algorithm to find interesting facts on your own datasets.

Complete Example For Performing Category Tree Breakdowns in PHP


<?php
//first name, last name, position
$rows = array(
    array('fname' => 'Ted', 'lname' => 'Cruz', 'party' => 'Republican', 'gender' => 'male', 'state' => 'Texas', 'job' => 'Senator'),
    array('fname' => 'John', 'lname' => 'Kasich', 'party' => 'Republican', 'gender' => 'male', 'state' => 'Ohio', 'job' => 'Governor'),
    array('fname' => 'Donald', 'lname' => 'Trump', 'party' => 'Republican', 'gender' => 'male', 'state' => 'New York', 'job' => 'Chairman'),
    array('fname' => 'Hillary', 'lname' => 'Clinton', 'party' => 'Democrat', 'gender' => 'female', 'state' => 'New York', 'job' => 'Secretary of State'),
    array('fname' => 'Bernie', 'lname' => 'Sanders', 'party' => 'Democrat', 'gender' => 'male', 'state' => 'Vermont', 'job' => 'Senator'),
);

$breakdown_indexes = array('party', 'job', 'state');
show_breakdown($breakdown_indexes, $rows);
/*
Output:
    For breakdown in form of: ["party","job","state"]
    Breakdown result is:

        Republican
            Senator
                Texas
            Governor
                Ohio
            Chairman
                New York
        Democrat
            Secretary of State
                New York
            Senator
                Vermont
*/
$breakdown_indexes = array('party', 'gender', 'fname');
show_breakdown($breakdown_indexes, $rows);
/*
Output:
    For breakdown in form of: ["party","gender","fname"]
    Breakdown result is:

        Republican
            male
                Ted
                John
                Donald
        Democrat
            female
                Hillary
            male
                Bernie
*/

function show_breakdown($breakdown_indexes, $rows) {
    $new_data = get_array_breakdown($breakdown_indexes, $rows);
    echo 'For breakdown in form of: '. json_encode($breakdown_indexes).PHP_EOL;
    echo 'Breakdown result is: '.PHP_EOL;
    echo $new_data;
    echo PHP_EOL;

    return $new_data;
}

function get_array_breakdown($breakdown_indexes, $rows, &$data = null) {
    if (count($breakdown_indexes) < 1) {
        return array();
    }
    if (!$data) {
        $data = new datum();
    }
    $metric_index = array_shift($breakdown_indexes);
    $to_add = array();
    foreach ($rows as $row) {
        $val = $row[$metric_index];
        $to_add[$val][] = $row;
    }
    //Now loop through the children to continue the breakdown
    foreach ($to_add as $key => $to_add_val) {
        $datum = new datum();
        $datum->name = $key;
        $datum->data = $to_add_val;
        $data->addChild($datum);
        get_arrays_for_breakdown($breakdown_indexes, $to_add[$key], $datum);
    }

    return $data;
}

class datum {
    public $name = '';
    public $data;
    public $children = array();

    private $depth = 0;

    public function addChild($child) {
        $child->depth = $this->depth + 1;
        $this->children[] = $child;
    }

    public function __toString() {
        $str = $this->toStringHelper();
        foreach ($this->children as $child) {
            $str .= $child;
        }

        return $str;
    }

    private function toStringHelper() {
        $str = str_pad(' ', $this->depth * 4);
        $str .= $this->name;
        $str .= PHP_EOL;

        return $str;
    }
}

1 comment: