; $LastChangedDate: 2008-08-16 12:27:49 -0500 (Sat, 16 Aug 2008) $
; $LastChangedRevision: 2718 $
; $HeadURL: svn://osiris/scheme-grant/trunk/slow-loop.scm $
;
; In the comp.lang.lisp post [Slow Loop (alternatives in lisp?)]
; Francogrex asked how to implement pivot table functionality without
; it taking forever using inner loops. Folks posted clearly faster
; solutions along with good advice. By the way, taking forever means
; 12 hours for inputs of 1 million data points.
;
; I wondered how you might lead someone down the path to understand how to approach
; solving a problem (this is pre-HTDP). Here is what I came up with, primarily for
; myself, mostly as a thought exercise.
; ---------------------------------------------------
; Now you have the solution. Here is something else to think about,
; how that solution was reached.
;
; What was your problem statement? You described it as wanting an Excel style pivot
; table that adds up similar items. Make what you already know a little more exactly:
;
; You got a set C, of collections, of integers: C [a, b, c ,d, e, …] .
;
; Your goal is to provide the sum of each collection and present it to the user.
; The problem is that rather than defining the collections individually like
; this (Using PLT Scheme notation):
;
#lang scheme
(require (planet schematics/schemeunit:3)
(prefix-in srfi/1: srfi/1))
(define collection-A '(1 4 3 6 1 3 5))
(define collection-B '(8 5 2 7 4 8 9))
; They are defined in two separate lists like this:
(define collection-names'("a" "b" "c" "d" "e" "a" "b"))
(define collection-values '(1 4 2 7 5 2 1))
; The implication being that collection-names is a list of symbols (or some other
; object) that represent the collection, and collection-values is a list of integers,
; and that both lists are the same size and when iterated over in order, the
; relationship between collection-names[n] and collection-values[n] is that the latter
; is a member of the former.
; Your goal is to do something like (apply + collection-A), (apply + collection-B)
; and so on, but you can’t because of the given constraints. Ask yourself what is the
; next best thing?
; The simplest thing is to start by getting the value, and the collection
; to which it belongs (after doing that you can figure out what to do with it).
; At this point, there are a few ways to get closer to your goal, and you should
; pursue the easiest one right now. The reason is that the way you are solving the
; problem is most important at this point. You already did that, and it was too slow.
; Since you are using 1 million data points, now you have to think about how to
; optimize your solution.
; Your solution does too much work. For every element that is in the list, it will
; iterate over the entire list that many times (N*N). That would imply that when you
; do the work, you actually need to go back to the beginning because you care about
; all of the values in there. In the case of a pivot table, though, you don't. Here is
; why:
; Suppose you've got the data a:1, b:4, c:2, d:7, e:5, a:2, b:1. When you, I mean
; *you* read it, you go along saying "1 is member of a", and since you know you don't
; want to save it, all you want to do is to sum up the members of a, you can apply 1
; to the sum of the members of the collection a. So straight away you would say
; "the current sum of members of a is 0, I'll add 1 to it, so now the sum is 1", you
; make a mental note of this. Now you look at the rest of the list: b:4, c:2, d:7,
; e:5, a:2, b:1 and do the same thing, and you keep repeating that until you are
; finished. This is our "counting algorithm". What we'll do in the code is replace
; our "memory" with a hashtable that stores sums for a given collection.
; There are two ways to get the data into pair values like this. Here is the
; first way:
(map cons collection-names collection-values)
; > (("a" . 1) ("b" . 4) ("c" . 2) ("d" . 7) ("e" . 5) ("a" . 2) ("b" . 1))
; Now we can implement the counting algorithm (which expects pairs), and uses
; an immutable hash table:
(define (counting-algorithm1 pairs sums)
(if (null? pairs) sums
(let* ([pair (car pairs)]
[name (car pair)]
[value (cdr pair)]
[old-value (hash-ref sums name 0)])
(counting-algorithm1 (cdr pairs) (hash-set sums name (+ old-value value))))))
(define c1-pairs (map cons collection-names collection-values))
(define c1-results (counting-algorithm1 c1-pairs (make-immutable-hash '())))
c1-results
; And run it to get the result.
; > #hash(("c" . 2) ("d" . 7) ("a" . 3) ("b" . 5) ("e" . 5))
; We'll test it to be sure this is what we expect:
(define (check-sums sums)
(check-equal? (hash-ref sums "a") 3 "Collection A")
(check-equal? (hash-ref sums "b") 5 "Collection B")
(check-equal? (hash-ref sums "c") 2 "Collection C")
(check-equal? (hash-ref sums "d") 7 "Collection D")
(check-equal? (hash-ref sums "e") 5 "Collection E"))
(check-sums c1-results)
; This seems to work correctly. The problem with it, though, is that it
; loops over the entire data set twice: the first time is to generate the pairs
; and the second time is to execute the counting algorithm. Ideally we could
; pair the values *and* execute the counting algorithm in the same step reducing
; execution time to N. We can start by looking at another way to build pairs,
; but this time doing nothing with the resulting pair.
(define (associate-values names values)
(unless (null? names)
(let ([name (car names)] [value (car values)])
(printf "~a:~a~n" name value))
(associate-values (cdr names) (cdr values))))
(associate-values collection-names collection-values)
; For which you get the result:
; > a:1
; > b:4
; > c:2
; > d:7
; > e:5
; > a:2
; > b:1
; By looking at the relationship, you can see that the collection a has 1 and 2 in
; it, and b has 4 in it, and so on. This is the same thing that we did before. The
; difference with this approach, though, is that we can integrate our
; counting algorithm right in to it.
(define (counting-algorithm2 names values sums)
(if (null? names) sums
(let* ([name (car names)]
[value (car values)]
[old-value (hash-ref sums name 0)])
(counting-algorithm2
(cdr names) (cdr values) (hash-set sums name (+ old-value value))))))
; We'll test the results here, too:
(define c2-results
(counting-algorithm2
collection-names
collection-values
(make-immutable-hash '())))
c2-results
(check-sums c2-results)
; Both approaches seem to yield the same result. The difference is in execution times
; in large data sets. Have a look:
(define data-set-size 1000000)
(define collection-names-size 5) ; Only ever use ("a" "b" "c" "d" "e")
(define collection-names-chunk-size (/ data-set-size collection-names-size))
(define large-collection-names #f)
(set!
large-collection-names
(append
(srfi/1:make-list collection-names-chunk-size "a")
(srfi/1:make-list collection-names-chunk-size "b")
(srfi/1:make-list collection-names-chunk-size "c")
(srfi/1:make-list collection-names-chunk-size "d")
(srfi/1:make-list collection-names-chunk-size "e")))
(define large-collection-values (srfi/1:make-list data-set-size 1))
(time (counting-algorithm1
(map cons large-collection-names large-collection-values)
(make-immutable-hash '())))
(time (counting-algorithm2
large-collection-names
large-collection-values
(make-immutable-hash '())))
; I ran each a few times to get the results:
; Algorithm 1: 1484, 1391, 1438, 1313, 1406
; Algorithm 2: 984, 968, 953, 969, 938
; You can process 1 million values in under a second using algorithm 2, versus 3 with
; algorithm one. I was expecting N versus 2N, not 1.5N. I wonder why there is a
; difference. Additionally, once in a while when I run algorithm 1, it takes more
; than twice as long, so for now I threw those values out.
; More research is required here.