Saturday, November 14, 2020

Rare intersecting with rare is pretty much empty

Having good intuition about rare events is actually pretty difficult.

Thinking in terms of the plots below helps me with the subject.

This is all in this Jupyter notebook.

(Note: formatting below is iffy as original is in Jupyter notebook)


#!/usr/bin/env python3

%matplotlib inline

import numpy as np

import matplotlib.pyplot as plt


An illustration of how some things are rare, and others common

Our "test" population is a thousand

In [2]:

n=1000

Rare numbers are numbers that are special one way of another. Here rarity means "uniformly randomly generated number between 0 and 1 near to one". We can find them by raising the numbers to a high power (e.g. 100). If they are not close enough to one, they get "crushed" to 0.

In [3]:

def mk_rare():

    return np.power(np.random.uniform(0.0, 1.0, n), 100)

rare = mk_rare()

 

The plot of the rare numbers:

In [4]:

plt.plot(rare)

plt.show()

The common numbers are all random numbers that are not close to zero. To be complementary how we found rare numbers, common numbers can be found by looking for high order "roots". The closer these are to one, the more common the number.In [5]:

def mk_common():

    return np.power(np.random.uniform(0.0, 1.0, n), 1.0/100.0)


common = mk_common()

 

Plotting the common numbers

In [6]:

plt.plot(common)

plt.show()

 



Let's find out how many rare numbers we have that are above a given number (between 0 and 1).

In [7]:

def count_above(s, skew_selection_for_better_plot):

    x = np.power(np.arange(0.0,1.0,0.01),skew_selection_for_better_plot)

    y = (s > x[:, np.newaxis]).sum(axis=1)

    plt.plot(x, y)

    plt.show()

count_above(rare, 10)

 


We can do the same for our common numbers:

In [8]:

count_above(common, 1/10)


 

The plots above are what you would expect, there are very few rare number above most thresholds. And most common numbers are above most of the numbers (between 0 and 1).

Let's do one more thing: multiply rare and common sets of numbers. First, let's have a few rare and common numbers.

In [9]:

a_rare = mk_rare()

b_rare = mk_rare()

c_common = mk_common()

d_common = mk_common()

 

We multiply a few

In [10]:

ab = a_rare*b_rare

ac = a_rare*c_common

cd = c_common*d_common

 

and plot the results

In [11]:

count_above(ab, 10)



In [12]:

count_above(ac, 10)

 


In [13]:

count_above(cd, 1/10)

 


The plots above confirm the following:

  • rare and rare is pretty much empty (when uncorrelated!)

  • rare and common is still rare

  • common and common is common


All original content copyright James Litsios, 2020.

Sunday, October 18, 2020

Healthy, noisy, productive Python

Asked about my feelings about Python, “speckle noise” came to mind.  The bigger subject here is Python's limited types and inherent noisiness in capturing deeper invariant properties (and therefore my speckle remark). 


Speckle noiseis the noise that arises due to the effect of environmental conditions on the imaging sensor during image acquisition. The analogy is that there is inherent noise in how one captures ideas in Python, and that this noise has "good Pythonic character". By noise, I mean something that cannot be precisely mastered.  Note that "real" speckle noise is not "good" or "bad", it just is.


All programming languages are "noisy". Yet to the developer, the way that noise affect you varies greatly. "Messiness" of computer languages may hurt you, as it may also help you (e.g. by improving your team's productivity). Said differently, sometime "cleanliness" is unproductive.  The main idea is the following: 


People naturally care about uncertainty.  Therefore, sometimes, we naturally focus on things that are not certain. As a bonus, we are naturally social around uncertain topics (think of the weather!), in part because we are happy to share when no one has an absolute truth, but also because sharing helps us deal with these uncertainties. Finally, there are many situation where an "external" nudge is needed to move out of a local minima. (I mention here the suggestion that financial markets need a bit uncertainty to be healthy, and here how I was once stuck in a bad local C++ design).


People naturally build on certainties. And when we do so, we in part lock ourselves in, because it would cost us to change what was certain and rebuild things.


This game of "certainty", "uncertainty", "building and locking ourselves in", "not finding a solid base to build", is what happens when we program.  Our choice of software language strongly affects how this happens. I have programmed and managed teams using Python, Haskell, F#, C++, Pascal, C, and Fortran (plus exotic languages like OPS5). Each of these languages is "robust" at a different level, some with more impact than others. 


Python, for example, is a language where expressions and functions come first, and types (e.g. list, object, classes, ...) are much a thin way to group functions and data together.  To contrast with Haskell,  where types are more important than expressions. The result is that new concepts are quickly captured in Python, and are considerably harder to capture in Haskell. However, it is quite difficult to capture deeper invariant properties of new concepts in Python, something that is easy to do in Haskell, with its strong types.


We might summarize by stating that Python has noisy types. At least that is often the way I feel when "dragging" concepts from one expression to another using "glue" list, dictionaries, objects or tuplets structures, just to make it work. Also to mention Python's limited dispatch logic, forcing yet more ad hoc constructions into your expressions Yet the magic of the real world, is that such noise creating properties is not necessarily bad!


I few years ago, I hired Haskell developers to build a system with "crystalline design properties". This had been one of my goals since being responsible for a "pretty messy design" in the late 90's. Therefore I co-founded a company with in part the goal of building a "single coherent distributed system". It is not easy to create a system where every complementary concerns fit precisely together, and all exists within coherent contexts. In fact, it only makes sense if you need it, for example to ensure trust and security. Now imagine the developers in the team working on such a "single coherent design". In such a development, no engineer can take independent decisions. In such a design no code can be written that does not fit exactly with rest of the code. How then to create common goals that map into personal team member tasks so as to avoid a design deadlock?  The simple answer might be make sure you have failed before in many ways, so as to avoid repeating those failures. Yet still, that does not avoid design deadlock. The hint of the approach is that for every dimension of freedom that you need to remove to guarantee the strength of your design, make sure to add an additional free non-critical dimension to help individuals still have a form of personal independence. In addition, I will add that it took me a lot of micro-management of vision and team dynamics to make that development a great success.


With “speckle noise”, especially at the type level, no such problems! There is no single crystalline unified software design. There is no coherency that is assured across your system. Python naturally accumulates imperfection which are just too expensive to keep precisely tamed with each addition of new code. This means that developers can agree on similar and yet different Python designs, In some sense one agrees to compare naturally fuzzy design views. And by doing so, one naturally protects one’s ego, as there is always a bit of room to express individual choices.


This may sound like Python bashing. It is not. This expensive “to design right” property is common to most programming languages.  This post is in fact a “praise Python” post.  If Python only had “a certain fuzziness”, it would be not much better than Visual Basic (to be slightly nasty).  Python is not “just another language”, it is a language where design logic is cheap to change because of the messy types are in fact naturally "spread apart". That is, the "noisy" Python type property results in "not too dense type relations", allowing changes to be made in one (implied) type without effecting the core of the other (implied) types. 


ps: I mentioned Fortran above really only because I like numpy's stride_tricks.as_strided and it reminds me of my large equivalence array structures which I used in Fortran when I was a teenager.


All original content copyright James Litsios, 2020.





Sunday, August 16, 2020

25'000 views; Some topics I am passionate about

 This week a took a day off to walk in the alps with my son. And while doing so noticed that my blog view count was 24999. So I quickly asked my son to be the next view! 

My blog has been around for more than ten years, and a tenth of a second is all a site like Google would need to achieve the same hit count. Still, I write this infrequent blog because it helps me find what is relevant to me and to others, and helps me communicate better. And encourages me to do my best, because I do care about my audience.

I write infrequently mostly because I do not have the time. Also, my topics tend to be complex, and sometimes even sensitive, and therefor take time. Still, I often jot down a title, sometimes with a page or two of content. These then become public if I happen to have free evening or weekend. For example, these are my unpublished blog titles going back to early 2019:

  • Balancing productivity, tempo, and success
  • Program = complementary
  • Some retrospectives in business of innovation
  • Type centric vs ...
  • The instantaneous view of the real world has no math
  • Control and flow in innovation
  • Lessoned learned: writing "coherent software"
  • Careful Software Architecture (in Python)
  • Absent minded at work
  • Choose your inner truth: run your organization like a hedge fund
  • Dealing with time
  • Computer languages as data for machine learning
  • My reality? Or your reality?
  • The Peter principle in software development
  • Innovate, but don't waste you time
  • ...
Out of these, you might notice topics that I am passionate about:
  • Innovation
  • Productivity
  • Formal software and architecture
  • Teamwork and organization processes
  • Modelling the real (and unreal!) world
The thing is: You cannot innovate if you are not productive. You cannot be productive if your teams and organizations do not work well together. You cannot work well together if you do not understand how to express the real world in your work, and to be precise and correct when needed. These are topics that I care about, and what this blog has mostly been about.

Cheers to all, I am looking forward to sharing some more writing!


Saturday, August 01, 2020

Finding constrained boundary of z-ordered / Morton code regions (pre-2000 coding)

Many years ago, while doing my thesis, I played a bit with z-order / Morton coding (see https://en.wikipedia.org/wiki/Z-order_curve). Therefore I share here a little snippet of code from ~1995, that I happened to fall upon while cleaning up an old laptop. I have not noted that the equivalent logic has already been shared, therefore I took an hour to write it up in Python (Jupyter notebook style uploaded here). I would be curious to know if it can be found elsewhere.

First a bit of historical context. My first job (before my PhD) had me writing a Design Rule Checker (DRC) for the world's first analog CAD tool, developed at the CSEM https://www.csem.ch/Home (see ILAC https://ieeexplore.ieee.org/document/18603). Among many things, a DRC does boolean operation on layers of integrated circuit masks, and when these operations produce unexpected results, you know that your CAD tool has made a mistake. Focusing on this blog, a DRC must efficiently "join" geometric datasets, and the question is then how should we structure the data of these geometries for optimal performance. At the CSEM I was using K-d trees (https://en.wikipedia.org/wiki/K-d_tree), yet one of the data structures that came out during this period was the Skip list (https://en.wikipedia.org/wiki/Skip_list), and because the skip list is only a one dimensional structure, I thought it would be fun to use z-ordering, to join 2 or more dimensions structures with skip lists. 

The difficulty is that keys in the skip list space only have one dimension, and it is not good enough to compare these keys to do high performance geometry operations. We need a way to take these one dimensional keys and optimally project them into the different dimensions of the geometry space. This is what the function "next_above" does below. You give it a boundary of z-space defined by its maximum z-order value, and as well a direction and a reference value, and it will tell you the next z-order value at the edge of the z-space that is aligned with the reference value along the given direction.

This is a 2d z-order example, but the concept works in any dimension.

We need some basic reference values:


log_bit_count = 5

bit_count = (1<<log_bit_count)

mask =  0xFFFFFFFFFFFFFFFF

mask0 = 0x5555555555555555

mask1 = mask0>>1  # not really but avoids python issues


This function finds the next constrained z-order value past a z-order region. Copyright 2020 James Litsios.

In [3]:

def next_above(x, m, z):

    a = (x&~m)|(z&m)

    b = x^a# different bits

    v = b

    for kp in range(0, log_bit_count+1):

        k = 1<<kp

        v |= (v>>k)

    w = ((v+1)>>1)&x

    if v==0:

        w = 1  # case where already on border

    nx = (((x|m)+w)&~m)&~v

    y = nx | (z&m)

    return y


From the web, some utility code to demonstrate things:

In [4]:

def to_morton2(x, y):

    # taken from https://stackoverflow.com/questions/30539347/2d-morton-code-encode-decode-64bits

    x = (x | (x << 16)) & 0x0000FFFF0000FFFF

    x = (x | (x << 8)) & 0x00FF00FF00FF00FF

    x = (x | (x << 4)) & 0x0F0F0F0F0F0F0F0F

    x = (x | (x << 2)) & 0x3333333333333333

    x = (x | (x << 1)) & 0x5555555555555555


    x = (y | (y << 16)) & 0x0000FFFF0000FFFF

    x = (y | (y << 8)) & 0x00FF00FF00FF00FF

    x = (y | (y << 4)) & 0x0F0F0F0F0F0F0F0F

    x = (y | (y << 2)) & 0x3333333333333333

    x = (y | (y << 1)) & 0x5555555555555555


    return x | (y << 1)


def from_morton1(x):

    x = x & 0x5555555555555555

    x = (x | (x >> 1))& 0x3333333333333333

    x = (x | (x >> 2))& 0x0F0F0F0F0F0F0F0F

    x = (x | (x >> 4))& 0x00FF00FF00FF00FF

    x = (x | (x >> 8))& 0x0000FFFF0000FFFF

    x = (x | (x >> 16)) & 0x00000000FFFFFFFF

    return x


def from_morton2(d):

    x = from_morton1(d)

    y = from_morton1(d >> 1)

    return x, y

In [5]:

import numpy as np


Choose a size for the example:

In [6]:

size = 64


Let's color a grid with two regions of lower ranges of z-ordered values:

In [7]:

z0, z1, z2 = (0, size*size//10, 2*size*size//3)

grid1 = np.ones((size+1,size+1,3))

for z in range(z0, z1):

    x, y = from_morton2(z)

    grid1[x,y,:] = (0,0.5,0)


for z in range(z1, z2):

    x, y = from_morton2(z)

    grid1[x,y,:] = (0.2,0.7,0.2)


x, y = from_morton2(z1)

grid1[x,y,:] = (1.0,0,0)


x, y = from_morton2(z2)

grid1[x,y,:] = (1.0,0,0)


What we have done is to mark in green the x,y values corresponding to all z-order values up to z2, which correspond to the coordinate (0, 63). For clarity, there are two zones, and the last value (0, 63) is shown in red, as is the coordinate (21, 10), that corresponds to z1.

In [8]:

from_morton2(z1)

Out[8]:

(21, 10)

In [9]:

from_morton2(z2)

Out[9]:

(0, 63)

In [10]:

import matplotlib.pyplot as plt

plt.figure(figsize = (5,5))

plt.imshow(grid1)

plt.show()


 


Now we choose twenty "lookup keys" to demonstrate what "next_above" does:

In [11]:

keys = np.random.randint(0,size,(20,))

In [12]:

keys

Out[12]:

array([34, 49, 10, 12, 45, 61, 10, 59, 16, 31, 37, 23, 12, 36, 27, 28, 26,

      


First we find the next z-order value outside the two z-order region that aligns "horizontally" with the given keys:

In [13]:

grid2 = grid1.copy()

for key in keys:

    grid2[key, :, :] *= 0.9

    grid2[key, 0, :] = (0.7,0.7,0.7)

    zabove1 = next_above(z1, mask0, to_morton2(key, 0))

    xa1, ya1 = from_morton2(zabove1)

    grid2[xa1,ya1,:] = (1,0,1)

    zabove2 = next_above(z2, mask0, to_morton2(key, 0))

    xa2, ya2 = from_morton2(zabove2)

    grid2[xa2,ya2,:] = (1,0,1)

In [14]:

plt.figure(figsize = (5,5))

plt.imshow(grid2)

plt.show()


 


Then, for demonstration purpose, we find the next z-order value outside the two z-order region that aligns "vertically" with the given keys:

In [15]:

grid3 = grid1.copy()

for key in keys:

    grid3[:, key, :] *= 0.9

    grid3[0, key, :] = (0.7,0.7,0.7)

    zabove1 = next_above(z1, mask1, to_morton2(0, key))

    xa1, ya1 = from_morton2(zabove1)

    grid3[xa1,ya1,:] = (1,0,1)

    zabove2 = next_above(z2, mask1, to_morton2(0, key))

    xa2, ya2 = from_morton2(zabove2)

    grid3[xa2,ya2,:]= (1,0,1)

In [16]:

plt.figure(figsize = (5,5))

plt.imshow(grid3)

plt.show()



The original ~1995 Prolog version of the algorithm:

xor X Y = (X or Y) and not (X and Y);


shiftDown V = shiftDown2 1 V;

shiftDown2 32 V = V;

shiftDown2 K V = shiftDown2 (2*K) (V or (shr V K));


nextAbove X M Z = Y

where

    A = (X and not M) or (Z and M),

    B = X xor A,

    V = shiftDown B,

    Y = nextAbove2 X M Z V (ifelse (V=0) 1 ((shr (V+1) 1) and X));


nextAbove2 X M Z V W = Y

where

    NX = (((X or M) + W) and not M) and not V,

    Y = NX or (Z and M);

C++ version from ~2003:

int

NextAbove(

        int x,

        int m,

        int z){


        int a = (x&~m)|(z&m);

        int b = x^a; // different bits

        int v= b;

        {

                for(int k=1; k<32; k<<=1){ //smeared bits down

                        v |= (v>>k);

                }

        }

        int w = ((v+1)>>1)&x; // keep top bit if it is in x

        if(v==0){

                w = 1; // case where is alreay on "border"

        }

        int nx = (((x|m)+w)&~m)&~v;

        int y = nx | (z&m);

        return y;

}