import pymbolic.primitives as p
u_var = p.Variable("u")
Want to define an image filter.
To that end, define a new formula 'thing': A neighbor-average.
class NeighborAverage(p.Expression):
def __init__(self, child):
self.child = child
def __getinitargs__(self):
return (self.child,)
mapper_method = "map_neighbor_average"
img_filter = NeighborAverage(u_var)
#img_filter = u_var + u_var - NeighborAverage(u_var)
img_filter
Let's define some indexing variables:
from pymbolic.mapper import IdentityMapper
i = p.Variable("i")
j = p.Variable("j")
ii = i+1
jj = j+1
class IndexMapper(IdentityMapper):
def map_variable(self, expr):
return expr[ii, jj]
def map_neighbor_average(self, expr):
var = expr.child
return (2*var[ii,jj] + var[ii+1,jj] + var[ii-1,jj]
+ var[ii,jj+1] + var[ii,jj-1])/6
Now apply this to our filter:
idx_mapper = IndexMapper()
print(idx_mapper(img_filter))
Now let's generate some code for this, using loopy
:
import loopy as lp
result_var = p.Variable("result")
Observe the two parts of the loopy
kernel description:
[lp.ExpressionInstruction()]
knl = lp.make_kernel(
"{[i,j]: 0<=i,j<n}",
[lp.ExpressionInstruction(result_var[ii,jj], idx_mapper(img_filter))],
[lp.GlobalArg("result", shape="n+2, n+2"), ...]
)
Kernels can always be inspected--simply use print
:
print(knl)
Let's move towards running this code. To do so, we need pyopencl
:
import numpy as np
import pyopencl as cl
import pyopencl.array
import pyopencl.clrandom
ctx = cl.create_some_context(interactive=True)
queue = cl.CommandQueue(ctx)
And some data to work with:
n = 200
import scipy.misc
u = scipy.misc.imread("cat.jpeg").astype(np.float32).sum(axis=2)/(3*256)
u = cl.array.to_device(queue, u)
import matplotlib.pyplot as pt
pt.imshow(u.get(), cmap="gray")
Now run the code, and tell loopy to print what it generates:
knl = lp.set_options(knl, write_cl=True)
result = cl.array.zeros_like(u)
_ = knl(queue, u=u, result=result, n=n)
u = result
pt.imshow(u.get(), cmap="gray", vmin=0, vmax=1)
That's obviously not very parallel. Introduce parallelism:
tknl = knl
tknl = lp.tag_inames(tknl, {"i": "g.0", "j": "g.1"})
evt, (result,) = tknl(queue, u=u)
But OpenCL/CUDA require blocking to be efficient!
sknl = knl
sknl = lp.split_iname(sknl,
"i", 16, outer_tag="g.1", inner_tag="l.1")
sknl = lp.split_iname(sknl,
"j", 16, outer_tag="g.0", inner_tag="l.0")
evt, (result,) = sknl(queue, u=u)
How about some data reuse?
sknl = knl
sknl = lp.split_iname(sknl,
"i", 16, outer_tag="g.1", inner_tag="l.1")
sknl = lp.split_iname(sknl,
"j", 16, outer_tag="g.0", inner_tag="l.0")
sknl = lp.add_prefetch(sknl, "u",
["i_inner", "j_inner"],
fetch_bounding_box=True)
evt, (result,) = sknl(queue, u=u, n=n)
cpuknl = knl
cpuknl = lp.split_iname(cpuknl, "i", 16, outer_tag="g.0")
cpuknl = lp.split_iname(cpuknl, "j", 8, inner_tag="l.0", slabs=(0,1))
cpuknl = lp.set_loop_priority(cpuknl, "i_inner, j_outer")
evt, (result,) = cpuknl(queue, u=u, n=n)
Let's time the execution of our "blur" operation on an actually big image. (We'll make do with random data for this timing:
from time import time
rounds = 4
n = 5000
u = cl.clrandom.rand(queue, (n+2, n+2), dtype=np.float32)
uh = u.get()
First, numpy
:
t_start = time()
for i in range(rounds):
blurred_u = np.zeros_like(uh)
blurred_u[1:-1, 1:-1] = (
2*uh[1:-1, 1:-1]
+ uh[2:, 1:-1]
+ uh[:-2, 1:-1]
+ uh[1:-1, 2:]
+ uh[1:-1, 2:]
)/6
print((time()-t_start)/rounds)
Next, our generated code:
queue.finish()
t_start = time()
for i in range(rounds):
evt, (result,) = cpuknl(queue, u=u)
queue.finish()
codegen_time = (time()-t_start)/rounds
print(codegen_time)
Now estimate performance:
pixels_accessed = 6*(n+2)**2
bytes_accessed = pixels_accessed * 4
gbytes_per_second = bytes_accessed / codegen_time / 1e9
print(gbytes_per_second)