In this enlightening video, I delve into the deep and fascinating world of image processing. Forget everything you thought you knew about pixels – they’re not squares or rectangles and they definitely aren’t discs. All pixels are, my friends, are point samples, each capturing brightness or color at a particular position. Curious to know how to manipulate them? I also unravel this mysterious tapestry, familiarizing you with the technicalities of grayscale, RGB, HSB, YUV, and a fleeting mention of CMYK. And if you think that's all, hold tight! Did you know we could apply Fourier transforms, akin to graphic equalisers used in audio, to our good old two-dimensional images? Strap in as I guide you through this potentially overwhelming realm with a pinch of humor, lots of simplicity, and oodles of practical examples.
[0:00] So, adventures in image processing. I gave this talk at iOS Dev UK but I think it should
[0:05] be interesting to most people and there’s only a few things that are specific to iPhones in this
[0:09] talk. Image processing is a massive subject. Just look at all these books. I’m going to try to give
[0:15] a taster of what’s possible and then I’ll do a couple of worked examples to show what can be done.
[0:19] So, let’s start off with the fundamentals. In the context of image processing, we’re generally
[0:24] talking about a regular 2D grid of pixels. But what you may ask is a pixel, so let’s do that
[0:30] first. Part 1, what is a pixel? Pixel is a contraction of the word’s picture element.
[0:36] We have pix which is derived from pix and was first used in the 1930s and we have l from element.
[0:42] The word pixel appears in the 1960s. Now you might also occasionally see them referred to as pearls,
[0:48] but that’s quite old fashioned now. So, in image processing, we’re generally talking about a
[0:53] regular 2D grid of pixels and you’re probably picturing something like what is currently on
[0:57] the screen. It’s just like in the olden days when we used to play computer games and yes,
[1:02] this was what was called entertainment in those days. So, the logical conclusion is that a pixel
[1:08] is a square or at worst, it’s a rectangle. Well, this is kind of wrong. A pixel is not a square
[1:13] and it’s not a rectangle either. Just look at how it ends up being displayed. And don’t get me
[1:19] started on how images are captured. Even if we have a sensor that is a square shape, we’ll be
[1:23] taking the average value over that square and probably applying all sorts of processing before
[1:28] we get the actual data. So, what is a pixel? Well, our grid of pixels is actually a grid of point
[1:34] samples. Each pixel is a measurement taken at a particular position. It’s the colour or brightness
[1:40] at that point and the point is infinitely small. So, a question that will spring into mind is,
[1:45] what about the space between the samples? What value do they have? And the simple answer is,
[1:50] we just don’t know. When we made the image, we sampled the values at these positions.
[1:54] We didn’t take any measurements in between the samples. How you decide to reconstruct the image
[1:59] from these samples is entirely up to you and you may just decide to use squares. But we all know
[2:05] the world is not made up of little squares, unless of course we’re in the matrix, in which case we’ve
[2:10] got bigger problems than pixels not being squares. The problem is, we’re trying to represent
[2:15] something that is a continuous function and has values at every position using a finite set of
[2:20] measurements. Now of course, this does have some implications. If you’ve done any signal or audio
[2:25] processing, you’ll know all about the Nyquist Theorem. This says that we need to sample the
[2:29] continuous function at twice the frequency of the highest frequency in the function. And if you’re
[2:34] not careful, you can get all sorts of weird effects. Just look at these wibbly-wobbly bricks.
[2:40] So a pixel is not a box, it’s not a disc, it’s not a teeny tiny light. A pixel is a sample,
[2:47] it has no dimensions, it occupies no area. You can’t see it because it’s infinitely small.
[2:52] It’s got a position, the location on the grid, and it’s got a value, the brightness or colour.
[2:57] Now, why do we even care about this? Well, when you scale an image up, you’re effectively creating
[3:03] new pixels. And as we’ve discussed, we didn’t capture any information for these new pixels,
[3:08] so we need to interpolate between the pixels. And this just doesn’t make sense if the pixels
[3:12] are squares. Filters like Gaussian blur, sharpening or edge detection work by altering a pixel’s value
[3:18] based on the values of the surrounding pixels. Again, this doesn’t make any sense if we’re
[3:23] treating pixels as squares. The same is true of image rotations and transformations. We inevitably
[3:29] end up trying to work out what is in between the pixels. We also want to apply techniques like
[3:34] Fourier transforms that treat images as a collection of sinusoidal components with each
[3:38] pixel contributing to the overall frequencies. This only works if our pixels are samples.
[3:43] So, I’ve talked a lot about the shape of a pixel. What about its value?
[3:47] Well, the value you’re most familiar with will be actually three values,
[3:51] red, green and blue or RGB. This is an additive colour model where we can create many colours
[3:57] by adding different proportions of red, green and blue together. Now, typically an image processing
[4:02] will have one unsigned byte for each colour which gives us over 16 million colours to pick from.
[4:06] But what’s really interesting is that it’s actually impossible to represent the full range
[4:10] of colours that we can see using RGB. Hue/Saturation Value and Hue/Saturation
[4:16] Brightness. Confusingly, these are exactly the same, so I’ll just call it HSB from now on.
[4:21] This can be a really useful colour space, particularly if you’re interested in the colour
[4:25] of each pixel. HSB has a cylindrical geometry. H is the hue of the pixel and is the angle around
[4:32] the cylinder. Zero is red, green is at 120, blue is at 240 and it wraps back around to red at 360.
[4:41] S is the saturation of the pixel. Zero means there is no colour and one means the colour
[4:46] is fully saturated. And V is the brightness of the pixel. Again, these values would typically
[4:52] be stored in 8-bit values. And one slight trap for the unwary is that various bits of software
[4:58] will encode the range of hue values in different ways. OpenCV uses 0-179. Another software may map
[5:05] the 0-360 degree values onto 0-255, so just be careful and check the documentation of your software.
[5:12] The next thing that you’ll be familiar with is black and white or more accurately, grayscale.
[5:18] This is simply the amount of light at each pixel. Again, we’d usually have one unsigned byte for
[5:23] this value, giving us 256 different shades of grey. You’ve got some options when converting from RGB
[5:30] to grayscale. You can just take the average of the three RGB components, but that is not perceptually
[5:35] correct. For humans, green is much more visible than red and red is much more visible than blue.
[5:41] You can see this here. The average version is just not bright enough. What we can do is use
[5:46] this special formula, and this gives us a perceptually accurate version of the colours.
[5:51] In a lot of image processing, you may not care what colour a pixel is. It’s often a lot easier
[5:56] and more efficient just to convert to grayscale right at the start of processing and ignore colour
[6:00] altogether. YUV is another colour space that you might have come across. Originally developed for
[6:06] television broadcasting, it describes a pixel using three components. Y for the luminance or
[6:12] brightness, U for the colour difference from blue to luminance and V for the colour difference from
[6:17] red to luminance. This is a really useful colour space for storing images and videos. It’s used by
[6:22] JPEG and it’s used by almost all video codecs. The reason for this is that humans are much more
[6:28] sensitive to variations in brightness than to variations in colour. This means that we can
[6:32] encode the Y component at a high resolution and the UNV components at a lower resolution.
[6:38] Generally, you’ll find YUV420 is used. This means in each 4x2 block of pixels, you have 4
[6:45] luminance samples on each row and one sample each from the UNV. The UNV samples are then duplicated
[6:51] in the second row. It’s easy enough to show how well this works. I’ve taken an image and
[6:56] down-sampled the UNV components. This is the original image and here’s a set of images with
[7:02] increasing amount of down-sampling. For the final image, I down-sampled the UNV by 32.
[7:07] You can see some artefacts but it’s still pretty good. Here’s the 8x8 down-sampling in more details.
[7:14] You can see the luminance is at full resolution and the two colour channels UNV at much lower
[7:19] resolution but the final reconstructed image is still really good. It’s this human perception of
[7:25] being more sensitive to light and dark compared to colours that let old films be painted over
[7:30] by hand with splashes of colour to give the appearance of colour film. The last colour space
[7:35] I want to mention is CMYK. Now it’s highly unlikely you’ll never need to worry about this
[7:40] unless you have to print something. This is a subtractive colour scheme and has four components
[7:46] C, Y, M and K. Cyan, yellow, magenta and key or black. Now getting good print quality would be a
[7:53] talk in and of itself so I’m not going to cover it here. There are a whole bunch more colour spaces
[7:58] way too many to cover in this short talk and some of them are very specialised. So let’s get back to
[8:04] the fact that we are sampling a continuous function. This leads us neatly onto our first set
[8:09] of image processing techniques. If you lived through the 80s you’ll be very familiar with
[8:14] the graphic equalisers that you had on old HiFi’s. An audio signal is just a one-dimensional
[8:20] continuous function so we can convert it from the time domain to the frequency domain. This lets you
[8:26] play around with the frequencies in the audio and then convert it back to the time domain and
[8:30] reconstruct your audio. You can do all sorts of interesting processing. Well we can do the same
[8:35] with our two-dimensional images. We can do a 2D Fourier transform and get the frequency domain
[8:41] representation of our image. We can do a lot of things in the frequency domain. The most obvious
[8:46] is low pass and high pass filtering. We can simply remove the high frequencies, for example by
[8:51] multiplying the frequencies by a 2D Gaussian. This results in a blurred image. We’ve low pass filtered
[8:58] our image. Or we can remove low frequencies. We can knock out the low frequencies. This will
[9:03] highlight the edges and small features. We’ve high pass filtered our images. But there are more
[9:09] interesting things we can do in the frequency domain. Suppose we’ve got an image that has been
[9:13] contaminated with noise at a particular frequency. We can see the noise very clearly in the frequency
[9:19] domain if we compare the original image’s FFT with the noisy image’s FFT. If we knock out these
[9:25] areas of the FFT and do an inverse FFT, the noise can be cleaned up. It’s magic.
[9:30] If you’ve done any signal processing courses, you’ll probably have been taught the convolution
[9:36] theorem. You really don’t need to understand the proof or the maths, you just need to know that
[9:40] multiplying two signals together in the frequency domain is the same as doing a convolution in the
[9:45] time or spatial domain. So what we’ve been doing in the previous couple of examples was actually
[9:51] convolution. So what is convolution? It’s a mathematical operation that combines two
[9:56] functions to produce a third function. In the context of image processing, one of these
[10:01] functions is the input image and the other is a filter or kernel. Now kernels are just small
[10:06] two-dimensional matrices that slide over the image looking at each pixel and its surrounding pixels.
[10:12] We can see that in action here. We apply the kernel to each pixel in turn, multiplying and
[10:17] adding up the total. So we can show the convolution theorem in action pretty easily. I’ve got a 3x3
[10:23] kernel that will do edge detection in the horizontal direction. You can see that when
[10:27] we apply this to our image, it emphasises horizontal edges. I can take the FFT of this
[10:33] kernel and multiply it with the FFT of our original image and we end up with almost exactly the same
[10:38] results. That’s pretty amazing. Now you may be asking, why would you do this? Well, convolution
[10:45] can be an expensive operation. You need to apply your kernel to every pixel and its surrounding
[10:50] pixels. If you have a large kernel and a large image, this can be very slow. With the FFT,
[10:56] we have the cost of the FFTs, but the cost of applying the convolution is completely determined
[11:01] by the image size. Even for quite small images, the point where it becomes faster to do the
[11:06] convolution in the frequency domain comes at quite a low kernel size. With a 1.024 size image,
[11:12] it’s around 8, so that’s pretty small. So what are some common kernels? Well, we’ve got the
[11:19] identity kernel, which does nothing so it’s not very interesting, but it is very good for testing.
[11:24] We have the Sobel edge detection kernel, there’s one for vertical lines and one for horizontal
[11:29] lines. We can combine these together to get the magnitude of the edge and the direction,
[11:34] which is pretty handy for some algorithms. We have the classic Gaussian blur that we saw earlier,
[11:40] and there’s also Sharpen. There’s many more well-known kernels and basically we can create
[11:45] new ones to detect or highlight particular features in our images. Which leads us nicely
[11:50] into the modern world and a slight digression into convolutional neural networks. These neural
[11:56] networks learn convolution kernels during training and they can be used with things like image
[12:00] recognition. They can consist of multiple levels of convolution with pooling layers for reducing
[12:05] the amount of data and typically follow with a deep fully connected network to take the results
[12:10] of the convolutions. It can be pretty interesting to have a look at what the neural network is
[12:15] calculating and there’s various ways of visualizing what’s happening in the layers and what’s
[12:19] happening inside. Apple have a whole bunch of models you can use ranging from simple number
[12:24] recognition through to object detection, segmentation, depth reconstruction and even
[12:29] pose detection. Well worth having a play and they’re really easy to get up and running.
[12:33] So that was a nice digression into cutting edge techniques but I want to take us back to the
[12:38] 60s and 70s and do some classic image processing. One of the biggest things you’ll be doing in image
[12:43] processing is separating or segmenting the image into things that are interesting, your features,
[12:48] and things that are not important. A very basic thing you might do is thresholding. In its
[12:54] simplest form we pick a value and say that anything brighter than this is foreground and anything
[12:59] darker is background. Now you can also do the opposite, it really depends on what you think
[13:03] is foreground and what you think is background. How do we pick the value for our threshold? If we
[13:09] go too low everything’s going to be background and if we go too high everything’s going to be
[13:14] foreground. One approach to finding the right level is to look at the histogram and use the
[13:18] shape of this to determine what the threshold should be. If you know the approximate percentages
[13:23] of foreground and background pixels then you can pick a value that matches this but it is quite
[13:27] difficult to know up front. There are some nice algorithms though. One that is well known is OTSU’s
[13:33] method. This works on the assumption that you have a bimodal distribution of pixels and tries to find
[13:39] an appropriate threshold level. It does work pretty well. But there is a big problem with
[13:44] picking one number. It doesn’t work in variable lighting conditions. I have simulated a particularly
[13:50] bad example here and tried to apply OTSU’s method and it just can’t find a good value that will
[13:55] capture the data we want. So there is a really nice solution to this and that’s to use adaptive
[14:01] thresholding. This is a really powerful technique. For each pixel we look at the average brightness
[14:06] and then use that as the threshold. One of the easiest ways to do this is to compare the image
[14:11] with a blurred version of the image. We can get really nice results from this technique and it’s
[14:16] one of my favourites. Now we’ve generally been looking at grayscale images but there is a very
[14:21] powerful way to do this involving colour. If we know the colour of the thing we’re looking for
[14:26] then we can first transform the image into the HSB colour space. We can then look for pixels that
[14:31] have the correct hue and are sufficiently saturated and bright. And this is a really
[14:35] clever technique that works amazingly well. Now, segmentation is a huge subject and it’s
[14:41] being actively researched. Simple thresholding is only the beginning. There’s a whole bunch
[14:46] of techniques and there’s also lots of deep learning approaches now as well. Another interesting
[14:51] thing that we might want to extract from an image is edges. These can be very useful for feature
[14:55] extraction and detection and we briefly touched on this when we looked at convolution. There are
[15:00] a lot of edge detection algorithms around. From the very simple convolution based ones such as
[15:05] Sobel to the more sophisticated ones like Canny edge detection. And there’s a lot of newer modern
[15:10] techniques. Some of the newer algorithms such as the HED are pretty impressive and give really nice
[15:15] results. Edge detection does lead us nicely onto one of my favourite feature extraction algorithms,
[15:21] the Huff transform. This will usually be used for detecting lines but it can also detect circles
[15:26] and ellipses. This is a very clever algorithm and it’s based on the idea of representing lines in
[15:32] polar coordinates. In polar coordinates, a line is represented by two parameters, rho, the distance
[15:38] from the origin, and theta, the angle of the normal to the line. When we do a huff transform,
[15:43] we take each pixel in the source image and plot every possible line in polar coordinates. We do
[15:49] this for each pixel in the image and where there are multiple intersections in the huff space,
[15:53] we have a potential line. Before applying the huff transform, the image is usually processed using an
[15:58] edge detection algorithm such as the Canny edge detector. This results in a binary image where
[16:03] the edge pixels are set to 1 or 255 and the non-edged pixels are set to 0. We then create
[16:09] an accumulator array often called the huff space with dimensions rho and theta and we initialise
[16:15] all the cells in this accumulator to 0. We then go through each pixel in the image and increment all
[16:20] the bins in the huff space for the possible values of rho and theta that the point satisfies.
[16:24] Once we process the image, we end up with peaks where there are prominent lines in the image.
[16:29] We can use the rho and theta to draw the lines back into the image. And once you’ve got these
[16:34] lines, you can find intersections which then lets you find rectangles and squares in the image.
[16:39] As I say, it’s one of my favourite algorithms but it’s largely been superseded by more modern
[16:43] techniques and finding rectangles has actually been built into iOS for ages.
[16:48] We’re almost at the end of our tour and there’s just a few more things I want to cover.
[16:52] The first is morphological operations and its closely related field, Connected Component
[16:56] Analysis. These are primarily related to binary images and use small structured elements to
[17:01] modify them. There are a couple of simple operations that demonstrate this. Dilation
[17:07] adds pixels where there’s already pixels, which is great for filling in holes in objects.
[17:12] And erosion removes pixels, which is great for removing noise.
[17:15] Closely related is opening, which is good for disconnecting regions that should not be connected.
[17:21] We can see here that some of the letters have got joined to other letters. We apply opening
[17:25] and they’re separated nicely. And of course we have the opposite, closing. This is good for
[17:30] connecting objects that should be connected. We’ve got some broken letters here and after
[17:35] applying closing, they’re all joined up. Connected Component Analysis follows
[17:40] closely after these operations and is great for extracting and analysing features after thresholding.
[17:46] These are basically flood fill algorithms. You start at a seed pixel and then explore
[17:50] until you’ve found all the connected pixels. You can have four way or eight way connectivity.
[17:56] This is a really handy algorithm. You can count things in a scene, extract the pixels and use
[18:00] object recognition, clean up segmentation results by removing small blobs and you can work out the
[18:05] location of blobs. It’s really very powerful and it’s used an awful lot. The final thing we have to
[18:11] cover is transformations. Some of these will be familiar to you from school. You’ve got scaling,
[18:17] which is making things bigger and smaller. You’ve got rotation, spinning things around.
[18:21] And shearing. Now you may wonder what the application of shearing is. Well, it’s often
[18:27] used when processing handwriting, prior to trying to extract individual characters or words. You can
[18:32] see here on the left we’ve got very slanty handwriting and on the right we have a corrected
[18:36] version of it. The last really cool type of transforms I want to talk about are called
[18:41] homographies. These have some great applications and the most interesting for me is perspective
[18:46] correction, but you can use it for image stitching, augmented reality, changing the camera viewpoint
[18:52] and 3D reconstruction. It’s really powerful. To compute a homography you just need four points
[18:58] from the two image planes you’re trying to map between, which is kind of handy as most of the
[19:03] time we’re using this for processing documents which have four corners. Once you’ve got these
[19:07] four matching points you can calculate this matrix which then lets you transform points between the
[19:12] two image planes. Now that concludes our whistle-stop tour of image processing and I’ve
[19:17] barely touched the surface. And I’m sure you’re thinking, this is all very nice but it’s probably
[19:22] pretty difficult. How do I do any of this? Well, you’re pretty lucky. The frameworks come with a
[19:28] lot of stuff right out of the box. V-Image is a high-performance image processing framework.
[19:33] It’s got convolutions, transformations, histogram operations, morphological operations and a bunch
[19:38] of other stuff. There’s also Core Image which has a bunch of built-in filters along with the ability
[19:43] to add your own. On the deep learning side of things we’ve got the Vision Framework. Base
[19:47] detection, pose detection, text detection, reading barcodes, image registration, feature tracking and
[19:54] you can even run your own ML models. And talking about your own ML models there’s even CreateML
[19:59] that will help you create and design your own models. And if you can’t find something built
[20:03] into the frameworks then you can also use OpenCV, the open source computer vision library. The only
[20:09] slightly annoying thing here is that OpenCV is a C++ library so to use it from Swift you’ll have
[20:14] to wrap it in Objective C++. If you google you’ll find quite a few examples of this. There’s also
[20:20] various libraries for doing low-level pixel manipulation. This will let you implement pretty
[20:24] much any image processing algorithm you want. So I get approached fairly often by people trying
[20:29] to do things. It’s not that much different from being an app developer. It’s the usual “I’ve got
[20:34] this great idea, I just need some image processing magic to make it work”. So as always the first
[20:39] question to ask is “What are you actually trying to do?” It’s highly likely that what they’re trying
[20:44] to do probably isn’t anything to do with image processing or can be solved by using something
[20:49] simple like a QR code. The next question to ask is “Do you actually have any sample images?” Now this
[20:56] may seem like a strange question. Surely you can just go and take some photographs but often these
[21:00] projects are based on products that don’t even exist yet. How constrained is the problem? Is it
[21:06] going to be an industrial inspection application with fixed cameras and good lighting or is it
[21:10] going to be people taking random pictures on their phones? Which leads on to the next question
[21:15] “How robust does it need to be?” And the final question is “What are the real-time requirements?
[21:21] How long can the user wait for results? Does it need to be a real-time camera feed?”
[21:25] Which kind of feeds into the final question “Does this need to be done on device? Does it need to
[21:30] work offline or can you offload the work to a beefy machine in the cloud?” Following on from
[21:36] that there’s a bit of introspection required. Does the project seem possible or feasible?
[21:42] There’s lots of things that sound amazing but fall into the “very hard” category. Can you see the
[21:47] shape of the problem? This is a tricky one but can you see how the problem could possibly be solved?
[21:52] Does it fit into what you already know about image processing? Is it a solved problem? If you Google
[21:58] it will you find a bunch of people talking about similar problems? Is it a deep learning problem?
[22:03] If it is you really need to think about where the training data is going to come from. Can your
[22:07] client generate lots of labelled data? And if it’s not deep learning it must be a classic image
[22:12] processing problem. You should already be thinking about the pipeline of image processing techniques
[22:16] that will solve the problem. Or maybe it’s a combination of both. Do you need to do some
[22:20] feature extraction and then hand it off to something clever? Finally you really have to
[22:25] understand if it’s a research project. Is this a PhD project or even a project for a research group?
[22:31] A lot of this is connected to the answers from all the other questions but you need to think how much
[22:35] time and money the client has and will they be able to accept failure? So the general approach
[22:40] if you really want to continue is get some sample images. Do a visual inspection of the images. What
[22:46] are the features you need to extract? Is there any pre-processing required? Try out some out of the
[22:52] box algorithms. Do a bit of proper research and see if there’s any similar problems. Get a rough
[22:58] end-to-end processing pipeline working and then check it actually works on your target system.
[23:03] There’s some amazing things you can do but if you can’t run them on the device you need to run them
[23:07] on it’s pretty pointless. So let’s do some worked examples. I’m going to use some personal projects
[23:13] that I can’t talk about most commercial things. So back in the day one of my early iPhone projects
[23:19] was a Sudoku solver that worked using the camera. What do we want to do? We want to take a picture
[23:24] of a Sudoku puzzle and then superimpose the solved puzzle back onto the image. So let’s think about
[23:30] what we need to do. What’s our processing pipeline going to look like? Well an obvious first problem
[23:35] to solve is locating and extracting the puzzle. A first logical step would be to threshold the image
[23:41] and we know we should probably use adaptive thresholding. With the image thresholded we need
[23:46] to locate the puzzle. Well what can we say about the puzzle? It’s a square shape, maybe that’s useful
[23:51] or maybe there’s a shortcut. The puzzle should be the largest thing in the image. We can analyse
[23:56] each blob and find the one that is largest. Now we’ve located the puzzle we need to extract it.
[24:02] We need a homography to take the image from the plane that’s been captured by the camera
[24:06] to a square puzzle. We need to find the corners of the puzzle. And knowing what you know now you’re
[24:12] probably thinking “aha half transform, that’s the answer”. But there’s another cheat. We’ll talk
[24:18] about that in a minute. With the puzzle extracted we need to get the digits. Since we know where the
[24:24] boxes are and we have the thresholded version of the image we can use connected component analysis
[24:28] again and just find the bounding box of each digit. We can then do OCR on these digits which
[24:34] is a great use of deep learning. And then we can solve the puzzle and for this we can just use
[24:39] Google to find a suitable algorithm. It’s already a solved problem and it’s not really image
[24:43] processing. So here’s our processing pipeline. Threshold, locate, extract, OCR, solve and then
[24:51] draw the results back. Let’s run through each step. We don’t care about colour so we can
[24:57] immediately turn our image into grayscale. And for our initial processing we just want
[25:01] foreground and background pixels so we threshold the image using adaptive thresholding. Now we can
[25:07] look at all the blobs in our image and we can throw away anything that is too small and we just
[25:11] want to keep the biggest object. This should be the puzzle. With the puzzle located we now just
[25:16] need to find the corners and here’s where we can cheat a bit. We can use the Manhattan distance
[25:21] from a squared version of the puzzle to each pixel to find the corner. If we run through this little
[25:26] animation we can see that the Manhattan distance with the lowest number is the nearest corner and
[25:32] I’ve superimposed this on the puzzle. And that gives us a way to calculate the homography to
[25:37] correct the perspective. We can now transform the image so that our puzzle is extracted and with the
[25:42] puzzle extracted we just use kinetic component analysis to find the bounding box of each digit
[25:47] and extract each one. And then we just feed each digit into a neural network to do the OCR. Now
[25:52] this is just an artist’s impression of a neural network, the actual one is a deep convolutional
[25:57] neural network. The rest of our project is not really image processing. There’s plenty of
[26:02] algorithms for solving sudoku puzzles so we just use one of them and then project the solved puzzle
[26:06] back onto the original image using the homography that we found earlier. You can give this a go if
[26:11] you want yourselves using the link up on the screen right now. One of the things to note about this is
[26:16] we don’t need to be robust. We can take a stream of frames from the camera and our algorithm only
[26:21] needs to work a few times to give a good impression. We can also generally easily detect if we’ve
[26:26] succeeded. We can check that we’ve extracted about a sudoku puzzle. Do we have enough numbers? Is it
[26:30] solvable? There’s some really nice sanity checks that we can employ. Another fun thing I’ve made
[26:36] is this word or solving bot. There are some nice things about this. It’s more constrained than the
[26:41] sudoku project. The camera is now in a fixed position looking straight down at the phone.
[26:46] This is much more like an industrial inspection scenario. But we do have the added complication
[26:51] of having to calibrate and control a robot. In this case I’m using a 3D printer.
[26:56] So what are our challenges? Well, the first thing we need to do is calibrate our robot.
[27:01] We have an image coming from the camera looking down on the robot. We need to be able to map from
[27:05] coordinates in the image to physical coordinates on the printer bed. We need to locate the phone
[27:10] screened. This is the only really unconstrained thing in our system. The phone could be anywhere
[27:14] on the printer bed. Once we know where the screen is, we need to read the result box colours. And
[27:19] then we need some kind of algorithm to predict the next best guess. So here’s our processing pipeline.
[27:25] We have the two location problems. Finding the phone screen and finding the printer bed.
[27:30] We need to calculate a couple of transforms to map from the screen coordinates to image coordinates
[27:34] and then to the printer bed coordinates. And we have something that will drive the printer to
[27:38] enter the guesses. And we then have something that needs to read the colour from each box.
[27:42] We can split these up into multiple pipelines. For locating the screen, we know that we’re going to
[27:48] have something quite bright on a dark background. We also know that we’ve got a fixed camera and
[27:52] controlled lighting conditions, so we can just use OTSU’s thresholding. We know we’re looking
[27:57] for a rectangular shape and we know the aspect ratio and approximate size we’re looking for,
[28:01] so we can just run connected component analysis. If that finds a suitable rectangle, we can
[28:06] calculate an affine transform to map that rectangle onto a straightened image.
[28:10] For the printer bed calibration, I cheated and stuck some coloured dots on the printer bed.
[28:15] After all, this is a controlled environment, so we can add helpful things. We can use HSV segments
[28:20] to find the dots. And then we can use connected component analysis to find circles and with the
[28:25] circles located, we can create a transform from the camera coordinates to the printer bed coordinates
[28:29] as we know where our coloured dots are. For the letter colours, we can just use the extracted
[28:33] screen image and apply HSV segmentation. So, let’s run through the steps. We’ll locate the screen,
[28:39] we just apply OTSU’s threshold. And then we run some connected component analysis to find rectangles
[28:45] in the thresholded image. With this rectangle found, we can easily work out as transform to
[28:50] give us a nice straight image of the screen. To detect the colours of the boxes, we do a simple
[28:55] HSV thresholding to find the green and the yellow boxes that we can also detect when there’s no
[29:00] colour at all. The printer bed location is another application of HSV thresholding, but this time
[29:06] we’re looking for magenta. We also do some connected component analysis to throw away
[29:10] any shapes that are not small circles. Combining all of these together gives us a complete working
[29:15] system. It’s pretty amazing what some simple algorithms can do. So that’s it for the talk.
[29:20] I hope this gives you a flavour of what’s possible with image processing and the libraries available.
[29:25] It’s a pretty magical world.