An overview of serialization formats. Numbers and anecdotes.

There are lots of format specifications to serialize your data against. These days, I have been looking for potential alternatives to YAML, which has been my go-to tool for a while, basically because since Rails decided to use YAML from its very beginnings, Ruby developers started to follow the leader and it’s pretty widely used. Funnily enough, YAML was born as a markup language, and developed into a data-oriented format. The main reason why I’m writing this blog post is so that I when I have to choose what serialization format to use, I can analyze the problem and see which format fits my problem best.

If you are not familiarized with serialization formats, just for the sake of making the article a little more engaging, here are some potential uses of serialized data:

  • Caches
  • Inter process communication.
  • Dummy objects for testing
  • Message brokers

Keeping this points in mind, let’s go on and analyze a few of the most promising serialization formats these days. I’d argue these list contains, (not limited to these) HDF5, BSON , MessagePack, YAML and Protocol Buffers. I would love to write about Thrift and Avro but I have no experience nor I know them very well, I might update the post with information about them in the future. Shoot me an email if you want to do it yourself! I don’t want to get into things like separating statically and dynamically typed formats, mainly because these formats are different enough to provide more significant differences in performance, space, and other metrics than just something that is (sort of) a preference. Now, the list:

HDF5

HDF5 is a hierarchical data format born more than 20 years ago in the NCSA (National Center for Supercomputing Applications). Born and bred in scientific environments, it’s not surprising that its user base are largely scientific laboratories and research institutes. Root, CERN’s data analysis framework, uses a variation of HDF5 for its own files, and it’s largely compatible with it. Not a human readable format, to me it’s more interesting feature is how it looks like it was designed for parallelizing IO operations. It does this by separating the data into chunks of related information, in a sort of nxn table. Then, any HDF5 reader can easily pick a chunk (a rectangle or a set of points) of this virtual table and start processing it. Another worker can be doing the same in other part of the table and so on. Neat, but datasets are large and they cannot be easily compressed because of this.

BSON

BSON is an attempt to binary serialize JSON documents. It shares some parts of the JSON spec but for instance it has added embeddable features like data types that are not part of JSON (Date, BinData). Funnily enough, BSON files is not always smaller than their JSON equivalents, but there is a good reason for this, traversability. The way BSON introduces overhead to improve access times is minimal and actually pretty easy to explain.

Take a JSON document like this:

Serializing this to BSON it will leave these numbers before the strings (it doesn’t do only this, but I want to focus on the overhead):

These markers are a very simplistic way of telling the BSON parser “hey, if you want to skip this segment, just advance your pointer X bytes to find something else”.
Therefore the efficiency of BSON lies on having a smart parser that can understand the code properly. As an example of the parser can optimize, think of the number 1 in javascript. Of course this needs to be stored as a number in BSON, but for certain numbers, the ASCII representation is best and you don’t need to waste 8 bytes per number. Your parser can figure things like this out, and probably people at 10gen and other places where mongodb is used, constantly find ways to improve the parser for a certain language.

MessagePack

MessagePack is not too different from BSON in the sense that both try to serialize JSON. Unlike BSON, MessagePack tries to keep a one to one correspondence between its spec and JSON’s spec, so there is no loss of data on the conversion and the format is more transparent. This affects space heavily, so for instance a trivial document like “{“a”:1, “b”:2}” is 7 bytes in MessagePack (19 in BSON). Such a simple thing as having extra metadata on the binary object (BSON) can help in some situations where the data is constantly changing, as it lets you change the values in place. MessagePack has to reserialize the whole object if a change needs to be made. This is just my personal opinion, but these differences are probably what makes MessagePack very suitable for networking and BSON is instead a format more suited to storage scenarios, like Mongodb.

YAML

YAML appeared as a human-readable alternative to XML. It is barely usable as a language for object serialization, but it’s worth mentioning why, as these same reason left out other possible candidates.
In the event of a network failure, YAML files might be transmitted but there is no way to tell whether whatever arrived to the other peer is correct or not. Most serialization formats simply break if you slice the file.
There is still no support for YAML schemas so two peers can exchange YAML files agreeing to a data exchange format. That renders it unuseful for RPC, message brokers and the like.

Protocol Buffers

A very smart way of define messages (but this works for any structured data) designed by Google. Instead of having a human readable format like some of the formats I mentioned previously, Protocol Buffers has source code files “.proto”s that have to be compiled into a binary object. It is mainly geared towards C++ programming but there are implementations in many languages. From my experience with Clojure’s library and other LISPs libraries are pretty much abandoned, while Ruby’s implementation is actively developed. I wouldn’t recommend anything but the official ones (Java, C, C++ and Python).

An example of a .proto file:

The way Protocol Buffers encodes the data has two aims. Consumers should be able to read the data produced by newer producers, by simply skipping unexpected fields. Consumers have to be able to find the end of a field without needing any metadata about the field. The whole encoding revolves about fixing this problem (varints, ZigZag coding). It’s called the binary wire format and in short uses varints to encode the data, varints are simply integers grouped in 7 bit sets, where the high bit (MSB) is the stop bit. The way negative values are handled are by ZigZag coding the 7 bit groups with the following function.

This basically renders -1 as 1, 1 as 2, -2 as 3, 2 as 4 and so forth.

Strings are UTF8 encoded. Advantages of Protocol Buffers for RPC (as opposed to XML, which apparently was Google’s itch for PB) are way faster  encoding, smaller files, and more easy to use programatically (aka ‘we got rid of the infamous XMLController’).

Protocol Buffers encoding guide by Google

 

—————–

Other resources I found useful:

Binary Serialization Tourguide

Browser Physics – BSON, binary JSON, now for the web

 

67 Kudos

 

Code Tuning, a programming pearl in Ruby

After a few weeks at CERN of not much blogging, I have been mostly flat hunting at one of the most expensive cities in the world. Of course the real estate market is accordingly crazy. This means I spend a big chunk of my spare time on the trolley, which is okay because I borrowed one of the best programming books ever at CERN’s library, Programming Pearls. I like to read whichever column I feel like reading instead of reading it from cover to cover, and I stumbled upon “Code Tuning” today.

Being the practical guy I am, I thought that would make a good reading choice on my way back to help me focus on details. And hell it did. I would like to go over a very well known algorithm, Sequential Search, and introduce some cool twists that Jon Bentley shows on the book.

Given an array, a sequential search simply looks for a certain element of the array iterating over it. Easy right? A simple sequential search in ruby would be something like:

Can we do better? I thought not, but there are ways to improve this dramatically. Let’s benchmark for 1M, 10M and 50M elements.

It seems that it takes around ~103 (ns) * n on the worst case, I calculated this by dividing the benchmark for 1M by the number of elements, I ran all benchmarks for the last item of the array. Fairly normal, this will be the baseline for further improvements.

How can we slim down this loop a little bit? Easy, stop performing more tests than necessary. On an ideal version of a sequential search you should only test for the elements to be equal to the desired element. In reality every time this block executes itself, array.each is checking that we haven’t gotten to the end of the array.

Benchmarks for 1M, 10M and 50M.

Bentley dropped the run for a speedup of ~5%. It looks like tests are more computationally expensive in Ruby so now it runs in ~45 (ns) * n. This is a crazy good improvement of about 55% and still it looks pretty readable.

It looks like our bottom line (tests) cannot be improved, as you will always have to check against every element in the array until you get to it. Now what. Incrementing the index in every loop is an addition that shouldn’t cost much, but we can do better. Loop_unwinding/unrolling as I prefer to call it is the answer. I am not very positive that Ruby’s 1.9.2 default interpreter unrolls loops. And we’re going to find it right away, because if it doesn’t the reduction will be very sharp.

Benchmarks for 1M, 10M and 50M.

It went from ~45 (ns) * n to ~41 (ns) * n.  A slight improvement of 9%. I’m a bit disappointed after checking the book and seeing that reducing the overhead of incrementing the iterator led to a reduction of 56%. What puzzles me even more is that his argument is that loop unrolling can help to avoid pipeline stalls, reduce branches, and increase instruction-level parallelism.

Why is this clearly happening on Bentley’s benchmarks in C and if it’s happening, it doesn’t yield the same results for my Ruby implementation? No idea, so instead of keeping this private I’m going to share it so you guys can wander about it. I’ll just leave a quote of Don Knuth:

Premature optimization is the root of much programming evil; it can compromise the correctness, functionality and maintainability of programs. Save concern for efficiency for when it matters.

 

 

93 Kudos

 

hackNY Lessons: Week 8

We met Anthony Volodkin from Hype machine and Thatcher Bell, Greg Pass (former Twitter CTO) and Daniel Huttenlocher (Cornell’s Information and Computer Science dean). I had no idea they were trying to open a Cornell branch in NYC (CornellNYC Tech, in partnership with Technion, aka Israel’s MIT), but if someone had to run it, they chose the right person. Honestly after meeting several CS/Engineering deans this is by far the only one I’ve met who really thinks out of the.. university? Usually they are so centered in academia that the only things they see outside of it it’s consulting companies like McKinsey or Accenture. In my view that doesn’t necessarily mean they can’t run a CS school, but the problem is they fail to show all the options to their students body. I’m not into academia nor I think I’ll ever be but heck as soon as this becomes something real I will be on the first line to check it out.

Definitely one of the best talks -I’d say QA rather- so far at Art.sy headquarters. I also got the chance to talk to Daniel Doubrovkine, Art.sy’s head of engineering and a very cool Swiss individual who gave me good advice about CERN, Geneva, universities in Switzerland and such. I should stop and start off with the learnings because I’m getting too emotional at this point.

 

Tech learnings:

  • Project Euler from 6 to 8 in Clojure
  • Git cherrypicking
  • Lifebooker’s API
  • Handle JSON
  • When scaling something that cannot be stopped measure everything.
  • Chef vs Puppet
  • Learn something that will make you think different when it comes to programming. No, learning C# when you know Java doesn’t count.

Startup learnings:

  • Step away from some friends. I should add, learn what not to do from some friends. Sometimes people forget how important it is to know what not to do and not only what to do.
  • In order to build a really good ecosystem, new startups should have some sort of role model (HP/Google/Microsoft/etc..)
  • It looks like for some people it’s easier to find a problem/solve it/make profit than innovating. I personally
  • think releasing something truly innovative is several orders of magnitude harder than the former.
  • Big problems (self driving car) have such big barriers to entry that most people with an entrepreneurial drive decide to go for the easy profitable thing (Instagram). No shame on that. I heard a story this week about two guys who are almost 99% certain they can make nuclear energy with Thorium way more efficiently than the way its done right now. Cost of that prototype? 100M. YES A PROTOTYPE. A big chunk of that is lawyer/bureaucracy fees.

Life learnings:

  • Please prove me wrong but there’s no way to avoid losing a holyfuckton of money when buying foreign currencies.
  • Again, bureaucracy sucks on so many levels. Probably the barriers to entry that market are super high
  • but there is definitely a big problem on that ‘market’ whatsoever.
  • Also sending money to foreign banks is a pain in the ass. I might open an account on CapitalOne.. 0% on ATM’s overseas, no fees at all.
 

10 Kudos

 

Project Euler 4 solution

I am doing Project Euler problems as a way to learn better idiomatic Clojure.

Today’s problem really made me realize how cool is clojure. Wording is the following

A palindromic number reads the same both ways. The largest palindrome made from the product of two 2-digit numbers is 9009 = 91 *99. Find the largest palindrome made from the product of two 3-digit numbers.

Easier than expected huh?

In short,  is-palindrome just checks for equality between a number and its reversed pair, if they are equal it returns the number, otherwise it returns nil.

The main function uses ‘for’ to multiply two ranges of numbers, i.e: 100*999, 100*998,… 101*999… This returns a lazy sequence that is evaluated against is-palindrome. That will return the same collection but with the value nil on all the numbers that are not palindromes. So we just filter numbers out, get the maximum value of the list of numbers with reduce, and we are all set. Such a great example of how lazy evaluation makes things a lot faster. Any thoughts on how to improve this solution?

 

11 Kudos

 

hackNY Lessons: Week 7

Chris Dixon and Josh Knowles were the talks for the week, being the former mostly QA and the second one probably the best presentation on Agile management concepts that I’ve ever seen. It was so much stuff in less than 1 hour and a half, I already know most of it but hell now that’s really ingrained in my brain. Chris Dixon put an emphasis on personal branding, having really clear goals for your startup.. oh wait maybe I should get started.

Tech learnings:

  • Chapter 1 of the Joy of Clojure. I read slow, mostly because I only read on the subway or when I’m waiting to go to work in the morning.
  • How to solve this with Clojure (will post the solution right after this post)
  • Codata, corecursion, f-algebras.. looks like I learnt more math than practical tech this week.
  • Apache routing to keep different Rails projects separate.

Startup learnings:

  • Ndas.. most ideas seem stupid to a person who is listening to pitches 24/7. It’s so rare that you tell someone an idea and that person copies it to the point it makes your idea a serious competitor against you.
  • Jesus christ getting people to use your stuff is hard. And I don’t even know where to look at to get better at this. I wrote this post on HN asking for help but nobody seemed to care much about it. I’m sure its a lot easier to look for a niche then build something people will use than trying to do innovative stuff.
  • You should of course focus on something and do it really well. You’ll get to a point where you can be ahead 99% of the people on something. Then take something new on a totally different topic and become good at it. All credits to Chris Dixon on this one.
  • Make really, really clear to your cofounders what do you want to do with your little baby that your startup is. It will be a major source of problems if you want to sell it to some Indian company in 5 years while your cofounder wants to keep running it.
  • If possible, avoid having a team who cannot communicate. That will kill your company. There are ways to build a startup from scratch where all employees are friends with each other and still they are ‘A’ players as they like to call themselves..

Life learnings:

  • I get frustrated when I have no side projects. Project Euler was a good mental outing this weekend, I guess I will continue until some idea pops up.
  • Too much bleach is never a good thing even if you can’t fit anything else on your washing machine.
  • Another headache on Saturday night for the same reasons. I should do something.
 

6 Kudos

 

Corecursion, codata. Learning by teaching

Four weeks ago I started learning Clojure on my own. I took the bold move of building something awesome without even knowing the syntax, which I hope will eventually learn by doing an Amazon web spider and Project Euler problems.

So far it’s being both good and hard at the same time, I kind of want to push off and be done with some problems at my work at Lifebooker. This and all hackNY events, hackathons, and heck just trying to make the most of my time at NYC (at which I’m sort of failing) don’t really leave much room for experiments. Tonight I feel like I should rest and work in the morning rather than trying to do and accomplish nothing on my Amazon pet project. So instead I’ll be writing a blogpost about corecursion and learn from explaining it. Let’s get started by defining what recursion is.

My dearest friend Wikipedia says recursion is the process of repeating items in a self-similar way. Pretty accurate, but for our purposes and for anyone not familiar with the mathematical concept of self-similarity, let’s understand recursion as whenever a function, be it a mathematical function f(x) or a function in a programming language function(x), are expressed in terms of themselves.

In maths we understand a function defined in terms of itself as, for instance, f(x) = f(x-1) + 2. A simple function defined in terms of itself in java-like pseudocode would be something like,

Recursive functions usually have at least a base case, so they stop at a certain point and stop calling to themselves. The simplest example I can make up off the top of my head is this, given an integer argument,

This would just return a number after some calls, provided that you call it with an integer argument less than 10. Now that we have brushed up on our CS101, let’s bring up the real thing. Corecursion is essentially the same idea but applied to data structures instead of functions. If I had to put a wording for the problem, it would be something like:

Given a data structure, define a computation on that structure. Execute it, then repeat on result of this computation, and so on. It is similar to recursion, but it produces a sequence of data, with calculations taking place on demand when this data is dereferenced. It’s oriented on data, not the operations.

Codata

In order to understand corecursion, I found vital to understand codata. If you really wanna get into it, I would recommend reading this paper which teaches the basics of F-algebrasF-coalgebras, and duality (F meaning functor in this context). For the sake of simplicity, let us use two really basic data types, based on natural numbers, Nat and CoNat.

Nat (the natural numbers) is defined as:

At the end of the day they are created out of S and 0. Think of S as 1 and you could create all the numbers with this data type.

CoNat is almost identical to Nat but with an extra element.

An interesting corollary to this is that the latest element of the CoNat set doesn’t contain c0. Plus, it’s truly infinite, as opposed to the “latest” Nat which is just applying S and 0 over and over. If we were to define Nat, we could say something like

But when it comes to infinity, we cannot do infinity = Successor(infinity). Thus, infinity (ω) is NOT a member of this set. If it was, now consider the set N that has all the same elements as Nat except it does not have ω. Clearly Zero is in N, and if y is in N, Succ(y) is in N, but N is smaller than Nat which is a contradiction. Therefore ω is not in Nat. Data is always finite. Codata can be infinite. 

Taken word by word from reddit:

All types in functional programming have two “tools” for using them

  • an introduction principle, and
  • an elimination principle

The introduction principle lets us turn objects of some other type into objects of this type. The elimination principle lets us turn objects of this type into objects of some other type.

This should be your personal mantra. Examples of this are:

  • Function types – Lambda abstraction & Function application
  • Sum types – inl and inr & case expressions
  • Product types – the function (,) & fst and snd
  • Unit – the unit constructor () & case () of () ->
  • Bool - True and False & if-then-else statements
  • Empty – (no introduction rule) & reductio ad absurdum, the ability to promote _|_ to any other type.

A requirement for an introduction principle to be such, is that the introduction principle has to be able to turn some type into Nat. We do have a introduction principle for Nat. It is just our two old friends 0 and S (the constructors).

CoNat does NOT have an introduction principle as simple as its constructors. It’s introduction principle is

So esentially, coNatIntro f includes a -> CoNat . And that’s enough to ‘turn some type into CoNat’, so it’s an introduction principle.

Elimination principles work just the opposite way, they turn elements of a type into elements of some other type, so in our example they would have the form Nat -> something, CoNat -> something else. 

The elimination principle for our simple Nat type is:

Wait. Isn’t Nat -> a the opposite of a -> CoNat? Yup, so it is even clearer than they are elimination and introduction principles respectively. So it is proven that CoNat is codata, hence infinite but we can only see a finite amount of CoNat. This has some consequences, that can help understand corecursion and codata.

  • You can’t define total subtraction. (infinity – infinity will always loop).
  • You can’t write total equality. (it loops when comparing infinity == infinity).
  • You can’t write a total elem function that works on a CoList (it loops when the list is infinity long and the element isn’t in the list).

Corecursive data structures are defined in terms of itself, corecursive algorithms are defined in terms of its output rather than on its input.

This and lazy evaluation allows “infinitely” large objects without allocating this memory and obviously running out of it prematurely.

Corecursive binary trees

A complex yet easy to understand corecursive structure would be a binary tree. I won’t go in the details, but you can see how this would be defined in terms of a corecursive structure. Here is an example on how they take advantage of corecursion to traverse the tree.


							
 

16 Kudos

 

hackNY Lessons: Week 6

Last week we attended two awesome talks. First Peter Bell taught us how to choose your tech stack wisely and it was.. very insightful, especially after me failing to choose the right technologies for my projects. Also he gave lots of tips about hiring people, etc, which are reflected on this list. On Wednesday we met Joel Spolsky and gave us a talk about how StackExchange was built, mostly focusing on how good they are doing attracting knowledgeable people to their sites. Lots of learnings from that talk will show up in the list for this week too!

Tech learnings:

  • I learnt essential bits of Puppet and Chef, a soft introduction to what is expecting me in just two months at CERN.
  • Code reviews are vital. Even if the rest of your team is not familiar with what you have been working on, there will probably be a lot of stuff that you didn’t realize how dumb it looks until someone points it out or your read it several times. That or I’m retarded. (which is very plausible)
  • Git stash back, pull, then pop, will avoid A LOT of conflicts. Thanks Adam!
  • I just started the Joy of Clojure, I hope this will be enough to get myself to a level where I could just have to google for libraries and not for simple syntax questions. And its only 360 pages!

Startup learnings:

  • The coolness factor of a language is important when you want to hire the right people. It will be much harder to find a REALLY top notch PHP developer than to find a really good Racket developer.
  • How many Haskell developers COULD also write PHP? You should know the answer and yet be surprised about it.
  • Attracting people who care about your site and who care so much about what you do that they will come every day is much more important than getting a peak of audience on HN/reddit/whatever.
  • I’d take a close look at how new StackExchange boards are created to do that (if I ever had to do this)

Life learnings:

  • Maybe I should learn these posts every time that I feel like procrastinating, I haven’t accomplished much this weekend just because I wanted to get my laptop fan to be less noisy.
  • Sometimes I have no idea about how to get myself on to do stuff. And that led me to a headache on Saturday night.

 

 

5 Kudos

 

hackNY Lessons: Week 5

We met with Chris Wiggins and Evan Korth and had some insights on what’s the way hackNY works and what do they want to accomplish with it. Also we met Chris Poole (m00t) founder of 4chan and probably one of the individuals in this city that has very opposite views to mine on almost every topic (lol).

I worked mostly fixing bugs and old broken code, got stuck at my project with Clojure, and I only managed to get to Project Euler 4. Super pointless week overall (aside from the talks), although I managed to do some stuff on the weekend and visit around some cool places around Downtown Manhattan. Props to Jesse and Cheryl who built mentor.im for NYC Angelhack and got into the top 20 best teams, they’ll be pitching in SF soon. I myself ditched Clojure for the Amazon Spider thing and used Ruby which is much more appropriate. Lesson learned. I guess I’ll learn Clojure by means of books/Project Euler and what not. Advice on this will be much appreciated.

Tech learnings:

  • SSH replaces telnet, SCP replaces R* (rexec, rcp, etc..)
  • If you let other people work on your stuff without supervision and they don’t do TDD, your tests will break badly
  • As a corollary to that, try to always get everything in small chunks so one person can work on one thing without interfering the rest.
  • Do not squash your commits without checking everything works properly… it’s not easy to get back to the previous state, especially when your squashed commits come from several different branches.
  • Corecursion (again). I’m writing a cool blogpost on it that is helping me to learn it better.
  • Codata, F-algebras and F-coalgebras (thanks ELIMF)
  • Don’t go with anything than the best tool that you best know for a certain work. Hitting a nail with a screwdriver feels so dumb and that’s metaphorically what I’ve been doing for a month. I.e: Rapid prototyping with Clojure not even knowing the syntax/idioms/etc…
  • For loops on the terminal are cool and useful.
  • Know concurrency like the back of your hand. It WILL hurt you if you don’t know it and even worse you won’t even know what your problems are nor how to solve them.

Startup learnings:

  • Go for bleeding edge technology. If you are not familiar with it and you cannot afford learning it (time or money wise), go for the second best option.
  • I do have conflicting views on almost everything with some people in this world. And so do other people. This is not a learning actually but myself trying to apply some common sense to my networking.

Life learnings:

  • Learning by doing doesn’t work for me, I need to read a book, do some tutorials and whatnot first.
  • GO OUT and explore. It made my week to go out on the weekend around NYC after the most pointless week ever.
  • Again, don’t be a hermit…
  • Let others have a second chance.
  • Don’t lose your passport if you’re working in the US. Getting an I-94 under the regular time span is really, really hard.
  • 1 week with no exercise made me go dizzy after some extenuating work.
  • Be kind to your neighbors, I met cool people at Palladium’s hall elevators (lol!)
  • Make the most of your 24h day. I wrote this whole list while waiting for some dumplings here (Vanessa’s dumplings)
 

5 Kudos

 

hackNY Lessons: Week 4

Awesome fellow Terence Nip  had a great idea and he is going to post his weekly learnings on his blog.

We met with Joanne Wilson (a.k.a. the gotham gal), Dennis and Harris Foursquare founders, and chilled out on Wednesday with the Fog Creek crew at their offices. Overall a terrific week. Plus we got some stuff done and I went to hack n jill and got to do some cool shit with Cheryl and Jesse, he wrote about it on his blog.

From http://blog.terryatjn.com/post/24749339216/hackny-lessons-week-3 ,

Tech learnings:

  • Mongomapper
  • I learnt to love HAML over ERB.. and SLIM over HAML.
  • I mostly did front-end stuff this week.
  • Enlive and how to scrape websites with Clojure. Still not quite there though but I’m getting the hang of it. 6 weeks to go.
  • Lean planning and dividing your work into chunks to put in your to do list makes everything easier. i.e: “Crawl 1 site and 1 link”. I try not to plan any further than 2 or 3 of this chunks at a time.
  • I need some help on learning more common functional programming idioms than map/reduce/filter …

Startup learnings:

  • Whenever you get denied for funding, always ask why. Otherwise you’re going to get in a never ending circumference path where you get rejected and you cannot do anything to solve the situation.
  • Get your legal stuff sorted out early on. If you don’t know how to start, Doccracy might help here 
  • Why would you try to raise funding if you don’t need it? Just bootstrap it and enjoy your cake while people in suits look at you with envy. Definitely the way to go if you or the problem you want to solve don’t really need the money and especially if you can’t give back 100x that amount.
  • Almost every successful entrepreneur has been very, very tight in money at some point of their lifes.
  • Overnight success doesn’t last.
  • People at foursquare were working for 10 years on the same sort of problem until it clicked. They even got acquired by Google and they ditched their product. That should tell you something.
  • However, in Harris words.. don’t take all advice you get. No matter if it comes from Larry Page and Sergey Brin. Take risks if you think your idea will change the world even though everyone else tells you it will not.

Life learnings:

  • Startup learnings should really go here but whatever.
  • Work hard surrounded by people who work hard. Then party hard together. Rinse and repeat. Motivation can be a bitch when you work on your stuff all alone.
  • Having no money forces you to be creative if you really want to get out of that situation.
  • I need to think for 5 minutes on the solution of a problem before actually starting to work on it.. not sure if this is a good thing or not.
  • For some reason places like Fog Creek attract lots of hardcore nerds. Even though they make really top notch cool stuff, diversity >>>>> being in a plastic bubble. Please prove me wrong on this.
 

4 Kudos

 

Compare images in Ruby

So last week I have to go over a problem that really got me for a couple of hours. I am using RMagick at Lifebooker to generate some images with stickers on them. In short we need to put a sticker on our deals like “Last chance” and similar stuff so people when they receive an email with their discounts they will notice which ones are about to finish.

We aim for 100% test coverage on Lifebooker so we try to embrace TDD as much as possible. In this case I had to test that I was always generating the same Image with the sticker where I wanted it. Given the two images, I want to come up with the exact thing that is below it.

 

lifebooker original

 

 

 

 

 

 

 

 

 

 

 

 

 

Easy enough, I came up with this function who did this job very well.

So you might be wondering, what’s the big deal? Testing this was a big pain in the ass. The normal approach would be to compute its digest, using Digest::MD5.hexdigest(file). That was messed up and after many tries, I gave up and decided to compare them manually with an awesome hex diff tool called dhex. And I found something weird. Check out the image, you’ll find out that 8 bytes are different. And they differ in the same way, 0c 13 becomes 1c 20.. twice. There are many nuances that could be said about why did this happen, but in short it all came down to the fact that I passed the file path to RMagick and I let it do its ‘magick’ opening the file, writing the new content on it, and saving it.

 

 

It turns out that probably the best approach is to open the file in “rb” (read binary) mode, change everything using RMagick on memory, and then saving it back using standard File class from Ruby. That would yield a correct MD5 hash which was what I was using. I did not realize this and ended up looking for a way to compare they directly on RMagick. And lucky me, there are many methods to compare images in RMagick!

It turns out that the best way to do this, instead of calculating the digest of the binary data, which will be different (really, no matter how bad you try it always generates an image with those 8 different bytes who will mess up the digest. But what about taking a different approach and comparing the pixels RGB values instead? At the end of the day, they should be the same for two identical images. And it works. Glenn Randers-Pehrson (libpng maintainer/imagemagick contributor) wrote about this on the Imagemagick mail list,  So I just use RMagick’s signature.

 

Lesson learned, if you have to compare an image, compare its pixels. Guys at libpng and Imagemagick probably know better than me.

 

9 Kudos