High Availability and Configuration Management

Tweet about this on TwitterShare on RedditShare on FacebookBuffer this pageGoogle+share on Tumblr

Disclaimer: This is only meant to be a list of experiences and solutions. Ultimately, high availability depends a lot of the particular setup of your application, servers, architecture, etc… If you have had different experiences than those outlined here with this tools, or you feel that we are missing something, please comment, send me an email and I will update it.

I’ve noticed a lot of noise lately on how to succesfully deploy large Puppet installations. Incidentally, the kind of places that need this always rely on other tools for inventory and provisioning at the very least. In this collaborative guide, I will try to explore as many paths as possible for a successful cloud deployment where we can put Puppet at the core of our configuration management. Some of these problems are central to configuration management itself, so if you are managing this in a data center by other means, Chef, by hand (no!), Ansible, Salt… I bet you can get some take aways from here. I assume the reader is more or less familiar with the basics of Puppet (server-client, modules and classes…), other tools will be explained in detail.

Puppet has not been around for so long that a single individual could have had experiences in dozens of deployments, so if you feel something is wrong or you want to add anything, you can contact me (my email is on the left bar) and I will fix it.

Puppet

First off, there is a basic architectural consideration when puppetizing your servers. Once upon a time… when Puppet just came out, everyone assumed we were meant to have a master client setup. Fundamentally, this means the master stores their configurations in “`/etc/puppet/environments/“`, and it figures out how to compile a catalog containing the configuration the client needs. Clients request this compiled catalog and run it. Obviously this solution is taken straight out from the fabulous world of cotton candy and lollypops. It’s great because:

Everyone else is doing this! We can use everyone else’s manifests to setup our own puppet masters.
Scalable? Well, we can put a bunch of puppet masters with a balancer in front. Apache mod_proxy or HA proxy are proven solutions for this. We will do this when our lonely Puppet master starts to blow up and drop requests because there’s just too much traffic or load.
We can use LVS/keepalived to setup permanent connections between our puppet clients in a data center and its closest master.
If we ever get to Google’s scale, we can provide DNS tiers that redirect clients to its closest (by request trip time, or geographically).

Everything here sounds like a bliss doesn’t it? We do have a plan to overcome the issues running a single Puppet master instance. We definitely do.

Here’s the thing: Scaling a single instance to tens or hundreds will fix your pains. But it will only fix those related with having a single instance.
It’s quite complicated to plan in advance the issues having many puppet masters will bring. However, we can sort of use this post as a way to document our experiences and hopefully avoid some headaches to the folks deploying these installations now.
There are some metrics that can more or less help to figure out when you need new puppet masters.
catalog compilation time (one catalog per thread, the shorter the better, as the thread will be busy compiling and not taking in new catalogs)

  • request/s
  • last call to master from each node (will let you know when there are some waves of requests)
  • new signups (puppetca)

Rate monotonic scheduling

There is space for a quick project that measures the catalog compilation time, and the time of last call from master to node. It can be possible to prevent, or mitigate at least, the effects of a wave of puppet client requests to the master by using this two points of data. You can plan this using MCollective, possibly scaling up and down your puppet master infrastructure depending on the time of the day…
Funnily enough, real time operating systems have a good answer to this.

Rate monotonic scheduling allows you plan your client request waves and avoid DDOSing your puppet masters. Given a number of nodes (puppet clients), and a maximum capacity for all puppet masters to handle x number of requests at the same time (say 500), using these formulas you can come with several periods, so you can plan a massive puppet run for each of them.
Explaining how you can schedule puppet runs using harmonic periods is a little out of scope for this post, but you can contact me privately if you didn’t understand any RMS scheduling guides online.

An example of the problem that RMS fixes:

Puppet interval 1: 25 min
Puppet interval 2: 60 min

Node set 1: 00:00 -> 00:25 -> 00:50 -> 01:15
Node set 2: 00:15 ……………….. 01:15

At 01:15 you better have your pager on.

PuppetCA

PuppetCA is another feature of Puppet that you might struggle to scale out. Even though you can share the CA files across all masters (using NFS, Netapp, or any other mechanism), it’s probably not a great idea. It’s a hack that will make your PuppetCA highly available.
Another hack is to autosign every new host that requests it, having a separate CA per puppet master. Anyway no configuration will be applied to the host if the hostname is unknown right? ;)
Puppetlabs take on this is to use a central CA. It’s not HA, but in case of failure, master/node connections remain working since CA certificates and CRL are cached. However if the central CA fails, sign up of new machines will not be possible until manual restore of the CA master.

DNS

Puppetlabs has sorted this out for us in 3+ versions of Puppet.
General guidelines:

  • 2.7.x:
    • Point a node to any working master using generic DNS, ‘puppet.redhat.com’
    • For multisite deployments, make all nodes within a site point to a local DNS for puppet masters, ‘ranana.puppet.redhat.com’… ugly and requires work.
  • 3.x:
    • SRV records! Nodes will pick up closest working master.
    • Algorithm prefers masters in your network.
    • DNS setup as many tiers as needed (global -> region -> data center -> puppet masters)

PuppetDB

PuppetDB essentially contains the real data from nodes in your installation, be it facts, reports, or catalogs. It can be useful when you want to know where has catalog X been applied, etc…
The DB differs from (part of) Foreman’s DB in that Foreman stores the expected data to be at the nodes, and then it tells the Puppet master to make the nodes look like what you expect, instead of only consuming the data.
As far as I could tell from a presentation by Deepak Giridharagopal (thanks for the puppetdb_foreman mention!) Puppetlabs has some tools in the oven to replicate data from one PuppetDB daemon to another, so mirroring will allow other DBs to take on the master in case of failure, and other strategies explained in that presentation. This is the most blurry component when it comes to scaling in my (very little) experience with it so any contributions in this area will be greatly appreciated.

Masterless – Distributed

Let’s try to list pros and cons of each approach over here

Masterless pros

  • Computation is massively parallelized
  • Easy to work with when number of modules is small
  • No SPOF (using autosign)
  • Distribute Puppet modules via RPM to nodes using Pulp

Masterless cons

  • Hard to monitor and spot failures
  • Large puppet module code bases will be stored on each node?
  • Forces you to resort to Mco/Capistrano for management

Distributed pros 

  • Only choice when module repositories are big and being written to very often
  • Failure easier to manage because problems will be at known locations (instead of all nodes)

Distributed cons

  • Keeping Puppet masters modules in sync is quite hard
  • Git + rsync? NFS? GlusterFS? NetApp? Any success story?

Foreman

Foreman can and should be one of the central pieces if you want to save time managing your infrastructure. If you have several devops guys, developers.. people that want to automatically get a provisioned virtual machine, you will need it. Thankfully, scaling it is considerably easier than scaling Puppet.

First off, you don’t want your Foreman UI or API to break down because your Puppet report broke Foreman. This is easy to solve using a service oriented architecture, which in Foreman would look like:

  • ENC: Critical service, should not better be cached to avoid flipping changes back and forth.
  • Reports: YAML processing will be slow with very large reports
  • UI/API: User facing services, will get the least load

This has the great benefit of being able of allowing failure to happen in one service at a time. It is more or less easy to setup a balancer (HAproxy, apache mod_proxy_balancer). Passenger will also allow you to run your Foreman (or Puppet) multithreaded. I recommend at least two backends per service.
The architecture needed for smart-proxies (DHCP, DNS, etc…) depends very much on where the services are located. Usually, you will want at least one (preferably two) smart-proxy in each of your services, in each of your data centers.
Foreman is able to multi-site scale for most capabilities without new instances, provisioning, inventory, compute resources, etc… do not need to scale ‘geographically’.

Thanks to Red Hat, CERN, and everyone else who has contributed to this post in some way.

Please contact (me @ daniellobato dot me), or comment below if you want me to update any part of this blogpost. I’ll be very happy to get some feedback.

Tweet about this on TwitterShare on RedditShare on FacebookBuffer this pageGoogle+share on Tumblr
 

45 Kudos

 

Becoming a better software developer is like being in a maze

Tweet about this on TwitterShare on RedditShare on FacebookBuffer this pageGoogle+share on Tumblr

Nearly every time the word ‘metric’ comes up in software development, is to drop another diss about it. This post is not meant to rant about the latest fad on how to measure code quality (wtfs/minute), monitor your developers (ROI), or count your sushi.

Instead, I would like to share how complicated it is to measure software developers quality, and how I struggle to get better at some areas.

The wall

I struggle because either me or this profession lack a standard ladder of merits to climb. It sort of looks like a maze in some ways. Surely other developers trying to improve their skills can relate to the feeling of studying up some exotic topic until you hit the wall and think: “How is this really making me improve as a software developer?”.

At that point, you either walk back your steps and rethink what do you want to learn next, or you break the wall and connect other paths to it. The decision is up to you, but there are a number of factors that can help you lean towards smashing the wall. I can’t really help with this, but here are some of the questions I pose to myself before continuing:

  • Are there plenty of positive reviews of the topic?
  • What do my friends say about it?
  • Is this something fundamental to Computer Science or some applied technology?
  • Will this broaden my knowledge in unexpected ways?

Again, these questions are just personal and more often than not their answer is blurry when you have just hit the wall. However, I do think it’s a good exercise to outline what are the benefits of devoting a big chunk of your free time to learning something.

These questions help me decide whether the wall is worth breaking or not. Moreover, how do I know how thick is it, and when will I start making out paths cross?

Skill acquisition measurement

There are not so many models that try to measure how far ahead are we with a certain skill. Dreyfus model of skill acquisition, four stages of competence, are two possible ways of merely figuring it out. I see at least two cons with the way these models can be applied to software development.

First off, software development is a vast field. It covers stuff from splay trees, to linear bounded automata or device drivers programming. The point is, it’s quite difficult to measure ‘raw software developer quality’. It’s not so difficult to measure skills.

Nonetheless an expert on some skills is needed to measure these skills. I would venture to say most people exploring a field, do not have access to experts who can let them know how much have they learned. Students get hopeless, and walk back without proper knowledge of the topic, and mostly having wasted a lot of time. I acknowledge and I’m grateful universities and moocs help fixing this issue, but still I’d love to find a way self-learners can mimic the experience themselves at home.

According to Dreyfus model of skill acquisition, experts in a topic make 1-5 % of the population, they write the books, push the research boundaries, and have developed great intuition. I believe this intuition solving problems is what makes the rest of the world believe they are geniuses.

In the end, no one is neither an expert nor a novice at all things software. Systematizing human knowledge into an array with defined boxes is hard. Hell, even knowing what ‘quantity’ to put in each of the boxes is really hard. Is it even worth it worrying about this?

To put things in context, I am spending a lot of free time digging these walls:

  • SICP
  • Contributions to MRI
  • Conway’s game of life
  • Clojure

Before you head on to more interesting places, two questions:

  • How do you know whether a topic is worth exploring or not?
  • How do you know when should you stop digging and move on?

Some useful links:

Pragmatic thinking and learning

Programmer competency matrix

5 stages of programmer incompetence

The four stages of programming competence

Tweet about this on TwitterShare on RedditShare on FacebookBuffer this pageGoogle+share on Tumblr
 

26 Kudos

 

Why is curry not popular among Rubyists?

Tweet about this on TwitterShare on RedditShare on FacebookBuffer this pageGoogle+share on Tumblr

I’ve been wondering this one for a while. In fact, as much as I like functional programming, most of the time my Ruby functions are not curried or partially applied in any way. I guess this is because I have always thought of Ruby as a very paradigmatic Object Oriented language, where absolutely everything is an object, Smalltalk is at its core and even I do find myself writing code in lambdas and blocks all the time like most Ruby programmers do, I’ve stayed away from curry and partial application in production code.

If I was asked why do I do this, I would say it’s just a matter of following the best Rubyists style, and keeping my code easy to understand for newbies. The former argument is weak at best -fallacious argument from authority- while the second one is true, but I consistently use other advanced features that newbies won’t understand at a first glance. Moreover, lazy evaluation is about to come in Ruby 2.0 and I am very, very positive its use will become very widespread.

If you care about this and treat it as a problem, it’s probably just a matter of being afraid of criticism for not adhering to the rules. It’s so easy to follow that route and shun from challenging the state of things. I’ll make a conscious effort from today on using it unless someone raises a valid point against using these two functional approaches in Ruby code. And to start off I am going to give a brief overview of it in this post.

In any case, Ruby core thought it was a good idea and allows us to make curried Procs from 1.9 on. So here is what I understand as currying and partial application in Ruby, how to do it, you can draw your own conclusions.

Currying is a concept that allows a function that takes N parameters to be a composition of N functions, each of them take 1 parameter.

If you are looking at this and your eyebrow is raising,  bear with me for one minute.

Functions have arguments. When a function is called using all its arguments it means we are applying all of the arguments to the function we are calling to. In the non-curried function above, we are applying ‘x, y, z’ to the function.

Curried functions allows us to define new functions in terms of partially applied functions. A few examples will clarify how this is relevant and very useful feature in a programming language.

The original way (ML family of languages):

Many functional languages will let you write  “f x y z".  If you call “f x y" then you get a partially-applied function—the return value is a closure of lambda(z){z(x(y))} with passed-in the values of x and y to f(x,y).

The Ruby way:

Partial application is not as natural to write in Ruby as it is in Haskell or SML. Functions are not curried by default and in fact the way to do this is to curry the function ourselves, then define new partially applied functions upon the curried function. This is a simple function that only adds up numbers from a to b, applying f to them.

The power that partial application is giving us here is that we can very easily define new functions that build up on sum by using partial application. 

For instance,  we can partially apply the function f, and get a function ‘sum_of’_squares’ that only requires the start and end of the interval.

Or we can even partially apply the function f and the start of the interval a, and provide a more specific function:

Of course we can pass functions that remove prime numbers from the sum, start the interval or end it wherever we want[1]. These are all useful things when building a set of abstractions for your domain. Will you start to use them?

 

[1]: Actually no, we have only explored leftmost currying in this article. Rightmost currying is not currently implemented in Ruby, I’ll do my best to have it ready for 2.0.0

[2]: Last but not least, here’s an example of an interesting use of partial application in a context, by PragDave aka The Pragmatic Programmer.

Tweet about this on TwitterShare on RedditShare on FacebookBuffer this pageGoogle+share on Tumblr
 

114 Kudos

 

An overview of serialization formats. Numbers and anecdotes.

Tweet about this on TwitterShare on RedditShare on FacebookBuffer this pageGoogle+share on Tumblr

There are lots of format specifications to serialize your data against. These days, I have been looking for potential alternatives to YAML, which has been my go-to tool for a while, basically because since Rails decided to use YAML from its very beginnings, Ruby developers started to follow the leader and it’s pretty widely used. Funnily enough, YAML was born as a markup language, and developed into a data-oriented format. The main reason why I’m writing this blog post is so that I when I have to choose what serialization format to use, I can analyze the problem and see which format fits my problem best.

If you are not familiarized with serialization formats, just for the sake of making the article a little more engaging, here are some potential uses of serialized data:

  • Caches
  • Inter process communication.
  • Dummy objects for testing
  • Message brokers

Keeping this points in mind, let’s go on and analyze a few of the most promising serialization formats these days. I’d argue these list contains, (not limited to these) HDF5, BSON , MessagePack, YAML and Protocol Buffers. I would love to write about Thrift and Avro but I have no experience nor I know them very well, I might update the post with information about them in the future. Shoot me an email if you want to do it yourself! I don’t want to get into things like separating statically and dynamically typed formats, mainly because these formats are different enough to provide more significant differences in performance, space, and other metrics than just something that is (sort of) a preference. Now, the list:

HDF5

HDF5 is a hierarchical data format born more than 20 years ago in the NCSA (National Center for Supercomputing Applications). Born and bred in scientific environments, it’s not surprising that its user base are largely scientific laboratories and research institutes. Root, CERN’s data analysis framework, uses a variation of HDF5 for its own files, and it’s largely compatible with it. Not a human readable format, to me it’s more interesting feature is how it looks like it was designed for parallelizing IO operations. It does this by separating the data into chunks of related information, in a sort of nxn table. Then, any HDF5 reader can easily pick a chunk (a rectangle or a set of points) of this virtual table and start processing it. Another worker can be doing the same in other part of the table and so on. Neat, but datasets are large and they cannot be easily compressed because of this.

BSON

BSON is an attempt to binary serialize JSON documents. It shares some parts of the JSON spec but for instance it has added embeddable features like data types that are not part of JSON (Date, BinData). Funnily enough, BSON files is not always smaller than their JSON equivalents, but there is a good reason for this, traversability. The way BSON introduces overhead to improve access times is minimal and actually pretty easy to explain.

Take a JSON document like this:

Serializing this to BSON it will leave these numbers before the strings (it doesn’t do only this, but I want to focus on the overhead):

These markers are a very simplistic way of telling the BSON parser “hey, if you want to skip this segment, just advance your pointer X bytes to find something else”.
Therefore the efficiency of BSON lies on having a smart parser that can understand the code properly. As an example of the parser can optimize, think of the number 1 in javascript. Of course this needs to be stored as a number in BSON, but for certain numbers, the ASCII representation is best and you don’t need to waste 8 bytes per number. Your parser can figure things like this out, and probably people at 10gen and other places where mongodb is used, constantly find ways to improve the parser for a certain language.

MessagePack

MessagePack is not too different from BSON in the sense that both try to serialize JSON. Unlike BSON, MessagePack tries to keep a one to one correspondence between its spec and JSON’s spec, so there is no loss of data on the conversion and the format is more transparent. This affects space heavily, so for instance a trivial document like “{“a”:1, “b”:2}” is 7 bytes in MessagePack (19 in BSON). Such a simple thing as having extra metadata on the binary object (BSON) can help in some situations where the data is constantly changing, as it lets you change the values in place. MessagePack has to reserialize the whole object if a change needs to be made. This is just my personal opinion, but these differences are probably what makes MessagePack very suitable for networking and BSON is instead a format more suited to storage scenarios, like Mongodb.

YAML

YAML appeared as a human-readable alternative to XML. It is barely usable as a language for object serialization, but it’s worth mentioning why, as these same reason left out other possible candidates.
In the event of a network failure, YAML files might be transmitted but there is no way to tell whether whatever arrived to the other peer is correct or not. Most serialization formats simply break if you slice the file.
There is still no support for YAML schemas so two peers can exchange YAML files agreeing to a data exchange format. That renders it unuseful for RPC, message brokers and the like.

Protocol Buffers

A very smart way of define messages (but this works for any structured data) designed by Google. Instead of having a human readable format like some of the formats I mentioned previously, Protocol Buffers has source code files “.proto”s that have to be compiled into a binary object. It is mainly geared towards C++ programming but there are implementations in many languages. From my experience with Clojure’s library and other LISPs libraries are pretty much abandoned, while Ruby’s implementation is actively developed. I wouldn’t recommend anything but the official ones (Java, C, C++ and Python).

An example of a .proto file:

The way Protocol Buffers encodes the data has two aims. Consumers should be able to read the data produced by newer producers, by simply skipping unexpected fields. Consumers have to be able to find the end of a field without needing any metadata about the field. The whole encoding revolves about fixing this problem (varints, ZigZag coding). It’s called the binary wire format and in short uses varints to encode the data, varints are simply integers grouped in 7 bit sets, where the high bit (MSB) is the stop bit. The way negative values are handled are by ZigZag coding the 7 bit groups with the following function.

This basically renders -1 as 1, 1 as 2, -2 as 3, 2 as 4 and so forth.

Strings are UTF8 encoded. Advantages of Protocol Buffers for RPC (as opposed to XML, which apparently was Google’s itch for PB) are way faster  encoding, smaller files, and more easy to use programatically (aka ‘we got rid of the infamous XMLController’).

Protocol Buffers encoding guide by Google

 

—————–

Other resources I found useful:

Binary Serialization Tourguide

Browser Physics – BSON, binary JSON, now for the web

Tweet about this on TwitterShare on RedditShare on FacebookBuffer this pageGoogle+share on Tumblr
 

163 Kudos

 

Code Tuning, a programming pearl in Ruby

Tweet about this on TwitterShare on RedditShare on FacebookBuffer this pageGoogle+share on Tumblr

After a few weeks at CERN of not much blogging, I have been mostly flat hunting at one of the most expensive cities in the world. Of course the real estate market is accordingly crazy. This means I spend a big chunk of my spare time on the trolley, which is okay because I borrowed one of the best programming books ever at CERN’s library, Programming Pearls. I like to read whichever column I feel like reading instead of reading it from cover to cover, and I stumbled upon “Code Tuning” today.

Being the practical guy I am, I thought that would make a good reading choice on my way back to help me focus on details. And hell it did. I would like to go over a very well known algorithm, Sequential Search, and introduce some cool twists that Jon Bentley shows on the book.

Given an array, a sequential search simply looks for a certain element of the array iterating over it. Easy right? A simple sequential search in ruby would be something like:

Can we do better? I thought not, but there are ways to improve this dramatically. Let’s benchmark for 1M, 10M and 50M elements.

It seems that it takes around ~103 (ns) * n on the worst case, I calculated this by dividing the benchmark for 1M by the number of elements, I ran all benchmarks for the last item of the array. Fairly normal, this will be the baseline for further improvements.

How can we slim down this loop a little bit? Easy, stop performing more tests than necessary. On an ideal version of a sequential search you should only test for the elements to be equal to the desired element. In reality every time this block executes itself, array.each is checking that we haven’t gotten to the end of the array.

Benchmarks for 1M, 10M and 50M.

Bentley dropped the run for a speedup of ~5%. It looks like tests are more computationally expensive in Ruby so now it runs in ~45 (ns) * n. This is a crazy good improvement of about 55% and still it looks pretty readable.

It looks like our bottom line (tests) cannot be improved, as you will always have to check against every element in the array until you get to it. Now what. Incrementing the index in every loop is an addition that shouldn’t cost much, but we can do better. Loop_unwinding/unrolling as I prefer to call it is the answer. I am not very positive that Ruby’s 1.9.2 default interpreter unrolls loops. And we’re going to find it right away, because if it doesn’t the reduction will be very sharp.

Benchmarks for 1M, 10M and 50M.

It went from ~45 (ns) * n to ~41 (ns) * n.  A slight improvement of 9%. I’m a bit disappointed after checking the book and seeing that reducing the overhead of incrementing the iterator led to a reduction of 56%. What puzzles me even more is that his argument is that loop unrolling can help to avoid pipeline stalls, reduce branches, and increase instruction-level parallelism.

Why is this clearly happening on Bentley’s benchmarks in C and if it’s happening, it doesn’t yield the same results for my Ruby implementation? No idea, so instead of keeping this private I’m going to share it so you guys can wander about it. I’ll just leave a quote of Don Knuth:

Premature optimization is the root of much programming evil; it can compromise the correctness, functionality and maintainability of programs. Save concern for efficiency for when it matters.

 

Tweet about this on TwitterShare on RedditShare on FacebookBuffer this pageGoogle+share on Tumblr
 

99 Kudos

 

hackNY Lessons: Week 8

Tweet about this on TwitterShare on RedditShare on FacebookBuffer this pageGoogle+share on Tumblr

We met Anthony Volodkin from Hype machine and Thatcher Bell, Greg Pass (former Twitter CTO) and Daniel Huttenlocher (Cornell’s Information and Computer Science dean). I had no idea they were trying to open a Cornell branch in NYC (CornellNYC Tech, in partnership with Technion, aka Israel’s MIT), but if someone had to run it, they chose the right person. Honestly after meeting several CS/Engineering deans this is by far the only one I’ve met who really thinks out of the.. university? Usually they are so centered in academia that the only things they see outside of it it’s consulting companies like McKinsey or Accenture. In my view that doesn’t necessarily mean they can’t run a CS school, but the problem is they fail to show all the options to their students body. I’m not into academia nor I think I’ll ever be but heck as soon as this becomes something real I will be on the first line to check it out.

Definitely one of the best talks -I’d say QA rather- so far at Art.sy headquarters. I also got the chance to talk to Daniel Doubrovkine, Art.sy’s head of engineering and a very cool Swiss individual who gave me good advice about CERN, Geneva, universities in Switzerland and such. I should stop and start off with the learnings because I’m getting too emotional at this point.

 

Tech learnings:

  • Project Euler from 6 to 8 in Clojure
  • Git cherrypicking
  • Lifebooker’s API
  • Handle JSON
  • When scaling something that cannot be stopped measure everything.
  • Chef vs Puppet
  • Learn something that will make you think different when it comes to programming. No, learning C# when you know Java doesn’t count.

Startup learnings:

  • Step away from some friends. I should add, learn what not to do from some friends. Sometimes people forget how important it is to know what not to do and not only what to do.
  • In order to build a really good ecosystem, new startups should have some sort of role model (HP/Google/Microsoft/etc..)
  • It looks like for some people it’s easier to find a problem/solve it/make profit than innovating. I personally
  • think releasing something truly innovative is several orders of magnitude harder than the former.
  • Big problems (self driving car) have such big barriers to entry that most people with an entrepreneurial drive decide to go for the easy profitable thing (Instagram). No shame on that. I heard a story this week about two guys who are almost 99% certain they can make nuclear energy with Thorium way more efficiently than the way its done right now. Cost of that prototype? 100M. YES A PROTOTYPE. A big chunk of that is lawyer/bureaucracy fees.

Life learnings:

  • Please prove me wrong but there’s no way to avoid losing a holyfuckton of money when buying foreign currencies.
  • Again, bureaucracy sucks on so many levels. Probably the barriers to entry that market are super high
  • but there is definitely a big problem on that ‘market’ whatsoever.
  • Also sending money to foreign banks is a pain in the ass. I might open an account on CapitalOne.. 0% on ATM’s overseas, no fees at all.
Tweet about this on TwitterShare on RedditShare on FacebookBuffer this pageGoogle+share on Tumblr
 

11 Kudos

 

Project Euler 4 solution

Tweet about this on TwitterShare on RedditShare on FacebookBuffer this pageGoogle+share on Tumblr

I am doing Project Euler problems as a way to learn better idiomatic Clojure.

Today’s problem really made me realize how cool is clojure. Wording is the following

A palindromic number reads the same both ways. The largest palindrome made from the product of two 2-digit numbers is 9009 = 91 *99. Find the largest palindrome made from the product of two 3-digit numbers.

Easier than expected huh?

In short,  is-palindrome just checks for equality between a number and its reversed pair, if they are equal it returns the number, otherwise it returns nil.

The main function uses ‘for’ to multiply two ranges of numbers, i.e: 100*999, 100*998,… 101*999… This returns a lazy sequence that is evaluated against is-palindrome. That will return the same collection but with the value nil on all the numbers that are not palindromes. So we just filter numbers out, get the maximum value of the list of numbers with reduce, and we are all set. Such a great example of how lazy evaluation makes things a lot faster. Any thoughts on how to improve this solution?

Tweet about this on TwitterShare on RedditShare on FacebookBuffer this pageGoogle+share on Tumblr
 

11 Kudos

 

hackNY Lessons: Week 7

Tweet about this on TwitterShare on RedditShare on FacebookBuffer this pageGoogle+share on Tumblr

Chris Dixon and Josh Knowles were the talks for the week, being the former mostly QA and the second one probably the best presentation on Agile management concepts that I’ve ever seen. It was so much stuff in less than 1 hour and a half, I already know most of it but hell now that’s really ingrained in my brain. Chris Dixon put an emphasis on personal branding, having really clear goals for your startup.. oh wait maybe I should get started.

Tech learnings:

  • Chapter 1 of the Joy of Clojure. I read slow, mostly because I only read on the subway or when I’m waiting to go to work in the morning.
  • How to solve this with Clojure (will post the solution right after this post)
  • Codata, corecursion, f-algebras.. looks like I learnt more math than practical tech this week.
  • Apache routing to keep different Rails projects separate.

Startup learnings:

  • Ndas.. most ideas seem stupid to a person who is listening to pitches 24/7. It’s so rare that you tell someone an idea and that person copies it to the point it makes your idea a serious competitor against you.
  • Jesus christ getting people to use your stuff is hard. And I don’t even know where to look at to get better at this. I wrote this post on HN asking for help but nobody seemed to care much about it. I’m sure its a lot easier to look for a niche then build something people will use than trying to do innovative stuff.
  • You should of course focus on something and do it really well. You’ll get to a point where you can be ahead 99% of the people on something. Then take something new on a totally different topic and become good at it. All credits to Chris Dixon on this one.
  • Make really, really clear to your cofounders what do you want to do with your little baby that your startup is. It will be a major source of problems if you want to sell it to some Indian company in 5 years while your cofounder wants to keep running it.
  • If possible, avoid having a team who cannot communicate. That will kill your company. There are ways to build a startup from scratch where all employees are friends with each other and still they are ‘A’ players as they like to call themselves..

Life learnings:

  • I get frustrated when I have no side projects. Project Euler was a good mental outing this weekend, I guess I will continue until some idea pops up.
  • Too much bleach is never a good thing even if you can’t fit anything else on your washing machine.
  • Another headache on Saturday night for the same reasons. I should do something.
Tweet about this on TwitterShare on RedditShare on FacebookBuffer this pageGoogle+share on Tumblr
 

7 Kudos

 

Corecursion, codata. Learning by teaching

Tweet about this on TwitterShare on RedditShare on FacebookBuffer this pageGoogle+share on Tumblr

Four weeks ago I started learning Clojure on my own. I took the bold move of building something awesome without even knowing the syntax, which I hope will eventually learn by doing an Amazon web spider and Project Euler problems.

So far it’s being both good and hard at the same time, I kind of want to push off and be done with some problems at my work at Lifebooker. This and all hackNY events, hackathons, and heck just trying to make the most of my time at NYC (at which I’m sort of failing) don’t really leave much room for experiments. Tonight I feel like I should rest and work in the morning rather than trying to do and accomplish nothing on my Amazon pet project. So instead I’ll be writing a blogpost about corecursion and learn from explaining it. Let’s get started by defining what recursion is.

My dearest friend Wikipedia says recursion is the process of repeating items in a self-similar way. Pretty accurate, but for our purposes and for anyone not familiar with the mathematical concept of self-similarity, let’s understand recursion as whenever a function, be it a mathematical function f(x) or a function in a programming language function(x), are expressed in terms of themselves.

In maths we understand a function defined in terms of itself as, for instance, f(x) = f(x-1) + 2. A simple function defined in terms of itself in java-like pseudocode would be something like,

Recursive functions usually have at least a base case, so they stop at a certain point and stop calling to themselves. The simplest example I can make up off the top of my head is this, given an integer argument,

This would just return a number after some calls, provided that you call it with an integer argument less than 10. Now that we have brushed up on our CS101, let’s bring up the real thing. Corecursion is essentially the same idea but applied to data structures instead of functions. If I had to put a wording for the problem, it would be something like:

Given a data structure, define a computation on that structure. Execute it, then repeat on result of this computation, and so on. It is similar to recursion, but it produces a sequence of data, with calculations taking place on demand when this data is dereferenced. It’s oriented on data, not the operations.

Codata

In order to understand corecursion, I found vital to understand codata. If you really wanna get into it, I would recommend reading this paper which teaches the basics of F-algebrasF-coalgebras, and duality (F meaning functor in this context). For the sake of simplicity, let us use two really basic data types, based on natural numbers, Nat and CoNat.

Nat (the natural numbers) is defined as:

At the end of the day they are created out of S and 0. Think of S as 1 and you could create all the numbers with this data type.

CoNat is almost identical to Nat but with an extra element.

An interesting corollary to this is that the latest element of the CoNat set doesn’t contain c0. Plus, it’s truly infinite, as opposed to the “latest” Nat which is just applying S and 0 over and over. If we were to define Nat, we could say something like

But when it comes to infinity, we cannot do infinity = Successor(infinity). Thus, infinity (ω) is NOT a member of this set. If it was, now consider the set N that has all the same elements as Nat except it does not have ω. Clearly Zero is in N, and if y is in N, Succ(y) is in N, but N is smaller than Nat which is a contradiction. Therefore ω is not in Nat. Data is always finite. Codata can be infinite. 

Taken word by word from reddit:

All types in functional programming have two “tools” for using them

  • an introduction principle, and
  • an elimination principle

The introduction principle lets us turn objects of some other type into objects of this type. The elimination principle lets us turn objects of this type into objects of some other type.

This should be your personal mantra. Examples of this are:

  • Function types – Lambda abstraction & Function application
  • Sum types – inl and inr & case expressions
  • Product types – the function (,) & fst and snd
  • Unit – the unit constructor () & case () of () ->
  • Bool – True and False & if-then-else statements
  • Empty – (no introduction rule) & reductio ad absurdum, the ability to promote _|_ to any other type.

A requirement for an introduction principle to be such, is that the introduction principle has to be able to turn some type into Nat. We do have a introduction principle for Nat. It is just our two old friends 0 and S (the constructors).

CoNat does NOT have an introduction principle as simple as its constructors. It’s introduction principle is

So esentially, coNatIntro f includes a -> CoNat . And that’s enough to ‘turn some type into CoNat’, so it’s an introduction principle.

Elimination principles work just the opposite way, they turn elements of a type into elements of some other type, so in our example they would have the form Nat -> something, CoNat -> something else. 

The elimination principle for our simple Nat type is:

Wait. Isn’t Nat -> a the opposite of a -> CoNat? Yup, so it is even clearer than they are elimination and introduction principles respectively. So it is proven that CoNat is codata, hence infinite but we can only see a finite amount of CoNat. This has some consequences, that can help understand corecursion and codata.

  • You can’t define total subtraction. (infinity – infinity will always loop).
  • You can’t write total equality. (it loops when comparing infinity == infinity).
  • You can’t write a total elem function that works on a CoList (it loops when the list is infinity long and the element isn’t in the list).

Corecursive data structures are defined in terms of itself, corecursive algorithms are defined in terms of its output rather than on its input.

This and lazy evaluation allows “infinitely” large objects without allocating this memory and obviously running out of it prematurely.

Corecursive binary trees

A complex yet easy to understand corecursive structure would be a binary tree. I won’t go in the details, but you can see how this would be defined in terms of a corecursive structure. Here is an example on how they take advantage of corecursion to traverse the tree.


Tweet about this on TwitterShare on RedditShare on FacebookBuffer this pageGoogle+share on Tumblr
 

17 Kudos

 

hackNY Lessons: Week 6

Tweet about this on TwitterShare on RedditShare on FacebookBuffer this pageGoogle+share on Tumblr

Last week we attended two awesome talks. First Peter Bell taught us how to choose your tech stack wisely and it was.. very insightful, especially after me failing to choose the right technologies for my projects. Also he gave lots of tips about hiring people, etc, which are reflected on this list. On Wednesday we met Joel Spolsky and gave us a talk about how StackExchange was built, mostly focusing on how good they are doing attracting knowledgeable people to their sites. Lots of learnings from that talk will show up in the list for this week too!

Tech learnings:

  • I learnt essential bits of Puppet and Chef, a soft introduction to what is expecting me in just two months at CERN.
  • Code reviews are vital. Even if the rest of your team is not familiar with what you have been working on, there will probably be a lot of stuff that you didn’t realize how dumb it looks until someone points it out or your read it several times. That or I’m retarded. (which is very plausible)
  • Git stash back, pull, then pop, will avoid A LOT of conflicts. Thanks Adam!
  • I just started the Joy of Clojure, I hope this will be enough to get myself to a level where I could just have to google for libraries and not for simple syntax questions. And its only 360 pages!

Startup learnings:

  • The coolness factor of a language is important when you want to hire the right people. It will be much harder to find a REALLY top notch PHP developer than to find a really good Racket developer.
  • How many Haskell developers COULD also write PHP? You should know the answer and yet be surprised about it.
  • Attracting people who care about your site and who care so much about what you do that they will come every day is much more important than getting a peak of audience on HN/reddit/whatever.
  • I’d take a close look at how new StackExchange boards are created to do that (if I ever had to do this)

Life learnings:

  • Maybe I should learn these posts every time that I feel like procrastinating, I haven’t accomplished much this weekend just because I wanted to get my laptop fan to be less noisy.
  • Sometimes I have no idea about how to get myself on to do stuff. And that led me to a headache on Saturday night.

 

Tweet about this on TwitterShare on RedditShare on FacebookBuffer this pageGoogle+share on Tumblr
 

6 Kudos