OMSCS: Year 1

Why

3 years after I started to work at Red Hat, I realized I was getting very specialized in everything related to systems management. There was another area in this software field that I just didn’t understand, Machine Learning / AI. During my undergraduate years, I took courses on Knowledge Engineering and AI but they did not take me anywhere near competent in a professional setting. For many years, I have been mesmerized at the things DeepMind, OpenAI,  Tesla (autopilot?!) and others have done.  Even just “simple” Counter Strike bots seemed magic. Over time, the outlook of “this is just magic, I don’t understand” made me itchy and I had to get rid of it. It was time to dive in and learn about ML, image recognition, and the math and research behind these.

I applied, got in, and here’s the story about my first year at OMSCS.

How

It takes a lot of time. In a nutshell, I had to say goodbye to many weekends, performing above and beyond at work and having time in the evenings. Forget about lying down on the couch for hours. From my estimations, you should be ready to spend 2-3 years committed to this. Side projects while working full time in grad school? Grad school is your “side-project”. If you enjoy programming (I do), you won’t even miss side projects after all the coding you will do. Most courses have a strict policy of not sharing code for previous assignments, but some allow it.

I started with Advanced Operating Systems, since I already have some experience at work with different OSs and virtualization, I thought it would be a great idea to learn more deeply about these topics as my first grad-level course. It was a fantastic choice. The pace was fine, and it helped me get familiar with reading research papers and learn better about OS fundamentals. It brought me to a point where I was able to read “Linux Kernel Development” without major problems and I got to dust off my C skills. As a kind of extracurricular, I recommend doing the Eudyptula challenge if you want to go from “I know basic C” to “I can contribute to the Linux Kernel”.

On my second term, I made a choice of taking two courses at the same time. Data and Visual Analytics and Computational Photography. It was a terrible idea. Even if you have good time management skills, I found it too exhausting to do two courses plus a full time job, which means that slowly you will feel more tired mentally until the end of the term. A tired mind affects all areas of your life, at home, and at my job I was just not as “present” as I wanted to be. On the other side, I learned about feature analysis, transformations, brushed up on statistics and R. Computational Photography taught me a lot about how to work with pixels and 3D representations, and I learned how to use OpenCV to a level where I could be comfortable using it in a professional setting. Here’s what we did (source code available upon request, it’s forbidden to publicly share it):

 

On my last term so far I took Reinforcement Learning. Summer terms are shorter, and RL suggests a prerequisite on Machine Learning which I have not taken yet. Going back to one course a term was a big relief. At the end of the previous term I slowly turned down certain lead responsibilities at Red Hat as I was clearly not fit for it – not with the self-imposed burden of grad school. Not only RL has been my favorite class so far, but I’ve also been able to go out some Sundays and just.. live life at a pace that I could enjoy. Hell, years after watching A Beautiful Mind I finally understood what a Nash equilibrium is – and other kinds of equilibria too. The assignments were very academic, in most of them we had to read a paper and replicate the results. We also got a chance to play with OpenAI gym. It was so satisfying to train this Lunar Lander and watch it land. If you’re curious, it’s deep Q-learning + neural networks I used to train it (we were free to experiment).


Academic papers always looked daunting to me. I had very little exposure to papers when I finished my undergrad degree. I remember reading about 4 papers, maybe. After one year of reading a few papers every month, I published some summaries for the ones I read and found interesting. Check out my repo paper-notes. This advantage of grad school is not to be taken lightly. Being able to stay on top of your field by following people on Google Scholar/Arxiv is very satisfying to me and much saner than following Hacker News/Twitter/Reddit/<insert fad of the day>.

Some of the stuff learned in the courses will be applicable at your job right away. I’m not talking about the discipline, the grit, how ever you want to call it. I’m talking about specific technologies like Protocol Buffers, Keras, ggplot2, and more.  The community of students is just great. If you are near other students in your class, you can always meet with them in person, but even if that isn’t possible, the Slack chat and forums will help you bond with your peers.

Graduate school -even through distance learning- is being one of the most rewarding academic experiences I’ve ever had.  I hope you are ready to apply now!

Value of re-reading books

If your goal is to learn, reading books is one of the best habits you could add to your life. It’s no wonder all successful people (by a traditional measure, wealth – Gates, Buffett…), and all interesting people I know are voracious readers. I mostly concur with that opinion, learning by reading builds up like compound interest.

There’s another habit for which I don’t have any ad-hominem argument to convince you to pick up. Make a conscious effort to re-read certain books.

Why this works for me? Lessons from books are just not obvious at every point of your life. There are certain experiences you have to share with the author in order to relate to them better. Maybe it’s also that the books I keep on my ‘re-read list’ are also very dense and it seems impossible to remember all they try to teach. From a more scientific point of view, spaced repetition has proven to be a great technique to retain stuff in your memory.

The way I do it is certainly no-frills. I just keep certain books on my tablet (Kindle Fire HD 7′ with Android, battery lasts like an ebook!) and my phone, and I read them using Bookari premium (previously Mantano Reader). You can configure it to read books using some text-to-speech engine so it works perfect for me.
There isn’t much more to it, than that. From time to time I get bored in a flight/train trip/beach/whatever… so I start reading and these books are always with me.

Here are some of the books in my list:

Peopleware: Productive Projects and Teams (2nd edition) – Tom DeMarco, Timothy Lister
Practical Object-Oriented Design in Ruby – Sandi Metz
Team Geek – Brian W. Fitzpatrick, Ben Collins-Sussman
A guide to the good life: The ancient art of stoic joy – William B. Irvine
Apprenticeship Patterns – Dave Hoover, Adewale Oshineye
The algorithm design manual – Steven Skiena
Release it! – Michael T. Nygard
Programming Pearls (2nd ed) – Jon Bentley

Which ones have you found worth reading over and over?


PS: It’s been a long time since the last blog post. Mostly because I thought of my blog as a place for ‘hardcore’ technical content. I don’t feel that should be the case any more, and I will not stop myself from writing about less ‘quantitative’ topics. What I aim with this is to share more, and learn more through this medium.

CoreOS cluster deployments with Foreman

As Major Hayden mentioned more than a year ago, deploying CoreOS is a bit of a different beast than deploying other operating systems. In this case, we are going to do it by PXE booting the image, then applying a cloud-config script which will will set the SSH keys, core user password, CoreOS version, and register in etcd.

We are going to pass parameters that will set these options. That way we can define a host group, with certain parameters, such as the authorized keys, etcd discovery url, and virtual disk. This will simplify booting hosts in our deployment so that creating a new CoreOS node in the cluster will be reduced to three clicks, New Host -> Hostgroup: CoreOS cluster > Submit.

As in my previous tutorial for unattended Atomic deployments, I will assume you have Foreman installed, and a PXE Smart Proxy in the network (or networks) you want to launch your cluster. If not, please go to theforeman.org and get a default installation. In my opinion, Libvirt is the easiest way to get this PXE “enabled” network.

The PXE templates are already in community-templates , make sure to add the snippet too. You can create them manually by going to Hosts > Provisioning Templates > New template. However, it is much easier to install the foreman_templates plugin, then run:

foreman-rake templates:sync

And you’ll get all of the templates in the community-templates repository.

Create a new operating system with the following options, in this case it will use CoreOS 647.0.0 from the stable channel.

Screenshot from 2015-06-04 22:46:12

Time to create the host group. Go to Configure > Host groups > New host group, and create a group with the following parameters. The network must be the one you can PXE boot on, and you can add a parameter ssh_authorized_keys, value should be your public ssh key, usually located in .ssh/id_rsa.pub. I did not add it here as I have a global parameter ssh_authorized_keys with that value. Get a discovery code by going to discovery.etcd.io/new , and put the value you got in etcd_discovery_url.

Screenshot from 2015-06-04 22:54:28

Screenshot from 2015-06-04 22:54:05  Screenshot from 2015-06-04 22:54:50That should be enough for the group. Now create a new host in that cluster, and as soon as it boots, it will connect to etcd. Remember the URL we used for discovery? Go to that URL and you should see all hosts that have registered in the cluster.

Screenshot from 2015-06-04 23:09:29Enjoy it, and please point out any mistakes on the comments section, or let me know on Twitter.

Unattended Atomic deployments with Foreman

Project Atomic is a new initiative to have a family of well-known, enterprise-tested operating systems ready for massive container deployments.

Atomic operating systems focus on:

  • Minimal size
  • Immutable
  • Easy to update and rollback
  • Container cluster and runtime provided (currently via Docker and Kubernetes)

It comes with a set of tools. ostree, fleet, kubectl to manage your OS updates, network configurations and cluster health. As I manage my virtual machines using Foreman, and I used to spend some time developing the Docker plugin, the project piqued my interest, as it helps me work with containers more efficiently.

For this deployment, I will assume you have a Foreman host with a Smart Proxy with at least TFTP. I will be using domains in the examples, but you could use IPs instead. I think it could be possible to skip the TFTP part for PXE too, but I have not got that far yet. You need a subnet in which you can PXE boot hosts, an example of such a subnet on Libvirt can be found on Dominic Cleal’s blog.

Step one, download Fedora 22 Atomic iso or RHEL 7 (installer) Atomic iso in your Foreman host. For the Fedora case, it’ll be possible to fetch the content straight from the repo through ostreeupdate , however for the moment we will need the image to get vmlinux, initrd, and a few other files.

# wget https://dl.fedoraproject.org/pub/alt/stage/current/Cloud_Atomic/x86_64/iso/Fedora-Cloud_Atomic-x86_64-22.iso -O fedora-atomic.iso

Mount these images in a public directory so that they can be reached from the virtual machine. To keep any existent vmlinux available for non-atomic hosts, we will copy and rename it to vmlinuz_atomic from the mounted iso. By default /var/www/html/pub/atomic would work:

# mkdir /var/www/html/pub/atomic
# mount -o loop fedora-atomic.iso /var/www/html/pub/atomic/
# cp /var/www/html/pub/atomic/isolinux/vmlinuz /var/lib/tftpboot/vmlinuz_atomic

We will need now an installation medium in Foreman pointing to this location. Go to Host > Installation media and create a mirror:

Screenshot from 2015-05-29 16:41:28

 

Create the operating system. We will go back to this operating system to associate it with the appropriate partition table and templates afterwards. For the moment, just make sure you choose the right major version 7 for RHEL, 22 in the case of Fedora, as these are the only Atomic ones. Go to Hosts > Operating systems and click on New operating system.

Create a new partition table in Foreman to provide an initial /boot and / in the Atomic virtual machine.  Go to Hosts > Partition tables, and click on New partition table.

zerombr
clearpart --all --initlabel
part /boot --size=300 --fstype="ext4"
part pv.01 --grow
volgroup atomicos pv.01
logvol / --size=3000 --fstype="xfs" --name=root --vgname=atomicos

Head to /config_templates (or Hosts > Provisioning templates) and create a new PXE template. Choose a name, then go click on the type tab and select PXELinux. Associate the template with the operating system you created previously, and use this as the content:

DEFAULT pxeboot
TIMEOUT 20
PROMPT 0
LABEL pxeboot
kernel vmlinuz_atomic
append initrd=<%= @host.medium.path %>isolinux/initrd.img repo=<%= @host.medium.path %> ks=<%= foreman_url('provision')%> ks.device=bootif network ks.sendmac
IPAPPEND 2

Stay on provisioning templates, and create the kickstart. To do so,  click on New Template, choose a type ‘provision’, associate it with the operating system you created previously, and add this code in the editor:

lang <%= @host.params['lang'] || 'en_US.UTF-8' %>
keyboard <%= @host.params['keyboard'] || 'us' %>
timezone --utc <%= @host.params['time-zone'] || 'UTC' %>

# Partition table should create /boot and a volume atomicos
<% if @dynamic -%>
%include /tmp/diskpart.cfg
<% else -%>
<%= @host.diskLayout %>
<% end -%>


bootloader --timeout=3
<% if @host.operatingsystem.name =~ /.*Fedora.*/ -%>
ostreesetup --nogpg --osname=fedora-atomic --remote=fedora-atomic --url=<%= @host.medium.path %>/content/repo/ --ref=fedora-atomic/f<%= @host.os.major %>/<%= @host.architecture %>/docker-host
<% else -%>
ostreesetup --nogpg --osname=rhel-atomic-host --remote=rhel-atomic-host --url=file:///install/ostree --ref=rhel-atomic-host/<%= @host.os.major %>/<%= @host.architecture %>/standard
<% end -%>
services --disabled cloud-init,cloud-config,cloud-final,cloud-init-local
rootpw --iscrypted <%= root_pass %>

reboot

%post
(
# Report success back to Foreman
curl -s -o /dev/null --insecure <%= foreman_url %>
) 2>&1 | tee /mnt/sysimage/root/install.post.log

exit 0

%end

This template will pull the content from the –ref you specify, and the URL would be $FOREMANSERVER/pub/atomic/content/repo/. If you visit this URL, you should be able to find the docker-host file at the end of the hierarchy, specifically heads/fedora-atomic/f22/x86_64/docker-host. If you cannot find docker-host there because you’re using this tutorial for Fedora 23 and it has changed, I would recommend you to peruse your /pub/atomic folder and find the correct URL. Currently this is the structure for RHEL 7 and Fedora 22.

It’s time to associate these templates with the operating system, and deploy the host. Go back to Hosts > Operating systems, and click on your Atomic operating system to associate the templates:

Screenshot from 2015-05-29 17:22:42 Screenshot from 2015-05-29 17:22:51 Screenshot from 2015-05-29 17:23:00 We’re all set. Time to deploy the host. I’m doing this in Libvirt as it’s where my PXE network is configured.  Go to Hosts > New host and choose a name for your Atomic host. The Puppet options are irrelevant, as Puppet is not able to modify the Atomic ostree for the moment. Select the right domain and subnet, to ensure you’re booting in a PXE-enabled network. Something I found is that Anaconda tends to get stuck when I have tried to provision Atomic systems with less than 1GB of RAM, so I would recommend to assign that amount of RAM to your Atomic host at least.

The operating systems tab should look similar to this. Remember this is network-based provisioning.

Screenshot from 2015-05-29 17:35:52If everything went well, your system will PXE boot and start Anaconda right away. If you have VNC access to the machine, it will look similar to this:

Screenshot from 2015-05-29 17:39:32 Screenshot from 2015-05-29 17:40:33After Anaconda finishes the installation, you should be able to SSH into that machine using the root password you provided in Foreman.

Screenshot from 2015-05-29 17:43:20We’re done! I suggest you use this host as a Kubernetes master, minion, or as a Docker host. For now, I will investigate how to pass the proper parameters through Foreman to provide Atomic/Kubernetes cluster provisioning.

How I work

After breakfast, hopefully 7:15am, I make a list of things I want to carry out on that day. I write this list by hand on a notepad to always keep it in front of me. This includes some tasks that might not have to do with work.  I estimate how long each of these will take, so I know at the start of the day what can I realistically accomplish. Anything that I haven’t done and it’s important, I move it to the next day, or I add a reminder in my calendar. Anything that I haven’t done and is not overly important is left behind.

Task estimations allows me to say – “There’s a meeting in 15 minutes. Let’s pick a 10 minutes task and do it.”  This helps me to avoid the paradox of choice. Without estimations, I would have hesitated for 5 minutes, picked one task, and most likely run late to the meeting so I could finish it. They also allow me to know beforehand how busy will my day be, and how much spare time will I get.

Aside from daily Japanese study, which I do every single day, I mark three tasks with a black line, meaning these are the 3 priorities for the day. Almost invariably, code reviews are one of them.  I start the day off by doing around 30-40 minutes of email, but I don’t count it as part of the 3 priorities. I use offlineimap and mutt for that, and unless necessary, I don’t check email for the rest of the day. After lunch I might refresh to see if anyone needs something urgently, but that’s it.

I work in chunks of about 50 minutes, taking breaks of 5 minutes which I spend cleaning, reviewing kanji, playing with the cat, or just not sitting.  Working for more than 1 hour will usually result on a longer break. I use workrave to keep track of this, and it works like a charm. Even if I’m not working, I still work in this fashion as I know I have a tendency to get RSI or tendinitis.

The way I choose what tasks to put on the list -work related- is defined by a pyramid I made. Top means most important, bottom means least important.

  • level 1 – Security bugs, CVEs
  • level 2 – Release blockers (upstream or downstream)
  • level 3 – Learning opportunities. Feature team work
  • level 4 – Features that will bring new users. Important bug fixes
  • level 5 – Features that might bring new users. Minor bug fixes

If it isn’t clear from the priorities ‘pyramid’ I’m mostly focused on getting more people to our projects, and delivering timely, stable releases. That doesn’t mean I let minor bugs ‘starve’ on the queue, but it’s unlikely they would be solved by me if there is a release soon and other issues are blocking it. As any human, there are times where I simply don’t have the energy to properly work or review on level 1,2,3 items, and I resort to simpler issues as a way to clear my mind and get something done.

Since I work on an open source project, reviewing contributions is also part of my work at Red Hat. It is not a minor part of my job, in fact I probably spend 1/3rd of my working hours doing that. I apply the same priorities as I listed above. However, I have not yet found a way to deal with IRC support and Redmine emails. For the former, I normally do it when I’m working on something that does not require my full attention, or if I see someone asking questions about topics I know few people are familiar with. I would love to find a way to process Redmine emails more efficiently, I used to only check closed issues, and change the release tag if needed, but it was very time consuming and I would miss support tickets and not learn about feature requests and bugs.

Any suggestions, criticisms or comments would be highly appreciated.

[1] Some people asked me why would I not use some TO DO app to keep it in sync with my phone. The reason is simple, I want to always keep that list in mind when I’m in work mode and I don’t want to look at it when I’m not. I do use Evernote to keep track of yearly goals I set through a weekly system.

This is roughly a repost of an internal newsletter I run at Red Hat.

Thanks to Edmond Lau, whose book The Effective Engineer helped me to design this system.

High Availability and Configuration Management

Disclaimer: This is only meant to be a list of experiences and solutions. Ultimately, high availability depends a lot of the particular setup of your application, servers, architecture, etc… If you have had different experiences than those outlined here with this tools, or you feel that we are missing something, please comment, send me an email and I will update it.

I’ve noticed a lot of noise lately on how to succesfully deploy large Puppet installations. Incidentally, the kind of places that need this always rely on other tools for inventory and provisioning at the very least. In this collaborative guide, I will try to explore as many paths as possible for a successful cloud deployment where we can put Puppet at the core of our configuration management. Some of these problems are central to configuration management itself, so if you are managing this in a data center by other means, Chef, by hand (no!), Ansible, Salt… I bet you can get some take aways from here. I assume the reader is more or less familiar with the basics of Puppet (server-client, modules and classes…), other tools will be explained in detail.

Puppet has not been around for so long that a single individual could have had experiences in dozens of deployments, so if you feel something is wrong or you want to add anything, you can contact me (my email is on the left bar) and I will fix it.

Puppet

First off, there is a basic architectural consideration when puppetizing your servers. Once upon a time… when Puppet just came out, everyone assumed we were meant to have a master client setup. Fundamentally, this means the master stores their configurations in “`/etc/puppet/environments/“`, and it figures out how to compile a catalog containing the configuration the client needs. Clients request this compiled catalog and run it. Obviously this solution is taken straight out from the fabulous world of cotton candy and lollypops. It’s great because:

Everyone else is doing this! We can use everyone else’s manifests to setup our own puppet masters.
Scalable? Well, we can put a bunch of puppet masters with a balancer in front. Apache mod_proxy or HA proxy are proven solutions for this. We will do this when our lonely Puppet master starts to blow up and drop requests because there’s just too much traffic or load.
We can use LVS/keepalived to setup permanent connections between our puppet clients in a data center and its closest master.
If we ever get to Google’s scale, we can provide DNS tiers that redirect clients to its closest (by request trip time, or geographically).

Everything here sounds like a bliss doesn’t it? We do have a plan to overcome the issues running a single Puppet master instance. We definitely do.
Here’s the thing: Scaling a single instance to tens or hundreds will fix your pains. But it will only fix those related with having a single instance.
It’s quite complicated to plan in advance the issues having many puppet masters will bring. However, we can sort of use this post as a way to document our experiences and hopefully avoid some headaches to the folks deploying these installations now.
There are some metrics that can more or less help to figure out when you need new puppet masters.
catalog compilation time (one catalog per thread, the shorter the better, as the thread will be busy compiling and not taking in new catalogs)

  • request/s
  • last call to master from each node (will let you know when there are some waves of requests)
  • new signups (puppetca)

Rate monotonic scheduling

There is space for a quick project that measures the catalog compilation time, and the time of last call from master to node. It can be possible to prevent, or mitigate at least, the effects of a wave of puppet client requests to the master by using this two points of data. You can plan this using MCollective, possibly scaling up and down your puppet master infrastructure depending on the time of the day…
Funnily enough, real time operating systems have a good answer to this.

Rate monotonic scheduling allows you plan your client request waves and avoid DDOSing your puppet masters. Given a number of nodes (puppet clients), and a maximum capacity for all puppet masters to handle x number of requests at the same time (say 500), using these formulas you can come with several periods, so you can plan a massive puppet run for each of them.
Explaining how you can schedule puppet runs using harmonic periods is a little out of scope for this post, but you can contact me privately if you didn’t understand any RMS scheduling guides online.

An example of the problem that RMS fixes:

Puppet interval 1: 25 min
Puppet interval 2: 60 min

Node set 1: 00:00 -> 00:25 -> 00:50 -> 01:15
Node set 2: 00:15 ……………….. 01:15

At 01:15 you better have your pager on.

PuppetCA

PuppetCA is another feature of Puppet that you might struggle to scale out. Even though you can share the CA files across all masters (using NFS, Netapp, or any other mechanism), it’s probably not a great idea. It’s a hack that will make your PuppetCA highly available.
Another hack is to autosign every new host that requests it, having a separate CA per puppet master. Anyway no configuration will be applied to the host if the hostname is unknown right? 😉
Puppetlabs take on this is to use a central CA. It’s not HA, but in case of failure, master/node connections remain working since CA certificates and CRL are cached. However if the central CA fails, sign up of new machines will not be possible until manual restore of the CA master.

DNS

Puppetlabs has sorted this out for us in 3+ versions of Puppet.
General guidelines:

  • 2.7.x:
    • Point a node to any working master using generic DNS, ‘puppet.redhat.com’
    • For multisite deployments, make all nodes within a site point to a local DNS for puppet masters, ‘ranana.puppet.redhat.com’… ugly and requires work.
  • 3.x:
    • SRV records! Nodes will pick up closest working master.
    • Algorithm prefers masters in your network.
    • DNS setup as many tiers as needed (global -> region -> data center -> puppet masters)

PuppetDB

PuppetDB essentially contains the real data from nodes in your installation, be it facts, reports, or catalogs. It can be useful when you want to know where has catalog X been applied, etc…
The DB differs from (part of) Foreman’s DB in that Foreman stores the expected data to be at the nodes, and then it tells the Puppet master to make the nodes look like what you expect, instead of only consuming the data.
As far as I could tell from a presentation by Deepak Giridharagopal (thanks for the puppetdb_foreman mention!) Puppetlabs has some tools in the oven to replicate data from one PuppetDB daemon to another, so mirroring will allow other DBs to take on the master in case of failure, and other strategies explained in that presentation. This is the most blurry component when it comes to scaling in my (very little) experience with it so any contributions in this area will be greatly appreciated.

Masterless – Distributed

Let’s try to list pros and cons of each approach over here

Masterless pros

  • Computation is massively parallelized
  • Easy to work with when number of modules is small
  • No SPOF (using autosign)
  • Distribute Puppet modules via RPM to nodes using Pulp

Masterless cons

  • Hard to monitor and spot failures
  • Large puppet module code bases will be stored on each node?
  • Forces you to resort to Mco/Capistrano for management

Distributed pros 

  • Only choice when module repositories are big and being written to very often
  • Failure easier to manage because problems will be at known locations (instead of all nodes)

Distributed cons

  • Keeping Puppet masters modules in sync is quite hard
  • Git + rsync? NFS? GlusterFS? NetApp? Any success story?

Foreman

Foreman can and should be one of the central pieces if you want to save time managing your infrastructure. If you have several devops guys, developers.. people that want to automatically get a provisioned virtual machine, you will need it. Thankfully, scaling it is considerably easier than scaling Puppet.

First off, you don’t want your Foreman UI or API to break down because your Puppet report broke Foreman. This is easy to solve using a service oriented architecture, which in Foreman would look like:

  • ENC: Critical service, should not better be cached to avoid flipping changes back and forth.
  • Reports: YAML processing will be slow with very large reports
  • UI/API: User facing services, will get the least load

This has the great benefit of being able of allowing failure to happen in one service at a time. It is more or less easy to setup a balancer (HAproxy, apache mod_proxy_balancer). Passenger will also allow you to run your Foreman (or Puppet) multithreaded. I recommend at least two backends per service.
The architecture needed for smart-proxies (DHCP, DNS, etc…) depends very much on where the services are located. Usually, you will want at least one (preferably two) smart-proxy in each of your services, in each of your data centers.
Foreman is able to multi-site scale for most capabilities without new instances, provisioning, inventory, compute resources, etc… do not need to scale ‘geographically’.

Thanks to Red Hat, CERN, and everyone else who has contributed to this post in some way.

Please contact (me @ daniellobato dot me), or comment below if you want me to update any part of this blogpost. I’ll be very happy to get some feedback.

Becoming a better software developer is like being in a maze

Nearly every time the word ‘metric’ comes up in software development, is to drop another diss about it. This post is not meant to rant about the latest fad on how to measure code quality (wtfs/minute), monitor your developers (ROI), or count your sushi.

Instead, I would like to share how complicated it is to measure software developers quality, and how I struggle to get better at some areas.

The wall

I struggle because either me or this profession lack a standard ladder of merits to climb. It sort of looks like a maze in some ways. Surely other developers trying to improve their skills can relate to the feeling of studying up some exotic topic until you hit the wall and think: “How is this really making me improve as a software developer?”.

At that point, you either walk back your steps and rethink what do you want to learn next, or you break the wall and connect other paths to it. The decision is up to you, but there are a number of factors that can help you lean towards smashing the wall. I can’t really help with this, but here are some of the questions I pose to myself before continuing:

  • Are there plenty of positive reviews of the topic?
  • What do my friends say about it?
  • Is this something fundamental to Computer Science or some applied technology?
  • Will this broaden my knowledge in unexpected ways?

Again, these questions are just personal and more often than not their answer is blurry when you have just hit the wall. However, I do think it’s a good exercise to outline what are the benefits of devoting a big chunk of your free time to learning something.

These questions help me decide whether the wall is worth breaking or not. Moreover, how do I know how thick is it, and when will I start making out paths cross?

Skill acquisition measurement

There are not so many models that try to measure how far ahead are we with a certain skill. Dreyfus model of skill acquisition, four stages of competence, are two possible ways of merely figuring it out. I see at least two cons with the way these models can be applied to software development.

First off, software development is a vast field. It covers stuff from splay trees, to linear bounded automata or device drivers programming. The point is, it’s quite difficult to measure ‘raw software developer quality’. It’s not so difficult to measure skills.

Nonetheless an expert on some skills is needed to measure these skills. I would venture to say most people exploring a field, do not have access to experts who can let them know how much have they learned. Students get hopeless, and walk back without proper knowledge of the topic, and mostly having wasted a lot of time. I acknowledge and I’m grateful universities and moocs help fixing this issue, but still I’d love to find a way self-learners can mimic the experience themselves at home.

According to Dreyfus model of skill acquisition, experts in a topic make 1-5 % of the population, they write the books, push the research boundaries, and have developed great intuition. I believe this intuition solving problems is what makes the rest of the world believe they are geniuses.

In the end, no one is neither an expert nor a novice at all things software. Systematizing human knowledge into an array with defined boxes is hard. Hell, even knowing what ‘quantity’ to put in each of the boxes is really hard. Is it even worth it worrying about this?

To put things in context, I am spending a lot of free time digging these walls:

  • SICP
  • Contributions to MRI
  • Conway’s game of life
  • Clojure

Before you head on to more interesting places, two questions:

  • How do you know whether a topic is worth exploring or not?
  • How do you know when should you stop digging and move on?

Some useful links:

Pragmatic thinking and learning

Programmer competency matrix

5 stages of programmer incompetence

The four stages of programming competence

Why is curry not popular among Rubyists?

I’ve been wondering this one for a while. In fact, as much as I like functional programming, most of the time my Ruby functions are not curried or partially applied in any way. I guess this is because I have always thought of Ruby as a very paradigmatic Object Oriented language, where absolutely everything is an object, Smalltalk is at its core and even I do find myself writing code in lambdas and blocks all the time like most Ruby programmers do, I’ve stayed away from curry and partial application in production code.

If I was asked why do I do this, I would say it’s just a matter of following the best Rubyists style, and keeping my code easy to understand for newbies. The former argument is weak at best –fallacious argument from authority– while the second one is true, but I consistently use other advanced features that newbies won’t understand at a first glance. Moreover, lazy evaluation is about to come in Ruby 2.0 and I am very, very positive its use will become very widespread.

If you care about this and treat it as a problem, it’s probably just a matter of being afraid of criticism for not adhering to the rules. It’s so easy to follow that route and shun from challenging the state of things. I’ll make a conscious effort from today on using it unless someone raises a valid point against using these two functional approaches in Ruby code. And to start off I am going to give a brief overview of it in this post.

In any case, Ruby core thought it was a good idea and allows us to make curried Procs from 1.9 on. So here is what I understand as currying and partial application in Ruby, how to do it, you can draw your own conclusions.

Currying is a concept that allows a function that takes N parameters to be a composition of N functions, each of them take 1 parameter.

If you are looking at this and your eyebrow is raising,  bear with me for one minute.

Functions have arguments. When a function is called using all its arguments it means we are applying all of the arguments to the function we are calling to. In the non-curried function above, we are applying ‘x, y, z’ to the function.

Curried functions allows us to define new functions in terms of partially applied functions. A few examples will clarify how this is relevant and very useful feature in a programming language.

The original way (ML family of languages):

Many functional languages will let you write  “f x y z".  If you call “f x y" then you get a partially-applied function—the return value is a closure of lambda(z){z(x(y))} with passed-in the values of x and y to f(x,y).

The Ruby way:

Partial application is not as natural to write in Ruby as it is in Haskell or SML. Functions are not curried by default and in fact the way to do this is to curry the function ourselves, then define new partially applied functions upon the curried function. This is a simple function that only adds up numbers from a to b, applying f to them.

The power that partial application is giving us here is that we can very easily define new functions that build up on sum by using partial application. 

For instance,  we can partially apply the function f, and get a function ‘sum_of’_squares’ that only requires the start and end of the interval.

Or we can even partially apply the function f and the start of the interval a, and provide a more specific function:

Of course we can pass functions that remove prime numbers from the sum, start the interval or end it wherever we want[1]. These are all useful things when building a set of abstractions for your domain. Will you start to use them?

 

[1]: Actually no, we have only explored leftmost currying in this article. Rightmost currying is not currently implemented in Ruby, I’ll do my best to have it ready for 2.0.0

[2]: Last but not least, here’s an example of an interesting use of partial application in a context, by PragDave aka The Pragmatic Programmer.

An overview of serialization formats. Numbers and anecdotes.

There are lots of format specifications to serialize your data against. These days, I have been looking for potential alternatives to YAML, which has been my go-to tool for a while, basically because since Rails decided to use YAML from its very beginnings, Ruby developers started to follow the leader and it’s pretty widely used. Funnily enough, YAML was born as a markup language, and developed into a data-oriented format. The main reason why I’m writing this blog post is so that I when I have to choose what serialization format to use, I can analyze the problem and see which format fits my problem best.

If you are not familiarized with serialization formats, just for the sake of making the article a little more engaging, here are some potential uses of serialized data:

  • Caches
  • Inter process communication.
  • Dummy objects for testing
  • Message brokers

Keeping this points in mind, let’s go on and analyze a few of the most promising serialization formats these days. I’d argue these list contains, (not limited to these) HDF5, BSON , MessagePack, YAML and Protocol Buffers. I would love to write about Thrift and Avro but I have no experience nor I know them very well, I might update the post with information about them in the future. Shoot me an email if you want to do it yourself! I don’t want to get into things like separating statically and dynamically typed formats, mainly because these formats are different enough to provide more significant differences in performance, space, and other metrics than just something that is (sort of) a preference. Now, the list:

HDF5

HDF5 is a hierarchical data format born more than 20 years ago in the NCSA (National Center for Supercomputing Applications). Born and bred in scientific environments, it’s not surprising that its user base are largely scientific laboratories and research institutes. Root, CERN’s data analysis framework, uses a variation of HDF5 for its own files, and it’s largely compatible with it. Not a human readable format, to me it’s more interesting feature is how it looks like it was designed for parallelizing IO operations. It does this by separating the data into chunks of related information, in a sort of nxn table. Then, any HDF5 reader can easily pick a chunk (a rectangle or a set of points) of this virtual table and start processing it. Another worker can be doing the same in other part of the table and so on. Neat, but datasets are large and they cannot be easily compressed because of this.

BSON

BSON is an attempt to binary serialize JSON documents. It shares some parts of the JSON spec but for instance it has added embeddable features like data types that are not part of JSON (Date, BinData). Funnily enough, BSON files is not always smaller than their JSON equivalents, but there is a good reason for this, traversability. The way BSON introduces overhead to improve access times is minimal and actually pretty easy to explain.

Take a JSON document like this:

Serializing this to BSON it will leave these numbers before the strings (it doesn’t do only this, but I want to focus on the overhead):

These markers are a very simplistic way of telling the BSON parser “hey, if you want to skip this segment, just advance your pointer X bytes to find something else”.
Therefore the efficiency of BSON lies on having a smart parser that can understand the code properly. As an example of the parser can optimize, think of the number 1 in javascript. Of course this needs to be stored as a number in BSON, but for certain numbers, the ASCII representation is best and you don’t need to waste 8 bytes per number. Your parser can figure things like this out, and probably people at 10gen and other places where mongodb is used, constantly find ways to improve the parser for a certain language.

MessagePack

MessagePack is not too different from BSON in the sense that both try to serialize JSON. Unlike BSON, MessagePack tries to keep a one to one correspondence between its spec and JSON’s spec, so there is no loss of data on the conversion and the format is more transparent. This affects space heavily, so for instance a trivial document like “{“a”:1, “b”:2}” is 7 bytes in MessagePack (19 in BSON). Such a simple thing as having extra metadata on the binary object (BSON) can help in some situations where the data is constantly changing, as it lets you change the values in place. MessagePack has to reserialize the whole object if a change needs to be made. This is just my personal opinion, but these differences are probably what makes MessagePack very suitable for networking and BSON is instead a format more suited to storage scenarios, like Mongodb.

YAML

YAML appeared as a human-readable alternative to XML. It is barely usable as a language for object serialization, but it’s worth mentioning why, as these same reason left out other possible candidates.
In the event of a network failure, YAML files might be transmitted but there is no way to tell whether whatever arrived to the other peer is correct or not. Most serialization formats simply break if you slice the file.
There is still no support for YAML schemas so two peers can exchange YAML files agreeing to a data exchange format. That renders it unuseful for RPC, message brokers and the like.

Protocol Buffers

A very smart way of define messages (but this works for any structured data) designed by Google. Instead of having a human readable format like some of the formats I mentioned previously, Protocol Buffers has source code files “.proto”s that have to be compiled into a binary object. It is mainly geared towards C++ programming but there are implementations in many languages. From my experience with Clojure’s library and other LISPs libraries are pretty much abandoned, while Ruby’s implementation is actively developed. I wouldn’t recommend anything but the official ones (Java, C, C++ and Python).

An example of a .proto file:

The way Protocol Buffers encodes the data has two aims. Consumers should be able to read the data produced by newer producers, by simply skipping unexpected fields. Consumers have to be able to find the end of a field without needing any metadata about the field. The whole encoding revolves about fixing this problem (varints, ZigZag coding). It’s called the binary wire format and in short uses varints to encode the data, varints are simply integers grouped in 7 bit sets, where the high bit (MSB) is the stop bit. The way negative values are handled are by ZigZag coding the 7 bit groups with the following function.

This basically renders -1 as 1, 1 as 2, -2 as 3, 2 as 4 and so forth.

Strings are UTF8 encoded. Advantages of Protocol Buffers for RPC (as opposed to XML, which apparently was Google’s itch for PB) are way faster  encoding, smaller files, and more easy to use programatically (aka ‘we got rid of the infamous XMLController’).

Protocol Buffers encoding guide by Google

 

—————–

Other resources I found useful:

Binary Serialization Tourguide

Browser Physics – BSON, binary JSON, now for the web

Code Tuning, a programming pearl in Ruby

After a few weeks at CERN of not much blogging, I have been mostly flat hunting at one of the most expensive cities in the world. Of course the real estate market is accordingly crazy. This means I spend a big chunk of my spare time on the trolley, which is okay because I borrowed one of the best programming books ever at CERN’s library, Programming Pearls. I like to read whichever column I feel like reading instead of reading it from cover to cover, and I stumbled upon “Code Tuning” today.

Being the practical guy I am, I thought that would make a good reading choice on my way back to help me focus on details. And hell it did. I would like to go over a very well known algorithm, Sequential Search, and introduce some cool twists that Jon Bentley shows on the book.

Given an array, a sequential search simply looks for a certain element of the array iterating over it. Easy right? A simple sequential search in ruby would be something like:

Can we do better? I thought not, but there are ways to improve this dramatically. Let’s benchmark for 1M, 10M and 50M elements.

It seems that it takes around ~103 (ns) * n on the worst case, I calculated this by dividing the benchmark for 1M by the number of elements, I ran all benchmarks for the last item of the array. Fairly normal, this will be the baseline for further improvements.

How can we slim down this loop a little bit? Easy, stop performing more tests than necessary. On an ideal version of a sequential search you should only test for the elements to be equal to the desired element. In reality every time this block executes itself, array.each is checking that we haven’t gotten to the end of the array.

Benchmarks for 1M, 10M and 50M.

Bentley dropped the run for a speedup of ~5%. It looks like tests are more computationally expensive in Ruby so now it runs in ~45 (ns) * n. This is a crazy good improvement of about 55% and still it looks pretty readable.

It looks like our bottom line (tests) cannot be improved, as you will always have to check against every element in the array until you get to it. Now what. Incrementing the index in every loop is an addition that shouldn’t cost much, but we can do better. Loop_unwinding/unrolling as I prefer to call it is the answer. I am not very positive that Ruby’s 1.9.2 default interpreter unrolls loops. And we’re going to find it right away, because if it doesn’t the reduction will be very sharp.

Benchmarks for 1M, 10M and 50M.

It went from ~45 (ns) * n to ~41 (ns) * n.  A slight improvement of 9%. I’m a bit disappointed after checking the book and seeing that reducing the overhead of incrementing the iterator led to a reduction of 56%. What puzzles me even more is that his argument is that loop unrolling can help to avoid pipeline stalls, reduce branches, and increase instruction-level parallelism.

Why is this clearly happening on Bentley’s benchmarks in C and if it’s happening, it doesn’t yield the same results for my Ruby implementation? No idea, so instead of keeping this private I’m going to share it so you guys can wander about it. I’ll just leave a quote of Don Knuth:

Premature optimization is the root of much programming evil; it can compromise the correctness, functionality and maintainability of programs. Save concern for efficiency for when it matters.