Searching Gradients with Huy Nguyen

Why Uncertainty Matters in Deep Learning and How to Estimate It

2020-01-20T00:00:00-08:00

More trustworthy models
Uncertainty estimates, what are they good for?
Sources of uncertainty
How to estimate uncertainty (overview)
How good are these uncertainty estimates?
Moving beyond traditional leaderboard metrics
References

More trustworthy models

For safety critical systems and infrastructure, you need to know when to trust a model’s prediction and when you should be more cautious about its output.

The problem with conventional deep neural networks is that they only provide point-estimates which are single predictive values given some input data. What you don’t get with these models is a measure of how uncertain they are for any given prediction.

While the softmax “probability” of a neural network classifier is commonly interpreted and used as a heuristic for uncertainty, scientists have shown that this measure often overestimates its confidence, even on examples that are unrecognizable when compared to a model’s training set.

What we really need is an accurate measure of uncertainty. Having one lets your model say “I don’t know” when it encounters a distribution of data that it wasn’t trained with or when it evaluates an example that can be interpreted in multiple ways.

Uncertainty estimates, what are they good for?

Accurate uncertainty estimates give you more flexibility when dealing with your model. You don’t have to always assign the same amount of trust to every model prediction, and likewise, you don’t always have to take the same action based upon those predictions.

For example, when uncertainty is high you can decide to:

Refrain from taking any action on the prediction at all.
Refer the particular piece of data to a human to make a final call.
Collect more examples that cause high uncertainty to retrain your model.
Implement a tiered prediction strategy by sending the data to a slower, but more accurate model, when uncertainty is high.

Blindly assigning the same amount of trust and action for every model prediction can lead to embarrassing, serious, or even fatal mistakes in your autonomous systems.

Sources of uncertainty

We have to understand where uncertainty comes from to know how we can account for it in our models. Let’s look at two major sources of uncertainty in predictive modeling below.

Aleatoric uncertainty is uncertainty arising from noisy data. The noise can come from observation noise or it can come from the underlying process itself. High Aleatoric uncertainty indicate that small changes in the input data lead to large variances in the target data.

Noisy measurements of an underlying process leading to high Aleatoric uncertainty.

Even with error-less observations, a noisy underlying process can give rise to high Aleatoric uncertainty.

Epistemic uncertainty is uncertainty arising from a noisy model. Given the same input, high epistemic uncertainty indicate that small changes in model parameters give rise to large changes in model predictions. This type of uncertainty commonly occurs when models are evaluated on data whose distribution is different from the training data.

Epistemic uncertainty can arise in regions of the data space where there are few observations for training. This is because when there is little training data, many plausible model parameters may suffice for explaining the underlying ground truth phenomenon.

How to estimate uncertainty (overview)

There is ongoing research into how best to estimate uncertainty and it is not a solved problem yet. Monte Carlo Dropout and Model Ensembling are methods that have garnered recent attention because they mesh well with existing neural network architectures and are less computationally constrained than other methods.

These methods try to account for both aleatoric and epistemic uncertainty.

Although the authors behind these methods give different motivations for their work, both approach the problem of estimating uncertainty using a similar strategy:

They first propose modifying your neural network to estimate probability distributions rather than point-estimates (we talk more on this for the tasks of regression and classification below). In doing so, we allow our models to capture Aleatoric Uncertainty during training.

Second, they propose that at prediction time, you sample and combine predictions from multiple realizations of your neural network to get a final prediction and its uncertainty. This procedure helps our networks capture Epistemic uncertainty.

Monte Carlo Dropout

The deep learning community often uses Dropout to prevent models from overfitting. The idea is easy to implement: just randomly zero-out activations in your neural network with rate $p$ at training time and scale your activations by $p$ at test-time.

It turns out Dropout with some modifications is also useful for estimating uncertainty as described by Gal and Ghahramani. As long as your neural networks are trained with a few Dropout layers, you can use this method at prediction-time to obtain an estimate of uncertainty for your model.

The approach works by combining the predictions of several “realizations” of your neural network, which are essentially multiple forward passes of the same data point $\mathbf{x}$ through your network while applying different dropout masks $\mathbf{w}_t$ .

Unlike traditional Dropout networks, Monte Carlo Dropout (MC Dropout) networks applies dropout both at train-time and at test-time.

Algorithm:

Train a neural network $f_\theta(\mathbf{x})$ containing Dropout layers and a probabilistic loss appropriate for either your regression or classification task (see below) .
At test time, perform $T$ stochastic forward passes through $f_\theta(\mathbf{x})$ to obtain predictions for input $\mathbf{x}$ .
Depending on whether you are doing regression or classification, “combine” predictions as described below to obtain an Expectation-based prediction and uncertainty estimate.

Deep Ensembles

Another way to estimate uncertainty is by using model ensembling as described in the paper Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles.

The approach is quite similar to MC Dropout, and in fact, one way to interpret MC Dropout is to view it as a form of model ensembling.

The major difference with this approach is that rather than using a single trained network to make predictions with several randomly sampled dropout masks, we use $M$ trained models initialized from random starting points to collect our Monte Carlo samples.

Estimating uncertainty for regression

Defining the Probabilistic Loss Function: When training models for the regression task, we usually minimize the error between some target values $y$ and predicted values $\hat{y}$ using the Mean Squared Error loss.

To obtain uncertainty estimates with MC Dropout or Model Ensembling however, we must take a more probabalistic view. Rather than predicting a single scalar value $\hat{y}$ , we assume our target data is normally distributed and predict a Gaussian distribution $\mathcal{N}$ parameterized with mean $\hat{\mu}$ and variance $\hat{\sigma}^2$ .

\begin{equation} \hat{\mu}, \hat{\sigma}^2 = f_\theta(\mathbf{x}) \end{equation}

\begin{equation} p_{\theta}(y | \mathbf{x}) = \mathcal{N}(\hat{\mu}, \hat{\sigma}^2) \end{equation}

For our loss, instead of minimizing the difference to the predicted and target variable, we minimize the difference of our predictive distribution to the target distribution using the Negative Log Likelihood loss:

\begin{equation} - \log{p_{\theta}(y | \mathbf{x}}) = \frac{\log{\hat{\sigma}^2}}{2} + \frac{(y - \hat{\mu})^2}{2\hat{\sigma}^2} \end{equation}

(As an aside, I find this loss quite facinating. Notice here, we’re never explicitly providing the network an “uncertainty label” or target $\sigma^2$ . The network implicitly learns to capture the variance through the balance of the $\hat{\sigma}^2$ terms in the numerator and denominator.)

Making Predictions and Quantifying Uncertainty: Once we’ve trained our model, if we’re performing Monte Carlo Dropout, we sample dropout masks $\mathbf{w_t}$ and perform forward passes through $f_\theta(\mathbf{x; \mathbf{w_t}})$ to obtain $T$ samples:

\begin{equation} \hat{\mu}_{t}, \hat{\sigma}_{t}^2 = f_\theta(\mathbf{x; \mathbf{w_t}}) \end{equation}

With these Monte Carlo samples $\hat{\mu}_{t}$ , $\hat{\sigma}_{t}^2$ in hand, we can now compute our final regression prediction $\hat{y}_{\ast}$ and its uncertainty $\hat{\sigma}_{\ast}^{2}$ :

\begin{equation} \hat{y}_{\ast} = \frac{1}{T}\sum_{t\in{T}}{\hat{\mu}_{t}} \end{equation}

\begin{equation} \hat{\sigma}_{\ast}^{2} = \frac{1}{T}\sum_{t\in{T}}{(\hat{\sigma}_{t}^2 + \hat{\mu}_{t}^2)} - \hat{y}_{\ast}^2 \end{equation}

If we’re using Model Ensembling, rather than performing $T$ forward passes through a single network with randomly sampled $\mathbf{w}_t$ dropout masks, we instead get our predictions from $M$ trained models whose parameters $\mathbf{\theta}_m$ are initialized to random starting points. Everything else remains the same.

Note: These formulations for regression are described in follow-on papers to the MC Dropout paper from Kendall and Gal; Lakshminarayanan et. al also derives the same formulation using Proper Scoring Rules.

Estimating uncertainty for classification

Defining the Probabilistic Loss Function: The great news is that for classification, we do not need to modify the loss in order to obtain meaningful uncertainty estimates using Monte Carlo Dropout or Model Ensembling. This is because the predictions of a conventional neural network classifier uses the Softmax function which already parameterizes a discrete probability distribution.

Likewise, the Cross Entropy loss used to optimize neural network classifiers is already minimizing the the difference between our target and predictive distributions (it’s basically another name for the negative log likelihood). For these reasons, we can keep our classification network’s loss mechanics exactly the same.

Making Predictions and Quantifying Uncertainty: Once we’ve trained a standard network for classification, it is simple to obtain an expectation-based prediction and uncertainty estimate for our model.

If we are performing MC Dropout, to get a final prediction $\mathbf{\hat{y}}_\ast$ , we can average the predicted softmax probabilities over $T$ stochastic forward passes of the data $\mathbf{x}$ through our network $\mathbf{f}_{\theta}(\mathbf{x}; \mathbf{w}_t)$ by sampling random dropout mask $\mathbf{w}_t$ for each pass:

\begin{equation} \mathbf{\hat{y}}_t = \mathit{Softmax}(\mathbf{f}_{\theta}(\mathbf{x}; \mathbf{w}_t)) \end{equation}

\begin{equation} \mathbf{\hat{y}}_\ast = \frac{1}{T} \sum_{t}{\mathbf{\hat{y}}_t} \end{equation}

We measure the uncertainty of our probabilistic prediction $\mathbf{\hat{y}_\ast}$ by computing its Entropy over its vector elements $\hat{y}_{\ast,c}$ :

\begin{equation} H(\mathbf{\hat{y}_\ast}) = - \sum_c^C \hat{y}_{*,i} * {\log{\hat{y}_{\ast,c}}} \end{equation}

Sample code

We implement Tensorflow 2.0 code to perform Monte Carlo Dropout and Model Ensembling for both classification and regression in the following repository:

https://github.com/huyng/incertae

Let’s take a quick look at what the code does.

We’ll first define our model below:

from tensorflow.keras.layers import Input, Dense, Dropout
from tensorflow.keras import Sequential

model = Sequential([
    Dense(20, activation='relu'),
    Dropout(.5),
    Dense(20, activation='relu'),
    Dropout(.5),
    Dense(2, activation=None),
])

Notice, that rather than outputting a single unit, we’re outputing 2 units at the end of the network for the parameters of our estimated Gaussian distribution $\hat{\mu}$ and $\hat{\sigma}^2$ .

We’ll now define the loss for the regression task based on the equations above:

def gaussian_nll(y_true, y_pred):
    """
    Gaussian negative log likelihood

    Note: to make training more stable, we optimize
    a modified loss by having our model predict log(sigma^2)
    rather than sigma^2.
    """

    y_true = tf.reshape(y_true, [-1])
    mu = y_pred[:, 0]
    si = y_pred[:, 1]
    loss = (si + tf.math.squared_difference(y_true, mu)/tf.math.exp(si)) / 2.0
    return tf.reduce_mean(loss)

model.compile(loss=gaussian_nll, optimizer='sgd')

Finally, we’ll define our prediction function that will provide us with both an uncertainty estimate and a expecation-based prediction from our model.

import numpy as np

def predict(model, x, T=20):
    '''
    Args:
        model: The trained keras model
        x: the input tensor with shape [N, M]
        T: the number of monte carlo trials to sample
    Returns:
        y_mean: The expected value of our prediction
        y_std: The standard deviation of our prediction
    '''
    mu_arr = []
    si_arr = []

    for t in range(T):
        y_pred = model(x, training=True)
        mu = y_pred[:, 0]
        si = y_pred[:, 1]

        mu_arr.append(mu)
        si_arr.append(si)

    mu_arr = np.array(mu_arr)
    si_arr = np.array(si_arr)
    var_arr = np.exp(si_arr)

    y_mean = np.mean(mu_arr, axis=0)
    y_variance = np.mean(var_arr + mu_arr**2, axis=0) - y_mean**2
    y_std = np.sqrt(y_variance)
    return y_mean, y_std

I know it might seem odd that we’re setting train=True when using the model to predict, but this is how the Keras framework determines that it needs to sample a random dropout mask when making a forward pass.

Let’s use our function now:

y_mean, y_std = predict(model, x)

Here, y_mean is the expected value of our estimated distribution and y_std is standard deviation and can be used for our uncertainty estimate.

How good are these uncertainty estimates?

We can review the experiments in A Systematic Comparison of Bayesian Deep Learning Robustness in Diabetic Retinopathy Tasks to get a sense of how well the proposed uncertainty estimates capture the concept of uncertainty.

In this paper, the authors train a image classifier to predict whether a patient suffers from Diabetic Retinopathy based on pictures of the patient’s cornea. They train models using both techniques discussed in this blog post (in addition to a few other techniques used for estimating uncertainty).

To test whether their uncertainty estimates meaningfully capture uncertainty, they propose a simple evaluation protocol:

Refer a fixed percentage of the test dataset to a expert human oracle by sweeping a threshold over the uncertainty estimates provided by their models.
Report their model’s accuracy on the remaining retained data split (i.e. the samples that were not referred).

The idea behind this protocol is that an accurate measure of uncertainty would prioritize referring out the examples with high uncertainty, and as a result, the retained data would theoretically only contain examples that the model can predict with higher accuracy.

Here are the results of running this protocol on the Diabetic Retinopathy dataset:

Plot of model accuracy on test data as the model refers less samples (based on the model's estimate of uncertainty) to a human oracle. For meaningful measures of uncertainty, we see accuracy increase as we decrease amount of retained data. Left: test images comes from the same machine type used for training data. Right: test images come from a different machine type than the one used for training data. [Filos et al.]

In the plot above, the authors evaluate their ability to estimate uncertainty both when the test data comes from the same distribution as the training data and when the test data comes from a different distribution.

As a sanity check we can look at the “Random Referal” baseline. Here, samples are chosen for referal at random regardless of their estimated uncertainty. As expected, randomly choosing samples neither increases nor decreases accuracy as we refer more samples to the human oracle. In contrast, both the MC Dropout and Model Ensembling methods increase accuracy as they refer more examples meaning their measures of uncertainty are finding examples that they are likely to get wrong when making a prediction.

The “Deterministic” baseline uses a single neural network to make point-estimate predictions and computes its estimate of uncertainty using entropy. In other words, it’s the conventional neural network that everyone is use to working with. It does much worse than MC Dropout and Model Ensembling for both in-distribution and out-of-distribution test sets. Interestingly for out-of-distribution data, using its uncertainty estimates is no better than random referral.

These experiments make a strong case for implementing either MC Dropout or Model ensembling to obtain more accurate uncertainty estimates. If you have the computational budget, combining both approaches (i.e. “Ensemble MC Dropout”) could yield the best results.

Moving beyond traditional leaderboard metrics

I wrote this article because in a world driven by leaderboard AUC and AP metrics, it was worth pointing out that there are other measures, specifically the quality of your model’s uncertainty estimates, that matter for production environments.

We have to know when to trust our models as much as we have to find models with the highest possible accuracy. Uncertainty quantification gives us this ability and it gives us more flexibility for deciding what to do with our model predictions.

While the article only scratches the surface of the field, hopefully you now have some basic tools to quantify uncertainty and you understand why uncertainty is so critical for developing trustworthy models.

References

Are your models calibrated?

2019-02-08T00:00:00-08:00

A calibration curve (sometimes called a “reliability diagram”) tells you whether your model’s predicted probabilities accurately reflect the real chance of your model being right.

It’s very common for neural networks to overestimate the confidence in their predictions, and this type of diagram helps us detect when this phenomenon occurs. Here’s an example:

On the x-axis we have our model’s predicted confidence. On the y-axis we plot the model accuracy given its predicted confidence. We can see from this particular diagram that the model is “overconfident” when it makes a prediction in the range between 0.5 to 0.7.

Calibration curves for multiclass classifiers

Scikit learn provides a function to compute calibration curves for binary classification problems. However, in many cases we want to obtain the calibration curve for a model that makes predictions for more than 2 classes.

We can look to Guo et al. to see how they generate their calibration curve plots.

They propose “binning” all predicted confidences into $M$ equally wide bins. Where $B_m$ is the bin containing the set of indices of samples that fall into interval $I_m = (\frac{m-1}{M}, \frac{m}{M}]$ .

For each bin we can compute the bin accuracy (which is the y-axis on our graph) using the following formula:

$acc(B_m) = \frac{1}{\vert B_m \vert} \sum_{i \in B_m}{\unicode{x1D7D9}(\hat{y}_i = y_i)}$

Here, $\unicode{x1D7D9}(\hat{y}_i = y_i)$ is $1$ if the example label $y_i$ belongs to the same class as the prediction $\hat{y}_i$ , and $0$ otherwise.

To sum up, we compute the y-axis of our plot by first segmenting all predicted confidence scores into M bins. Each of these prediction scores are associated to a class $c$ . For each bin, we count the number of examples whose labels match the class $c$ associated to our predicted score and divide by the total count of items in the bin.

Code

Here is the code to compute and plot the calibration curves for your models in matplotlib.

import numpy as np
import matplotlib.pyplot as plt

def multiclass_calibration_curve(probs, labels, bins=10):
    '''
    Args:
        probs (ndarray):
            NxM predicted probabilities for N examples and M classes.
        labels (ndarray):
            Vector of size N where each entry is an integer class label.
        bins (int):
            Number of bins to divide the prediction probabilities into.
    Returns:
        midpoints (ndarray):
            Midpoint value of each bin
        accuracies (ndarray):
            Fraction of examples that are positive in bin
        mean_confidences:
            Average predicted confidences in each bin
    '''
    step_size = 1.0 / bins
    n_classes = probs.shape[1]
    labels_ohe = np.eye(n_classes)[labels.astype(np.int64)]

    midpoints = []
    mean_confidences = []
    accuracies = []

    for i in range(bins):
        beg = i * step_size
        end = (i + 1) * step_size

        bin_mask = (probs >= beg) & (probs < end)
        bin_cnt = bin_mask.astype(np.float32).sum()
        bin_confs = probs[bin_mask]
        bin_acc = labels_ohe[bin_mask].sum() / bin_cnt

        midpoints.append((beg+end)/2.)
        mean_confidences.append(np.mean(bin_confs))
        accuracies.append(bin_acc)

    return midpoints, accuracies, mean_confidences

def plot_multiclass_calibration_curve(probs, labels, bins=10, title=None):
    '''
    Plot calibration curve
    '''
    title = 'Reliability Diagram' if title is None else title
    midpoints, accuracies, mean_confidences = multiclass_calibration_curve(probs, labels, bins=bins)
    plt.bar(midpoints, accuracies, width=1.0/float(bins), align='center', lw=1, ec='#000000', fc='#2233aa', alpha=1, label='Model', zorder=0)
    plt.scatter(midpoints, accuracies, lw=2, ec='black', fc="#ffffff", zorder=2)
    plt.plot(np.linspace(0, 1.0, 20), np.linspace(0, 1.0, 20), '--', lw=2, alpha=.7, color='gray', label='Perfectly calibrated', zorder=1)
    plt.xlim(0.0, 1.0)
    plt.ylim(0.0, 1.0)
    plt.xlabel('\nconfidence')
    plt.ylabel('accuracy\n')
    plt.title(title+'\n')
    plt.xticks(midpoints, rotation=-45)
    plt.legend(loc='upper left')
    plt.tight_layout()
    return midpoints, accuracies, mean_confidences

References

Similarity search 101 - Part 2 (Fast retrieval with vp-trees)

2014-03-26T00:00:00-07:00

This is the 2nd article in a two part series on similarity search. See part 1 for an overview of the subject.

In this second installment of my series on similarity search we’ll figure out how to improve on the speed and efficiency of querying our database for nearest neighbors using a data structure known as a “vantage point tree”.

We previously used a brute force approach by computing pairwise distances between our query and all points in our dataset so that we could find items that were close to it.

Unfortunately, this technique scales in $O( \# examples \times \# features)$ time which is prohibitively expensive on even modestly sized datasets.

Kd-trees, and more recently vantage point trees (a.k.a vp-trees), have gained popularity within the machine learning community for their efficacy in reducing the computational cost of similarity search over large datasets.

For this article, we’ll focus on examining how a vp-tree works.

What is a vantage point tree and how do we construct one?

In a nutshell, a vantage point tree structure allows us to store the elements of our dataset in such a way that during query time, we can quickly exclude from examination large portions of our data without having to perform any distance computations on the elments of that excluded portions.

Let’s take a look at the basic structure of a vp-tree because it will allow us to understand how we can prune data from a search at query time.

By definition, each node in a vp-tree stores at a minimum 5 pieces of information:

A list of elements sampled from our dataset
A vantage point element chosen randomly from the list of elements above
A distance called mu
A “left” child node
A “right” child node

I’ll explain soon how all of these compoenents relate, but in the meantime here’s an illustration of the vp-tree concept:

At the root node of our tree, the list of elements consists of every single item in our data set. From this list of items, we choose one element and designate it as our vangate point.

To choose $mu$ , we compute the median distance between our vantage point $vp$ and all other elements $P$ in the current node .

$mu = median(\ dist(vp, p)\ )\ \ \ \forall p \in P$

We select all points within a distance $mu$ from the vantage point to assign elements to the left child node. And similarly, we can assign all points outside of $mu$ to the right child node.

Since $mu$ is the median distance between the vantage point and all other points, this procedure effectively divides into half the elements we assign to the left and right child nodes.

Finally, to construct the rest of the tree, we recursively follow this same procedure for each child node, until there are no more elements to assign to child nodes.

Here is some pseudo code to build the tree:

class VPNode:
    elements
    left_child
    right_child
    mu

def build_vp_tree(elements):
    node = new VPNode()
    node.vp = select_random(elements)
    node.mu = median(distance(vp,e) for e in elements)
    left_elements = [e for e in elements where distance(vp, e) < mu]
    right_elements = [e for e in elements where distance(vp, e) > mu]
    node.left_child = build_vp_tree(left_elements)
    node.right_child = build_vp_tree(right_elements)
    return node

Nearest neighbor search with the vantage point tree

For a dataset encoded as a vantage point tree and a query point $q$ , how can we find the closest $k$ points in our dataset without running distance computations for every single element?

One approach we could take is to say that for every $q$ there is some threshold distance $tau$ where all of its closest $k$ neighbors are contained within this threshold. You can imagine this area as an enclsosed circle as depcited below:

There are three scenarios for how this query-tau area can relate to any node within our vantage point tree.

Pruning the left child node

The first scenario is if the area lies completely outside of our vantage-point-mu radius as depicted below. If this is the case, we can safely assume that if we are to find $q$ ’s nearest neighbors we can forego looking in our node’s left child, which contains all elements within the mu radius of this vantage point.

Pruning the right child node

The next scenario is when the query-tau area lies completely inside the bounds of the vantage point’s mu-radius (see below). In this case, we can ignore all points outside of $mu$ which we had conveniently assigned to the right child node.

Worst case, we check both left and right child nodes

What happens when the query-tau area partially intersects with our node’s vantage point’s mu-radius?

In this scenario, we can’t say whether the right or left child contains the nearest neighbors, so we have to search both nodes.

Traversing the tree to find nearest neighbors

To summarize, when the query threshold area is completely outside the bounds of our node’s vantage-point mu boundary, we can exclude or “prune” the left child node from our search space. When the query threshold is completely inside the bounds of vantage-point mu boundary, we cans safely ignore the right child node. And finally when neither is the case, we must search both left and right child nodes.

Now that we know how to behave when examining a single node, we can use this knowledge to find $q$ ’s nearest neighbors by recursively shrinking $tau$ as we search down the vantage point tree.

More concretely, we initialize $tau$ to be infinity. And as we traverse from the root node to each child node of the vp-tree, we set $tau$ to be equal to the lesser of the distance from $q$ to $vp$ or any previously seen $tau$ .

def find_nearest_neighbors(tree, q, k):
    """
    tree = the VP tree
    k    = # of nearest neighbors you wanted to find
    q    = query point
    """

    tau = infinity
    nodes_to_visit = [tree]

    # fixed size array for nearest neightbors
    # sorted from closest to farthest neighbor
    neighbors = PriorityQueue(k)

    while nodes_to_visit.length() > 0:
        node = nodes_to_visit.popleft()
        d = distance(q, node.vp)

        if d < tau:
            # store node.vp as a neighbor if it's closer than any other point
            # seen so far
            neighbors.append(node.vp)

            # shrink tau
            farthest_nearest_neighbor = neighbors[-1]
            tau = distance(q, farthest_nearest_neighbor)

        # check for intersection between q-tau and vp-mu regions
        # and see which branches we absolutely must search

        if d < node.mu:
            if d < node.mu + tau:
                nodes_to_visit.append(node.left_child)
            if d >= node.mu - tau:
                nodes_to_visit.append(node.right_child)
        else:
            if d >= node.mu - tau:
                nodes_to_visit.append(node.right_child)
            if d < node.mu + tau:
                nodes_to_visit.append(node.left_child)

    return neighbors

Here is the full source code for my python implementation of a vantage point tree.

references

Similarity search 101 - Part 1 (overview)

2014-03-19T00:00:00-07:00

In my work applying machine learning technologies, finding “similar” items is one of the most common challenges that people come to my team with:

“Can you find me the image that looks like this other image?”
“Which webpage is similar to this webpage?”
“Which song matches this audio clip I’ve recorded?”

To more formally define the similarity task, let’s assume we have a big database of items. Solving the this task means that when we are given a new item – let’s call it the query – our algorithm is able to locate within the database the closest or most similar items to the query.

Whether we’re dealing with pictures, audio clips, faces, documents, or DNA sequences, we can approach solving this problem using a common framework which I briefly summarize below.

Feature extraction

First we need to know how to “represent” each item in our data set as a series of numbers. This process is often called “feature extraction” or “feature representation”.

There is a large field of research focused on finding good feature representations for all kinds of data. And while it may not yield the most optimal results in terms of discounting noise in our data, we can take the raw signals themselves and convert them verbatim into our “feature vector” as a first step in building our similarity search engine.

So for example, if our dataset is made up of 16 x 16 gray scale images, we would take the pixel value of each row and store them as an array of 256 numbers.

Defining a similarity metric

Once we have a feature vector for every item in our dataset, we need a way of comparing one feature vector to another feature vector. If we view each feature vector as a point in some n-dimensional space, one can use the euclidean distance between two points $p$ and $q$ as a similarity metric:

$distance = \sqrt{\sum_{i=1}^n (q_i-p_i)^2}$

Brute force similarity search

With the similarity metric and a feature extraction routine in place, we pretty much have a complete working similarity search system, albeit an efficient one. If given a query $q$ , we can find the closest item to $q$ in dataset $D$ using the following routine:

def find_nearest(q, D):
    nearest_neighbor = None
    min_distance = infinity
    for p in D:
        distance = compute_distance(p, q)
        if min_distance > distance:
            min_distance = distance
            nearest_neighbor = p
    return nearest_neighbor

The approach above scales in $O(m \times n)$ time where $m$ is the number of items in our dataset $D$ and $n$ is the number of dimensions in our feature vector. So even for a modest database size of 1000 items and a feature vector of 256 dimensions, we could be computing up to 256 thousand operations with every single query!

To be continued …

In the next part of this series, we’ll take a look at using vantage point trees to see how we can more efficiently store our data set so that it’s less computationally intensive to search over.

continue to part 2 →

Faster Numpy Dot Product for Multi-dimensional Arrays

2013-12-14T00:00:00-08:00

When installed and linked correctly against a blas implementation like ATLAS or OpenBlas, numpy’s dot product can run incredibly fast. What I hadn’t known until recently is that for some special cases, you can perform dot products an order of magnitude faster just by calling the underlying blas routines directly.

In order to do so, I’ve written a module called fastdot, whose source code is located in a github gist.

The code, in a nutshell, is a wrapper around blas’s generalized matrix-matrix multiplication routines (A.K.A “gemm”). Its main job is to ensure that the arrays that get passed into it are in FORTRAN contiguous order before handing-off the bulk of the work to the underlying blas routines.

Performance comparison between dot product implementations

I ran a few benchmarks and plotted below the speed differences between fastdot.dot and np.dot operating on matrices of varying dimensions and sizes.

A shape            B shape     array dims    function       time (s)
-----------------  ----------  ------------  -----------  ----------
(8, 55, 55, 1024)  (1024, 92)  4d * 2d       np.dot           2.0168
(440, 55, 1024)    (1024, 92)  3d * 2d       np.dot           1.9883
(24200, 1024)      (1024, 92)  2d * 2d       np.dot           0.0944

(8, 55, 55, 1024)  (1024, 92)  4d * 2d       fastdot.dot      0.1187
(440, 55, 1024)    (1024, 92)  3d * 2d       fastdot.dot      0.0940
(24200, 1024)      (1024, 92)  2d * 2d       fastdot.dot      0.1342

As you can see fastdot.dot runs almost 20X faster than np.dot when one of your matrices has 3 dimensions or more. But before you start using it everywhere that you need a dot product, do take notice that it runs slightly slower than numpy’s implementation when operating on matrices with 2 dimensions or less.

As a rule of thumb, use fastdot when you know ahead of time that the number of dimensions in your matrices will be larger than 3. It could increase your program’s performace by 20x. Otherwise stick with numpy’s implementation.

The JSON Streaming Record (JSRec) data format

2013-10-30T00:00:00-07:00

I’m inventing a new data format (as blasphemous as it sounds). It’s flexible, human readable, easy to produce, and best of all, nearly impossible to screw up parsing.

I’m doing it to replace CSV files because you shouldn’t have to worry about quoting, escaping, or deciding whether your “comma” sepearated values turn out to really mean semicolon or even worse, tab delimited.

The new data format is called json streaming record or JSRec for short. While I say it’s “new”, I’m sure many of you have either produced or consumed this type of data already at some point in your career.

Here’s how it’s defined:

Files of this format have .jsrec as their file extension
Each line in the file is a json hash map
Empty lines and lines beginning with ‘#’ are considred comments and ignored during parsing

Here’s an example file foobar.jsrec

{"foo":1, "bar": "marry"}
{"foo":11, "bar": "had a"}
{"foo":21, "bar": "little lamb"}

# some comments
{"foo":33, "bar": "more data"}

Here’s the code to parse and encode this data format:

import json

def load_jsrec(filename):
    """loads a .jsrec file"""
    fh = open(filename)
    for line in fh:
        line = line.strip()
        if line == "":
            continue
        if line.startswith("#"):
            continue
        yield json.loads(line)
    fh.close()

def dump_jsrec(filename, records):
    """writes a .jsrec file"""
    with open(filename, "w") as fh:
        for rec in records:
            fh.write(json.dumps(rec) +"\n")

A new use case: building data processing pipelines with JSRec

Because each line in a JSRec file contains all the information necessary to parse a record, you can use it to pipe output from one program to another:

cat foobar.jsrec | progA | progB

As long as the programs you’re using understands JSRec, you can start chaining them together. This is HUGE because it makes building data processing pipelines on the commandline a modular and simple task.

When to use it

Json Streaming Record is an ideal replacement for CSV files. Use it when you want a data format that can store “streams” of records that are human readable yet easily parsed by a machine.

With each line of the format being a completely self-contained JSON object, JSRec allows you to produce and consume data in an incremental fashion. I encourage you to start using it as a data format to pass around on the commandline for when you’re building those data processing pipelines.

A Guide to Analyzing Python Performance

2013-09-03T00:00:00-07:00

Introduction
Coarse grain timing with time
Fine grain timing with a timing context manager
Line-by-line timing and execution frequency with a profiler
How much memory does it use?
IPython shortcuts for line_profiler and memory_profiler
Where’s the memory leak?
Effort vs precision
Refrences

Introduction

While it’s not always the case that every Python program you write will require a rigorous performance analysis, it is reassuring to know that there are a wide variety of tools in Python’s ecosystem that one can turn to when the time arises.

Analyzing a program’s performance boils down to answering 4 basic questions:

How fast is it running?
Where are the speed bottlenecks?
How much memory is it using?
Where is memory leaking?

Below, we’ll dive into the details of answering these questions using some awesome tools.

Coarse grain timing with time

Let’s begin by using a quick and dirty method of timing our code: the good old unix utility time.

$ time python yourprogram.py

real    0m1.028s
user    0m0.001s
sys     0m0.003s

The meaning between the three output measurements are detailed in this stackoverflow article, but in short

real - refers to the actual elasped time
user - refers to the amount of cpu time spent outside of kernel
sys - refers to the amount of cpu time spent inside kernel specific functions

You can get a sense of how many cpu cycles your program used up regardless of other programs running on the system by adding together the sys and user times.

If the sum of sys and user times is much less than real time, then you can guess that most your program’s performance issues are most likely related to IO waits.

Fine grain timing with a timing context manager

Our next technique involves direct instrumentation of the code to get access to finer grain timing information. Here’s a small snippet I’ve found invaluable for making ad-hoc timing measurements:

timer.py

import time

class Timer(object):
    def __init__(self, verbose=False):
        self.verbose = verbose

    def __enter__(self):
        self.start = time.time()
        return self

    def __exit__(self, *args):
        self.end = time.time()
        self.secs = self.end - self.start
        self.msecs = self.secs * 1000  # millisecs
        if self.verbose:
            print 'elapsed time: %f ms' % self.msecs

In order to use it, wrap blocks of code that you want to time with Python’s with keyword and this Timer context manager. It will take care of starting the timer when your code block begins execution and stopping the timer when your code block ends.

Here’s an example use of the snippet:

from timer import Timer
from redis import Redis
rdb = Redis()

with Timer() as t:
    rdb.lpush("foo", "bar")
print "=> elasped lpush: %s s" % t.secs

with Timer() as t:
    rdb.lpop("foo")
print "=> elasped lpop: %s s" % t.secs

I’ll often log the outputs of these timers to a file in order to see how my program’s performance evolves over time.

Line-by-line timing and execution frequency with a profiler

Robert Kern has a nice project called line_profiler which I often use to see how fast and how often each line of code is running in my scripts.

To use it, you’ll need to install the python package via pip:

$ pip install line_profiler

Once installed you’ll have access to a new module called “line_profiler” as well as an executable script “kernprof.py”.

To use this tool, first modify your source code by decorating the function you want to measure with the @profile decorator. Don’t worry, you don’t have to import anyting in order to use this decorator. The kernprof.py script automatically injects it into your script’s runtime during execution.

primes.py

@profile
def primes(n):
    if n == 2:
        return [2]
    elif n < 2:
        return []
    s = range(3,n+1,2)
    mroot = n ** 0.5
    half = (n+1)/2-1
    i = 0
    m = 3
    while m <= mroot:
        if s[i]:
            j = (m*m - 3)/2
            s[j] = 0
            while j < half:
                s[j] = 0
                j += m
        i = i + 1
        m = 2*i + 3
    return [2] + [x for x in s if x]

primes(100)

Once you’ve gotten your code setup with the @profile decorator, use kernprof.py to run your script.

$ kernprof.py -l -v fib.py

The -l option tells kernprof to inject the @profile decorator into your script’s builtins, and -v tells kernprof to display timing information once you’re script finishes. Here’s one the output should look like for the above script:

Wrote profile results to primes.py.lprof
Timer unit: 1e-06 s

File: primes.py
Function: primes at line 2
Total time: 0.00019 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
                                         @profile
                                         def primes(n):
       1            2      2.0      1.1      if n==2:
                                                 return [2]
       1            1      1.0      0.5      elif n<2:
                                                 return []
       1            4      4.0      2.1      s=range(3,n+1,2)
       1           10     10.0      5.3      mroot = n ** 0.5
       1            2      2.0      1.1      half=(n+1)/2-1
       1            1      1.0      0.5      i=0
       1            1      1.0      0.5      m=3
       5            7      1.4      3.7      while m <= mroot:
       4            4      1.0      2.1          if s[i]:
       3            4      1.3      2.1              j=(m*m-3)/2
       3            4      1.3      2.1              s[j]=0
      31           31      1.0     16.3              while j<half:
      28           28      1.0     14.7                  s[j]=0
      28           29      1.0     15.3                  j+=m
       4            4      1.0      2.1          i=i+1
       4            4      1.0      2.1          m=2*i+3
      50           54      1.1     28.4      return [2]+[x for x in s if x]

Look for lines with a high amount of hits or a high time interval. These are the areas where optimizations can yield the greatest improvements.

How much memory does it use?

Now that we have a good grasp on timing our code, let’s move on to figuring out how much memory our programs are using. Fortunately for us, Fabian Pedregosa has implemented a nice memory profiler modeled after Robert Kern’s line_profiler.

First install it via pip:

$ pip install -U memory_profiler
$ pip install psutil

(Installing the psutil package here is recommended because it greatly improves the performance of the memory_profiler).

Like line_profiler, memory_profiler requires that you decorate your function of interest with an @profile decorator like so:

@profile
def primes(n):
    ...
    ...

To see how much memory your function uses run the following:

$ python -m memory_profiler primes.py

You should see output that looks like this once your program exits:

Filename: primes.py

Line #    Mem usage  Increment   Line Contents
==============================================
                         @profile
  7.9219 MB  0.0000 MB   def primes(n):
  7.9219 MB  0.0000 MB       if n==2:
                                 return [2]
  7.9219 MB  0.0000 MB       elif n<2:
                                 return []
  7.9219 MB  0.0000 MB       s=range(3,n+1,2)
  7.9258 MB  0.0039 MB       mroot = n ** 0.5
  7.9258 MB  0.0000 MB       half=(n+1)/2-1
  7.9258 MB  0.0000 MB       i=0
  7.9258 MB  0.0000 MB       m=3
  7.9297 MB  0.0039 MB       while m <= mroot:
  7.9297 MB  0.0000 MB           if s[i]:
  7.9297 MB  0.0000 MB               j=(m*m-3)/2
  7.9258 MB -0.0039 MB               s[j]=0
  7.9297 MB  0.0039 MB               while j<half:
  7.9297 MB  0.0000 MB                   s[j]=0
  7.9297 MB  0.0000 MB                   j+=m
  7.9297 MB  0.0000 MB           i=i+1
  7.9297 MB  0.0000 MB           m=2*i+3
  7.9297 MB  0.0000 MB       return [2]+[x for x in s if x]

IPython shortcuts for line_profiler and memory_profiler

A little known feature of line_profiler and memory_profiler is that both programs have shortcut commands accessible from within IPython. All you have to do is type the following within an IPython session:

%load_ext memory_profiler
%load_ext line_profiler

Upon doing so you’ll have access to the magic commands %lprun and %mprun which behave similarly to their command-line counterparts. The major difference here is that you won’t need to decorate your to-be-profiled functions with the @profile decorator. Just go ahead and run the profiling directly within your IPython session like so:

In [1]: from primes import primes
In [2]: %mprun -f primes primes(1000)
In [3]: %lprun -f primes primes(1000)

This can save you a lot of time and effort since none of your source code needs to be modified in order to use these profiling commands.

Where’s the memory leak?

The cPython interpreter uses reference counting as it’s main method of keeping track of memory. This means that every object contains a counter, which is incremented when a reference to the object is stored somewhere, and decremented when a reference to it is deleted. When the counter reaches zero, the cPython interpreter knows that the object is no longer in use so it deletes the object and deallocates the occupied memory.

A memory leak can often occur in your program if references to objects are held even though the object is no longer in use.

The quickest way to find these “memory leaks” is to use an awesome tool called objgraph written by Marius Gedminas. This tool allows you to see the number of objects in memory and also locate all the different places in your code that hold references to these objects.

To get started, first install objgraph:

pip install objgraph

Once you have this tool installed, insert into your code a statement to invoke the debugger:

import pdb; pdb.set_trace()

Which objects are the most common?

At run time, you can inspect the top 20 most prevalent objects in your program by running:

(pdb) import objgraph
(pdb) objgraph.show_most_common_types()

MyBigFatObject             20000
tuple                      16938
function                   4310
dict                       2790
wrapper_descriptor         1181
builtin_function_or_method 934
weakref                    764
list                       634
method_descriptor          507
getset_descriptor          451
type                       439

Which objects have been added or deleted?

We can also see which objects have been added or deleted between two points in time:

(pdb) import objgraph
(pdb) objgraph.show_growth()
.
.
.
(pdb) objgraph.show_growth()   # this only shows objects that has been added or deleted since last show_growth() call

traceback                4        +2
KeyboardInterrupt        1        +1
frame                   24        +1
list                   667        +1
tuple                16969        +1

What is referencing this leaky object?

Continuing down this route, we can also see where references to any given object is being held. Let’s take as an example the simple program below:

x = [1]
y = [x, [x], {"a":x}]
import pdb; pdb.set_trace()

To see what is holding a reference to the variable x, run the objgraph.show_backref() function:

(pdb) import objgraph
(pdb) objgraph.show_backref([x], filename="/tmp/backrefs.png")

The output of that command should be a PNG image stored at /tmp/backrefs.png and it should look something like this:

The box at the bottom with red lettering is our object of interest. We can see that it’s referenced by the symbol x once and by the list y three times. If x is the object causing a memory leak, we can use this method to see why it’s not automatically being deallocated by tracking down all of its references.

So to review, objgraph allows us to:

show the top N objects occupying our python program’s memory
show what objects have been deleted or added over a period of time
show all references to a given object in our script

Effort vs precision

In this post, I’ve shown you how to use several tools to analyze a python program’s performance. Armed with these tools and techniques you should have all the information required to track down most memory leaks as well as identify speed bottlenecks in a Python program.

As with many other topics, running a performance analysis means balancing the tradeoffs between effort and precision. When in doubt, implement the simplest solution that will suit your current needs.

Refrences

My Tmux Setup

2013-08-21T00:00:00-07:00

If you do a lot of context switching between projects, recreating your terminal environment & window layout can easily eat up hours of your day. Here’s a quick tip to help you create and manage persistent terminal workspaces so that with a few keystrokes, you can jump straight back into whatever you were working on as quickly as possible.

How it works

The whole idea behind this setup is to make it so you will have 1) The ability to name your terminal workspaces with easily memorizable names and 2) the ability to keep persistent terminal workspaces running even if you’ve closed your terminal window.

All of this is achieved with the following shell script which we’ll configure iTerm2 to execute anytime you open a new window.

#!/bin/sh
export PATH=$PATH:/usr/local/bin

# abort if we're already inside a TMUX session
[ "$TMUX" == "" ] || exit 0

# startup a "default" session if none currently exists
tmux has-session -t _default || tmux new-session -s _default -d

# present menu for user to choose which workspace to open
PS3="Please choose your session: "
options=($(tmux list-sessions -F "#S") "NEW SESSION" "BASH")
echo "Available sessions"
echo "------------------"
echo " "
select opt in "${options[@]}"
do
    case $opt in
        "NEW SESSION")
            read -p "Enter new session name: " SESSION_NAME
            tmux new -s "$SESSION_NAME"
            break
            ;;
        "BASH")
            bash --login
            break;;
        *)
            tmux attach-session -t $opt
            break
            ;;
    esac
done

Setting it up

First off, place the above script in a location that’s accessible to iTerm2 (I usually place it in ~/.dotfiles/tmux.start.sh).

Then open up iTerm2’s terminal preferences and have it execute this script anytime you open a new window:

Usage

Once you’ve done all the above, everytime you open a new window, you’ll be prompted to choose which previous workspace you want to join.

You’ll also have the opportunity to create new work spaces by choosing NEW SESSION. Or if you don’t want to open a full blown tmux session, you can choose to just open up a BASH prompt. All of these sessions are persistent. So if you decide to close your terminal window, they will remain active in the background and ready for you to rejoin at a later point in time.

GGPlot2 theme for Matplotlib

2011-02-08T00:00:00-08:00

John Hunter, creator of MatPlotlib, originally designed it’s color scheme to be familiar to Matlab users. As it turns out, the color scheme works well for publication material but doesn't work so great for viewing visualizations on the web.

I find the default styling for graphs produced using ggplot2 aesthetically pleasing for this purpose, so I spent some time over the weekend to refine the default colors and settings for my matplotlib installation. The result of this work is embodied in this .matplotlibrc color theme file. If you want graphs that look like the ones below by default, download it and place the file under ~/.matplotlib/matplotlibrc.

Don't Hash Your Secrets

2010-02-01T00:00:00-08:00

Ben Adida suggests that you don't hash your secrets.

That means that if you know SHA1(secret || message), then you can compute SHA1(secret || message || ANYTHING), which is a valid signature for message || ANYTHING. So to break this system, you just need to see one signature.

Not being a cryptography expert, I was blown away by his article. At the core of his post is the idea that given a hash digest of a message, one could compute the hash of message + appended_message without even knowing the original message.

I had to see this for myself. Was it that easy to extend an MD5 or SHA1 hash? Below, you'll find working python code and an explanation for spoofing signatures signed with the MD5 algroithm.

Implementation

To generate a hash from a message, algorithms like MD5 and SHA1 iterate through the message block by block. For each block, the algorithm runs a transformation function where the input is a seed state and a message block . The output of this transformation is then fed back as the seed state for the transformation of the next message block (see the above diagram).

After the hashing function has digested the entire message, it then appends some padding and runs the transformation function one more time. The final state of this transformation becomes the digest.

from hashlib import md5
signature = md5("secret" + "hello world").digest()

print(repr(signature))
"O'Q\xa8\xb8\x9d\x81%\xd7\x13'\xe0\xfb_2\xde"

In the code above, the signature represents the state output of the final transformation function.

AHA! We now have a strategy to extend the hash. If we can seed the transformation function with the state(AKA signature) of the original message, we can essentially extend the hash without even knowing the original message.

There is one problem however. I mentioned before that the MD5 algorithm adds a piece of padding to the original message before it gives us the hash. That means whenever we see a signature it's really the hash of the message + padding. Fortunately, the padding is only dependent upon the length of the original message. With that in mind, we can easily generate both the new signature and padding. Here's some pseudocode

state = decode(signature)
padding = calculate_padding(original_message_len)
new_signature = transform(state, "appended message")

# This should be True
new_signature == md5(original_message + padding + "appended message").digest()

Now here is the real code.

def spoof_digest(originalDigest, originalLen, spoofMessage=""):
    # first decode digest back into state tuples
    state = Decode(originalDigest, 16)

    # generate a seed md5 object
    spoof = md5()

    # seed the count variable for calculation of index, padLen, and bits
    spoof.count += originalLen << 3

    # calculate some variables to generate the original padding
    index = int((spoof.count >> 3) & 0x3f)
    padLen = (56 - index) if index < 56 else (120 - index)
    bits = Encode((spoof.count & 0xffffffffL, spoof.count>>32), 8)

    # construct the original padding
    padding = PADDING[:padLen]

    # augment the count with the new padding and trailing bits
    spoof.count += len(padding) << 3
    spoof.count += len(bits) << 3
    spoof.state = state

    # run an update
    spoof.update(spoofMessage)

    # We now have a digest of the original secret + message + some_padding
    return (spoof.digest(), padding + bits)

The code has a dependency on a pure-python implementation of the md5 algorithm that I've packaged it together with the source code. If you want to try it out, download the file and run this test function (also included in the file):

def test_spoofing():
    originalMsg = "secret" + "my message"
    appendedMsg = "my message extension"

    # This is the signature that a legitimate user sends
    # over the wire in clear text. 
    originalSignature = md5(originalMsg).digest()

    # This is how an attacker would spoof the signature where,
    # the message ==  originalMsg + padbits + appendedMsg .
    # Notice that this method implies that the attacker
    # knows the original length of the "secret" ... 
    # Most apis such as Flickr assign secrets that are of
    # uniform length for all of their api users.
    spoofSignature, padbits = spoof_digest(originalSignature, len(originalMsg), appendedMsg)

    # This is how a legitimate user would construct the
    # a signature when message == originalMsg + padbits + appendedMsg
    testSignature = md5(originalMsg + padbits + appendedMsg).digest()

    # make sure the spoof signature and the test signature match.
    # if, this passes, we've successfully constructed a spoofed message
    # of the form: secret + orginal_message + padding + appended_message
    # without actually knowing the secret.
    assert testSignature == spoofSignature

Information in this blog is meant for educational purposes only!

Bashmarks: Directory Bookmarks in the Shell

2009-09-10T00:00:00-07:00

EDIT 2010-07-01 : This post is left up for context / historical posterity. I've packaged up a shell script to allow you to save and jump to commonly used directories. It's called bashmarks and it has tab completion functionality built-in. Visit the link below to learn more.

Bashmarks

Do not use the stuff below

Before I wrote this script, It felt like I spent half of my time in terminal cd-ing around to various directories. If you're like me, placing this snippet into your .bashrc file will save you tons of time each and every single day:

# Bash Directory Bookmarks
alias m1='alias g1="cd `pwd`"'
alias m2='alias g2="cd `pwd`"'
alias m3='alias g3="cd `pwd`"'
alias m4='alias g4="cd `pwd`"'
alias m5='alias g5="cd `pwd`"'
alias m6='alias g6="cd `pwd`"'
alias m7='alias g7="cd `pwd`"'
alias m8='alias g8="cd `pwd`"'
alias m9='alias g9="cd `pwd`"'
alias mdump='alias|grep -e "alias g[0-9]"|grep -v "alias m" > ~/.bookmarks'
alias lma='alias | grep -e "alias g[0-9]"|grep -v "alias m"|sed "s/alias //"'
touch ~/.bookmarks
source ~/.bookmarks

Directory Bookmark Usage

With this in place, your bash shell will have the ability to set and retrieve directory bookmarks. Let's say you're in a folder that you visit a hundreds of times per day. Run one of the "m" (a.k.a mark) commands inside the directory to create a bookmark. Here's an example:

# This will create a bookmark for the /var/www directory
user@host[/var/www/] : m1

Now whenever you want to cd into that directory, you can run the corresponding "g" (a.k.a goto mark) command.

# This will cd into /var/www
user@host[/etc/apache2] : g1

In other words the m1 command will set the g1 bookmark, the m2 command will set the g2 bookmark, and so on ... If you don't want to keep track of these bookmarks in your head, you'll be glad to hear that the "lma" (a.k.a "list marks ") command can show you all of your current bookmarks like so:

user@host[/usr/local/] : lma
g1='cd /var/www/'
g2='cd /etc/'

Persisting the Bookmarks

If you want to preserve your bookmarks for the next time you log in, execute the mdump command which will store the bookmarks into a file called .bookmarks under your HOME directory. Keep in mind that if you do not run this command your bookmarks will be forgotten once you log out of the shell.

Toward an Ergonomic Vim Setup

2009-07-21T00:00:00-07:00

Here’s a brief tip: Rebind your ctrl key to capslock.

Save yourself from the pain. The contorted positions we developers strain our hands into will eventually break them - Emacs & Textmate users, you know what I’m talking about. Do yourself a favor and rebind your control key to capslock. You can do this under MacOSX by going to System Preferences > Keyboard & Mouse > Keyboard > Modifier Keys. Change the settings to look like the following and you’re set.