New York Times interactive for 2013

All of last year’s New York Times interactive pieces are in one place – here.

Often it’s hard to search for interactive pieces or data visualizations, because it’s not the keywords that you remember most.  So, it’s nice having them all in one place to find older ones.  Moreover, it’s a great way to discover fantastic pieces that you might have missed, or that you want to be inspired by. 

Thanks to the NY Times team for all of their great work!

Posted in Uncategorized | Leave a comment

Made a responsive website!

I made a new website!  It gives advice for traveling to New Zealand, based on my trip there a few months ago*.  But, the main goal wasn’t the content, but rather to get my hands dirty with web development from start-to-finish.  I’d previously made an interactive analysis tool using javascript/D3 which I launched on Google’s App Engine.  In that project, the focus was primarily on the interactive data visualization.  However, it was clear that a stronger understanding of how to use html/css can put a good visualization in the context of the larger page and make the tool more usable.

Therefore, my goal was to build a “normal” website in which the html/css is front and center.  In this case, I also wanted it more focused on a user reading for content, than something that an analyst might come to day after day.  And, while creating the site would be nice, this was mostly meant to be a learning exercise.  So, I often took time to dig into trying to understand why something worked rather than just put something that seemed to work.

One of my favorite things about the web is that you can right-click on a webpage, choose “view-source,” and see the hundreds (or thousands) code directly.  In some ways this seems really powerful!  You can see everything that makes the website work!  But, what does it all do?  And, how?  How do the building blocks fit together into something comprehensible?  Something beautiful?  Something useful?

And, even seeing it all, it still doesn’t tell you how the programmer created it.  Did they right every line from scratch?  For most websites this seems unlikely given the repetition required of such a task, and the difficulty of keeping track as it gets increasingly complex.  But, what else could it be?

Typically I like to create things from the basics.  I like the idea of understanding a language from the ground up, rather than borrowing a solution that I don’t understand.

However, for this project, I ended up choosing to use a template based on Bootstrap here.  Why?  I wanted to see how a full site fit together.  What gave it the sense of a whole?  Of the grid?  Of coherency?  Of responsiveness?  What does the code look like all together.

My intent was to adapt the template in noticeable ways, not just the content but also the layout, colors, and other visual elements.  Figuring out how to change the color of a certain section, or to remove parts of code, that weren’t necessary helped me to understand what each part of the code did.  Seeing what they were doing under the hood also helped me figure out which questions to ask.  I also spent a lot of time investigating parts that I didn’t recognize.

Through this project I learned:

  • Bootstrap may be one reason that so many websites have a similar “look” these days
    • also – something that looks fancy might not be that hard to implement – don’t be intimidated 🙂
  • Sass is awesome in making CSS manageable – I didn’t actually use it on this project because there was already so much CSS, but I will definitely use it on the next project when I build more from scratch or when I’m using CSS w/ D3
  • responsive is important, and doable
    • also – the Firefox web developer tools have a great responsive setting for testing what a site will look like under different conditions
    • @media makes this possible
  • new html5 “section” element is useful for organizing multiple parts of a site if it’s mostly one long page
  • a 12-wide grid layouts is great – very helpful for creating a sense of coherent, consistent structure as well as enabling responsive design
  • GitHub Pages are perfect for my use case of wanting to launch the site to the world, but still mostly be in a learning/playing mode – thanks to Brandon for suggesting it!
  • using very large photographs can make the website super sluggish – resize to something appropriate

Overall, one of the things I’m most excited about is actually taking the project through to the point of having it out on the web for others to see*.  Often those questions, about how different pieces connect (aka – now that I’ve got the code, where do I put it so others can see the site), are the hardest because they’re not neatly in a sandbox.  The second is realizing that just because something looks pretty, it doesn’t mean that it is somehow magical (or super difficult).  I’m also very glad to start getting a sense for some of the tools that people use to get from the fundamentals of the code to a complete site with hundreds of lines.  Things like Sass make a lot of sense.  But, there is no way to see that it’s being used just by looking at the source code since it processes the CSS before it’s published.

This is certainly just an early step.  But, it makes me excited for what’s to come :).

Lastly, let me know if you have any questions about New Zealand!  It’s fantastic 🙂


* Note that there are still some parts that are WIP (work in progress).

Posted in Uncategorized | Leave a comment

A very short leap from a possible association to causation and questioning policy decisions

Today I came across the article: “High Home Ownership Is Strongly Linked To High Unemployment [STUDY]” from the Business Insider – link.

Seems fascinating, right?  Perhaps we be changing public policies encouraging home ownership?

Unfortunately, there seems to be quite a leap from the original research saying “We should be cautious before imputing meaning into such patterns” to the claims in the article.  Moreover, the original research article has basic flaws like conflicting data, sources without citation, and inaccurate descriptions of how the data was collected.  Beyond those basic errors, I’m unconvinced my the study’s central point.  And, even if we take the association to be valid, I am unconvinced that there is an argument for causation or by ability for the hypothesis to explain the data they present.  All in all, this shows how easy it is to make a very quick leap from a potential association in data, to assuming causation, to questioning policy.

Since the BI article doesn’t cite the original paper, I searched for it and believe the original article is this one as it’s on topic and by the mentioned author.

Although the research is focused on statistical associations and does not (and can not) make any claims of causation, the researcher Oswald clearly believes that there is causation as he is reported in the BI article to have said “I have become convinced that by boosting home ownership we have ruined our labor market.”

Next, the report suggests that it’s the homeowners who are disproportionately unemployed, when the abstract for the research says “Our argument is not that owners
themselves are disproportionately unemployed.”  The research article later reports “nor does [the conclusions] rely on the idea that home owners are themselves disproportionately unemployed (there is a considerable literature that suggests such a claim is false, or, at best, weak).”  This arguably confusing in the original research, though, since 2 of the 3 suggested causes explaining how home ownership rates may increase unemployment rates a few years later are about reasons why a homeowner might be less employed (lack of mobility and long commutes).

Looking to the original research, even upon a relatively quick read of the paper I found several things that made me question their results.  The three most basic are about the data itself:

Critical data sources are not cited, or the citations are incorrect.

For example, Tables 1 and 2 contain the raw data used in the study.  Yet, the citations for these are not included in the text and the source for the tables says only: “Source: US Census Bureau” and “Source(here and in the next table): Current Population Survey.” None of the cited references include the US Census Bureau.

Searching for any links to the US Census Bureau, I found one in footnote #5.  Unfortunately, the posted link is broken and leads to a page saying “We are really sorry but the page you requested cannot be found.”

Data in two different tables, Table 1 and Table 2a, is contradictory. 

For example, both tables include data on the 2000 and 2010 home ownership rates in the United States.  In one table these are given as 66.2% and 65.1% while in the other they are 67.4% and 66.9%.

There is no discussion of why these tables disagree.  Nor, did they clearly say which data source was used in their analysis, why the other was included if it was not used or how they merged the two sources, or why we source was more appropriate for their study or not.

Sampling description is inaccurate, and sample size seems misleading

The introduction claims that “Using data on two million randomly sampled Americans, we also estimate equations for the number of weeks worked, the extent of labor mobility, the length of commuting times, and the number of businesses.”  Later, they explain in more detail that “Table 7 … estimates a weeks-worked equation using data from the March Current Population Surveys between 1992 and 2011. The sample size is approximately 2 million individuals.”

However, according to the US Census Bureau “the CPS [Current Population Survey] is administered by the Census Bureau using a probability selected sample of about 60,000 occupied households.”  Each of these household in the study is sampled 8 times: “in the survey for 4 consecutive months, out for 8, and then return for another 4 months before leaving the sample permanently.”

In short, the survey :

* samples households, not Americans as claimed

* samples only 60,000 households per monthly data point, not 2 million

* includes households that are are probability selected, not random as described.  The method, described here, seems much better than random.  But the point is that it is different from what Oswald’s paper reported.

Granted, if there are 60,000 households sampled monthly for 20 years and each household is sampled 8 times, then there are 60,000 households * 12 months * 20 years / 8 repetitions = 1.8M households sampled total.  1.8M is within the ballpark of the 2 million reported, so perhaps this is how they came up with the 2 million value to report.  But, I would imagine that many people reading the statement “using data on two million randomly sampled Americans,” would think that these 2 million were tracked throughout the data and not just for 8 months within one year in a 20 year study.

Citing sources, reporting internally consistent data (or discussing why it’s not consistent and why a certain data set was used), and representing how the data was collected accurately are all absolutely fundamental to research.  Yet, unless I am missing something, this research fails to do these.

More generally, there are a lot of other things that might be correlated with both home ownership and unemployment rates that were not considered.  What are the demographics of the state?  How much does housing cost and how does this compare to salaries?  Are 20’s somethings living with their parents or starting households?  When were most of the houses purchased, and what were the unemployment rates at that time?  In general, sometimes the economy is doing better and other times worse.  If houses are purchased when the economy is stronger so more people are able to afford a home, then at some point later the economy (and unemployment) are likely to be worse.  Could that explain the lag?  And many more…

Similarly, data on housing and data on unemployment are counting different groups of people.  Unemployment data includes only those people who are part of the labor force, excluding retirees for example.  At the same time, home ownership is counted for head of households over the age of 25.  And, in 2010, people over the age of 60 had higher rates of home ownership than under the age of 60 – link.  Furthermore, home-ownership is measured by household while unemployment is measured by person in the labor force.  So, when comparing these two metrics, we’re not comparing the same populations. This, coupled with the fact that there is no evidence that the home-owners themselves are less employed, means that an association between home ownership rates and unemployment rates isn’t about home-owners being less employed but some larger systemic relationship between some people being more likely owning homes and others being more likely to be unemployed.

I remain unconvinced by both the original research and in the translation from the research to the article written in Business Insider. And, even if the fundamental association suggested between home ownership and, a few years later, unemployment rate is robust, there is no reason to conclude that higher rates of home ownership cause increased unemployment. Yet, someone reading the article reported in the Business Insider could easily come away wondering if (or believing that) governments should change policies around encouraging home ownership in order to protect against increased unemployment.


Posted in Uncategorized | Leave a comment

Examples illustrating axis, gridline, title, and legend formatting in ggplot2 and R

I love the ggplot visualization package in R, but often found myself forgetting the exact syntax for formatting titles, axis, gridlines, etc.  As a reminder to myself and others, I put together a quick set of examples.  These are based on the excellent ggplot2 documentation here. Rather than be comprehensive, the examples below are based on some of the basic formatting I find myself using over and over again in most charts.

The goal in these examples is to be really clear with what text controls what part of the image, rather than be beautiful :).  So, to show what changes between charts, I’ll use the color red to make it obvious.

Note that these examples require the “reshape” and “ggplot2” R packages.

Getting the data ready
Sample data for demo comes from this famous study. It gives the number of applicants and the admit rate for males and females to Berkeley’s six largest graduate schools (designated by the letters A-E) in 1973.

CSV of raw data
Department,Male No applicants,Male,Female applicants,Female
R code to set up data
# setup required packages
# read in csv w/ data from Documents folder (which you’ve copied and saved as berk.csv)
berk <- read.csv(‘Documents/berk.csv’, header = TRUE, = TRUE)# select only the columns that we’ll use for this tutorial
a <- subset(berk, select = c(Department, ‘Male’, ‘Female’))
# change the ‘shape’ of the data from ‘wide’ to ‘long’ using the melt function
d <- melt(a, id.vars = ‘Department’)
# rename the columns
names(d) <- c(‘Department’, ‘gender’, ‘admit.rate’)
# check the structure of the dataframe

Basic plot

# use the ggplot function on the dataframe d, set x to Department, y to admit rate, fill color to gender
# add on the geom_bar to make it a bar chart, and position ‘dodge’ so that the gender columns are side-by-side instead of stacked
# use scale_fill_manual to set the colors to a dark and light blue
ggplot(d, aes(x = Department, y = admit.rate, fill = gender)) + geom_bar(stat = ‘identity’, position = ‘dodge’) + scale_fill_manual(values=c(“#6baed6”, “#084594”))


Add a Title to the chart
Set the title of plot: + labs(title = “Best Chart Ever”)
Color the title red:+ theme(plot.title = element_text(colour = ‘red’))

ggplot(d, aes(x = Department, y = admit.rate, fill = gender)) + geom_bar(stat = ‘identity’, position = ‘dodge’) + scale_fill_manual(values=c(“#6baed6”, “#084594”))  + theme(plot.title = element_text(colour = ‘red’))
+ labs(title = “Best Chart Ever”)

Set the size and color of the major gridlines: + theme(panel.grid.major = element_line(size = 2, color = “red”)

ggplot(d, aes(x = Department, y = admit.rate, fill = gender)) + geom_bar(stat = ‘identity’, position = ‘dodge’) + scale_fill_manual(values = c(‘#6baed6’, ‘#084594’)) + theme(panel.grid.major = element_line(size = 2, color = “red”))


Axis Lines
Set the size and color of the axis lines: + theme(axis.line = element_line(size = 2, color = “red”))

ggplot(d, aes(x = Department, y = admit.rate, fill = gender)) + geom_bar(stat = ‘identity’, position = ‘dodge’) + theme_bw(base_family= ‘Helvetica’) + scale_fill_manual(values = c(‘#6baed6’, ‘#084594’)) + theme(axis.line = element_line(size = 2, color = “red”))



Themes are actually a big topic, which I’ll address quite briefly.  In short, if you have a set of parameters that you want to define as a set, you can define them as a theme and then simply add that theme.

Adding black/white theme: One of my favorites is the very simple theme_bw() which comes with the package, and which sets a black/white theme and removes the grey background. To use the theme, simply add: + theme_bw()

ggplot(d, aes(x = Department, y = admit.rate, fill = gender)) + geom_bar(stat = ‘identity’, position = ‘dodge’) + scale_fill_manual(values = c(‘#6baed6’, ‘#084594’)) + theme_bw()

Note that with ggplot2, the order of the terms matters.  Theme_bw() defines the axis lines to be black.  But, as we saw above, we can use theme(axis.line = element_line(size = 2, color = “red”)) to explicitly define our axis to be red and size 2.  If we add the theme_bw() term followed by the theme(axis.line = element_line(size = 2, color = “red”)) term, then the red axis lines overwrite the black ones, and we see this:

ggplot(d, aes(x = Department, y = admit.rate, fill = gender)) + geom_bar(stat = ‘identity’, position = ‘dodge’) + scale_fill_manual(values = c(‘#6baed6’, ‘#084594’))  + theme_bw() + theme(axis.line = element_line(size = 2, color = “red”))


Alternatively, if we add: + theme(axis.line = element_line(size = 2, color = “red”)) + theme_bw(), then the black lines that come with theme_bw() will overwrite the red ones to give us the following chart.

ggplot(d, aes(x = Department, y = admit.rate, fill = gender)) + geom_bar(stat = ‘identity’, position = ‘dodge’) + scale_fill_manual(values = c(‘#6baed6’, ‘#084594’)) + theme(axis.line = element_line(size = 2, color = “red”)) + theme_bw()


Or, if you want to use theme_bw and change the base font to Helvetica, you can add: + theme_bw(base_family= ‘Helvetica’)

Curious what’s included in theme_bw()?  If you just type “theme_bw()” into your R console and press enter, you’ll see the full specification.

Using themes to create a “look”: Or, in this blog post, Rosemary Hartman introduces themes in more detail, including defining a “Science theme” to create the look of charts required for publishing in Science.

+ science_theme = theme(panel.grid.major = element_line(size = 0.5, color = “grey”),
axis.line = element_line(size = 0.7, color = “black”), legend.position = c(0.85,
0.7), text = element_text(size = 14))

Once you define a theme, you can add it to your plot:

ggplot(d, aes(x = Department, y = admit.rate, fill = gender)) + geom_bar(stat = ‘identity’, position = ‘dodge’) + scale_fill_manual(values = c(‘#6baed6’, ‘#084594’)) + science_theme



Define a Legend: + scale_xxxx_manual(name=”Applicants\nBy Gender”, breaks=c(“Male”, “Female”), labels=c(“Men”, “Women”))

ggplot(d, aes(x = Department, y = admit.rate, fill = gender)) + geom_bar(stat = ‘identity’, position = ‘dodge’) + theme_bw(base_family= ‘Helvetica’) + scale_fill_manual(values=c(“#6baed6”, “#084594″),name=”Applicants\nBy Gender”, breaks=c(“Male”, “Female”),labels=c(“Men”, “Women”))

“name” sets the name of the legend, “breaks” determines which columns we’ll use, and “labels” determines the visible name of these segments in the legend in the chart.  Note that you can use this with a range of scales. In this case, we’re using it with scale_fill_manual, a command we’ve already been using to determine the color values.  Now we add in the name, breaks, and labels as well.


Building on this, we can further format the text in the legend.  For example…

Setting the format for the title of the legend:+ theme(legend.title = element_text(colour=”red”, size=16, face=”italic”)

ggplot(d, aes(x = Department, y = admit.rate, fill = gender)) + geom_bar(stat = ‘identity’, position = ‘dodge’) + scale_fill_manual(values=c(“#6baed6”, “#084594″),name=”Applicants\nBy Gender”, breaks=c(“Male”, “Female”),labels=c(“Men”, “Women”)) + theme(legend.text = element_text(colour=”red”, size=16, face=”italic”))

Drawing a red box around the legend with a dashed line and fill the box with a pink background: + theme(legend.background = element_rect(color=”red”, fill = “pink”, size=.5, linetype=”dotted”))

 ggplot(d, aes(x = Department, y = admit.rate, fill = gender)) + geom_bar(stat = ‘identity’, position = ‘dodge’) + scale_fill_manual(values=c(“#6baed6”, “#084594″),name=”Applicants\nBy Gender”, breaks=c(“Male”, “Female”),labels=c(“Men”, “Women”)) + theme(legend.background = element_rect(color=”red”, fill = “pink”, size=.5, linetype=”dotted”))

Move the legend box to the top [or bottom, or right]+ theme(legend.position=”top”) [or bottom, or right]

ggplot(d, aes(x = Department, y = admit.rate, fill = gender)) + geom_bar(stat = ‘identity’, position = ‘dodge’) + scale_fill_manual(values=c(“#6baed6”, “#084594″),name=”Applicants\nBy Gender”, breaks=c(“Male”, “Female”),labels=c(“Men”, “Women”)) + theme(legend.background = element_rect(color=”red”, fill = “pink”, size=.5, linetype=”dotted”)) + theme(legend.position=”top”)
ggplot(d, aes(x = Department, y = admit.rate, fill = gender)) + geom_bar(stat = ‘identity’, position = ‘dodge’) + scale_fill_manual(values=c(“#6baed6”, “#084594″),name=”Applicants\nBy Gender”, breaks=c(“Male”, “Female”),labels=c(“Men”, “Women”)) + theme(legend.background = element_rect(color=”red”, fill = “pink”, size=.5, linetype=”dotted”)) + theme(legend.position=”bottom”)

Move the legend box inside the chart: + theme(legend.position=c(.75, .75))

0.75, 0.75 refers to the % of the chart from the left and from the bottom.

ggplot(d, aes(x = Department, y = admit.rate, fill = gender)) + geom_bar(stat = ‘identity’, position = ‘dodge’) + scale_fill_manual(values=c(“#6baed6”, “#084594″),name=”Applicants\nBy Gender”, breaks=c(“Male”, “Female”),labels=c(“Men”, “Women”)) + theme(legend.background = element_rect(color=”red”, fill = “pink”, size=.5, linetype=”dotted”)) + theme(legend.position=c(.75, .75))

Order of elements on X-axis

Reorder the elements on the x-axis based on their name: +scale_x_discrete(limits=c(“B”, “A”, “C”, “D”, “E”, “F”))

ggplot(d, aes(x = Department, y = admit.rate, fill = gender)) + geom_bar(stat = ‘identity’, position = ‘dodge’) + scale_fill_manual(values=c(“#6baed6”, “#084594”)) +scale_x_discrete(limits=c(“B”, “A”, “C”, “D”, “E”, “F”))

Reverse the order: + scale_x_discrete(limits = rev(levels(factor(d$Department))))

ggplot(d, aes(x = Department, y = admit.rate, fill = gender)) + geom_bar(stat = ‘identity’, position = ‘dodge’) + scale_fill_manual(values=c(“#6baed6”, “#084594”)) + scale_x_discrete(limits = rev(levels(factor(d$Department))))


Change the names of the x-axis categories: + scale_x_discrete(labels=c(“basket weaving”, “knitting”, “cooking”, “cartooning”, “satire”, “love”))

Note that you may combine this with breaks = c() in order to fully specify relationship and ensure correct labeling.  Example + scale_x_discrete(labels=c(“basket weaving”, “knitting”, “cooking”, “cartooning”, “satire”, “love”), breaks = c(“A”, “B”, “C”, “D”, “E”, “F”)). Like before “theme(axis.text.x = …” can be combined with this to color the text red and change the size to highlight it.

ggplot(d, aes(x = Department, y = admit.rate, fill = gender)) + geom_bar(stat = ‘identity’, position = ‘dodge’) + scale_fill_manual(values=c(“#6baed6”, “#084594”)) + theme(axis.text.x = element_text(colour=”red”, size = 16)) + scale_x_discrete(labels=c(“basket weaving”, “knitting”, “cooking”, “cartooning”, “satire”, “love”))
A bit more to come… but that’s all for now!
Posted in Uncategorized | Leave a comment

Good news: InfoVis and Simpson’s Paradox!

I’m on vacation this week in the Stockholm archipelago, so have mostly been thinking about swimming, sailing, running, and blueberries this week and I will keep this post fairly short.  But I do have some exciting news to report!

Visualizing Statistical Mix Effects and Simpson‘s Paradox, the paper Martin Wattenberg and I wrote based on my work with Google’s Big Picture data viz team, will be published at InfoVis this November!  As a teaser, the cover photo and abstract are here:

Screen Shot 2014-07-20 at 8.34.06 AM

We discuss how “mix effects” can surprise users of visualizations and potentially lead them to incorrect conclusions. This statistical issue (also known as omitted variable bias, or in extreme cases as Simpson’s Paradox) is widespread and can affect any visualization in which the quantity of interest is an aggregated value such as a weighted sum or average. Our first contribution is to document how mix effects can be serious issue for visualizations, and we analyze how mix effects can cause problems in a variety of popular visualization techniques, from bar charts to treemaps. Our second contribution is a new technique, the “comet chart,” that is meant to ameliorate some of these issues.

I attacked this problem of understanding mix effects because as an analyst at Google I found that it was really, really important to understand.  Understanding mix was the crux of answering key questions, and I needed a better way to identify what was going on under the surface when there was a change in some important top-line metric.  So, during the 6 months I spent in Cambridge, MA mentored by Google’s (amazing!) Big Picture data visualization research team, I designed and implemented an approach and tool to visualize these effects.

As if to reinforce the importance of understanding this, Simpson’s Paradox was mentioned in this week’s Q2 Google earning’s call.  An investor asked for “a clarification on the cost per click breakout from the report, why aggregated cost per click year-over-year growth is actually better than individually the Google sites and the network sites? ” and Patrick Pichette, the CFO, answered saying that essentially it’s Simpson’s Paradox.  The full question and answer can be heard here, starting at 55:40.

More to come on this topic in the coming months :).

Posted in Uncategorized | Leave a comment

Algorithms: heap data structure and intro to Greedy Algorithms

I’m currently taking the Algorithms 1/Algorithms 2 courses on Coursera.  This is an aside from pure data viz, but is good to get this part of the core cs foundation.  And, it’s fun!

Today’s lectures & main take-away messages

Heaps as Data Structures: (1) if you find yourself doing repeated minimum (or maximum) computations, consider a heap and (2) choosing the right data structure can decrease an algorithm’s running time

Intro to Greedy Algorithms: (1) Greedy algorithms are one of the major algorithm design paradigms along with divide & conquer, randomized, and dynamic programming.  (2) Comparing Greedy to Divide & Conquer, greedy algorithms are generally easier to apply while you need the right insight to find how to decompose for D & C, easier to calculate Big O classification since often one aspect of the algorithm dominates, but typically non-trivial to prove correctness.

Optimal Caching as an example application: (1) the theoretical Belady algorithm is an example of greedy algorithm used to determine which elements to remove from a cache. (2) Even theoretical algorithms that are impossible to implement can be useful as a guideline for practical algorithms or by providing an idealized benchmark.

Scheduling Application – intro: (1) also can be addressed w/ a greedy algorithm.  (2) In order to make trade-offs between different desired metrics, you can define an objective function.

Scheduling Application – algorithm: sequential steps in the problems suggests it might be a good candidate for a greedy algorithm, and a good way to get started figuring out what that algorithm might be is to look at special cases.  

Posted in algorithms | Leave a comment

Some troubleshooting suggestions for adapting “let’s make a map” to other countries

Following up on the previous Let’s Make a Map post, here is some additional info for others adapting the original demo to other countries or parts of the world.

Some things that you might need to take into account for other countries:

Centering & scale parameters for the projection are obviously country dependent – so you’ll need to find the right parameters for the chosen country or countries.

Small territories: some countries include additional territories.  For example, Denmark, Sweden, and Norway all own some island territories both near and far.  I ended up needing to filter these out.  If I remember right they were causing problems primarily because it was breaking the country naming part of the code due to skewing the projected centroid.  They were also getting names, which wasn’t necessary.

Two different country code naming schemes: The ogr2ogr command to get the geographic boundaries uses ISO 3166-1 alpha-3 country codes while the one for city names uses ISO 2.  Not a big deal, but you will need to reference both to get the right set of codes to filter.

Non-English City Names & Character Encoding: If non-English speaking, the character encoding might be off for non-English characters.  You can see that with my example for Sweden, “Luleå” is spelled “Lulen.”  This happens in the ogr2ogr step when translating the data from Natural Earth’s download.  I’m still troubleshooting this, and will post the solution later. *

* Edit (July 11th): The issue is actually with the original dbf file, before translation. More info on the Natural Earth Forums.

Posted in d3 | Leave a comment