How to Data Analytics (in a Start-up)

3 Lessons I Learned as the “Chief Data Analyst” of a Silicon Valley–funded Start-up

From 2015 till 2017 I helped grow HoloBuilder Inc., a start-up providing virtual reality solutions for the construction industry, as their VP of Customer & Data Analytics & Optimization, which roughly translates to “Chief Data Analyst”. The company is headquartered in San Francisco while I was a part of their R&D lab in Aachen, Germany. I was responsible for the whole data analytics* pipeline — from collecting data on the web platform using Google Analytics and own trackers to processing the data in Google BigQuery and visualizing it using tools like Power BI and Klipfolio. During my time in Aachen I learned lots of valuable lessons. Here, I want to share with you the three most important ones that are directly concerned with data analytics (please scroll down for a TL;DR).

How to Data Analytics

1. Data Analytics ≈ UX Design

Data analytics is a lot like UX design. You have specific target audiences that expect to experience what you provide them with in the most optimal way — concerning content, presentation, and possibly interactivity. For instance, providing data for C-level management and for potential investors are two completely different stories. While management requires low-level insights concerning the software itself, among other things, for VCs we usually prepared more high-level business metrics including projections and forecasts, due to the different requirements. Moreover, internal data would usually be provided through dynamic dashboards that could be adjusted and customized while data for investors would rather be delivered in the form of PowerPoint slides that matched the layout of the pitch deck. Therefore, it is crucial to have the definition of a target audience (potentially even personas) and a requirements elicitation from that audience at the beginning of every data science process. At HoloBuilder Inc., this lesson became especially clear because of the split between San Francisco and Germany and the fact that most of the (potential) VCs were residing in the Silicon Valley.

I am convinced that a data analyst without some proper UX skills — and, of course, adequate requirements and input — cannot be successful.

2. Ask the Right Questions — and Do So early

This one goes hand in hand with requirements elicitation. Don’t provide analyses just for the sake of it!

This whole “let’s just analyze everything we can get” thing doesn’t work! It’s extremely important to define the questions you intend to answer beforehand. Tracking is cheap, so you can (and should!) track more than you need (at the moment). But the processing and visualization of data that nobody ever looks at eats up a whole lot of resources that would be required for the meaningful analysis and presentation of the few nuggets that are buried in your giant pile of big data. Also, having concrete questions in mind greatly helps with tailoring data structures more precisely to your specific needs. Of course, this doesn’t mean that your infrastructure doesn’t have to be flexible enough to quickly react to changing and new questions that need to be answered. In an optimal world, the data for answering new questions is already there and you “just” have to do the processing and visualization. In general: Expect surprise on-demand questions anytime! Therefore, anticipate and be prepared!

(While the questions that need to be answered can be seen as part of the requirements elicitation, I treat them separately here, because I give requirements a more technical connotation — e.g., “possibility to toggle between line/bar charts” or “include difference to previous period in %” — compared to key questions such as “Why do we lose users?”.)

3. Data is Meaningless …

… unless you give it meaning by interpreting it. For this, it’s inherently important to not think in silos. A data analytics team has to closely cooperate with the UX team and (almost) all other teams in the company in order to find meaningful interpretations or reasons for the collected data. Yet, this is still not the norm in industry. For instance, there is still the widely believed misconception that A/B testing = usability testing.

To ensure meaningful data analytics, at HoloBuilder Inc., marketing manager Harry Handorf and I developed a boilerplate for a weekly KPI report that posed three crucial questions:

  1. Which data did we collect?
  2. What are the reasons the data looks like that?
  3. What actions must/should be taken based on the above?

That is, the first part delivered the hard facts; the second part explained these numbers (e.g., less sign-ups due to change in UI); and the third part presented concrete calls to action (e.g., undo UI change). The report looked at those questions from the platform as well as the marketing perspective. Therefore, we had to extensively collaborate with software engineers, designers, UX people, marketing and sales to find meaningful answers. According to the second learning above, the basis of the report always were higher-level questions defined beforehand, such as: “Does the new tutorial work?”, “How can we gain more customers?”, and “Have we reached our target growth?”. In general, the interpretation of data is based on the processed data and the questions to be answered, rather than on technical requirements (see infographic above).

Again, because this is really important: Your data is worth nothing without proper interpretation and input from outside the data analytics department.

Ultimately, to conclude this article, I don’t want to withhold from you Harry’s take on the topic:

You might have heard of the metaphor for life feeling like a tornado. It perfectly applies to working with data of a young business — it spins you around with all of its metrics, data points and things you COULD measure. It’s noisy and wild. A good data scientist figures out how to step out of it. But that does not mean getting out of the tornado completely, letting it do its thing and becoming a passive spectator. It means getting inward, to the eye. Where silence and clarity allow for a better picture of what’s going on around you, defining appropriate KPIs and asking the right, well thought-out questions.”
—Harry Handorf (tornado tamer)

TL;DR

  1. Data analytics is a lot like UX design! As a data analyst, you have to define target audiences and elicit requirements. Tailor content & presentation of your analyses to those.
  2. Define the questions to be answered beforehand, then process and interpret the data necessary to answer those questions. Don’t analyze everything you can just for the sake of it.
  3. Data is meaningless without interpretation. Extensively collaborate with other departments — especially UX — to ensure meaningful data analytics.

(This article has also been published in Startups.co on Medium.)

Footnotes

* What we did at HoloBuilder Inc. was clearly a mix of data analytics and data science. But since it was closer to the analytics part, I refer to it as data analytics in this article. In case you are interested in the specific differences between the two (and how difficult it is to tell them apart), I recommend reading the Wikipedia articles about data science and data analytics, as well as “Data Analytics vs Data Science: Two Separate, but Interconnected Disciplines” by Jerry A. Smith.

Acknowledgments

Special thanks go to Harry for proofreading the article & his valuable input.

Advertisements

Happiness—How Does It Work?

Traurig sein hat keinen Sinn,
die Sonne scheint auch weiterhin.
— Farin Urlaub: “Sonne” (2005)

I know, I know. Not yet another one of those “How to find happiness” articles 🙄. So why am I writing this anyway? First of all, I think that writing is probably the best way of self⸗reflection—sadly, an art that is performed way too little by those who would need it the most. Second, because I moved to the United States recently—which can be considered a pretty significant life event I guess—I thought it might be a neat idea to read the various diaries and notebooks I’ve kept over the past ten years. So, while I was going through my diary entries—dating back as far as January 2007 (three months before I started my Bachelor studies at the University of Koblenz)—I noticed there might be some interesting patterns. And since I’m a guy who likes to play around with data from time to time, I prepared an Excel sheet and did a little number crunching. I’m not going to give you the raw data (because privacy), but it looks something like this:

met friends often? had a goal? worked out regularly? read a lot? happy?
yes yes yes yes 😊
some­what yes some­what some­what 😐
no yes no yes 😔

Here are the higher⸗level results I found (spoiler alert: the following is kind of obvious). It seems like I was happier at times when I (in no particular order)

  • didn’t worry about money
  • often met with good friends
  • worked out a lot
  • did competitive sports on a regular basis
  • pursued my hobbies

compared to times when I did none or only few of the above.

You’re still reading the article? That’s nice! Of course, from a statistical perspective it must be noted that the above are mere correlations (threshold = 0.7). This means, it could be that I just work out more when I’m happy for different reasons and working out doesn’t actually contribute to my happiness, but that’s beyond the scope of this article. Since all of this is about myself, I can ensure that a certain degree of causality is given.

beachHowever, the point of all this is not primarily to just tell you what makes me personally happy. Rather, I’ve observed that quite some people are not really aware of what actually contributes to their happiness. As I noticed, in this case, it can be very helpful to keep a diary and do the same little analysis that I’ve done. It’s really insightful to simply note down what you did and didn’t do at times when you were happy and at times when you were not, then search for underlying patterns. Because every now and then, we all easily miss the obvious.

What Do Highly Successful Start-ups Have in Common?

Abstract: In the first year, a clear majority of the investigated companies secured more than $0.395m of seed or angel funding; in the second year, a clear majority secured more than $4.366m of Series A or venture funding; and in the third year, a clear majority secured more than $11.131m of Series B funding.

champagne.jpg

At bitstars GmbHHoloBuilder Inc. I’ve been a part of “start-up grad school” for almost two years now. Most of the time the early years of a start-up are a constant fight for a good valuation and big investments. The first steps towards a real, innovative product, the hunt for customers who (are going to) pay actual money to use it and pitching nice figures to potential investors are at the core of this process. Recently, I’ve repeatedly asked myself what the early rounds of funding of the most successful start-ups looked like and whether they might all have something in common. So I’ve done some number crunching on the topic.

Method

First off, I needed a list of companies that are commonly recognized as highly successful start-ups. The most prominent features of these are high valuations and huge amounts of raised money. I went with Inc. magazine’s “15 Most Valuable Startups in the World”, the Telegraph’s “most valuable start-ups in the world” and Verge HQ’s “Top 20 Startups of All Time”. The 44 companies found in these three articles (Uber, Airbnb and Dropbox, among others) are referred to as “Group A” in the following. To complement the list with some very successful start-ups that have not (yet) reached the status of the above big players, I’ve created an additional “Group B”. It comprises 16 companies found in “10 Wildly Successful Startups and Lessons to Learn From Them” by Inc. magazine and the top 7 exits in the portfolio of “leading global venture capital seed fund and startup accelerator” 500 Startups.

In the next step I consulted CrunchBase and for every company in my list—as far as the data was available—looked up the money raised in the first three major rounds of funding and in which month/year it happened.1 Using the Consumer Price Index (CPI), all numbers have been inflation-adjusted, i.e., they are the equivalent amount of money that would (have to) be raised in October 2016. There was no data available for 5 companies from Group A, which left me with a total of 55 data sets—39 in Group A (71%) and 16 in Group B (29%).

In the following I report on my findings for groups A and B as well as both groups combined. They particularly focus on how much money was raised in the different rounds of funding and how many months after the founding of a company it happened.2 Before the analysis, outliers were removed from both data series (the amounts of money raised and months after founding) separately using Tukey’s test for outliers based on groups A and B combined.

The raw data I used for my analysis can be publicly accessed at https://docs.google.com/spreadsheets/d/1ge7RaUe6Pc6bpBM6PtbkVlgg35mmTJTMVSx3vbmb2s8/pubhtml.

1 Multiple investments of the same type (e.g., “Series A”) that happened in a relatively short time span were aggregated considering the month/year of the latest investment as the effective date.
2 The date of the first investment was assumed as the founding date of a company if it was earlier than the founding date given by Crunchbase. In case only the founding year of a company was given, I assumed June of that year as the founding date; or January if the first investment already happened in June or earlier.

1st Round of Funding

The seed or first angel investment (in case there was no dedicated seed round) of a start-up was considered as the first major round of funding. For this, Crunchbase provided data on 16 companies from Group A and 8 from Group B.

When looking at Group A—the most valuable start-ups—on average they raised roughly $0.932m (σ ≈ $0.692m) and reached this milestone an average 7 months (σ ≈ 5) after the company was founded. Interestingly, the Group B start-ups on average raised more money in this first round, i.e., $1.374m (σ ≈ $0.976m), which averagely happened 6.5 months after founding (σ ≈ 6.5).

Combining the two groups gives us an average of $1.080m (σ ≈ $0.804m). This money was raised roughly 7 months after founding the start-up (avg. ≈ 6.8, σ ≈ 5.5). Neglecting outlier investments1, 75% of all considered companies raised more than $0.395m and managed to do so within the first 10.5 months after founding. For Group A only, these numbers are $0.395m and 11.25 months.

1 That is, an investment that is an outlier in either the money or the time dimension.

2nd Round of Funding

The Series A or first VC investment (in case there was no dedicated Series A) of a company was considered as the second major round of funding. Crunchbase provided data on 37 companies from Group A and 16 from Group B for this.

Group A secured an average $11.372m (σ ≈ $7.573m) in this round, roughly 16 months after founding (avg. = 16.25, σ ≈ 10.12). Group B falls a little short, with “only” $7.032m raised on average (σ ≈ $5.361m), but in a similar timeframe (avg. = 16.8, σ ≈ 10.73). The average amount raised by Group B is about 62% of that raised by Group A.

Combining the two groups yields an average of roughly $9.925m (σ ≈ $7.160m) raised after approximately 16.5 months (avg. ≈ 16.43, σ ≈ 10.20). Not considering outlier investments, 75% of the start-ups in both groups achieved a Series A/Venture funding of over $4.366m within the first 23.25 months of their existence. When looking at Group A only, these numbers change to $5.471m and 23 months.

3rd Round of Funding

The Series B investment was considered as the third major round of funding. For this, Crunchbase provided data on 31 companies from Group A and 13 from Group B.

Group A companies raised an average investment of roughly $25.619m (σ ≈ $14.093m) in this round, at an average 27.54 months (σ ≈ 12.35) after having been founded. The relative difference to Group B stays almost constant compared to the second round of funding, with an average investment of roughly $15.614m (σ ≈ $14.093m). These are 61% of the funding secured by Group A. Group B companies secured their investments an average 32.92 months (σ ≈ 17.22) after having been founded.

When looking at both groups combined, the average investment is roughly $22.196m (σ ≈ $12.998m), averagely 29.24 months (σ ≈ 14.09) after the founding of the company. 75% of all considered start-ups secured an investment of at least $11.131m within the first 35 months (without outlier investments). For Group A only, these numbers are $13.167m and 33 months.

Conclusions

The question posed in the title of this article is “What Do Highly Most Successful Start-ups Have in Common?”. So let’s see what we’ve learned. First off, Seed/Angel funding seems to be usually secured within the first, Series A/Venture funding within the second and Series B funding within the third year of existence.

months-after-funding.png

When looking at the amounts of money raised, it becomes evident that the difference between the most successful (Group A) and the slightly less famous (Group B) companies is neglectable in the first round of funding, but becomes more considerable in the following two rounds. This is most probably due to a mutual effect of “If you raise more money you become more famous” and “If you are more famous you can raise more money”. Still, the first huge investment usually comes before the fame. Therefore, the numbers given in this article can be (cautiously) considered a common trait of highly successful start-ups.

Hence, to answer the initial question: Based on the first quartiles determined earlier we can state that in the first year, a clear majority of the investigated companies secured more than $0.395m of seed or angel funding; in the second year, a clear majority secured more than $4.366m of Series A or venture funding; and in the third year, a clear majority secured more than $11.131m of Series B funding.

money-raised.png

Additionally, the following scatter plot maps all investigated investments (without outliers) from all three rounds. As can be seen, the rounds overlap and the variance in both money and time becomes bigger with each round of funding.

scatter-plot.png

Finally, it is important to note that my analysis—as originally intended—only investigates the commonalities in the funding of the considered companies. What I haven’t done was to look at what explicitly distinguishes these (highly) successful start-ups from start-ups that failed. That being said, although proper funding most of the time is a key factor to success, you can raise just as much money as the companies described above in the same amount of time and still fail if you don’t make proper use of your investments. The other way round, it’s of course also possible to fall short of the figures above and still build a highly successful start-up. Always bear in mind that it takes more than money to transform a start-up idea—no matter how awesome it is—into a successful company!

(This article has also been published on Startups.co.)

My 2015 in Blogging

In 2015 I was not as busy as last year when it came to blogging, mainly due to my new job and my PhD thesis. But still, the WordPress.com stats helper monkeys prepared a very nice 2015 annual report for Twenty Oh Eight. I vow that I’m gonna post more regularly again next year!

In case someone actually reads this: I wish you a Happy New Year! 😉

Here’s an excerpt:

A San Francisco cable car holds 60 people. This blog was viewed about 2,700 times in 2015. If it were a cable car, it would take about 45 trips to carry that many people.

Click here to see the complete report.

What ’bout some fancy dashboards for ya? Power BI vs. Geckoboard

In my capacity as the chief data analyst of bitstars, it’s one of my key responsibilities to regularly compile all relevant figures concerning our web platform HoloBuilder. These figures are mostly intended for people who don’t have the time to dive deeply into some fancy but complicated statistics. Hence, from the user experience perspective it’s crucial to provide them in an easy-to-understand and pleasant-to-look-at form. A well-established way of doing so are data visualizations of different forms which are provided in terms of dashboards for optimal accessibility. Since we are currently redesigning our internal process for providing figures and statistics, I’ve done some research on two potential software solutions that could be used for this.

Requirements

Since we are talking about a solution for our internal process at bitstars, there is a set of company-specific requirements a contemplable software has to fulfill. In particular, these requirements are:

Moreover, there are two nice-to-haves:

In the following, I investigate two possible solutions—Power BI and Geckoboard—, which are evaluated against the above requirements.

Power BI

Power BI WebsitePower BI is a cloud-based business analytics service provided by Microsoft. It comes as a part of the Office 365 suite, but can also be used standalone. There is an online as well as a desktop version, whereas the latter has a significantly larger range of functions. Power BI distinguishes between dashboards (“[…] something you create or something a colleague creates and shares with you. It is a single canvas that contains one or more tiles.”) and reports [“one or more pages of visualizations (charts and graphs)”]. Reports can be saved in the Power BI Desktop file format (.pbix); dashboards can be shared.

Power BI comes with a rather limited range of integrable services, among which are Google Analytics (☑) and MailChimp (☑). AdWords (☑) statistics can be integrated via Google Analytics if your respective accounts are connected. However, integration for Facebook Ads (✖), Pipedrive (✖) and AWS (✖) is still missing. FB Ads integration has been requested, but is yet to be realized. There is moreover functionality to integrate data from Excel and CSV files (from your computer or OneDrive) or Azure SQL databases, among others, which also enables you to import your own custom data.

The basic version of Power BI can be used for free while Power BI Pro comes for $9.99 per user & month.

How to create a cumulative chart in Power BI?

Cumulative charts are not a built-in functionality of Power BI, but can be easily realized using Data Analysis Expressions (DAX, ☑). That is, you have to create a new measure in your dataset. Assume, for instance, you want a cumulative chart of your sales (to be accumulated, Y axis) over time, which are only present in your dataset as the number of sales per date (X axis). The DAX formula for your new measure would be as follows:

Measure = calculate(
  SUM('Your Dataset'[Sales]);
  FILTER(
    all('Your Dataset'[Date]);
    'Your Dataset'[Date] <= max('Your Dataset'[Date])
  )
)

(found at http://www.daxpatterns.com/cumulative-total/). You can then simply add a chart visualizing your new measure (Y axis) per date (X axis) to your Power BI report to obtain your desired cumulative chart.

Geckoboard

Geckoboard WebsiteGeckoboard is a web platform for creating individual dashboards that show your business’s KPIs (key performance indicators), e.g., unique visits to your website, Facebook likes or sales per day. The platform has built-in support for integration of a wide range of external data sources, including Google Analytics (☑), AdWords (☑), Facebook Ads (☑), MailChimp (☑), Pipedrive (☑) and AWS (☑) and many more (in fact, way more compared to Power BI). Moreover, Geckoboard supports CSV and Google Sheets integration for your own custom data.

Like in Power BI, there is no built-in support for cumulative charts. However, since it is easily possible to create those in Google Sheets (see, e.g., this link), they can simply be imported and visualized in Geckoboard as well (☑). Of course, this means an additional intermediate step is required.

Geckoboard offers no free plan. Paid plans start from $49 per month for one user and two dashboards.

Conclusion

Power BI Geckoboard
Cumulative charts (☑)1 (☑)1
Google Analytics integration
AdWords integration
Facebook Ads integration
MailChimp integration
(Pipedrive integration)
(AWS integration)
overall rating ⭐⭐⭐ ⭐⭐⭐⭐

Both tools miss built-in functionality for cumulative charts, but provide means for importing own custom data. When it comes to the integration of third-party services, Geckoboard supports a significantly lager range of available data sources. Because of this, I give Power BI an overall rating of 3 out of 5 (⭐⭐⭐). Since the pricing is more expensive and cumulative charts require an additional intermediate step, but the overall package makes a better impression regarding what we need at bitstars, Geckoboard receives a rating of 4 out of 5 (⭐⭐⭐⭐).

To summarize, if you’re fine with Google Analytics stats and some custom data imported via Excel files or an Azure DB, go for Power BI. Yet, if you rely on the seamless integration of a wider range of external services, you’re clearly better off with Geckoboard—unless you wanna implement the integration of the different services’ APIs yourself in a DIY solution.

1 These are given in parentheses because an additional intermediate step is required.

How to do Power Law Regression in R, or What Happened When One of my Posts Made it to the Front Page of Hacker News

WordPress stats 2014/07/29Recently, my post about motherfuckingwebsite.com was featured on the front page of Y Combinator’s Hacker News. I was watching football with some friends when my WordPress app told me twice that my blog’s stats “are exploding”. Back at home, I checked my stats and noticed that I had almost 6,000 unique visitors and over 7,000 views that day (i.e., “day 1”), instead of the normal 4–10 visitors. The effect of the publication on Hacker News was noticeable until day 7, with 13 visitors and 16 views, which was still above average. The number of visitors was back to normal (4 visitors/views) only on day 8.

day visitors views
1 5,893 7,246
2 793 974
3 190 246
4 53 78
5 32 35
6 21 29
7 13 16

Fitting a Linear Model

The temporal progress of the numbers strongly reminded me of a power law function, i.e., a function of the form y = a * xb. So I posed myself the following question: How can I determine the parameters a and b from the given empirical data and what will they look like? A quick Google search pointed me to a very helpful StackExchange page. The solution given there was to determine a linear regression model based on a logarithmic scaling of both the x and y axes:

data <- read.csv(file="data.csv")
plot(data$days, data$visitors, log="xy", cex=0.8)
model <- lm(log(data$visitors) ~ log(data$days))
points(data$days,
  round(exp(coef(model)[1] + coef(model)[2] * log(data$days))),
  col="red")
#visitors (black) vs. fitted model (red)
Actual number of visitors (black) vs. fitted model (red), on a double logarithmic scale.

Transforming the Regression Function

The determined model yields two parameters coef(model)[1] =: c and coef(model)[2] =: d. Yet, since the model is a linear regression model, these parameters correspond to the function y = ec + d * ln(x). To obtain the desired form y = a * xb, we can transform the given function as follows:

y = e^{c + d \cdot ln(x)} \Leftrightarrow\\  y = e^c \cdot e^{d \cdot ln(x)} \Leftrightarrow\\  y = e^c \cdot x^d

Results

This means that the parameter a is given by ec while the parameter b is given by d. Thus, they can be obtained using the following assignments in R:

a <- exp(coef(model)[1])
b <- coef(model)[2]

For my data, this gave me the following parameters:

data c a b regression function
visitors 8.729153 6180.491018 -3.220374 y = 6180.491018 * x-3.220374
views 8.948748 7698.247619 -3.203968 y = 7698.247619 * x-3.203968

To conclude, simple power law regression is not as difficult as it might seem at first. However, in scientific work, it is necessary to also investigate uncertainty in the regression parameters and test the significance of the fitted model. More information on this topic can be found in this paper and on the accompanying website.