Global broadband pricing study: Updated dataset

Vincent Chiu is a Technical Program Manager at Google.

Since 2012, Google has supported the study and publication of broadband pricing for researchers, policymakers and the private sector in order to better understand the affordability landscape and help consumers make smarter choices about broadband access. We released the first dataset in August 2012, and periodically refresh the data (May 2013, March 2014, and February 2015). This data has become an integral part of our understanding of global broadband affordability. Harvard’s Berkman Center, Facebook, and others actively use the data to understand the broadband landscape. Today, we’re releasing the latest dataset.

For the mobile data set, we increased the number of countries represented from 112 to 157, and the number of carriers from 331 to 402. For the fixed-line dataset, we increased the number of countries from 105 to 159, and number of carriers from 331 to 424. This data covers 99.3% of current Internet users globally on a country level.

The collection methodology is designed to capture the cost of data plans. We collected samples from a broad range of light to heavy data usage plans, and recorded numerous individual plan parameters such as downstream bandwidth, monthly cost, and more. Finally, where possible, we collected plans from multiple carriers in each country to get an accurate picture.

Broadband Data:

  • Price observations for fixed broadband plans can be found here.
  • Mobile broadband prices can be found here.

We provide this information to help people understand the state of internet access and make data-driven decisions. Along with this data collection effort, we have analyzed these pricing data and conducted researches on affordability and Internet penetration. Our early results have indicated several topics deserving further discussions within the ICT data community, including metric normalization, income distribution, and broadband value. We believe:

  • Normalization is essential for any meaningful statistical analysis. The diversity of plan types and the complexity of tariff structures surrounding mobile broadband pricing requires a careful analytical methodology for normalization.
  • Income distribution needs to be considered when assessing the broadband affordability situation. The commonly used GNI per capita metric is based on average national income level, which does not consider inequality of income distribution.
  • High broadband value-to-cost ratio is important to drive Internet adoption: beyond just affordability metrics, we need to take into account of broadband experience to define meaningful value-for-the-money metrics.

We look forward to sharing more findings on these topics in 2016. Please stay tuned for updates on our progress. If you have any feedback on the methodology, contact us at broadband-study@google.com

Continua a leggere

The reusable holdout: Preserving validity in adaptive data analysis

Moritz Hardt is a Research Scientist at Google. This post was originally published on the Google Research Blog.

Machine learning and statistical analysis play an important role at the forefront of scientific and technological progress. But with all data analysis, there is a danger that findings observed in a particular sample do not generalize to the underlying population from which the data were drawn. A popular XKCD cartoon illustrates that if you test sufficiently many different colors of jelly beans for correlation with acne, you will eventually find one color that correlates with acne at a p-value below the infamous 0.05 significance level.

Image credit: XKCD

Unfortunately, the problem of false discovery is even more delicate than the cartoon suggests. Correcting reported p-values for a fixed number of multiple tests is a fairly well understood topic in statistics. A simple approach is to multiply each p-value by the number of tests, but there are more sophisticated tools. However, almost all existing approaches to ensuring the validity of statistical inferences assume that the analyst performs a fixed procedure chosen before the data are examined. For example, “test all 20 flavors of jelly beans.” In practice, however, the analyst is informed by data exploration, as well as the results of previous analyses. How did the scientist choose to study acne and jelly beans in the first place? Often such choices are influenced by previous interactions with the same data. This adaptive behavior of the analyst leads to an increased risk of spurious discoveries that are neither prevented nor detected by standard approaches. Each adaptive choice the analyst makes multiplies the number of possible analyses that could possibly follow; it is often difficult or impossible to describe and analyze the exact experimental setup ahead of time.

In The Reusable Holdout: Preserving Validity in Adaptive Data Analysis, a joint work with Cynthia Dwork (Microsoft Research), Vitaly Feldman (IBM Almaden Research Center), Toniann Pitassi (University of Toronto), Omer Reingold (Samsung Research America) and Aaron Roth (University of Pennsylvania), to appear in Science tomorrow, we present a new methodology for navigating the challenges of adaptivity. A central application of our general approach is the reusable holdout mechanism that allows the analyst to safely validate the results of many adaptively chosen analyses without the need to collect costly fresh data each time.

The curse of adaptivity

A beautiful example of how false discovery arises as a result of adaptivity is Freedman’s paradox. Suppose that we want to build a model that explains “systolic blood pressure” in terms of hundreds of variables quantifying the intake of various kinds of food. In order to reduce the number of variables and simplify our task, we first select some promising looking variables, for example, those that have a positive correlation with the response variable (systolic blood pressure). We then fit a linear regression model on the selected variables. To measure the goodness of our model fit, we crank out a standard F-test from our favorite statistics textbook and report the resulting p-value.

Inference after selection: We first select a subset of the variables based on a data-dependent criterion and then fit a linear model on the selected variables.

Freedman showed that the reported p-value is highly misleading—even if the data were completely random with no correlation whatsoever between the response variable and the data points, we’d likely observe a significant p-value! The bias stems from the fact that we selected a subset of the variables adaptively based on the data, but we never account for this fact. There is a huge number of possible subsets of variables that we selected from. The mere fact that we chose one test over the other by peeking at the data creates a selection bias that invalidates the assumptions underlying the F-test.

Freedman’s paradox bears an important lesson. Significance levels of standard procedures do not capture the vast number of analyses one can choose to carry out or to omit. For this reason, adaptivity is one of the primary explanations of why research findings are frequently false as was argued by Gelman and Loken who aptly refer to adaptivity as “garden of the forking paths.”

Machine learning competitions and holdout sets

Adaptivity is not just an issue with p-values in the empirical sciences. It affects other domains of data science just as well. Machine learning competitions are a perfect example. Competitions have become an extremely popular format for solving prediction and classification problems of all sorts.

Each team in the competition has full access to a publicly available training set which they use to build a predictive model for a certain task such as image classification. Competitors can repeatedly submit a model and see how the model performs on a fixed holdout data set not available to them. The central component of any competition is the public leaderboard which ranks all teams according to the prediction accuracy of their best model so far on the holdout. Every time a team makes a submission they observe the score of their model on the same holdout data. This methodology is inspired by the classic holdout method for validating the performance of a predictive model.

Ideally, the holdout score gives an accurate estimate of the true performance of the model on the underlying distribution from which the data were drawn. However, this is only the case when the model is independent of the holdout data! In contrast, in a competition the model generally incorporates previously observed feedback from the holdout set. Competitors work adaptively and iteratively with the feedback they receive. An improved score for one submission might convince the team to tweak their current approach, while a lower score might cause them to try out a different strategy. But the moment a team modifies their model based on a previously observed holdout score, they create a dependency between the model and the holdout data that invalidates the assumption of the classic holdout method. As a result, competitors may begin to overfit to the holdout data that supports the leaderboard. This means that their score on the public leaderboard continues to improve, while the true performance of the model does not. In fact, unreliable leaderboards are a widely observed phenomenon in machine learning competitions.

Reusable holdout sets

A standard proposal for coping with adaptivity is simply to discourage it. In the empirical sciences, this proposal is known as pre-registration and requires the researcher to specify the exact experimental setup ahead of time. While possible in some simple cases, it is in general too restrictive as it runs counter to today’s complex data analysis workflows.

Rather than limiting the analyst, our approach provides means of reliably verifying the results of an arbitrary adaptive data analysis. The key tool for doing so is what we call the reusable holdout method. As with the classic holdout method discussed above, the analyst is given unfettered access to the training data. What changes is that there is a new algorithm in charge of evaluating statistics on the holdout set. This algorithm ensures that the holdout set maintains the essential guarantees of fresh data over the course of many estimation steps.

The limit of the method is determined by the size of the holdout set—the number of times that the holdout set may be used grows roughly as the square of the number of collected data points in the holdout, as our theory shows.

Armed with the reusable holdout, the analyst is free to explore the training data and verify tentative conclusions on the holdout set. It is now entirely safe to use any information provided by the holdout algorithm in the choice of new analyses to carry out, or the tweaking of existing models and parameters.

A general methodology

The reusable holdout is only one instance of a broader methodology that is, perhaps surprisingly, based on differential privacy—a notion of privacy preservation in data analysis. At its core, differential privacy is a notion of stability requiring that any single sample should not influence the outcome of the analysis significantly.

Example of a stable learning algorithm: Deletion of any single data point does not affect the accuracy of the classifier much.

A beautiful line of work in machine learning shows that various notions of stability imply generalization. That is any sample estimate computed by a stable algorithm (such as the prediction accuracy of a model on a sample) must be close to what we would observe on fresh data.

What sets differential privacy apart from other stability notions is that it is preserved by adaptive composition. Combining multiple algorithms that each preserve differential privacy yields a new algorithm that also satisfies differential privacy albeit at some quantitative loss in the stability guarantee. This is true even if the output of one algorithm influences the choice of the next. This strong adaptive composition property is what makes differential privacy an excellent stability notion for adaptive data analysis.

In a nutshell, the reusable holdout mechanism is simply this: access the holdout set only through a suitable differentially private algorithm. It is important to note, however, that the user does not need to understand differential privacy to use our method. The user interface of the reusable holdout is the same as that of the widely used classical method.

Reliable benchmarks

A closely related work with Avrim Blum dives deeper into the problem of maintaining a reliable leaderboard in machine learning competitions (see this blog post for more background). While the reusable holdout could directly be used for this purpose, it turns out that a variant of the reusable holdout, we call the Ladder algorithm, provides even better accuracy.

This method is not just useful for machine learning competitions, since there are many problems that are roughly equivalent to that of maintaining an accurate leaderboard in a competition. Consider, for example, a performance benchmark that a company uses to test improvements to a system internally before deploying them in a production system. As the benchmark data set is used repeatedly and adaptively for tasks such as model selection, hyper-parameter search and testing, there is a danger that eventually the benchmark becomes unreliable.

Conclusion

Modern data analysis is inherently an adaptive process. Attempts to limit what data scientists will do in practice are ill-fated. Instead we should create tools that respect the usual workflow of data science while at the same time increasing the reliability of data driven insights. It is our goal to continue exploring techniques that can help to create more reliable validation techniques and benchmarks that track true performance more accurately than existing methods.

Continua a leggere

Smart Maps for Smart Cities: India’s $8 Billion+ Opportunity

Gaurav Gupta is Dalberg‘s Regional Director for Asia.

Did you know that India is expected to see the greatest migration to cities of any country in the world in the next three decades, with over 400 million new inhabitants moving into urban areas? To accommodate this influx of city dwellers, India’s urban infrastructure will have to grow, too.

That growth has already begun. In the last six years alone, India’s road network has already expanded by one-quarter, while the number of total businesses increased by one-third.

To better understand how smart maps—citizen-centric maps that crowdsource, capture, and share a broad range of detailed data—can help India develop smarter and more efficient cities, our team at Dalberg Global Development Advisors worked with the Confederation of Indian Industry on a new study, Smart Maps for Smart Cities: India’s $8 Billion+ Opportunity. What we found was that even for a select set of use cases, smart maps can help India gain over USD $8 billion in savings and value, save 13,000 lives, and reduce one million metric tons of carbon emissions a year in cities alone. Their aggregate impact is likely to be several multiples higher.

Our research shows that simple improvements in basic maps can lead to significant social impact: smart maps can also help businesses attract more consumers, increase foreign tourist spending and even help women feel safer.

In these quickly changing cityscapes, online tools like maps need to be especially dynamic, able to update faster and quickly expand coverage of local businesses in order to serve as highly useful tools for citizens. Yet today, most cities lack sophisticated online tools that make changing information, like road conditions and new businesses, easy to find online. Only 10-20% of the India’s businesses, for instance, are listed on online maps.

So what will it take to continue developing smart maps to help power these cities? Our study shows that India will need to embrace a new policy framework that truly encourages scalable solutions and innovation by promoting crowdsourcing and creating a single accessible point of contact between government and the local mapping industry.

Continua a leggere

Moving beyond the binary of connectivity

Back in April, we shared a post from designer and Internet researcher An Xiao Mina about the “sneakernet.” She has a new post on The Society Pages in which she sets out to define a concept she calls the binary of connectivity.

But what exactly is this binary of connectivity? Attendees at my talk asked me to define it, and I’d like to propose a working definition:

The connectivity binary is the view that there is a single mode of connecting to the internet — one person, one device, one always-on subscription.

The connectivity binary is grounded in a Western, urban, middle class mode of connectivity; this mode of connecting is seen as the penultimate realization of our relationship to the internet and communications technologies. Thinking in a binary way renders other modes of access invisible, both to makers and influencers on the internet and to advertising engines and big data, and it limits our understanding of the internet and its global impact.

I can imagine at least two axes of a connectivity spectrum: single vs. shared usage, and continuous vs. intermittent access. For many readers of Cyborgology, single usage, continuous access to the web is likely the norm. The most extreme example of this might be iconized in the now infamous image of Robert Scoble wearing Google Glass in the shower–we are always connected, always getting feeds of data our way.

Here’s how other sections of those axes might map to practices I’ve observed in different parts of the world. Imagine these at differing degrees away from the center of a matrix:

  • Shared Usage, Continuous Access: I saved up to buy a laptop with a USB stick that my family of four can use. We take turns using it, and our connection is pretty stable.
  • Single, Intermittent: I have a low-cost Chinese feature phone (maybe a Xiaomi), and I pay a few dollars each month for 10 MB of access. I keep my data plan off most of time.
  • Shared, Intermittent: I walk all day to visit an internet cafe once every few months to check my Facebook account, listen to music on YouTube and practice my typing skills. I don’t own a computer myself.

For the purposes of simplicity, I’m assuming that we’re talking about devices that have one connection. But, of course, some devices have multiple connections (think of a phone with multiple SIMs) and some connections have multiple devices (think of roommates sharing a wifi router).

Read the full post here. Continua a leggere

Exploring the world of data-driven innovation

Mike Masnick is founder of the Copia Institute.

In the last few years, there’s obviously been a tremendous explosion in the amount of data floating around. But we’ve also seen an explosion in the efforts to understand and make use of that data in valuable and important ways. The advances, both in terms of the type and amount of data available, combined with advances in computing power to analyze the data, are opening up entirely new fields of innovation that simply weren’t possible before.

We recently launched a new think tank, the Copia Institute, focused on looking at the big challenges and opportunities facing the innovation world today. An area we’re deeply interested in is data-driven innovation. To explore this space more thoroughly, the Copia Institute is putting together an ongoing series of case studies on data-driven innovation, with the first few now available in the Copia library.

Our first set of case studies includes a look at how the Polymerase Chain Reaction (PCR) helped jumpstart the biotechnology field today. PCR is, in short, a machine for copying DNA, something that was extremely difficult to do (outside of living things copying their own DNA). The discovery was something of an accident: A scientist discovered that certain microbes survived in the high temperatures of the hot springs of Yellowstone National Park, previously thought impossible. This resulted in further study that eventually led to the creation of PCR.

PCR was patented but licensed widely and generously. It basically became the key to biotech and genetic research in a variety of different areas. The Human Genome Project, for example, was possible only thanks to the widespread availability of PCR. Those involved in the early efforts around PCR were actively looking to share the information and concept rather than lock it up entirely, although there were debates about doing just that. By making sure that the process was widely available, it helped to accelerate innovation in the biotech and genetics fields. And with the recent expiration of the original PCR patents, the technology is even more widespread today, expanding its contribution to the field.

Another case study explores the value of the HeLa cells in medical research—cancer research in particular. While the initial discovery of HeLa cells may have come under dubious circumstances, their contribution to medical advancement cannot be overstated. The name of the HeLa cells comes from the patient they were originally taken from, a woman named Henrietta Lacks. Unlike previous human cell samples, HeLa cells continued to grow and thrive after being removed from Henrietta. The cells were made widely available and have contributed to a huge number of medical advancements, including work that has resulted in five Nobel prizes to date.

With both PCR and HeLa cells, we saw an important pattern: an early discovery that was shared widely, enabling much greater innovation to flow from proliferation of data. It was the widespread sharing of information and ideas that contributed to many of these key breakthroughs involving biotechnology and health.

At the same time, both cases raise certain questions about how to best handle similar developments in the future. There are questions about intellectual property, privacy, information sharing, trade secrecy and much more. At the Copia Institute, we plan to more dive into many of these issues with our continuing series of case studies, as well as through research and events.

Continua a leggere

Disability Confident: How can we measure if government policy is working?

Andy White is Employment and Working Age Manager in the Evidence & Service Impact Section at RNIB.

Current UK government policy to improve employment opportunities for disabled people is based on the government’s Disability Confident campaign. Charities such as RNIB are keeping a close watch on this by measuring its impact on the employment rates of disabled people.

Blind and partially sighted people are significantly less likely to be in paid employment than the general population or other disabled people. For every three registered blind and partially sighted people of working age, only one is in paid employment. Worse, blind and partially sighted people are nearly five times more likely than the general population to have had no paid work for five years.

Measuring the employment rates of people registered as blind (serious sight impaired) or partially sighted (sight impaired) gives us the clearest indication of the employment status of people living with sight loss. But even among those not registered, the Labour Force Survey indicates that just over 4% of people who are described as “long term disabled with a seeing difficulty” are employed, compared with 74% of the general population.

One way to increase the numbers of blind and partially sighted people in employment is to focus on increasing the supply of blind and partially sighted people to the labour market by building their attributes and capabilities, and increasing the demand for meaningful work by supporting creative employment opportunities.

Another approach is to support people with sight loss to keep working—27% of non-working registered blind and partially sighted people said that the main reason for leaving their last job was the onset of sight loss or deterioration of their sight. However, 30% who were not working but who had worked in the past said that they maybe or definitely could have continued in their job given the right support.

We can address this by providing blind and partially sighted people with appropriate vocational rehabilitation support, and helping employers understand the business case for job retention. This is a challenge, given that the majority of employers have a negative attitude toward employing a blind or partially sighted person.

Blind and partially sighted people looking for work need specialist support on their journey towards employment. In addition to barriers common to anyone out of work for a long period, blind and partially sighted jobseekers have specific needs related to their sight loss.

Research indicates that those furthest from the labour market require a more resource-intensive model of support than those who are actively seeking work. Many blind and partially sighted jobseekers fall into this category.

The increased pressure on out-of-work blind and partially sighted people to join employment programmes means greater engagement in welfare to work programmes, and an increasing responsibility for the welfare to work industry to meet the specific needs of blind and partially sighted jobseekers.

Government policies such as the Disability Confident campaign will only be effective when there is a sea change in the proportion of blind and partially sighted people of working age achieving greater independence through paid employment.

Research about the employment status of blind and partially sighted people can be found on the Knowledge Hub section of RNIB’s website. We also publish a series of evidence-based reviews, including one for people of working age, upon which this blog is based.

Continua a leggere

Data shows what millions knew: the Internet was really slow!

Meredith Whittaker is Open Source Research Lead at Google.

For much of 2013 and 2014, accessing major content and services was nearly impossible for millions of US Internet users. That sounds like a big deal, right? It is. But it’s also hard to document. Users complained, the press reported disputes between Netflix and Comcast, but the scope and extent of the problem wasn’t understood until late 2014.

This is thanks in large part to M-Lab, a broad collaboration of academic and industry researchers committed to openly and empirically measuring global Internet performance. Using a massive archive of open data, M-Lab researchers uncovered interconnection problems between Internet service providers (ISPs) that resulted in nationwide performance slowdowns. Their published report, ISP Interconnection and its Impact on Consumer Internet Performance, lays out the data.

To back up a moment—interconnection sounds complicated. It’s not. Interconnection is the means by which different networks connect to each other. This connection allows you to access online content and services hosted anywhere, not just content and services hosted by a single access provider (think AOL in the 1990’s vs today’s Internet). By definition, the Inter-net wouldn’t exist without interconnection.

Interconnection points are the places where Internet traffic crosses from one network to another. Uncongested interconnection points are critical to a healthy, open Internet. Put another way, it doesn’t matter how wide the road is on either side—if the bridge is too narrow, traffic will be slow.

M-Lab data and research exposed just such slowdowns. Let’s take a look…

The chart above shows download throughput data, collected by M-Lab in NYC between Feb 2013 and Sept 2014. The reflects traffic between customers of Time Warner Cable, Verizon, and Comcast—major ISPs—and an M-Lab server hosted on Cogent’s network. Cogent is a major transit ISP and many content and services are hosted on Cogent’s network and on similar transit networks. Traffic between people and the content they want to access has to move through an interconnection point between their ISP (TWC, Comcast, and Verizon, in this case) and Cogent. What we see here, then, is severe degradation of download throughput between these ISPs and Cogent that lasted for about a year. During this time, customers of these three ISPs attempting to access anything hosted on Cogent in NYC were subjected to severely slowed Internet performance.

But maybe things are just slow, no?

Here you see download throughput in NYC during the same time period, for the same three ISPs (plus Cablevision). The difference: here they are accessing an M-Lab server hosted on Internap’s network (another transit ISP). In this case, in the same region, for the same general population of users, during the same time, download throughput was stable. Content and services accessed on Internap’s network performed just fine.

Couldn’t this just be Cogent’s problem? Another good question…

Here we return to Cogent. This graph spans the same time period, in NYC, looking again at download throughput across a Cogent interconnection point. The difference? We’re looking at traffic to customers of the ISP Cablevision.

Comparing these three graphs, we see M-Lab data exposing problems that aren’t specific to one ISP or ISPs, but a problem with the relationship between pairs of ISPs—in this example, Cogent when paired with Time Warner, Comcast, or Verizon. This relationship manifests, technically, as interconnection.

These graphs focus on NYC but M-Lab saw similar patterns across the US as researchers examined performance trends across pairs of ISPs nationwide—e.g., whenever Comcast interconnected with Cogent. The research shows that the scope and scale of interconnection-related performance issues were nationwide and continued for over a year. IT also shows that these issues were not strictly technical in nature. In many cases, the same patterns of performance degradation existed across the US wherever a given pair of ISPs interconnected. This rules out a regional technical problem and instead points to business disputes as the cause of congestion.

M-Lab research shows that when interconnection goes bad, it’s not theoretical: it interferes with real people trying to do critical things. Good data and careful research helped to quantify the real, human impact of what had been relegated to technical discussion lists and sidebars in long policy documents. More focus on open data projects like M-Lab could help quantify the human impact across myriad issues, moving us from a hypothetical to a real and actionable understanding of how to draft better policies.

Continua a leggere

Mapping the sneakernet

In March, internet researcher and designer An Xiao Mina published a fascinating piece on The New Inquiry about “the sneakernet,” a concept that addresses the nuances of connectivity and the myriad social methods through which people exchange culture, access, and information. In the article, she shares an anecdote from a research trip to Northern Uganda, a region where residents had no access to the electric grid or running water and access to 3G internet was limited by both availability and affordability. She writes:

At night, residents turn on their radios, and those who can afford Chinese feature phones play mp3s. One day, I heard familiar lyrics:

Hey, I just met you
And this is crazy
But here’s my number
So call me maybe

I turned my head. A number of young people gathered around a woman rocking out to Carly Rae Jepsen’s “Call Me Maybe,” a song that owes so much of its success to the viral power of YouTube and Justin Bieber. The phone’s owner wasn’t accessing it via the Internet. Rather, she had an mp3 acquired through a Bluetooth transfer with a friend.

Indeed, the song was just one of many media files I saw on people’s phones: There were Chinese kung fu movies, Nigerian comedies, and Ugandan pop music. They were physically transferred, phone to phone, Bluetooth to Bluetooth, USB stick to USB stick, over hundreds of miles by an informal sneakernet of entertainment media downloaded from the Internet or burned from DVDs, bringing media that;s popular in video halls—basically, small theaters for watching DVDs—to their own villages and huts.

In geographic distribution charts of Carly Rae Jepsen’s virality, you’d be hard pressed to find impressions from this part of the world. Nor is this sneakernet practice unique to the region. On the other end of continent, in Mali, music researcher Christopher Kirkley has documented a music trade using Bluetooth transfers that is similar to what I saw in northern Uganda. These forms of data transfer and access, though quite common, are invisible to traditional measures of connectivity and Big Data research methods. Like millions around the world with direct internet connections, young people in “unconnected” regions are participating in the great viral products of the Internet, consuming mass media files and generating and transferring their own media.

What does this have to do with public policy? At the end of the piece, An explains how understanding connectivity as a spectrum, rather than a binary, can inform policies and strategies for outreach and access. To illustrate this, she uses a vivid water analogy:

Like water, the Internet is vast, familiar and seemingly ubiquitous but with extremes of unequal access. Some people have clean, unfettered and flowing data from invisible but reliable sources. Many more experience polluted and flaky sources, and they have to combine patience and filters to get the right set of data they need. Others must hike dozens of miles of paved and dirt roads to access the Internet like water from a well, ferrying it back in fits and spurts when the opportunity arises. And yet more get trickles of data here and there from friends and family, in the form of printouts, a song played on a phone’s speaker, an interesting status update from Facebook relayed orally, a radio station that features stories from the Internet.

Like water from a river, data from the Internet can be scooped up and irrigated and splashed around in novel ways. Whether it’s north of the Nile in Uganda or south of Market St. in the Bay Area, policies and strategies for connecting the “unconnected” should take into account the vast spectrum of ways that people find and access data. Packets of information can be distributed via SMS and mobile 3G but also pieces of paper, USB sticks and Bluetooth. Solar-powered computer kiosks in rural areas can have simple capabilities for connecting to mobile phones’ SD cards for upload and download. Technology training courses can start with a more nuanced base level of understanding, rather than assuming zero knowledge of the basics of computing and network transfer. These are broad strokes, of course; the specifics of motivation and methods are complex and need to be studied carefully in any given instance. But the very channels that ferry entertainment media can also ferry health care information, educational material and anything else in compact enough form.

An Xiao Mina is a product owner at Meedan and an internet researcher with The Civic Beat.

Continua a leggere

How are Internet start-ups affected by liability for user content?

David Jevons is a Partner at Oxera

Internet intermediaries facilitate the free flow of information online by assisting users to find, share and access content. However, users may sometimes share copyright-protected or illegal content; ‘internet intermediary liability’ (IIL) laws define the extent to which the intermediaries are liable for this. Holding internet intermediary start-ups accountable for user content will reduce the costs of enforcement but may also harm the incentive for entrepreneurs to develop new intermediary business models. To help inform this debate, Google asked our team at Oxera to examine the effect of different IIL laws in terms of success rates and profitability on Internet start-ups, including a detailed examination of four countries: Germany, Chile, Thailand and India.

The effects on start-ups of clear and cost-efficient requirements

Ambiguity in IIL laws can lead to over enforcement, which can alienate users. SoundCloud, a streaming service in Berlin, suffered a user backlash resulting from issues in its takedown policy, including petitions and threats to open a competing platform. Over-compliance is a related issue, which can be costly for the start-up. MThai, a web portal in Thailand, employs more than 20 people to check content before uploading, and prevents uploading during the night, in order to limit its costs. In extreme cases, ambiguity in legislation can lead to inadvertent violations of the law. The executives of Guruji, an Indian search engine, were arrested in 2010 following claims that they were infringing copyright which eventually led to the shutdown of the music search site.

In line with these examples, we find that intermediary start-ups could benefit considerably from a modified IIL regime with legislation that is clearer and sanctions that are focussed on cases where it is socially efficient to hold intermediaries liable. This is reflected in the quantitative results of our study, with the largest effects found in markets (such as India and Thailand) where current legislation is most ambiguous. Our analysis indicates that an improved IIL regime could increase start-up success rates for intermediaries in our focus countries by between 4% (Chile) and 24% (Thailand) and raise their expected profit by between 1% (Chile) and 5% (India).

Estimated impact on start-up success rates (%)
Estimated impact on the expected profits of successful start-ups (%)

Implications for the design of future IIL regimes

The IIL regime is one of several levers available to policymakers wishing to encourage more start-up activity, however it may be one of the easier ones to pull for policy makers wanting to stimulate growth in this sector.

Our study highlighted the following implications for the design of future IIL regimes:

  • Find the right balance between the cost effective enforcement of copyright and allowing innovation in intermediary start-ups.
  • Costs matter when designing safe harbours. The costs of compliance are likely to have a considerable impact on intermediaries, particularly on start-ups.
  • Legal uncertainty increases costs of compliance. Intermediaries will find it difficult to ascertain the required level of compliance and may ‘over-comply’
  • Start-ups comply with take down requests as they do not have the resources to engage in legal action. Legitimate user content may be removed as a precaution.
  • Start-up vibrancy can be lost as high risks and compliance costs increase the likelihood that a start up with a commercially sound, legitimate business model fails.

If you are interested in finding out more about our study and the economic issues surrounding IIL, please read our full study on the Oxera website.

Continua a leggere