Global broadband pricing study: Updated dataset

Vincent Chiu is a Technical Program Manager at Google.

Since 2012, Google has supported the study and publication of broadband pricing for researchers, policymakers and the private sector in order to better understand the affordability landscape and help consumers make smarter choices about broadband access. We released the first dataset in August 2012, and periodically refresh the data (May 2013, March 2014, and February 2015). This data has become an integral part of our understanding of global broadband affordability. Harvard’s Berkman Center, Facebook, and others actively use the data to understand the broadband landscape. Today, we’re releasing the latest dataset.

For the mobile data set, we increased the number of countries represented from 112 to 157, and the number of carriers from 331 to 402. For the fixed-line dataset, we increased the number of countries from 105 to 159, and number of carriers from 331 to 424. This data covers 99.3% of current Internet users globally on a country level.

The collection methodology is designed to capture the cost of data plans. We collected samples from a broad range of light to heavy data usage plans, and recorded numerous individual plan parameters such as downstream bandwidth, monthly cost, and more. Finally, where possible, we collected plans from multiple carriers in each country to get an accurate picture.

Broadband Data:

  • Price observations for fixed broadband plans can be found here.
  • Mobile broadband prices can be found here.

We provide this information to help people understand the state of internet access and make data-driven decisions. Along with this data collection effort, we have analyzed these pricing data and conducted researches on affordability and Internet penetration. Our early results have indicated several topics deserving further discussions within the ICT data community, including metric normalization, income distribution, and broadband value. We believe:

  • Normalization is essential for any meaningful statistical analysis. The diversity of plan types and the complexity of tariff structures surrounding mobile broadband pricing requires a careful analytical methodology for normalization.
  • Income distribution needs to be considered when assessing the broadband affordability situation. The commonly used GNI per capita metric is based on average national income level, which does not consider inequality of income distribution.
  • High broadband value-to-cost ratio is important to drive Internet adoption: beyond just affordability metrics, we need to take into account of broadband experience to define meaningful value-for-the-money metrics.

We look forward to sharing more findings on these topics in 2016. Please stay tuned for updates on our progress. If you have any feedback on the methodology, contact us at [email protected]

Continua a leggere

It’s Humans, Not Algorithms, That Have a Bias Problem

Joshua New is a policy analyst at the Center for Data Innovation. Reposted from CDI’s blog.

Bias in big data. Automated discrimination. Algorithms that erode civil liberties.

These are some of the fears that the White House, the Federal Trade Commission, and other critics have expressed about an increasingly data-driven world. But these critics tend to forget that the world is already full of bias, and discrimination permeates human decision-making.

The truth is that the shift to a more data-driven world represents an unparalleled opportunity to crack down on unfair consumer discrimination by using data analysis to expose biases and reduce human prejudice. This opportunity is aptly demonstrated by the Consumer Financial Protection Bureau’s (CFPB) December 2013 auto loan discrimination suit against Ally Financial, the largest such suit in history, in which data and algorithms played a critical role in identifying and combating racial bias.

CFPB found that, from April 2011 to December 2013, Ally Financial had unfairly set higher interest rates on auto loans for 235,000 minority borrowers and ordered the company to pay out $80 million in damages. But the investigation also posed an interesting challenge: Since creditors are generally prohibited from collecting data on an applicant’s race, there was no hard evidence showing Ally had engaged in discriminatory practices. To piece together what really happened, CFPB used an algorithm to infer a borrower’s race based on other information in his or her loan application. Its analysis identified widespread overcharging of minority borrowers as a result of discriminatory interest rate markups at car dealerships.

Ally Financial buys retail installment contracts from more than 12,000 automobile dealers in the United States, essentially allowing dealers to act as middlemen for auto loans. If a consumer decides to finance his or her new car through a dealership rather than a bank, the dealership submits the consumer’s application to a company like Ally. If approved, the consumer pays back the dealership with interest. The interest rate, of course, matters a great deal. To determine what it will be, Ally calculates a “buy rate”—a minimum interest rate for which it is willing to purchase a retail installment contract, as determined by actuarial models. Ally notifies dealerships of this buy rate, but then also gives them substantial leeway to increase the interest rate to make the contract more profitable. Though consumers are free to negotiate these rates and shop around for the best deal, CFPB’s analysis determined that discretionary dealership pricing had a disparate impact on borrowers who were African American, Hispanic, Asian, or Pacific Islanders. On average, they paid between $200 and $300 more than similarly situated white borrowers.

Since creditors cannot inquire about race or ethnicity, Ally’s algorithmically generated buy rates are objective assessments. But when dealerships increase these rates, their judgments are entirely subjective, relying on humans to make decisions that could very well be influenced by racial bias. If dealerships instead took a similar approach to creditors and automated this decision-making process, there would be no opportunity for human bias to enter the equation. While dealerships could still increase interest rates to capture more profits, they could do so based on algorithmic analysis of predefined criteria about a consumer’s willingness to pay, thereby preventing themselves from offering similar consumers different rates based on their race.

Policymakers should guard against the possibility that automated decision-making could perpetuate bias, but with ever-increasing opportunities to collect and analyze data, the public and private sectors also should follow CFPB’s lead and identify new opportunities where data analytics can help expose and reduce human bias. For example, employers could rely on algorithms to select job applicants for interviews based on their objective qualifications rather than relying on human oversight that can be biased against factors such as whether or not the job applicant has an African American–sounding name. And taxi services could rely on algorithms to match drivers with riders rather than leaving it up to drivers who might be inclined to discriminate against passengers based on their race. If policymakers let fear of computerized decision-making impede wider deployment of fair algorithms, then society will lose a valuable opportunity to build a more just world.

Continua a leggere

Use Data and Innovation to Match Resources with Need? Sure, We Can Do That

Hannah Walker is Director of Government Relations at the Food Marketing Institute (FMI). Reposted with permission from FMI’s blog.

We have all heard the concerning statistics regarding the amount of food going to waste in the United States while many go without. As a founding member of the Food Waste Reeducation Alliance (FWRA), FMI has been addressing the challenge of reducing food waste in both the United States and globally. FMI recently participated in an announcement with the U.S. Department of Agriculture and the Environmental Protection Agency to emphasize our commitment to the issue and highlight the importance of collaboration between government and the private sector.

I wanted to highlight an interesting and innovative case of using data to address this pressing public concern and stewardship issue. Feeding America has partnered with dozens of grocers across the country seeking creative ways to solve both the food waste and hunger problems. For years, grocers had limited options when perishables were reaching their sell-by-date; the primary one being send it to the landfill. Significant changes came in 2006 when many of FMI’s members began teaming up with Feeding America to better identify perishable food and donate it rather than discard it. This improved collaboration has proven incredibly successful; grocers have donated over 1.4 billion pounds of food between July 1, 2014 and June 30, 2015, a truly amazing increase from the 140 million first donated when the program started in 2006.

In a recent conversation with Feeding America, I learned that they have found a great willingness from our retail members from large national chains down to smaller operators—to donate perishables that will stock the shelves of the local food bank as opposed to adding to their local landfill. In one short decade, the partnership between the grocery industry and Feeding America has made perishables, such as meat, dairy, and produce much more common items on food bank shelves.

This smart and seemingly simple solution is backed by the use of data, innovation and analytics to measure what and how much food is received and where to send it so that it reaches those in the greatest need. By matching meal gap data with available resources, our local food banks are able to serve those who are in the greatest need while reducing our national food waste at the same time.

While 1.4 billion pounds is an incredible improvement from the reported 140 million reported just nine short years ago, there is always more that can be done. Feeding America and grocery partners are currently targeting an additional 300 million pounds of food they believe they can get by further optimizing the data and collection process.

There will never be one solution to solve the challenges of food waste and hunger in the United States and abroad; however, creative ideas like this partnership backed with strong data and creative innovation are making great strides toward both goals.

Continua a leggere

Data for social good: suicide prevention

Earlier this week, GOOD Magazine published an interesting piece by Mark Hay on suicide prevention titled “Can Big Data Help Us Fight Rising Suicide Rates?” The part of the article that talks about data-driven prevention starts about halfway through. What follows is an excerpt from that section.

Yet there is one frontier in suicide prevention that seems especially promising, though in a way, it maybe a bit removed from the problem’s human element: big data predictions and intervention targeting.

We know that some populations are more likely than others to commit suicide. Men in the United States account for 79 percent of all suicides. People in their 20s are at higher risk than others. And whites and Native Americans tend to have higher suicide rates than other ethnicities. Yet we don’t have the greatest ability to grasp trends and other niche factors to build up actionable, targetable profiles of communities where we should focus our efforts. We’re stuck trying to expand a suicide prevention dragnet, as opposed to getting individuals at risk the precise information they need (even if they don’t tip off major signs to their friends and family).

That’s a big part of why last year, groups like the National Action Alliance for Suicide Prevention’s Research Prioritization Task Force listed better surveillance, data collection, and research on existing data as priorities for work in the field over the next decade. It’s also why multiple organizations are now developing algorithms to sort through diverse datasets, trying to identify behaviors, social media posting trends, language, lifestyle changes, or any other proxy that can help us predict suicidal tendencies. By doing this, the theory goes, we can target and deliver exactly the right information.

One of the greatest proponents of this data-heavy approach to suicide prevention is the United States Army, which suffers from a suicide rate many times higher than the general population. In 2012, they had more suicide deaths than casualties in Afghanistan. Yet with millions of soldiers stationed around the globe and limited suicide prevention resources, it’s been difficult to simply rely on expanding the dragnet. Instead, last December the Army announced that they’d developed an algorithm that distills the details of a soldier’s personal information into a set of 400 characteristics that mix and match to show whether an individual is likely in need of intervention. Their analysis isn’t perfect yet, but they’ve been able to identify a cluster of characteristics within 5 percent of military personnel who accounted for 52 percent of suicides, showing that they’re on the right track to better targeting and allocating prevention resources.

Yet perhaps the greatest distillation of this data-driven approach (combined with the expansive, barrier-reducing impulse of mainstream efforts) is the Crisis Text Line. Created in 2013 by organizers from DoSomething.org, the text line allows those too scared, embarrassed, or uncomfortable to vocalize their problems to friends, or over a hotline, to simply trace a pattern on a cell phone keypad (741741) and then type their problems in a text message. As of 2015, algorithmic learning allows the Crisis Text Line to search for keywords, based on over 8 million previous texts and data gathered from hundreds of suicide prevention workers, to identify who’s at serious risk and assign counselors to respond. But more than that, the data in texts can trip off time and vocabulary sensors, matching counselors with expertise in certain areas to respond to specific texters, or bringing up precisely tailored resources. For example, the system knows that self-harm peaks at 4 a.m. and that people typing “Mormon” are usually dealing with issues related to LGBTQ identity, discrimination, and isolation. Low-impact and low-cost with high potential for delivering the best information possible to those in need, it’s one of the cleverer young programs out there pushing the suicide prevention gains made over the last century.

It’ll be a few years before we can understand the impact of data analysis and targeting on suicide prevention efforts, especially relative to general attempts to expand existing programs. And given the limited success of a half-century of serious gains in understanding and resource provision, we’d be wise not to get our hopes up too much. But it’s not unreasonable to suspect that a combination of diversifying means of access, lowering barriers of communication, and better identifying those at risk could help us bring programs to populations that have not yet received them (or that we could not support quickly enough before). At the very least, crunching existing data may help us to discover why suicide rates have increased in recent years and to understand the mechanisms of this widespread social issue. We have solid, logical reason to support the development of programs like the Army’s algorithms and the Crisis Text Line, and to push for further similar initiatives. But really we have reason to support any kind of suicide prevention innovation, even if it feels less robust or promising than the recent data-driven efforts. If you’ve ever witnessed the pain that those moving towards suicide feel, or the wide-reaching fallout after someone takes his or her life, you’ll understand the visceral, human need to let a thousand flowers bloom, desperately hoping that one of them sticks. Hopefully, if data mining and targeting works well, that’ll only inspire further innovation, slowly putting a greater and greater dent in the phenomenon of suicide.

Continua a leggere

The reusable holdout: Preserving validity in adaptive data analysis

Moritz Hardt is a Research Scientist at Google. This post was originally published on the Google Research Blog.

Machine learning and statistical analysis play an important role at the forefront of scientific and technological progress. But with all data analysis, there is a danger that findings observed in a particular sample do not generalize to the underlying population from which the data were drawn. A popular XKCD cartoon illustrates that if you test sufficiently many different colors of jelly beans for correlation with acne, you will eventually find one color that correlates with acne at a p-value below the infamous 0.05 significance level.

Image credit: XKCD

Unfortunately, the problem of false discovery is even more delicate than the cartoon suggests. Correcting reported p-values for a fixed number of multiple tests is a fairly well understood topic in statistics. A simple approach is to multiply each p-value by the number of tests, but there are more sophisticated tools. However, almost all existing approaches to ensuring the validity of statistical inferences assume that the analyst performs a fixed procedure chosen before the data are examined. For example, “test all 20 flavors of jelly beans.” In practice, however, the analyst is informed by data exploration, as well as the results of previous analyses. How did the scientist choose to study acne and jelly beans in the first place? Often such choices are influenced by previous interactions with the same data. This adaptive behavior of the analyst leads to an increased risk of spurious discoveries that are neither prevented nor detected by standard approaches. Each adaptive choice the analyst makes multiplies the number of possible analyses that could possibly follow; it is often difficult or impossible to describe and analyze the exact experimental setup ahead of time.

In The Reusable Holdout: Preserving Validity in Adaptive Data Analysis, a joint work with Cynthia Dwork (Microsoft Research), Vitaly Feldman (IBM Almaden Research Center), Toniann Pitassi (University of Toronto), Omer Reingold (Samsung Research America) and Aaron Roth (University of Pennsylvania), to appear in Science tomorrow, we present a new methodology for navigating the challenges of adaptivity. A central application of our general approach is the reusable holdout mechanism that allows the analyst to safely validate the results of many adaptively chosen analyses without the need to collect costly fresh data each time.

The curse of adaptivity

A beautiful example of how false discovery arises as a result of adaptivity is Freedman’s paradox. Suppose that we want to build a model that explains “systolic blood pressure” in terms of hundreds of variables quantifying the intake of various kinds of food. In order to reduce the number of variables and simplify our task, we first select some promising looking variables, for example, those that have a positive correlation with the response variable (systolic blood pressure). We then fit a linear regression model on the selected variables. To measure the goodness of our model fit, we crank out a standard F-test from our favorite statistics textbook and report the resulting p-value.

Inference after selection: We first select a subset of the variables based on a data-dependent criterion and then fit a linear model on the selected variables.

Freedman showed that the reported p-value is highly misleading—even if the data were completely random with no correlation whatsoever between the response variable and the data points, we’d likely observe a significant p-value! The bias stems from the fact that we selected a subset of the variables adaptively based on the data, but we never account for this fact. There is a huge number of possible subsets of variables that we selected from. The mere fact that we chose one test over the other by peeking at the data creates a selection bias that invalidates the assumptions underlying the F-test.

Freedman’s paradox bears an important lesson. Significance levels of standard procedures do not capture the vast number of analyses one can choose to carry out or to omit. For this reason, adaptivity is one of the primary explanations of why research findings are frequently false as was argued by Gelman and Loken who aptly refer to adaptivity as “garden of the forking paths.”

Machine learning competitions and holdout sets

Adaptivity is not just an issue with p-values in the empirical sciences. It affects other domains of data science just as well. Machine learning competitions are a perfect example. Competitions have become an extremely popular format for solving prediction and classification problems of all sorts.

Each team in the competition has full access to a publicly available training set which they use to build a predictive model for a certain task such as image classification. Competitors can repeatedly submit a model and see how the model performs on a fixed holdout data set not available to them. The central component of any competition is the public leaderboard which ranks all teams according to the prediction accuracy of their best model so far on the holdout. Every time a team makes a submission they observe the score of their model on the same holdout data. This methodology is inspired by the classic holdout method for validating the performance of a predictive model.

Ideally, the holdout score gives an accurate estimate of the true performance of the model on the underlying distribution from which the data were drawn. However, this is only the case when the model is independent of the holdout data! In contrast, in a competition the model generally incorporates previously observed feedback from the holdout set. Competitors work adaptively and iteratively with the feedback they receive. An improved score for one submission might convince the team to tweak their current approach, while a lower score might cause them to try out a different strategy. But the moment a team modifies their model based on a previously observed holdout score, they create a dependency between the model and the holdout data that invalidates the assumption of the classic holdout method. As a result, competitors may begin to overfit to the holdout data that supports the leaderboard. This means that their score on the public leaderboard continues to improve, while the true performance of the model does not. In fact, unreliable leaderboards are a widely observed phenomenon in machine learning competitions.

Reusable holdout sets

A standard proposal for coping with adaptivity is simply to discourage it. In the empirical sciences, this proposal is known as pre-registration and requires the researcher to specify the exact experimental setup ahead of time. While possible in some simple cases, it is in general too restrictive as it runs counter to today’s complex data analysis workflows.

Rather than limiting the analyst, our approach provides means of reliably verifying the results of an arbitrary adaptive data analysis. The key tool for doing so is what we call the reusable holdout method. As with the classic holdout method discussed above, the analyst is given unfettered access to the training data. What changes is that there is a new algorithm in charge of evaluating statistics on the holdout set. This algorithm ensures that the holdout set maintains the essential guarantees of fresh data over the course of many estimation steps.

The limit of the method is determined by the size of the holdout set—the number of times that the holdout set may be used grows roughly as the square of the number of collected data points in the holdout, as our theory shows.

Armed with the reusable holdout, the analyst is free to explore the training data and verify tentative conclusions on the holdout set. It is now entirely safe to use any information provided by the holdout algorithm in the choice of new analyses to carry out, or the tweaking of existing models and parameters.

A general methodology

The reusable holdout is only one instance of a broader methodology that is, perhaps surprisingly, based on differential privacy—a notion of privacy preservation in data analysis. At its core, differential privacy is a notion of stability requiring that any single sample should not influence the outcome of the analysis significantly.

Example of a stable learning algorithm: Deletion of any single data point does not affect the accuracy of the classifier much.

A beautiful line of work in machine learning shows that various notions of stability imply generalization. That is any sample estimate computed by a stable algorithm (such as the prediction accuracy of a model on a sample) must be close to what we would observe on fresh data.

What sets differential privacy apart from other stability notions is that it is preserved by adaptive composition. Combining multiple algorithms that each preserve differential privacy yields a new algorithm that also satisfies differential privacy albeit at some quantitative loss in the stability guarantee. This is true even if the output of one algorithm influences the choice of the next. This strong adaptive composition property is what makes differential privacy an excellent stability notion for adaptive data analysis.

In a nutshell, the reusable holdout mechanism is simply this: access the holdout set only through a suitable differentially private algorithm. It is important to note, however, that the user does not need to understand differential privacy to use our method. The user interface of the reusable holdout is the same as that of the widely used classical method.

Reliable benchmarks

A closely related work with Avrim Blum dives deeper into the problem of maintaining a reliable leaderboard in machine learning competitions (see this blog post for more background). While the reusable holdout could directly be used for this purpose, it turns out that a variant of the reusable holdout, we call the Ladder algorithm, provides even better accuracy.

This method is not just useful for machine learning competitions, since there are many problems that are roughly equivalent to that of maintaining an accurate leaderboard in a competition. Consider, for example, a performance benchmark that a company uses to test improvements to a system internally before deploying them in a production system. As the benchmark data set is used repeatedly and adaptively for tasks such as model selection, hyper-parameter search and testing, there is a danger that eventually the benchmark becomes unreliable.

Conclusion

Modern data analysis is inherently an adaptive process. Attempts to limit what data scientists will do in practice are ill-fated. Instead we should create tools that respect the usual workflow of data science while at the same time increasing the reliability of data driven insights. It is our goal to continue exploring techniques that can help to create more reliable validation techniques and benchmarks that track true performance more accurately than existing methods.

Continua a leggere

Mapping youth well-being worldwide with open data

Ryan Swanstrom is a blogger at Data Science 101. This post originally appeared on DataKind’s blog.

How does mapping child poverty in Washington DC help inform efforts to support child and young adult well being in the UK and Kentucky?

Back in March 2012, a team of DataKind volunteers in Washington DC worked furiously to finish their final presentation at a weekend DataDive. Little did they know, the impact of their work would extend far beyond DC and far beyond the weekend. Their prototyped visualization ultimately became a polished tool that would impact communities worldwide.

DC Action for Children’s Data Tools 2.0 is an interactive visualization tool to explore the effects of income, healthcare, neighborhoods, and population on child well-being in the Washington DC area. The source code for Data Tools 2.0 and open data sources have since been used by DataKind UK and Code for America volunteers to benefit their local partners. There is now potential for it to reach even more communities through DataLook’s #openimpact Marathon.

See how far a solution can spread when you bring together open data, open code and open hearted volunteers around the world.

What a difference a DataDive makes

DC Action for Children, a Washington DC nonprofit focusing on child well-being, needed help understanding how Washington DC could be one of the most affluent and wealthy cities in the United States, yet have one of the highest child poverty rates. Could mapping child poverty help uncover patterns and insights to drive action to address it?

A team of DataDive volunteers, led by Data Ambassador Sisi Wei, took on the challenge and, in less than 24 hours, created a prototype that wrangled data in a multitude of forms from government agencies, Census and DC Action for Children’s own databases.  The 24-hours then evolved into a multi-month DataCorps project involving many DataKind volunteers. The team unveiled a more polished version to a large and influential audience in Washington DC, including the Mayor of DC himself! They then completed the final enhancements to create Data Tools 2.0, which is now live on DC Action for Children’s website.

The project has since released the source code on Github, and the team has continued to collaborate and advance the project to where it is today. In fact, if you’re local, check out the August 5th DataKind DC Meetup to join in and continue improving the tool.

This story alone is incredible and speaks to the incredible commitment of these volunteers and the importance of having a strong partner like DC Action for Children to implement and utilize the work as an integrated part of its mission.

And that’s usually where the story ends. Thanks to DataKind’s global network though, the impact of this work was just starting to spread.

A Visualization Goes Viral

Because the visualization used open data (freely available data for public use) and open source software or code (freely available code that can be viewed, modified, and reused), other volunteers could quickly repurpose the work and apply it to their local community.

DataKind UK London DataDive

The first time the visualization was replicated was in October 2014 for The North East Child Poverty Commission. The Commission had a similar challenge of wanting to better understand child poverty in the North East of England. A team at the London DataDive reused the code from DataTools 2.0 and created a similar visualization for the North East of England. This enabled the team to quickly produce valuable results that “thrilled” NECPC. One of the team’s Data Ambassadors continued to work with the organization and has since migrated the visualization to a different platform in Tableau.

DataKind UK Leeds DataDive

In April 2015, DataKind UK hosted another DataDive in Leeds with three charity partners, Volition, Voluntary Action Leeds and the Young Foundation, to tackle the structural causes of inequality in the city. All three charity teams came together to create a visualization tool that allows people to explore features of financial, young NEETs (Not in Education, Employment, or Training) and mental health inequality. But they did not recreate the wheel—they leveraged past work and repurposed code from DC Action for Children. Read more about the event in this recap from DataDive attendee, Andy Dickinson.

Beyond the DataKind Network

Now, it’s great to see a solution scale within an organization’s network, but it’s even more impressive to see it scale beyond, in this case, into Kentucky and maybe one day India or Finland.

#HackForChange with Code For America

In June 2015, the city of Louisville, Kentucky teamed with Civic Data Alliance to host a hackathon in honor of the National Day of Civic Hacking. Kentucky Youth Advocates, a nonprofit organization focused on “making Kentucky the best place in America to be a kid,” wanted to visually explore the factors affecting successful children outcomes across Council Districts. There is a large variance in child resources throughout the city, which is having an effect on child well-being. The volunteers repurposed the original code and used local publicly available data to create the Kentucky Youth Advocates Data Visualization, which is now helping the city of Louisville better distribute resources for children.

#openimpact Marathon

DC Action for Children is also one of the projects selected for the #openimpact Marathon hosted by DataLook. The goal of the marathon is to get people and groups to replicate existing data-driven projects for social good. So far, there is interest in replicating the Data Tools 2.0 visualization for child crimes in India and another potential replication for senior citizens in Finland. There is no telling where this visualization will end up helping next. Get involved!

Ok ok, but what is the impact of all this really?

Aren’t these just visualizations? Yes, as any good data scientist knows, data visualizations are not an end in and of themselves. In fact, it’s typically just part of the overall process of gaining insight into data for some larger end goal. Similarly, open data in and of itself does not automatically mean impact. The data has to be easy to access, in the right formats, and people have to apply it to real-world challenges. Just because you build it (or open it), does not necessarily mean impact will come.

Yet visualizations and open data sources are often a critical first step to bigger outcomes. So what makes the difference between a flashy marketing tool and something that will help improve real people’s lives? The strength of the partner organization that will ultimately use it to create change in the world.

Data visualizations, open data and open source code alone are not going to end child poverty. People are going to end child poverty. The strength of the tool itself is less important than the strength of an organization’s strategy of how to use it to inform decision-making and conversation around a given issue.

Thankfully, DC Action for Children has been a tremendous partner and is using Data Tools 2.0 as a key part of its efforts to improve the lives of children in DC. It’s exciting to see the tool now spreading to equally impressive partners around the world.

Continua a leggere

Data for Good in Bangalore

Miriam Young is a Communications Specialist at DataKind.

At DataKind, we believe the same algorithms and computational techniques that help companies generate profit can help social change organizations increase their impact. As a global nonprofit, we harness the power of data science in the service of humanity by engaging data scientists and social change organizations on projects designed to address critical social issues.

Our global Chapter Network recently wrapped up a marathon of DataDives, helping local organizations with their data challenges over the course of a weekend. This post highlights two of the projects from DataKind Bangalore’s first DataDive earlier this year, where volunteers used data science to help support rural agriculture and combat urban corruption.

Digital Green

Founded in 2008, Digital Green is an international, nonprofit development organization that builds and deploys information and communication technology to amplify the effectiveness of development efforts to affect sustained social change. They have a series of educational videos of agricultural best practices to help farmers in villages succeed.

The Challenge

Help farmers more easily find videos relevant to them by developing a recommendation engine that suggests videos based on open data on local agricultural conditions. The team was working with a collection of videos, each focused on a specific crop, along with descriptions, but each description was in a different regional language. The challenge, then, was parsing and interpreting this information to use it as as a descriptive feature for the video. To add another challenge, they needed geodata with the geographical boundaries of different regions to map the videos to a region with specific soil types and environmental conditions, but the data didn’t exist.

The Solution

The volunteers got to work preparing this dataset and published boundaries of 103,344 indian villages and geocoded 1062 Digital Green villages in Madhya Pradesh(MP) to 22 soil polygons. They then clustered 22 MP districts based on 179 feature vectors. They also mapped the villages that Digital Green works with into 5 agro-climatic clusters. Finally, the team developed a Hinglish parser that parses the Hindi titles of available videos and translates them to English to help the recommender system understand which crop the videos relate to.

I Change My City / Janaagraha

Janaagraha was established in 2001 as a nonprofit that aims to combine the efforts of the government and citizens to ensure better quality of life in cities by improving urban infrastructure, services and civic engagement. Their civic portal, IChangeMyCity promotes civic action at a neighborhood level by enabling citizens to report a complaint that then gets upvoted by the community and flagged for government officials to take action.

The Challenge

Deal with duplicate complaints that can clog the system and identify factors that delay open issues from being closed out.

The Solution

To deal with the problem of duplicate complaints, the team used Jaccard similarity and Cosine similarity on vectorized complaints to cluster similar complaints together. Disambiguation was performed by ward and geography. The model they built delivered a precision of more than 90%.

To deal with the problem of identifying factors affecting closure by user and authorities, the team used two approaches. The first approach involved analysis using Decision Trees by capturing attributes like Comments, Vote-ups, Agency ID, Subcategory and so on. The second approach involved logistic regression to predict closure probability. Closure probability was modeled as a function of complaint subcategory, ward, comment velocity, vote-ups and similar other factors.

With these new features, iChangeMyCity will be able to better handle the large volume of incoming requests and Digital Green will be better able to serve farmers.

These initial findings are certainly valuable, but DataDives are actually much bigger than just weekend events. The weeks of preparation that go into them and months of impact that ripple out from them make them a step in an organization’s larger data science journey. This is certainly the case here, as both of these organizations are now exploring long-term projects with DataKind Bangalore to expand on this work.

Stay tuned for updates on these exciting projects to see what happens next!

Interested in getting involved? Find your local chapter and sign up to learn more about our upcoming events.

Continua a leggere

How do political campaigns use data analysis?

Looking through SSRN this morning, I came across a paper by David Nickerson (Notre Dame) and Todd Rogers (Harvard), “Political Campaigns and Big Data” (February 2014). It’s a nice follow-up to yesterday’s post about the software supporting new approaches to data analysis in Washington, DC.

In the paper, Nickerson and Rogers get into the math behind the statistical methods and supervised machine learning employed by political campaign analysts. They discuss the various types of predictive scores assigned to voters—responsiveness, behavior, and support—and the variety of data that analysts pull together to model and then target supporters and potential voters.

In the following excerpt, the authors explain how predictive scores are applied to maximize the value and efficiency of phone bank fundraising calls:

Campaigns use predictive scores to increase the efficiency of efforts to communicate with citizens. For example, professional fundraising phone banks typically charge $4 per completed call (often defined as reaching someone and getting through the entire script), regardless of how much is donated in the end. Suppose a campaign does not use predictive scores and finds that upon completion 17 of the call 60 percent give nothing, 20 percent give $10, 10 percent give $20, and 10 percent give $60. This works out to an average of $10 per completed call. Now assuming the campaign sampled a diverse pool of citizens for a wave of initial calls. It can then look through the voter database that includes all citizens it solicited for donations and all the donations it actually generated, along with other variables in the database such as past donation behavior, past volunteer activity, candidate support score, predicted household wealth, and Census-based neighborhood characteristics (Tam Cho and Gimpel 2007). It can then develop a fundraising behavior score that predicts the expected return for a call to a particular citizen. These scores are probabilistic, and of course it would be impossible to only call citizens who would donate $60, but large gains can quickly be realized. For instance, if a fundraising score eliminated half of the calls to citizens who would donate nothing, so that in the resulting distribution would be 30 percent donate $0, 35 percent donate $10, 17.5 percent donate $20, and 17.5 percent donate $60. The expected revenue from each call would increase from $10 to $17.50. Fundraising scores that increase the proportion of big donor prospects relative to small donor prospects would further improve on these efficiency gains.

If you’ve ever wanted to know more about how campaigns use data analysis tools and techniques, this paper is a great primer.

Continua a leggere

Quorum: Is software the new Congressional intern?

Last month, a number of news outlets wrote about a startup called Quorum. Winner of the 2014 Harvard Innovation Challenge’s McKinley Family Grant for Innovation and Entrepreneurial Leadership in Social Enterprise, Quorum has amazing potential to create new ways for legislators to easily use data to understand their constituencies and track legislation—literally data for policymaking. Quorum even pulls data from the American Community Survey, which James Treat of the Census Bureau wrote about for this blog a few years back.

TechCrunch touts Quorum as a replacement for the hordes of summer Hill interns, while the Washington Post likens it to Moneyball for K Street.

Danny Crichton at TechCrunch writes:

The challenges are numerous in this space. “Figuring out who you should talk to is a really tough process,” Jonathan Marks, one co-founder of Quorum, explained. “This is a problem that a lot of our clients have, [since] there are tens of thousands of relationships in DC.” The challenge is magnified since those relationships change so often.

Another challenge is simply following legislation. Marks gave the example of a non-profit firm that wanted to develop a scorecard with grades for each congressmen on several key votes (a common strategy these days in Washington advocacy). One firm had “three people spending 1.5 weeks to tabulate all the data.” An opposition research firm went through “6000 votes on abortion” to tabulate every single congressman’s legislative history. This was all done manually (i.e. with an army of interns).

But Quorum is not the first product of its kind. Bloomberg and CQ have long dominated with products targeted at this audience. But this is becoming a competitive space for entrepreneurs. Catherine Ho at the Washington Post explains:

Since 2010, at least four companies, ranging from start-ups to billion-dollar public corporations, have introduced new ways to sell data-based political and competitive intelligence that offers insight into the policymaking process.”

[...]

Other companies are emerging in the space with some success. For others, it’s too soon to tell.

Popvox, founded in 2010, is an online platform that collects correspondence between constituents and their representatives on bills, organizes the data by state, and packages the information in charts and maps so lawmakers can easily spot where voters stand on a proposed bill. An early win was when nearly 12,000 people nationwide used the platform to oppose a proposal to allow robo-calls to cellphones — the bill was withdrawn by its sponsors.

Popvox does not disclose its revenue, but co-founder Marci Harris said the platform has more than 400,000 users across every congressional district and has delivered more than 4 million constituent positions to Congress.

FiscalNote, which uses data-mining software and artificial intelligence to predict the outcome of legislation and regulations, has pulled in $19.4 million in capital since its 2013 start from big-name investors including Dallas Mavericks owner Mark Cuban, Yahoo co-founder Jerry Yang and the Winklevoss twins. The company says it achieves 94 percent accuracy. And Ipsos, the publicly traded market research and polling company, is amping up efforts to sell polling data to lobby firms.

For an academic’s take on the trend toward data in politics and camapigning, UNC assistant professor Daniel Kreiss published a great piece for the Stanford Law Review in 2012 titled “Yes We Can (Profile You),” which lays out the ways in which political campaigns employ sophisticated data analysis techniques to measure and target voters.

Continua a leggere

The Rural Broadband Digital Divide

Michael Curri is president and founder of Strategic Networks Group.

There is a high degree of awareness of how differences in Internet connectivity contribute to the “digital divide” experienced by many, if not most, rural areas. Less is understood about a very real divide that exists from (a lack of) utilization. That’s right, just as important as “speed” is how much businesses and non-commercial organizations utilize the Internet.

Using the data SNG has collected in numerous states between 2012 and February 2015, we can actually quantify this digital divide. Just as significantly, we can identify the types of organizations (industry, size, rural/urban, etc.) that are experiencing the greatest gap in utilization. To quantify utilization, SNG has developed a means to measure utilization we call the Digital Economy index (DEi) that is a reflection of how many Internet processes or applications an organization uses. We measure use of 17 applications on a ten-point scale (ten being best) to develop the DEi (e.g. an organization using 8 of 17 applications would have a DEi score of 4.7).

Collecting data in numerous states, each with rural and urban components, SNG has uncovered the digital divide that exists based largely on the size of the community businesses are located. The table on the right shows that the more urban a community, the higher the DEi score. Regardless of speed available, rural communities are utilizing the Internet and its applications at a lower rate largely because in rural areas there is less knowledge transfer amongst peers and less of a market for specialized technical services.

Beyond the notable gap in Internet utilization between rural and urban areas, SNG’s research also reveals sectors and types of organizations that suffer most from this digital divide. This is consistent with our findings that rural communities have far less local resources to support businesses looking to better utilize broadband applications.

For small towns & isolated small towns (in essence, the census terms for “rural”), local governments have the largest utilization gap compared to their metropolitan peers: with a DEi of 5.24 compared to 7.17. Libraries also show a notable utilization gaps: metro = 7.23; rural = 6.12). In contrast, K-12 schools of comparable size have very similar DEi scores regardless of how urban or rural they are.

When examining industry type, it is illuminating to see just how much variance there can be depending on industry. Ironically, one of the biggest utilization gaps is in what might be considered the most advanced sector (Professional and Technical Services) which is large, growing, and well paying but slow to adopt key Internet applications.

Larger businesses in rural areas (100 or more employees) still experience a utilization gap to their urban counterparts. Rural businesses with less than 100 employees experience a much larger utilization gap.

So while fiber, net neutrality, and FCC decisions dominate the news, the success of broadband in driving impacts is dependent on utilization.

This means that providing our rural businesses with the knowledge and support to leverage the Internet is key to maintaining competitiveness. Furthermore, in today’s landscape it is easier to live rural and work globally, as long as rural businesses have access to networks and support systems that help them thrive in the digital economy. Developing local networks and supports is a direct and significant opportunity (as well as challenge) for local business retention and growth. There are ways to achieve this, including SNG’s Small Business Growth Program. We’d love to share with you how this program can drive economic growth in your region.

See more here.

Continua a leggere