Data for Good in Bangalore

Miriam Young is a Communications Specialist at DataKind.

At DataKind, we believe the same algorithms and computational techniques that help companies generate profit can help social change organizations increase their impact. As a global nonprofit, we harness the power of data science in the service of humanity by engaging data scientists and social change organizations on projects designed to address critical social issues.

Our global Chapter Network recently wrapped up a marathon of DataDives, helping local organizations with their data challenges over the course of a weekend. This post highlights two of the projects from DataKind Bangalore’s first DataDive earlier this year, where volunteers used data science to help support rural agriculture and combat urban corruption.

Digital Green

Founded in 2008, Digital Green is an international, nonprofit development organization that builds and deploys information and communication technology to amplify the effectiveness of development efforts to affect sustained social change. They have a series of educational videos of agricultural best practices to help farmers in villages succeed.

The Challenge

Help farmers more easily find videos relevant to them by developing a recommendation engine that suggests videos based on open data on local agricultural conditions. The team was working with a collection of videos, each focused on a specific crop, along with descriptions, but each description was in a different regional language. The challenge, then, was parsing and interpreting this information to use it as as a descriptive feature for the video. To add another challenge, they needed geodata with the geographical boundaries of different regions to map the videos to a region with specific soil types and environmental conditions, but the data didn’t exist.

The Solution

The volunteers got to work preparing this dataset and published boundaries of 103,344 indian villages and geocoded 1062 Digital Green villages in Madhya Pradesh(MP) to 22 soil polygons. They then clustered 22 MP districts based on 179 feature vectors. They also mapped the villages that Digital Green works with into 5 agro-climatic clusters. Finally, the team developed a Hinglish parser that parses the Hindi titles of available videos and translates them to English to help the recommender system understand which crop the videos relate to.

I Change My City / Janaagraha

Janaagraha was established in 2001 as a nonprofit that aims to combine the efforts of the government and citizens to ensure better quality of life in cities by improving urban infrastructure, services and civic engagement. Their civic portal, IChangeMyCity promotes civic action at a neighborhood level by enabling citizens to report a complaint that then gets upvoted by the community and flagged for government officials to take action.

The Challenge

Deal with duplicate complaints that can clog the system and identify factors that delay open issues from being closed out.

The Solution

To deal with the problem of duplicate complaints, the team used Jaccard similarity and Cosine similarity on vectorized complaints to cluster similar complaints together. Disambiguation was performed by ward and geography. The model they built delivered a precision of more than 90%.

To deal with the problem of identifying factors affecting closure by user and authorities, the team used two approaches. The first approach involved analysis using Decision Trees by capturing attributes like Comments, Vote-ups, Agency ID, Subcategory and so on. The second approach involved logistic regression to predict closure probability. Closure probability was modeled as a function of complaint subcategory, ward, comment velocity, vote-ups and similar other factors.

With these new features, iChangeMyCity will be able to better handle the large volume of incoming requests and Digital Green will be better able to serve farmers.

These initial findings are certainly valuable, but DataDives are actually much bigger than just weekend events. The weeks of preparation that go into them and months of impact that ripple out from them make them a step in an organization’s larger data science journey. This is certainly the case here, as both of these organizations are now exploring long-term projects with DataKind Bangalore to expand on this work.

Stay tuned for updates on these exciting projects to see what happens next!

Interested in getting involved? Find your local chapter and sign up to learn more about our upcoming events.

Continua a leggere

Housing Data Hub – from Open Data to Information

Joy Bonaguro Chief Data Officer, City and County of San Francisco. This is a repost from April at DataSF.org announcing the launch of their Housing Data Hub.

Housing is a complex issue and it affects everyone in the City. However, there is not a lot of broadly shared knowledge about the existing portfolio of programs. The Hub puts all housing data in one place, visualizes it, and provides the program context. This is also the first of what we hope to be a series of strategic open data releases over time. Read more about that below or check out the Hub, which took a village to create!

Evolution of Open Data: Strategic Releases

The Housing Data Hub is also born out of a belief that simply publishing data is no longer sufficient. Open data programs need to take on the role of adding value to open data versus simply posting it and hoping for its use. Moreover, we are learning how important context is to understanding government datasets. While metadata is an essential part of context, it’s a starting not endpoint.

For us a strategic release is one or more key datasets + a data product. A data product can be a report, a website, an analysis, a package of visualizations, an article…you get the idea. The key point: you have done something beyond simply publishing the data. You provide context and information that transforms the data into insights or helps inform a conversation. (P.S. That’s also why we are excited about Socrata’s new dataset user experience for our open data platform).

Will we only do strategic releases?

No! First off – it’s a ton of work and requires amazing partnerships. Strategic (or thematic) releases should be a key part of an open data program but not the only part. We will continue to publish datasets per department plans (coming out formally this summer). And we’ll also continue to take data nominations to inform department plans.

We’ll reserve strategic releases to:

  • Address a pressing information gap or need
  • Inform issues of high public interest or concern
  • Tie together disparate data that may otherwise be used in isolation
  • Unpack complex policy areas through the thoughtful dissemination of open data
  • Pair data with the content and domain expertise that we are uniquely positioned to offer (e.g answer the questions we receive over and over again in a scalable way)
  • Build data products that are unlikely to be built by the private sector
  • Solve cross-department reporting challenges

And leverage the open data program to expose the key datasets and provide context and visualizations via data products.

We also think this is a key part of broadening the value of open data. Open data portals have focused more on a technical audience (what we call our citizen programmers). Strategic releases can help democratize how governments disseminate their data for a local audience that may be focused on issues in addition to the apps and services built on government data. It can also be a means to increase internal buyin and support for open data.

Next steps

As part of our rolling release, we will continue to work to automate the datasets feeding the hub. You can read more about our rollout process, inspired by the UK Government Digital Service. We’ll also follow up with technical post on the platform, which is available on GitHub, including how we are consuming the data via our open data APIs.

Continua a leggere

Five ways for states to make the most of open data

Mariko Davidson serves as an Innovation Fellow for the Commonwealth of Massachusetts where she works on all things open data. These opinions are her own. You can follow her @rikohi.

States struggle to define their role in the open data movement. With the exception of some state transportation agencies, states watch their municipalities publish local data, create some neat visualizations and applications, and get credit for being cool and innovative.

States see these successes and want to join the movement. Greater transparency! More efficient government! Innovation! The promise of open data is rich, sexy, and non-partisan. But when a state publishes something like obscure wildlife count data and the community does not engage with it, it can be disappointing.

States should leverage their unique role in government rather than mimic a municipal approach to open data. They must take a different approach to encourage civic engagement, more efficient government, and innovation. Here are few recommendations based on my time as a fellow:

  1. States are a treasure trove of open data. This is still true. When prioritizing what data to publish, focus on the tangible data that impacts the lives of constituents—think aggregating 311 request data from across the state. Mark Headd, former Chief Data Officer for the City of Philadelphia, calls potholes the “gateway drug to civic engagement.”
  2. States can open up data sharing with their municipalities—which leads to a conversation on data standards. States can use their unique position to federate and facilitate data sharing with municipalities. This has a few immediate benefits: a) it allows citizens a centralized source to find all levels of data within the state; b) it increases communication between the municipalities and the state; and c) it begins to push a collective dialogue on data standards for better data sharing and usability.
  3. States in the US create an open data technology precedent for their towns and municipalities. Intentional or not, the state sets an open data technology standard—so they should leverage this power strategically. When a state selects a technology platform to catalog its data, it incentivizes municipalities and towns within the state to follow its lead. If a state chooses a SaaS solution, it creates a financial barrier to entry for municipalities that want to collaborate. The Federal Government understood this when it moved Data.gov to the open source solution CKAN. Bonus: open source software is free and embodies the free and transparent ethos of the greater open data movement.
  4. States can support municipalities and towns by offering open data as a service. This can be an opportunity to provide support to municipalities and towns that might not have the resources to stand up their own open data site.
  5. Finally, states can help facilitate an “innovation pipeline” by providing the data infrastructure and regularly connecting key civic technology actors with government leadership. Over the past few years, the civic technology movement experienced a lot of success in cities with groups like Code for America leading the charge with their local Brigade Chapters. After publishing data and providing the open data infrastructure, states must also engage with the super users and data consumers. States should not shy away from these opportunities. More active state engagement is a crucial element still missing in the civic innovation space in order to collectively create sustainable technology solutions for the communities they serve.

Continua a leggere

Data shows what millions knew: the Internet was really slow!

Meredith Whittaker is Open Source Research Lead at Google.

For much of 2013 and 2014, accessing major content and services was nearly impossible for millions of US Internet users. That sounds like a big deal, right? It is. But it’s also hard to document. Users complained, the press reported disputes between Netflix and Comcast, but the scope and extent of the problem wasn’t understood until late 2014.

This is thanks in large part to M-Lab, a broad collaboration of academic and industry researchers committed to openly and empirically measuring global Internet performance. Using a massive archive of open data, M-Lab researchers uncovered interconnection problems between Internet service providers (ISPs) that resulted in nationwide performance slowdowns. Their published report, ISP Interconnection and its Impact on Consumer Internet Performance, lays out the data.

To back up a moment—interconnection sounds complicated. It’s not. Interconnection is the means by which different networks connect to each other. This connection allows you to access online content and services hosted anywhere, not just content and services hosted by a single access provider (think AOL in the 1990’s vs today’s Internet). By definition, the Inter-net wouldn’t exist without interconnection.

Interconnection points are the places where Internet traffic crosses from one network to another. Uncongested interconnection points are critical to a healthy, open Internet. Put another way, it doesn’t matter how wide the road is on either side—if the bridge is too narrow, traffic will be slow.

M-Lab data and research exposed just such slowdowns. Let’s take a look…

The chart above shows download throughput data, collected by M-Lab in NYC between Feb 2013 and Sept 2014. The reflects traffic between customers of Time Warner Cable, Verizon, and Comcast—major ISPs—and an M-Lab server hosted on Cogent’s network. Cogent is a major transit ISP and many content and services are hosted on Cogent’s network and on similar transit networks. Traffic between people and the content they want to access has to move through an interconnection point between their ISP (TWC, Comcast, and Verizon, in this case) and Cogent. What we see here, then, is severe degradation of download throughput between these ISPs and Cogent that lasted for about a year. During this time, customers of these three ISPs attempting to access anything hosted on Cogent in NYC were subjected to severely slowed Internet performance.

But maybe things are just slow, no?

Here you see download throughput in NYC during the same time period, for the same three ISPs (plus Cablevision). The difference: here they are accessing an M-Lab server hosted on Internap’s network (another transit ISP). In this case, in the same region, for the same general population of users, during the same time, download throughput was stable. Content and services accessed on Internap’s network performed just fine.

Couldn’t this just be Cogent’s problem? Another good question…

Here we return to Cogent. This graph spans the same time period, in NYC, looking again at download throughput across a Cogent interconnection point. The difference? We’re looking at traffic to customers of the ISP Cablevision.

Comparing these three graphs, we see M-Lab data exposing problems that aren’t specific to one ISP or ISPs, but a problem with the relationship between pairs of ISPs—in this example, Cogent when paired with Time Warner, Comcast, or Verizon. This relationship manifests, technically, as interconnection.

These graphs focus on NYC but M-Lab saw similar patterns across the US as researchers examined performance trends across pairs of ISPs nationwide—e.g., whenever Comcast interconnected with Cogent. The research shows that the scope and scale of interconnection-related performance issues were nationwide and continued for over a year. IT also shows that these issues were not strictly technical in nature. In many cases, the same patterns of performance degradation existed across the US wherever a given pair of ISPs interconnected. This rules out a regional technical problem and instead points to business disputes as the cause of congestion.

M-Lab research shows that when interconnection goes bad, it’s not theoretical: it interferes with real people trying to do critical things. Good data and careful research helped to quantify the real, human impact of what had been relegated to technical discussion lists and sidebars in long policy documents. More focus on open data projects like M-Lab could help quantify the human impact across myriad issues, moving us from a hypothetical to a real and actionable understanding of how to draft better policies.

Continua a leggere

International Broadband Pricing Study: Updated Dataset

Derek Slater is a Policy Manager at Google.

Last year, we hired a respected consultancy, Communications Chambers, to produce an international dataset of retail broadband Internet connectivity prices. The dataset can be used to make international comparisons and evaluate the efficacy of particular public policies—e.g., direct regulation and oversight of Internet peering and termination charges—on consumer prices.

We received a lot of positive feedback and suggestions—thank you!—and have now made available an updated dataset.

  • A Fusion Table containing the price observations for 1,523 fixed broadband plans can be found here.
  • A Fusion Table containing 2,167 mobile broadband prices can be found here.
  • Explanatory notes here, and ancillary data is here.


Continua a leggere

Here comes the collaborative economy

When travelling, have you rented somebody’s flat as an alternative to booking a room in a hotel? Or prefered the car-sharing option to taking the train? These new ways of sharing resources are increasingly becoming common practise and are part of an emerging movement often coined as the “collaborative” or “sharing” economy.

We are proud to support the “OuiShare Fest”, the first major European event dedicated to the collaborative economy taking place in Paris from May 2 to May 4. During these three days, more than 600 entrepreneurs, designers, economists, investors, politicians and citizens will come together to reflect about how to build a collaborative future.

European Commission Vice President Neelie Kroes supports the project and even has opened up her blog to a guest post from OuiShare’s organizers.

The digital economy has proved a vector of economic growth throughout Europe. It has allowed for the emergence of horizontal and networked organizations that offer new opportunities in traditional sectors from health to transportation, education and finance. Online platforms that offer services such as crowdfunding, taxi-sharing or flat-renting are testimony to the rise of new business models which are based on a culture of openness and transparency.

OuiShare will do much to “connect” the actors of this new movement across Europe and we wish them a successful OuiShare Fest.

Posted by Florian Maganza, Policy Analyst, Paris
Continua a leggere