Social Data Collection Methods

Last updated: 16 June 2024

The Australian Internet Observatory (AIO) involves developing a range of cutting-edge new tools, processes and datasets as well as utilising existing investments in tools and infrastructure and collaborative resources across the research infrastructure ecosystem. This article covers some of the methods and approaches we will use for collecting social data.

The Australian Internet Observatory will assist researchers in accessing and analysing a range of social data which may be derived directly from platforms, either through official or unofficial Application Programming Interfaces (APIs), directly from users in the form of crowdsourced or donated data or through various other tools and processes. It may also import data from traditional social statistical information, hosted on other national and international platforms such as the HASS and Indigenous Research Data Commons (HASS&I RDC) and the Australian Urban Research Infrastructure Network (AURIN).

Data sourcing within AIO will involve bringing together and developing a range of tools and approaches as illustrated in the project plan overview below.


Data Donation Browser Extensions and Plugins

Using browser extensions for data donations from online platforms is a new approach to collecting social data from volunteers who agree to participate by contributing data on select online activities. It involves developing browser extensions or plugins for collecting certain kinds of data or activities which are made available from the browser extension stores for different targeted browsers together with background information about the project. This approach makes it possible to shed light on what are otherwise black-box algorithms and to understand how they generate recommendations, evaluations and content-related decisions.

Data donations may be gathered from user trace data, for example when using social media platforms, apps, credit scoring services or when shopping online, or via activities run in the background via the extension such as through simulating search activity. The installation procedure may also collect voluntarily supplied demographic information. These optional responses are valuable in investigating whether different demographic groups encounter different search results but do not allow us to identify or track individual users, either on Facebook or elsewhere online. The plugin does not access any personal information from the user’s computer or online profile. The data donation approach requires a range of new tools and approaches as well as integration of existing systems. Researchers at ADM+S have developed browser extensions for two major research projects to solicit data donations for search engine results (Bruns 2022) and Facebook ads (Burgess et al. 2022). These approaches draw on and extend international projects such as those developed by Algorithm Watch in Germany and ProPublica in the US.

Application Programming Interface (APIs)

AIO will use various APIs for sourcing data and for capturing additional information related to digital platforms.

The Australian Digital Observatory

AIO builds on and extends the work of the Australian Digital Observatory (ADO), an API data harvesting platform and set of tools based at QUT and the University of Melbourne supported by the ARDC. The ADO provides bespoke tools developed and used to support collection and analysis of social media platforms such as twitter and Reddit that provide APIs.

The ADO provides access to datasets such as the Australian Twittersphere a longitudinal, curated collection of tweets from approximately 838,000 Twitter accounts identified as ‘Australian’. The Digital Observatory has maintained reliable, ongoing data collection since early 2018, with approximately 22 million tweets being collected per month. There is also an archive of approximately 2 billion tweets from 2006 to 2016. The Digital Observatory currently collects approximately 41 million tweets per month.

ADO also provides a range of tools for collecting and analysing Twitter data including:

  • ADOReD, a high-level social media analysis via an interactive dashboard. It can be used to derive a list of tweet IDs relevant to the research question.

  • twarc hydrate: open-source command-line tool for extracting data from Twitter API. It can be used to hydrate full tweet content from tweet IDs

  • tidy_tweet: in-development open-source Python library for processing raw responses from Twitter API into an SQLite database;

  • tweet_exploR: in-development R package providing descriptive statistics and visualisation of Twitter data within an SQLite database produced by tidytweet

Other API services

Other APIs will also be used where available, for example Meta’s Ad Library API.

Data download packages (DDP)

Data Download Packages (DDP) involve users donating their existing digital trace data eg. text message history or browsing history data. The Open-Source Data Donation Framework (OSD2F) developed by researchers at the Digicomlab, University of Amsterdam provides a guide to this approach (Araujo et al. 2022). The DDP approach leverages recent developments in data rights, such as the EU General Data Protection Regulation (EU GDPR), that gives people more control over their own data. Companies are now required to make digital trace data available on request in a machine-readable format.

OSD2F places the individual as a data subject, maintaining agency over their data. As such it aims to enable data access by researchers in a way that respects individual rights. It is built with privacy, transparency, and flexibility as key design principles and also means researchers are less dependent on commercial organizations that store this data in proprietary archives.

OSD2F generates a web interface containing both the interface for participants to donate their data and–if authentication is configured–also for researchers to download the donated data. Its core is written in Python–with the web server based on Flask. JavaScript is used for the UI interactive elements, including (a) the steps where participants load their DDP locally, extract the allowed listed files from the DDP and upload them to the server and (b)the interface for data visualization and selection. Configuration files use the YAML format. The production version of OSD2F can be installed in a traditional server or in a cloud-based web app (e.g., Azure Web App, Amazon Elastic Beanstalk, Google App Engine, Heroku etc.)

Data Download Package Workflow

Source: Araujo et al. 2022

Data crowdsourcing, labelling and annotation

Crowdsourced data collection is when researchers enlist the services or use the wisdom of a diverse group of people to research, survey or provide feedback, paid or unpaid via online databases or crowdsourcing platforms. Crowdsourced data collection is gaining popularity because it is convenient, cheap and relatively fast. Data Crowdsourcing has been utilized by large research firms for years in a variety of markets and increasingly used in development and emergency management, political advocacy and local government. There are a number of commercial and open source apps and websites that provide crowdsourcing data collection from individuals all across the world to report issues with their local traffic, weather, and urban conditions.

Data labelling and annotation is a different type of data collection which is an important stage of data preprocessing which relies on either in house labelling or crowdsourcing approaches. Commercial platforms that provide access to ‘crowd-, platform-, and app-based’ workers include Amazon Mechanical Turk, Appen, Scale AI, Clickworker and Isahit. Open source systems are also popular for citizen science data crowdsourcing such as Zooniverse, iNaturalist, Wikidata and Pybossa.

Crowdsourcing tools will be required for recruitment, surveys, data donations, annotations, labelling and data analysis and AIO will work with existing systems and develop its own platform and extensions as needed.

Web scraping tools

Web scraping is commonly used for gathering data from web platforms where no official API or other data download mechanism exists. Web scraping is useful in tracking changes to government websites, news and media content, public forums, and commercial platforms (e.g. tracking changes to Terms of Service).

Synthetic data generation and Generative AI

Synthetic data is another relatively new approach which can be used to develop and test data science algorithms and models without compromising legal rights or research ethics standards (when used appropriately). Tools include open source systems such as the Synthetic Data Vault, recently developed by the Data to AI Lab at MIT which allows synthetic data to be generated in place of, or in addition to real data. Adapting this technology within AIO will enable researchers in Australia to conduct ground-breaking studies in a variety of areas such as public health, behavioural science, algorithmic bias and recommendation systems.


Date published: 29 June 2023

Last updated: 16 June 2024

Next
Next

Test environments for social data and digital platforms