Tools > Web Archiving

Web Archiving and Web Archive Manager

Collecting data directly from the web is a fundamental need for many researchers, but the methods can be challenging. Manually copying and pasting content is slow and prone to error, while automated web scraping often requires advanced coding skills and can be blocked by modern, dynamic websites like social media platforms. The AIO’s Web Archiving toolkit provides an accessible and powerful middle ground. It allows researchers to create high-fidelity, complete, and interactive archives of web pages as they appear in a browser. This method is ideal for preserving and analysing the full context of online content, from social media groups to news websites.

The Web Archive Manager (WAM) is part of the toolkit to aid researchers in working with web archives. It allows you to organise, replay, and annotate your archive files (.wacz or .warc). You can also use WAM to scrape data into structured formats. WAM is developed by QUT Digital Observatory, as part of the Australian Internet Observatory (AIO). AIO received co-investment (doi.org/10.3565/hjrp-b141) from the Australian Research Data Commons (ARDC) through the HASS and Indigenous Research Data Commons. The ARDC is enabled by the National Collaborative Research Infrastructure Strategy (NCRIS).

  1. Simple Tool InstallationThe researcher begins by installing a simple, lightweight extension into their web browser. This tool requires no complex setup.
  2. Start ArchivingTo begin a collection session, the researcher simply navigates to the target website and clicks the "Start Archiving" button in the extension.
  3. Capture Dynamic ContentThe researcher then browses the site as a normal user would—scrolling through feeds, clicking links, and interacting with content. The extension runs in the background, recording everything that is rendered in the browser window, including all text, images, links, and the underlying structure of the page.
  4. Save the Complete ArchiveOnce the desired content has been captured, the researcher stops the recording. The entire session is saved as a single, self-contained archive file. This file can be viewed offline and functions as an interactive snapshot of the website at that exact moment in time.
  5. Data Extraction and AnalysisA raw web archive contains a huge amount of unstructured data. The AIO is developing tools that will allow researchers to easily process these archive files, extracting structured information.

How does web archiving work?

Web archiving plugin and WAM tool installation

To create an interactive archive, researchers need to follow an easy, two-step installation process:

  1. Installing a browser plugin that will do the heavy lifting and capture websites.

  2. Installing a tool that further enables researchers to work with the created archive.

Installing the browser plugin

This plugin will record a copy of the website and create an archive that we can then digest and replay using WAM.

1. In your Chrome Window, click on the following link: https://chromewebstore.google.com/detail/webrecorder-archivewebpag/fpeoodllldobpkbkabpblcfaogecpndd. Alternatively, you can navigate to the homepage of our recommended third-party plugin for Web Archiving - https://archiveweb.page/. Then, go to “Install Extension from Chrome Web Store.”

2. Click the “Add to Chrome” button to install the extension .

3. In the pop-up window, confirm that you are adding the extension.

4. Once the extension has been successfully installed, you will see a notification in the top right corner of the Chrome window (next to the address bar).

5. For the next step, please download and install the WAM tool available at:

Download WAM (Web Archive Manager)-0.0.1a

Follow the instructions on the screen to complete the installation process.

Archiving a webpage

1. To start creating a web archive, navigate to the target page that you want to start your recording from. Once you are there, on the same page, click the extension icon (looks like a puzzle piece) and select the plugin we installed in the previous step - Webrecorder Archive Web - from the dropdown menu.

If you do not have the puzzle piece icon in your Chrome window, click on the three vertical dots icon in the upper right corner and choose “Extensions” -> “Manage Extensions”, and use the toggle switch to turn on the Webrecorder Archive Web extension.

2. Confirm that you want to start web archiving by clicking on the “Start Archiving” button. Wait until the page has been reloaded, or do it manually by clicking on Chrome’s “Refresh” button in the top left corner next to the address bar.

3. Browse the website, interacting with all the links and sections that you need to record. Make sure that all the elements that you need from this website are displayed in the browser window (e.g., post comments are fully unscrolled, interactive tabs are switched, and the page is fully loaded). You might not be able to see some elements that were not displayed or properly rendered.

4. Whenever you want to stop recording your web archive, click on the extension icon next to the address bar in the top right corner and select “Stop Recording”. If needed, you can navigate to another website to continue recording and repeat the previous steps.

5. Once you finish web archiving, click on the “View Archived pages” link in the extension pop-up. It will redirect you to a new tab.

6. Select the archives that you would like to download, or click on the “Download” dropdown menu and select a corresponding option to download all archives. It is preferred that you choose WACZ as the default option that works well with our tool. Store the file in any location on your computer.

Using WAM to replay and edit web archives

In WAM, you manage web archives by projects. A default project already exists when you start the application.

Importing archives

To import archives into your project, select File > Import archives. Only files with .warc and wacz extensions are accepted.

If the archives are imported successfully, you will see them in the right table.

Replaying archives

To replay an imported archive, select the archive and click Replay.

Once in the replay view, you will see a list of recorded pages. If a page is recorded multiple times, you will see all of them in this list. To view only the latest version, tick Deduplicate pages.

To replay a page, simply click on the page. To go back to the page list, click Back to List.

Adding metadata

In addition to a web archive’s standard metadata, you can annotate your archives by adding custom metadata. To view the archive’s existing data, in the archive view, select an archive and click Edit metadata. The standard metadata are file name, the timestamp when the archive is created, and the file size. These cannot be changed or removed.

To add tags to an archive, select the tags row, and click Edit. Once you have added or edited the tags, click Add. You might have to click the tags row again for the updates to show.

Ideally, a tag should contain no spaces or other special characters, except for hyphens (“-”) and underscores (“_“). If you want to add multiple tags, separate each one with a comma and no space.

Adding new custom metadata

To add new custom metadata, click Add metadata. You will need to specify a title for the metadata, and the value.

Naming conventions for metadata titles

Metadata titles are case sensitive, meaning metadataTitle and metadatatitle will be taken as two different things.

No spaces or special characters are allowed. The only special character you can use is @.

Standard metadata such as file, timestamp, size, and tags are reserved, so if you try to add a metadata title using these words, they will be ignored.

Examples of acceptable metadata titles: authorName, authorname, @authorName.

Creating new projects

To create a new project, either click New in the sidebar, or Projects > New in the top menu.

You can also add, edit or remove metadata to projects. Select a project in the project sidebar, then click Edit from the project sidebar or from the top menu. The same rules apply to naming metadata titles.

If you would like to learn more about WAM and Web Archiving, please also check out additional information available at wam.digitalobservatory.net.au and the QUT Digital Observatory, which manages the project. If you have any questions, contact us using the button below or send an email to digitalobservatory@qut.edu.au.

If you are looking to export data from an existing archive, consider using Warcex, our command line tool.

Contact Us

The browser plugin is a versatile tool for ethically accessing data on the desktop web. If you want to learn more about this method or have a project that can potentially benefit from this form of data donation, please contact us.