No boundaries: Exfiltration of personal data by session-replay scripts
This is the first post in our “No Boundaries” series, in which we reveal how third-party scripts on websites have been extracting personal information in increasingly intrusive ways. 
by Steven Englehardt, Gunes Acar, and Arvind Narayanan
Update: we’ve released our data — the list of sites with session-replay scripts, and the sites where we’ve confirmed recording by third parties.
You may know that most websites have third-party analytics scripts that record which pages you visit and the searches you make. But lately, more and more sites use “session replay” scripts. These scripts record your keystrokes, mouse movements, and scrolling behavior, along with the entire contents of the pages you visit, and send them to third-party servers. Unlike typical analytics services that provide aggregate statistics, these scripts are intended for the recording and playback of individual browsing sessions, as if someone is looking over your shoulder.
The stated purpose of this data collection includes gathering insights into how users interact with websites and discovering broken or confusing pages. However the extent of data collected by these services far exceeds user expectations ; text typed into forms is collected before the user submits the form, and precise mouse movements are saved, all without any visual indication to the user. This data can’t reasonably be expected to be kept anonymous. In fact, some companies allow publishers to explicitly link recordings to a user’s real identity.
For this study we analyzed seven of the top session replay companies (based on their relative popularity in our measurements ). The services studied are Yandex, FullStory, Hotjar, UserReplay, Smartlook, Clicktale, and SessionCam. We found these services in use on 482 of the Alexa top 50,000 sites.
This video shows the “co-browse” feature of one company, where the publisher can watch user sessions live.
What can go wrong? In short, a lot.
Collection of page content by third-party replay scripts may cause sensitive information such as medical conditions, credit card details and other personal information displayed on a page to leak to the third-party as part of the recording. This may expose users to identity theft, online scams, and other unwanted behavior. The same is true for the collection of user inputs during checkout and registration processes.
The replay services offer a combination of manual and automatic redaction tools that allow publishers to exclude sensitive information from recordings. However, in order for leaks to be avoided, publishers would need to diligently check and scrub all pages which display or accept user information. For dynamically generated sites, this process would involve inspecting the underlying web application’s server-side code. Further, this process would need to be repeated every time a site is updated or the web application that powers the site is changed.
A thorough redaction process is actually a requirement for several of the recording services, which explicitly forbid the collection of user data. This negates the core premise of these session replay scripts, who market themselves as plug and play. For example, Hotjar’s homepage advertises: “Set up Hotjar with one script in a matter of seconds” and Smartlook’s sign-up procedure features their script tag next to a timer with the tagline “every minute you lose is a lot of video”.
To better understand the effectiveness of these redaction practices, we set up test pages and installed replay scripts from six of the seven companies . From the results of these tests, as well as an analysis of a number of live sites, we highlight four types of vulnerabilities below:
1. Passwords are included in session recordings. All of the services studied attempt to prevent password leaks by automatically excluding password input fields from recordings. However, mobile-friendly login boxes that use text inputs to store unmasked passwords are not redacted by this rule, unless the publisher manually adds redaction tags to exclude them. We found at least one website where the password entered into a registration form leaked to SessionCam, even if the form is never submitted.
2. Sensitive user inputs are redacted in a partial and imperfect way. As users interact with a site they will provide sensitive data during account creation, while making a purchase, or while searching the site. Session recording scripts can use keystroke or input element loggers to collect this data.
All of the companies studied offer some mitigation through automated redaction, but the coverage offered varies greatly by provider. UserReplay and SessionCam replace all user input with an equivalent length masking text, while FullStory, Hotjar, and Smartlook exclude specific input fields by type. We summarize the redaction of other fields in the table below.
Automated redaction is imperfect; fields are redacted by input element type or heuristics, which may not always match the implementation used by publishers. For example, FullStory redacts credit card fields with the `autocomplete` attribute set to `cc-number`, but will collect any credit card numbers included in forms without this attribute.