Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

Pages

Posts

Opted Out, Yet Tracked: Are Regulations Enough to Protect Your Privacy?

2 minute read

Publish Date:

Overview. GDPR (in the EU) and CCPA (in California) are two of the first “User Data Right” policies in existence. The goal of these policies is to give the user rights’ over their data, which has been commodified by advertisers and trackers. These rights enable users’ to know “what” is collected from them, “why” it is collected and by “whom”. In addition, the users’ can also “opt out” of their data being shared/sold.

Compliance Measurement. Enforcement of these laws has lead to several big tech companies getting served massive fines for non-compliance. However, most of these cases have been reactive (to media or consumer reports) rather than pro-active. Mostly because these laws are not mature enough and there is no systematic mechanism to measure compliance of these rights. Keeping this in mind, the authors of this work create a framework that measures compliance of the “opt out” right.

Compliance Proxy. Similar to most of these user rights, the opt out right does not have a systemic way to be measured other than the word of the businesses asked to stop sale/sharing of user data. To this end, the authors use advertiser bids as a proxy to measure compliance. Based on insights from previous works, they hypothesize that advertisers bid differently on users’ they know more about as compared to unknown users’. Hence, a user that opts out of their data being shared/sold, should receive different bid values as compared to users’ who do not.

Crawling Infrastructure. The measurement infrastructure has 3 major components: (1) Persona training (2) Managing Opt Out (3) Collecting Ads. The first component is executed in consistency with previous similar work that trains personas online (by creating search history). To opt out/in of businesses selling their data, they use OpenWPM to automate the process of managing opting out\in on ~20 websites that support opt out via CookieBot and OneTrust (platforms that manage execution of user data rights). After opting out, they use prebid.js (javascript API) to collect ads from websites. They repeat this process for 16 categories of personas (gathered from Alexa) plus one control profile (no browsing history). Next, to measure affects of opting in as compared to opting out, they repeat this entire process while opting in to businesses selling/sharing their data.

Results. Analyzing the bid values, the authors are able to show no significant difference between opting in and opting out. Indicating no significance, or faulty implementation of opting out functionality. Furthermore, they show advertisers bid higher for personas as compared to control, indicating they have previous knowledge about the persona. Since opt out should have restricted this knowledge flow, this shows lack of compliance w.r.t opting out functionality.

Conclusion. In summary, this paper sheds insights on how businesses are avoiding compliance to user data rights. Furthermore, this work highlights the difficulty and need, of a systematic compliance measuring framework.

Cookie Swap Party: Abusing First-Party Cookies forWeb Tracking

5 minute read

Publish Date:

Overview. Most of the javascript (JS) code on a website is provided by thirdparties for purposes such as analytics, ads, etc. However, this thirdparty JS code executes in the context of a firstparty on the website. This implies, that thirdparty JS has firstparty privileges when executing, enabling unanticipated activity which is detrimental to users privacy. In this paper, the authors quantify the prevalence and usage of one of these activities: accessing firstparty cookie by thirdparties JS.

Problem Background. A users’ cookies are accessed by first/third party in 2 ways: (1) HTTP header request (2) document.cookie API. The former method has been studied in detail by researchers [] and there are several ways to mitigate thirdparties from accessing cookies via HTTP requests e.g adblockers, privacy preserving browsers. However, since the latter is a browser API which thirdparty JS accesses while running in the context of a firstparty, it is non-trivial for browsers to mitigate its usage by thirdparties. The authors refer to cookies set by thirdparties using document.cookie API are referred to as external cookies.

Motivation. Talking about why external cookies are problematic, we can quickly see how a thirdparty can identify a returning user on a website using external cookies e.g. (1) a user visits a website and creates a cookie for that website (2) a thirdparty on that website adds an id in the users’ cookie for this website (3) when a user returns to this website with its’ cookie for this website it will contain the id and the thirdparty will be able to identify its the same user. Identifying a returning user is a tracking issue but not as bad as tracking a user across websites, which is also enabled by external cookies. If the id set by the thirdparty is based on the users fingerprint (a unique value across the web for this user) the thirdparty can identify the same user across all the websites it can see the user on. Furthermore since any party can access external cookies via document.cookie, thirdparty_1 can read external cookie set by thirdparty_2 and sync their data, gaining more knowledge about the user. This motivates analysis of how these external cookies are being created and who is reading them later on.

Setup. In this work, researchers from NCSU and UIC edit Google Chromes’ source code, to add functionality which enables browsers to flag usage of external cookies. They do so by leveraging a technique called Taint Analysis, which essentially marks or taints some resource (in this case a cookie value) on its creation and logs when it is accessed thereafter. This helps them create a provenance graph for external cookies. Using this updated Chrome they run analysis on the top 10,000 Alexa ranked websites, to measure prevalence and usage of external cookies.

Creating Provenance. There are two major parts for provenance graph for external cookies. First, marking cookies when they are set up thirdparty JS, using document.API. Second, flagging when these cookies are read thereafter. For the first part, whenever a JS uses sets or update a cookie, and the domain serving the JS does not match the firstparty domain, the browser marks this cookie as tainted by this JS domain. For the second part, whenever a network request is generated, the browser (based on some heuristics) searches the url for tainted cookie values and records the domain of the url as well. Combining these two, the authors are able to identify prevalence and usage of external cookies.

Prevalence Results. From the 10,000 sites in the crawl, the authors found external cookies on 97% of them. On these 97% websites, the authors record 13,323 non-sessional external cookies keyed by <JS_domain, cookie_name>. Next, using well established heuristics from previous research, they show that 31% of these external cookies had tracking ids in them. Having roughly 1/3rd of the top 10K websites prone to external cookies is an alarming result for reasons mentioned in the motivation section above.

Usage Results. It is clear that external cookies are being created at a high rate. This motivates measuring how these cookies are used (if at all) and by whom. Amongst the 4,212 cookies that had tracking id’s 3,256 were identified in a network request, enabling the creation of the provenance graph as discussed earlier. 2,354 of these external cookies are read by a different thirdparty then the ones generating them. This is a sign of high collusion rate between thirdparties, via external cookies. Even more problematic was the fact that 3 of these external cookies were based of users’ fingerprint.

Concluding Remarks. Summing up, the authors show a significant problem in the tracking ecosystem that has been overlooked by the community. The authors show how easy it is for thirdparties to not only track and fingerprint the user but also share this information with their partners, without worrying about any mitigation technique (as there doesn’t exist one).

Data Portability between Online Services: An Empirical Analysis on the Effectiveness of GDPR Art. 20

4 minute read

Publish Date:

Data Portability. As of May 2018, the EU activated General Data Protection Regulation (GDPR), with the goal to increase user data transparency and privacy. One major pillar of GDPR is giving user access to the data that services have collected. Consequently, GDPR contains a clause prompting the development of infrastructure that allows moving data from one service to another. This clause aims to solve two problems (1) giving users access to their data (via export) (2) leveling the playing field for all services, by making it easier for users to move to smaller services. In an era where services thrive on user data, this is extremely incentivizing for smaller services (theoretically atleast).

Services. Due to the absence of a platform for direct data transfer between services, the authors analyze second hand data transfer characteristics (Exporting from one service and Importing on another). They achieve this by analyzing 182 (including 100 top Alexa services) services manually. Their dataset creation is manual but extremely extensive, as they keep record of all interactions, time windows, formats, correspondence etc. while importing and exporting data.

Dataset. To produce real world data, the authors created accounts and performed interactions with each service manually. It is interesting to note that there are 4 categories of data that the services acquire from users; received (which is given directly e.g email, clicks), observed (data collected by sensors), inferred, predicted. The scope of the following analysis is limited to received data.

Questions. Using regression models the authors were able to answer the following question:

* Are services with higher rank more compliant?

* Do services with higher rank provide less data to users?

* Do higher rank services use extensive authentication for data porting?

* Do higher rank services provide faster data transfer?

* Do lower rank services provide more and better import opportunities?

Compliance. Any service that processes data export request in legally-allowed time (30 days), provides data in machine readable format (JSON, HTML, XML), provides all of the received data of the user, is considered compliant. Only 74% of the 182 services facilitated data export at some level, and popularity / Alexa rank, was not a significant variable in predicting overall compliance.

Scope of Data. Furthermore, as per intuition, we should see higher rank services exporting less data as compared to lower ranked. This insight is based on the hunch that popular services value possession of unique data. To everyones’ surprise, rank was significantly related to providing more data while exporting. In some cases, higher ranked services would even provide observed data.

Authentication. Analyzing data scope remains incomplete without checking characteristics of authentication during data export. Large amounts of data in malicious hands can cause serious issues. The data shows services with higher rank require significantly more authentication than lower ranked services. This is a positive result as we showed higher ranked services provided more data while exporting.

Transfer Speed and Import Opportunities Finally, the authors analyzed how quickly a service processed data transfer and if they had import opportunities. They expected to see higher ranked services to show slower processing times and lower ranked services to provide more import opportunities. They were able to show higher rank did not affect faster processing times. However, adding to more unexpected results, they saw no relation of lower rank with more import opportunities.

Discussion. This paper highlights several interesting results and offers possible explanations w.r.t compliance with data portability. They argue that absence of a direct transfer platform is one major reason that higher rank services are being more compliant. As doing so will build more trust with audience. At the same time lack of direct transfer reduces the probability of users actually transferring their data (convenience). Making compliance a win-win for higher ranked services. Furthermore, they argue that lack of awareness is behind lower rank services not utilizing this golden opportunity.

Setting the Bar Low: Are Websites Complying With the Minimum Requirements of the CCPA?

4 minute read

Publish Date:

CCPA. California Consumer Privacy Act (CCPA) is the first and at the moment, only law in the USA, which protects a consumers personal information (PI). It equips Californians with the right to know who is collecting their PI and for what purpose. It also allows users to opt-out of their data being collected / sold and request websites to delete their data.

Application. CCPA applies to any business that collects or sells Californians’ PI and (1) has annual gross revenue > $25M USD or (2) sells > 50K Californians PI annually or (3) Derives > 50% of its annual revenue from the sale of Californians’ PI. Businesses that fall under the regulations of CCPA have several requirements to fulfill. One of which is displaying a “Do Not Sell My Personal Information” (DNSMPI) link to the users.

Compliance. Due to the complexity of the online ecosystem, analyzing CCPA compliance is a non trivial task. However, by analyzing a measurable component of CCPA, the non trivial problem can be reduced to a measurement task. Keeping this in mind this work focuses on analyzing compliance of CCPA with the lens of DNSMPI links. As integrating a DNSMPI link is a low cost task for any business, analyzing its adoption can set lower bounds for CCPA compliance overall.

Analysis. This work aims to investigate several aspects of CCPA with the lens of DNSMPI link; (1) overall adoption (2) updating implementation with policy updates (3) geo-fencing CCPA guidelines (3) semantical implementation of the guidelines (font, size, placement).

Exemptions. As a verification step, they filter businesses exempt from CCPA and subsequently DNSMPI links. To built this filter, they quantify unique user count (from semrush.com) and tracker presence on the websites. They argue that tracker presence and amount of unique users combined indicate data sharing/selling. Hence, giving them a good sense of websites that should abide by CCPA regulations.

Instrumentation. To analyze the various aspects of DNSMPI link, they first created a set of 1M websites based on popularity and tracker presence. Next, using python and Chrome Developer Protocol, they ran 3 crawls, (a crawl consisted of visiting the set of websites and gathering data about DNSMPI links). Crawl 1 aimed to measure global adoption and applicability of DNSMPI links. Crawl 2 measured locational and longitudinal differences w.r.t DNSMPI links, and presentation characteristics of DNSMPI links was measured in crawl 3.

DNSMPI Adoption. Analyzing set of websites (1M) from crawl 1 showed presence of trackers and advertisers on most of the websites, making them highly probable to fall under the the regulations of CCPA. However, < 2% websites integrate a DNSMPI link and that these websites are spread above and (mostly) below the threshold of unique visitors set by CCPA. On the other hand PrivacyPolicy and ToS links were present on >80% of the websites. The authors argue that since FTC regulating PP and ToS links since the 90’s has made it widely adopted, it should serve as motivation to keep regulating DNSMPI (CCPA) over time as well.

Updated DNSMPI. After the first crawl, CCPA announced changes in the text of the DNSMPI link, which presented the perfect opportunity to measure how (if at all) websites are staying updated with the CCPA regulations. After running a second crawl they observed several sites had removed the DNSMPI links whereas the overall adoption rate increase was very low. Furthermore, they noticed that most (~90%) of the websites had not updated their DNSMPI links showing a one time initial compliance attitude.

Dynamically Hiding DNSMPI. Their second crawl was run from 2 IP’s (Virginia, California) to measure if websites are dynamically hiding the DNSMPI link for visitors outside CA. They measure ~2000 websites that use geo-fencing and hide links for non-californians. They argue this number is low but nonetheless problematic.

Semantics of DNSMPI. Using results form crawl 1 and 2 the authors analyze the semantics of DNSMPI links in their third crawl. They observe the prominence (ratio of the links text size to other text on the page) of the link text is in accordance with the FTC guidelines and the placement of the link is similar to PP and ToS links.

Concluding Remarks. Overall this paper highlights several key points w.r.t to CCPA. First, compliance is extremely difficult to measure due to the lack of a proper vantage point. As mitigation, CCPA should for instance, in addition to requiring a PP link, also require websites to state clearly that they do not sell users PI in accordance with CCPA in their PP. Second, these results should encourage regulators to be more proactive to increase the quality of current regulations and new ones to follow.

TrackerSift: Untangling Mixed Tracking and Functional Web Resources

2 minute read

Publish Date:

Tracking Lists. EasyList and EasyPrivacy (EL, EP) are two of the most well known open-source resources which drive privacy preserving web extensions such as Ublock, AdBlock and Ghostry. These lists comprise of url patterns that belong to online advertisers/trackers. Privacy preserving extensions leverage these lists to block network responses matching these rules. However, these lists have their limitations: (1) slow maintenance due to handful of contributors. (2) inability to block mixed resources. The focus of this paper is analyzing and mitigating the latter issue.

What are Mixed Resources? To circumvent blocking lists, trackers either change their network location (domain, url etc.) or mix functional content with tracking content e.g serve tracking and functional content from the same network endpoint or CDN. If a tracker uses the latter technique, it is called a mixed resource. Mixed resources are problematic for blocking lists, as blocking functional content breaks websites and allowing them increases privacy breaches. This paper provides a framework TrackerSift, which untangles functional and tracking content served by mixed resources.

TrackerSift. At a high-level, this new framework suggests a hierarchical analysis of the request urls at multiple granularities. At each hierarchical level you decide to block the url or analyze it at a finer granularity. These levels include domain, hostname, script and method in increasing granularities.

Before analyzing the urls, TrackerSift uses EL and EP as ground truth, to label each incoming url as functional or tracking. Following this, at level 1, TrackerSift extracts domains from each url. If a domain has significantly more tracking/functional urls than functional/tracking, it is categorized as tracking/functional domain. Otherwise, it is categorized as mixed domain and sent to level 2. At level 2, the deciding factor is hostname, at level 3 it is the script generating the request and finally at level 4 it is the method inside the script responsible for the request. Blocking lists can leverage the untangle mixed resources and create rules, based on the level at which they were untangled.

Analysis. To analyze the characteristics of mixed resources and performance of TrackerSift, the authors ran a crawl on 100k sites to gather network requests and stack traces. By processing the requests with TrackerSift they untangled ~25,000 requests.

Applicability. Untangling at level 1, 2 and 3 is ideal for blocking lists as they possess functionality to swiftly create rules with domain/hostname/script based options. Method based untangling is slightly more complex. It requires generating surrogate scripts with tracking methods removed. TrackerSift can provide both, new rules and surrogate scripts based on their analysis.

Concluding Remarks. This paper focuses on a crucial gap in the blocking lists research area. It increases privacy while retaining functionality, making it more applicable. It also encourages improving the state of blocking lists via more complex and granular analysis.


portfolio

publications

talks

Talk 1 on Relevant Topic in Your Field

Publish Date:

This is a description of your talk, which is a markdown files that can be all markdown-ified like any other post. Yay markdown!

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

Workshop, University 1, Department, 2015

This is a description of a teaching experience. You can use markdown like any other post.