Building a National AI Research Resource

A Blueprint for the National Research Cloud

Stanford Institute for Human-Centered Artificial Intelligence | October 2021

Report Overview

"Building a National AI Research Resource: A Blueprint for the National Research Cloud" is a comprehensive white paper published by Stanford's Institute for Human-Centered Artificial Intelligence (HAI) in October 2021. The report provides detailed analysis and recommendations for creating a National AI Research Resource (NRC) to democratize access to AI research infrastructure.

Key Insight: The National Research Cloud represents a critical investment in America's AI innovation ecosystem, addressing the growing imbalance between industry and academic research capabilities by providing access to computational resources and government datasets.

Key Data Points

22
Universities co-signed the original NRC proposal
$67.9B
Global private investment in AI (2021)
2/3
AI Ph.D.s now go to industry vs. academia
82%
Algorithms originated from federally funded nonprofits and universities

Key Insights Summary

AI Innovation Ecosystem Challenges

The current AI innovation ecosystem faces serious challenges, including brain drain of researchers from academia to industry, limited access to cutting-edge computational resources for academic researchers, and concentration of AI talent in a few elite institutions and companies.

Complementarity Between Compute and Data

One of the systemic challenges in AI research is the decoupling of compute resources from data infrastructures. High-performance computing can be useless without data, while access to valuable data often requires secure, privacy-protecting computing environments.

Rebalancing AI Research

The NRC aims to rebalance AI research toward long-term, academic, and non-commercial research. Public investment in basic AI infrastructure can support innovation in the public interest and complement private innovation efforts.

Dual Investment Strategy for Compute

The report recommends a dual investment strategy: quickly launching the NRC by subsidizing commercial cloud computing while simultaneously investing in a pilot for public infrastructure to assess long-term viability.

Tiered Data Access Model

A tiered access model is recommended for government data, with researchers gaining default access to public data and applying through streamlined processes for higher-security data on a project-specific basis.

Organizational Design Recommendations

The report recommends establishing the NRC as a Federally Funded Research and Development Center (FFRDC) in the short term, transitioning to a public-private partnership (PPP) model in the long run.

Content Overview

Executive Summary

Artificial intelligence (AI) appears poised to transform the economy across sectors ranging from healthcare and finance, to retail and education. This "Fourth Industrial Revolution" is driven by three key trends: greater availability of data, increases in computing power, and improvements to algorithm design.

Yet the AI innovation ecosystem faces serious challenges. Computing power has become critical for AI advancement, but the high cost of compute has placed cutting-edge AI research in a position accessible only to key industry players and a handful of elite universities. Access to data—the raw ingredients used to train most AI models—is increasingly limited to the private sector and large platforms.

The National AI Research Resource Task Force Act, enacted in January 2021, created a Task Force to study and plan for the implementation of a "National Artificial Intelligence Research Resource" (NAIRR), also referred to as the National Research Cloud (NRC). The NRC will provide affordable access to high-end computational resources, large-scale government datasets in a secure cloud environment, and necessary expertise through partnerships between academia, government, and industry.

The report identifies three primary themes: complementarity between compute and data, rebalancing AI research toward long-term academic and non-commercial research, and coordinating short-term and long-term approaches to creating the NRC.

Introduction

In March 2020, Stanford's Institute for Human-Centered Artificial Intelligence (HAI) published an open letter, co-signed by Presidents and Provosts of 22 top universities, urging adoption of a National Research Cloud (NRC). The NRC proposal aims to close a significant gap in access to computing and data that has distorted the long-term trajectory of AI research.

This White Paper is the culmination of a two-quarter, independent policy practicum at Stanford Law School's Policy Lab program, which brought together law, business, and engineering students to contemplate key design dimensions of the NRC. The team interviewed and convened a wide range of stakeholders, including privacy attorneys, cloud computing technologists, government data experts, cybersecurity professionals, potential users, and public interest groups.

The proposal for an NRC is ambitious, covering eligibility criteria, compute infrastructure, data access models, organizational design, privacy compliance, ethical safeguards, cybersecurity, and intellectual property considerations.

A Theory for a National Research Cloud

This chapter articulates a theory of impact for the NRC, addressing the market failure in AI research infrastructure. While AI innovation appears vibrant in the United States, current commercialization masks systematic underinvestment in basic, non-commercial AI research that could ensure the long-term health of technological innovation.

The case for the NRC is grounded in both efficiency and distributive rationales. First, the NRC may yield positive externalities by supporting investments in basic research that may be commercialized decades later. Second, it may help level the playing field by broadening researcher access to both compute and data.

The chapter identifies shifting trends in AI research, including the migration of talent from academia to industry, disparities between academic institutions in access to resources, and the focus of industry research on private profit rather than public benefit.

Risks of federal inaction include slowing basic AI research that has paved the way for advances in AI and machine learning, and widening significant inequalities in the AI landscape.

Eligibility, Allocation, and Infrastructure for Computing

This chapter discusses eligibility, resource allocation, and computing infrastructure for the NRC. Researcher eligibility should begin with "Principal Investigator" status at U.S. universities, tracking the most common criterion for federal research funding.

The report recommends a hybrid approach to resource allocation: universal default access for the majority of researchers, with a grant process for excess computing beyond the default allocation. This approach keeps administrative costs low while enabling tailoring for high-need users.

For computing infrastructure, the report recommends a dual investment strategy: quickly launching the NRC by subsidizing commercial cloud computing while simultaneously investing in a pilot for public infrastructure. Cost comparisons show that building standalone public infrastructure is projected to be significantly less expensive than relying on commercial cloud services over the long term.

Case studies examined include NSF CloudBank, XSEDE, Fugaku (Japan's supercomputer), and Compute Canada, each offering different models for eligibility, allocation, and ownership of compute resources.

Securing Data Access

This chapter addresses how to store and provide access to datasets through the NRC. The report focuses on facilitating access to government data rather than private sector data, due to existing mechanisms for private data sharing and complex intellectual property concerns.

The current patchwork system for accessing federal data presents significant barriers. Agencies typically require data-use agreements (DUAs), but the process for negotiating DUAs is highly fragmented and inconsistent across government agencies.

The report recommends a tiered data access and storage model based on the Federal Risk and Authorization Management Program (FedRAMP), which categorizes systems into low, moderate, or high impact levels. Researchers would have default access to low-risk datasets, with streamlined processes for accessing higher-risk data.

The NRC can help harmonize the fragmented federal data-sharing landscape by promoting inter-agency standardization and adoption of modern data access standards. Case studies include the Coleridge Initiative's Administrative Data Research Facility and Stanford's Center for Population Health Sciences.

Organizational Design

This chapter addresses the institutional form the NRC should take, considering ease of access to data and ease of coordination with compute resources.

The report recommends establishing the NRC as a Federally Funded Research and Development Center (FFRDC) in the short term. FFRDCs are quasi-governmental nonprofit corporations sponsored by a federal agency but operated by contractors. This model facilitates access to data through close agency relationships while maintaining independent administration.

In the longer term, the report recommends transitioning to a public-private partnership (PPP) model. PPPs can increase the quality and quantity of R&D, increase the value and efficiency of sharing public sector data, and reduce long-run maintenance costs.

The Science & Technology Policy Institute (STPI) serves as a case study for the FFRDC model, demonstrating how multiple agency co-sponsors can reduce difficulties in accessing data across agencies.

Data Privacy Compliance

This chapter addresses data privacy compliance challenges, particularly those arising from the Privacy Act of 1974. The Act is intended to put a check on interagency data sharing and disclosure of sensitive data without consent.

To avoid conflicts with non-consensual interagency data sharing, the report recommends that the NRC should not be instituted as its own federal agency, nor should federal agency staff be allowed access to interagency data.

To avoid conflicts with the Act's "no disclosure without consent" requirement, any data released to the NRC must not be individually identifiable. The majority of AI research will likely fall under the Act's statistical research exception.

Given concerns about potential privacy risks, federal agencies may desire to share data contingent on the use of technical privacy measures such as differential privacy.

Technical Privacy and Virtual Data Safe Rooms

This chapter explores technical approaches to privacy protection, including virtual "data safe rooms" that enable researchers to access data in a secure, monitored, cloud-based environment.

Privacy-enhancing technologies (PETs) such as differential privacy, homomorphic encryption, and secure multi-party computation can help protect sensitive data while enabling research. However, these approaches are no panacea and should not substitute for robust data access policies.

The NRC should explore the design of virtual data safe rooms that implement multiple layers of security and privacy protection. These environments would allow researchers to work with sensitive data without the ability to export raw data, reducing privacy risks.

The California Policy Lab serves as a case study for implementing secure data environments while enabling valuable research on sensitive administrative data.

Safeguards for Ethical Research

This chapter addresses ethical challenges in AI research and recommends safeguards for the NRC. Given the scope of the NRC, it would be infeasible to review every research proposal for potential ethical violations.

The report recommends a twofold approach: For default PI access to base-level data and compute, the NRC should establish an ex-post review process for allegations of ethical research violations. For applications requesting access to restricted datasets or resources beyond default compute, researchers should be required to provide an ethics impact statement.

One advantage of beginning with PIs is that university faculty are accountable under existing Institutional Review Boards (IRBs) for human subjects research, as well as to the tenets of peer review.

The report urges non-NRC parties (e.g., universities) to explore additional measures to address ethical concerns in AI compute, such as ethics review processes or embedding ethicists in projects.

Managing Cybersecurity Risks

This chapter addresses cybersecurity considerations for the NRC. The report recommends that the NRC take the lead in setting security classifications and protocols, counteracting the balkanized security system across federal agencies.

The NRC should use dedicated security staff to work with Affiliated Government Agencies and university representatives to harmonize and modernize agency security standards.

Security measures should address both external threats (e.g., cybercriminals, adversarial foreign governments) and internal risks (e.g., inadvertent data exposure, misuse by authorized users).

The NRC should implement comprehensive security controls based on established frameworks such as NIST Special Publication 800-53, with appropriate adaptations for the specific context of AI research.

Intellectual Property

This chapter addresses intellectual property (IP) considerations for the NRC. While evidence on optimal IP incentives for innovation is mixed, the report recommends that the NRC adopt the same approach to allocating patent rights, copyrights, and data rights to NRC users that apply to federal funding agreements.

The NRC should consider conditions for requiring NRC researchers to disclose or share their research outputs under an open-access license. This approach would maximize the public benefit of NRC-funded research while respecting researchers' legitimate interests in their intellectual contributions.

For data contributed by private sector entities, the NRC would need to develop clear policies regarding data rights and usage restrictions, balancing the interests of data contributors with the goal of enabling broad research access.

The report notes that IP considerations are particularly complex when dealing with AI models trained on multiple datasets with different licensing terms and ownership structures.

Note: The above is only a summary of the report content. The complete document contains extensive analysis, case studies, and detailed recommendations. We recommend downloading the full PDF for in-depth reading.