UCSF Clinical Data

UCSF electronic health record data: What's available for research?

UCSF has identified and de-identified data available for research.

* Identified data: requires an IRB approved protocol and typically requires funding to work with centralized data experts to extract data on your behalf. FAQs & Learn more!

* De-identified (DeID) data: does not require IRB approval and is self-serve via SQL server or point-and-click tools. Learn more!

* New! De-identified data from the San Francisco Department of Public Health (SFDPH) -- including Zuckerberg San Francisco General Hospital (ZSFG), Laguna Honda Hospital (LHH), Population Health Division (PHD), Behavioral Health Services (BHS), and ambulatory care areas -- is now available for research! No IRB approval needed; data programming skills like R and Python are required.  Learn more!

 

Many access paths, one point of entry

Need help or guidance? Do you need data for your research with identifiers like birth dates or medical record numbers? Want to make sure you are compliant with regulations? Your first hour of consultation is free. Make sure you head in the right direction.

 

About the UCSF electronic health record (EHR) data:

  • APeX data dating back to 2012
  • STOR data dating back to 1988
  • Benioff Children's Hospital (BCH) Oakland data dating from March 2020 (with additional select historical data) 
  • Images
  • Clinical notes

Plus additional data, such as:

>> COVID-19 specific data for research is also available

 

There’s a big difference between "identified" and "de-identified (DeID)" data.

A diagram of UCSF research data assets and their sources

De-Identified Research Data in Information Commons

Information Commons is a comprehensive ecosystem of data, tools, secure computational and analysis environments, and community & support services that enables the full cycle of data-driven research. It includes de-identified data from UCSF Health and San Francisco Department of Public Health (SFDPH) EHR systems, UCSF PACS imaging system, and UCSF cancer genomic testing datasets. The data is de-identified and linked together on patient and clinical event levels. Information Commons welcomes self-service use of the data and lowers barriers to access (No IRB Required!), helping minimize the time scientists spend on preparation for research.  

Information Commons supports a range of computational environments for data access, from turn-key Windows-based Research Analysis Environment (RAE) to advanced cloud-based specialized clusters on IC AWS Secure and on-premise high-performance computing resources on IC Wynton. The structured data and clinical notes are hosted on the MS SQL Server environment and are accessible via RAE, with data copies in parquet format available on other IC environments. Radiology Imaging data, linked with the rest of the IC data, is available via the IC Wynton environment. 

Information Commons Research Data Assets are listed below. 

Data Asset 

Description 

User-Friendly Access Tools 

UCSF Health DEID CDW 

The most complete set of UCSF Health de-identified clinical and dental care data. UCSF Health, an academic medical center, is renowned for its specialized care and advanced medical research, serving primarily insured populations. UCSF DEID CDW includes data from:

  • UCSF Medical Center 
  • UCSF Benioff Children’s Hospitals 
  • UCSF Dental Center 
  • Langley Porter Psychiatric Hospital and Clinics

 As of November 2024, UCSF DEID CDW includes data from over 161M clinical encounters for over 4.3M UCSF Health patients. 

Some of the data domains included in DEID CDW:

  • Demographics 
  • Encounters 
  • Diagnosis 
  • Medications 
  • Labs 
  • Procedures 
  • Flowsheets 
  • Vital status from CA Death Registry 

 

Learn more on Research Data Assets Wiki 

PatientExploreR is a web-based data exploration tool operating on a subset of UCSF DEID CDW data.  

 

 

UCSF DEID Notes and Notes Extracts 

 

UCSF De-Identified Clinical Notes dataset is sourced from UCSF Epic CLARITY and integrated into UCSF DEID CDW. De-identified clinical notes and associated metadata are linked with the DEID CDW patient records, encounters, and relevant clinical procedures.

As of November 2024, UCSF DEID Clinical Notes contain de-identified text and metadata from over 175 million notes from over 3.3 million patients. In addition, there are structured data containing clinical concepts extracted from notes text. 

Learn More on Research Data Assets Wiki  

EMERSE  is a user-friendly search engine that helps you search patients based on the language in the clinical notes. UCSF EMERSE is implemented on UCSF DEID Clinical Notes and patient metadata from UCSF DEID CDW.

UCSF DEID Cancer Genomic Testing Data 

A component of UCSF DEID CDW, the UCSF DEID Cancer Genomic Testing data contains de-identified next-generation sequencing results from targeted cancer genomic tests, UCSF500 (Clinical Cancer Genomics Lab) and Foundation Medicine (commercial test). 

Learn More on Research Data Assets Wiki 

  

UCSF cBioPortal – user-friendly gene testing data exploration and visualization tool based on cancer genomic testing data from UCSF DEID CDW (available via UCSF Cancer Center’s Molecular Oncology Initiative). 

 

UCSF Imaging Commons

The Imaging Commons includes de-identified radiology images, which encompass various modalities, such as MRI scans, with details like patient demographics, diagnosis, and procedural information linked through structured EHR data. It also provides metadata associated with the images, including information about imaging sequences, acquisition parameters, and the ability to view DICOM headers and image pixels. 
Learn more 

 

MIX, a user-friendly image explorer, is available via IC Wynton App Server. Use MIX for data exploration and cohort selection. 

Watch the tutorial for more information on what imaging resources are available on the IC Wynton App server, and download the presentation itself. 

UCSF DEID OMOP 

UCSF DEID OMOP is a deidentified clinical dataset sourced from UCSF DEID CDW and transformed into the common data model (OMOP CDM) created by OHDSI. UCSF DEID OMOP contains major types of structured clinical data (visits, diagnoses, labs, measurements, observations, medications, procedures) for all UCSF DEID CDW patients with clinical activity. 

Analyses done in DEID OMOP are more readily reproducible and scalable to other institutions that embrace OMOP (e.g., UC-wide de-identified EHR through UCDDP). 

 

UCSF OMOP Atlas – UCSF implementation of OMOP data exploration/analysis tool based on UCSF DEID OMOP data 

SFDPH DEID CDW 

SFDPH De-identified Clinical Data Warehouse includes San Francisco Department of Public Health (SFDPH) electronic health record data sourced from SFDPH Epic Caboodle instances. SFDPH offers essential health services to all San Francisco residents, emphasizing accessibility and affordability. It includes the following facilities: 

  • Zuckerberg San Francisco General Hospital
  • Laguna Honda Hospital  
  • Clinics including Primary Care  
  • Population Health Division  
  • Behavioral Health Services 

 

As of November 2024, SFHN DEID CDW contains data on over 33.6M clinical encounters for over 767K patients.  

Learn more on Research Data Assets Wiki 

 

This dataset is currently only accessible via data tools on Information Commons environments. Working with this data requires SQL, R, Python, or other statistical programming skills.

UCSF-SFDPH DEID OMOP 

UCSF-SFDPH DEID OMOP is a de-identified OMOP-standard database that incorporates EHR data from both UCSF Health and SFDPH (including ZSFG). It includes clinical data from each health system transformed into OMOP common data model and merged at the patient level. 

Learn more on Research Data Assets Wiki 

UCSF OMOP Atlas now supports combined  UCSF–SFDPH data source, in addition to UCSF-only data source. 

 

Request identified clinical data: you need a consultation

Identified data is provided by consultation only. The first hour of your consultation is free!

  • Clarity - closest data to APeX; clinical notes available
  • Clinical Data Warehouse (CDW) - concise, pulls common data in Clarity into one field
  • OMOP - uses a national common data model on data derived from PCORnet pSCANNER
    Data is further from original state and there is potential to lose information

The consultant will help you define a data specification. The APeX Pick List and/or ZSFG Pick List (Large Excel files via UCSF Box) are helpful tools for this work - see more information below.

Working with clinical data? Preparation is key.

Be ready with adequate computing capabilities and tools for:

Use the APeX Pick List  or the ZSFG Pick List (Large Excel files via UCSF Box) to identify variables for your research and to define your cohort.

  • Diagnoses
  • Meds
  • Labs
  • Procedures
  • Flowsheet
  • Departments
  • Smart Data Elements