UCSF electronic health record data: What's available for research?
UCSF has identified and de-identified data available for research.
* Identified data: requires an IRB approved protocol and typically requires funding to work with centralized data experts to extract data on your behalf. FAQs & Learn more!
* De-identified (DeID) data: does not require IRB approval and is self-serve via SQL server or point-and-click tools. Learn more!
* New! De-identified data from the San Francisco Department of Public Health (SFDPH) -- including Zuckerberg San Francisco General Hospital (ZSFG), Laguna Honda Hospital (LHH), Population Health Division (PHD), Behavioral Health Services (BHS), and ambulatory care areas -- is now available for research! No IRB approval needed; data programming skills like R and Python are required. Learn more!
Many access paths, one point of entry
Need help or guidance? Do you need data for your research with identifiers like birth dates or medical record numbers? Want to make sure you are compliant with regulations? Your first hour of consultation is free. Make sure you head in the right direction.
- First time User? Request Data Access for Research
- Not sure what option is best for your project? Request a free brief consultation for advice.
- Already using the data for research? Join the active User Group!
- If you'd like to learn more about what happens behind the scenes, read about the process.
About the UCSF electronic health record (EHR) data:
- APeX data dating back to 2012
- STOR data dating back to 1988
- Benioff Children's Hospital (BCH) Oakland data dating from March 2020 (with additional select historical data)
- Images
- Clinical notes
Plus additional data, such as:
- Geocoded address data
- CA Death Registry data
- SFDPH / ZSFG and other Department of Public Health data
- UC Health data (EHR data from UC Davis, UC Irvine, UCLA, UCSD, UCSF)
>> COVID-19 specific data for research is also available
There’s a big difference between "identified" and "de-identified (DeID)" data.
De-Identified Research Data in Information Commons
Information Commons is a comprehensive ecosystem of data, tools, secure computational and analysis environments, and community & support services that enables the full cycle of data-driven research. It includes de-identified data from UCSF Health and San Francisco Department of Public Health (SFDPH) EHR systems, UCSF PACS imaging system, and UCSF cancer genomic testing datasets. The data is de-identified and linked together on patient and clinical event levels. Information Commons welcomes self-service use of the data and lowers barriers to access (No IRB Required!), helping minimize the time scientists spend on preparation for research.
Information Commons supports a range of computational environments for data access, from turn-key Windows-based Research Analysis Environment (RAE) to advanced cloud-based specialized clusters on IC AWS Secure and on-premise high-performance computing resources on IC Wynton. The structured data and clinical notes are hosted on the MS SQL Server environment and are accessible via RAE, with data copies in parquet format available on other IC environments. Radiology Imaging data, linked with the rest of the IC data, is available via the IC Wynton environment.
Information Commons Research Data Assets are listed below.
Data Asset |
Description |
User-Friendly Access Tools |
---|---|---|
The most complete set of UCSF Health de-identified clinical and dental care data. UCSF Health, an academic medical center, is renowned for its specialized care and advanced medical research, serving primarily insured populations. UCSF DEID CDW includes data from:
As of November 2024, UCSF DEID CDW includes data from over 161M clinical encounters for over 4.3M UCSF Health patients. Some of the data domains included in DEID CDW:
|
PatientExploreR is a web-based data exploration tool operating on a subset of UCSF DEID CDW data.
|
|
UCSF DEID Notes and Notes Extracts
|
UCSF De-Identified Clinical Notes dataset is sourced from UCSF Epic CLARITY and integrated into UCSF DEID CDW. De-identified clinical notes and associated metadata are linked with the DEID CDW patient records, encounters, and relevant clinical procedures. As of November 2024, UCSF DEID Clinical Notes contain de-identified text and metadata from over 175 million notes from over 3.3 million patients. In addition, there are structured data containing clinical concepts extracted from notes text. |
EMERSE is a user-friendly search engine that helps you search patients based on the language in the clinical notes. UCSF EMERSE is implemented on UCSF DEID Clinical Notes and patient metadata from UCSF DEID CDW.
|
A component of UCSF DEID CDW, the UCSF DEID Cancer Genomic Testing data contains de-identified next-generation sequencing results from targeted cancer genomic tests, UCSF500 (Clinical Cancer Genomics Lab) and Foundation Medicine (commercial test). Learn More on Research Data Assets Wiki
|
UCSF cBioPortal – user-friendly gene testing data exploration and visualization tool based on cancer genomic testing data from UCSF DEID CDW (available via UCSF Cancer Center’s Molecular Oncology Initiative).
|
|
The Imaging Commons includes de-identified radiology images, which encompass various modalities, such as MRI scans, with details like patient demographics, diagnosis, and procedural information linked through structured EHR data. It also provides metadata associated with the images, including information about imaging sequences, acquisition parameters, and the ability to view DICOM headers and image pixels.
|
MIX, a user-friendly image explorer, is available via IC Wynton App Server. Use MIX for data exploration and cohort selection. Watch the tutorial for more information on what imaging resources are available on the IC Wynton App server, and download the presentation itself. |
|
UCSF DEID OMOP is a deidentified clinical dataset sourced from UCSF DEID CDW and transformed into the common data model (OMOP CDM) created by OHDSI. UCSF DEID OMOP contains major types of structured clinical data (visits, diagnoses, labs, measurements, observations, medications, procedures) for all UCSF DEID CDW patients with clinical activity. Analyses done in DEID OMOP are more readily reproducible and scalable to other institutions that embrace OMOP (e.g., UC-wide de-identified EHR through UCDDP).
|
UCSF OMOP Atlas – UCSF implementation of OMOP data exploration/analysis tool based on UCSF DEID OMOP data
|
|
SFDPH De-identified Clinical Data Warehouse includes San Francisco Department of Public Health (SFDPH) electronic health record data sourced from SFDPH Epic Caboodle instances. SFDPH offers essential health services to all San Francisco residents, emphasizing accessibility and affordability. It includes the following facilities:
As of November 2024, SFHN DEID CDW contains data on over 33.6M clinical encounters for over 767K patients. Learn more on Research Data Assets Wiki
|
This dataset is currently only accessible via data tools on Information Commons environments. Working with this data requires SQL, R, Python, or other statistical programming skills. |
|
UCSF-SFDPH DEID OMOP is a de-identified OMOP-standard database that incorporates EHR data from both UCSF Health and SFDPH (including ZSFG). It includes clinical data from each health system transformed into OMOP common data model and merged at the patient level. |
UCSF OMOP Atlas now supports combined UCSF–SFDPH data source, in addition to UCSF-only data source.
|
- The self-serve research data assets available through Information Commons do not require IRB approval to access.
- First-time User? Request Data Access for Research
- Not sure what option is best for your project? Request a free brief consultation for advice.
- Already using the De-Identified Research Data Assets on Information Commons? Join the active Online User Group on Slack or attend Research Data Office Hours!
Request identified clinical data: you need a consultation
Identified data is provided by consultation only. The first hour of your consultation is free!
- Clarity - closest data to APeX; clinical notes available
- Clinical Data Warehouse (CDW) - concise, pulls common data in Clarity into one field
- OMOP - uses a national common data model on data derived from PCORnet pSCANNER
Data is further from original state and there is potential to lose information
The consultant will help you define a data specification. The APeX Pick List and/or ZSFG Pick List (Large Excel files via UCSF Box) are helpful tools for this work - see more information below.