Difference between revisions of "HTAC Database - Data Download Guide"

From Pheno Wiki
Jump to: navigation, search
 
(21 intermediate revisions by the same user not shown)
Line 1: Line 1:
This guide assumes that you are familiar with the CNP dataset, names of the data subsets (e.g., LA5C), and have access to the HTAC Database. <br/>
+
'''Note: This guide assumes that you are familiar with the CNP dataset, names of the data subsets (e.g., LA5C), and have access to the HTAC Database.''' <br/>
  
 
In the HTAC Customized Data Export section, you can request data organized by Subject Type (Step 3) or Subject Status (Step 4). <br/>
 
In the HTAC Customized Data Export section, you can request data organized by Subject Type (Step 3) or Subject Status (Step 4). <br/>
 
* For Step 3, you can chose to download only a certain set of patients, for example; if you want to download the entire dataset, select "ALL SUBJECTS". <br/>
 
* For Step 3, you can chose to download only a certain set of patients, for example; if you want to download the entire dataset, select "ALL SUBJECTS". <br/>
* For Step 4, you have 3 options:<br/>
+
* For Step 4, you have three options:<br/>
 
:* '''"Master List (N = 1254)":''' This is most likely the option that all users will chose. This includes all subjects with Status = 2 (Complete). <br/>
 
:* '''"Master List (N = 1254)":''' This is most likely the option that all users will chose. This includes all subjects with Status = 2 (Complete). <br/>
 
:* '''"Population Stratified Set":''' This includes all subjects with Status = 2 (Complete), plus 62 additional subjects with Status = 0 and Genetic Recovery Case = 1. This larger dataset (N = 1316) will be used for primary genetic analyses only. We have included the additional Genetic Recovery Cases in an attempt to increase our total sample size as much as possible, but they don't necessarily meet inclusion criteria for the Master List. <br/>
 
:* '''"Population Stratified Set":''' This includes all subjects with Status = 2 (Complete), plus 62 additional subjects with Status = 0 and Genetic Recovery Case = 1. This larger dataset (N = 1316) will be used for primary genetic analyses only. We have included the additional Genetic Recovery Cases in an attempt to increase our total sample size as much as possible, but they don't necessarily meet inclusion criteria for the Master List. <br/>
 
:* '''"Inactive/Active/Complete (N = 1839)":''' This includes all subjects recorded in the study. This should only be downloaded for QC purposes. This dataset should not be downloaded and used for analyses. <br/>
 
:* '''"Inactive/Active/Complete (N = 1839)":''' This includes all subjects recorded in the study. This should only be downloaded for QC purposes. This dataset should not be downloaded and used for analyses. <br/>
[[File:CNP_FinalSamples_030613.png]]
+
[[File:CNP_FinalSamples_030713.png]]
 +
* For Step 5, you have two options:<br/>
 +
:* '''"CLEANED":''' Cleaning rules have been applied to derived variables, such that subjects that fail cleaning rules for a given task will be excluded. See [[HTAC Database - Cleaned Data: Cleaning Rules]].
 +
:* '''"UNCLEANED":''' Cleaning rules have not been applied to the data, and the investigator should apply them before examining/analyzing the data. See [[HTAC Database - Cleaned Data: Cleaning Rules]].
  
At this point, you have a complete data set with subjects that have been determined to be included in the Master or Population Stratified set. They vary in how complete the data are, but they have all been determined to be usable. <br/>
+
''At this point, you have a complete data set with subjects that have been determined to be included in the Master or Population Stratified set. They vary in how complete the data are, but they have all been determined to be usable.'' <br/>
Under variables listed in the Patient Registry form, these subjects may have values entered in the DropDate, DQ, SF, or Flag fields. [[These do not necessarily make the subject unusable.]]
+
 
* '''Positive DropDate''': Not a grounds for exclusion, but rather indicate who may have stopped, then restarted the study. Those that were really dropped for meeting exclusion criteria were marked as Inactive (Status = 1) and since these data should not be downloaded for analysis, there should be no confusion between a "real" DQ and the DropDate/DQ fields here.  
+
 
*
+
'''Notes about DropDate, DQ, and SF Fields:''' <br/>
 +
These subjects may have values entered in the DropDate, DQ, SF, or Flag fields, which are variables listed in the Patient Registry form. ''These do not necessarily make the subject unusable.'' <br/>
 +
* '''Positive DropDate''': Not a grounds for exclusion, but rather indicate who may have stopped, then restarted the study. Those that were really dropped for meeting exclusion criteria were marked as Inactive (Status = 1) and since these data should not be downloaded for analysis, there should be no confusion between a "real" DQ and the DropDate/DQ fields here. <br/>
 +
* '''DQ_Reason''': This variable was used throughout the study to record why a subject did not complete the study. This does not make their data unusable. Many of the remaining DQ_codes were entered at the scanning stage and are therefore scan specific (e.g., subject failed to show up for their scan, so this code was entered in the Registry in order to indicate why this portion of their data are missing). Consistent with the fact that their Status = 2 (Complete), the data are usable, despite a positive DQ code. <br/>
 +
* '''SF_Reason''': This was another field used in the Registry to record information about why complete data are not available from the subject. Presence of a SF flag does not mean that the data are unusable. <br/>
 +
 
 +
'''Master Set (N = 1254), all patients and controls:''' There are 6 subjects with DropDates, 17 subjects with a DQ_Reason, and 3 subjects with a SF_Reason. Since many of these overlap, there are 20 total subjects with either of these fields filled in. Whatever data were collected from these subjects has been determined to be usable. [[A table listing these subjects with either DropDate, DQ or SF is here.]] <br/>
 +
 
 +
 
 +
'''Notes about Flags:''' <br/>
 +
These subjects may have values entered in the Flag field, which is a variable listed in the Patient Registry form. ''These do not necessarily make the subject unusable.'' <br/>
 +
For the most part, Flags record information that an investigator may want to use to further filter out subjects (specific to the question being asked) or simply record additional information about the subject/data.
 +
 
 +
'''Master Set (N = 1254), all patients and controls:''' There are 70 subjects with a Flag. [[A table listing these subjects with a Flag is here.]] <br/>
 +
 
 +
 
 +
'''Notes about Additional Cleaning based on DropDates, DQ, SF and Flags:''' <br/>
 +
The Master Set is available for download and analysis, and all subjects have been determined to be usable. No subjects *need* to be further excluded based on a DQ or Flag, as explained above. After downloading the Master Set, if an investigator wants to do further filtering based on DQ or Flags, this is the investigator's decision and can be done by sorting based on those fields. <br/>
 +
Whether to include or exclude based on Flags is up to the person analyzing the data, just as it is the person's responsibility to make sure that they include subjects with complete data for their measure(s) of interest, that those data have been checked, cleaned, etc. <br/>
 +
If the investigator decides to conduct further filtering based on DQ or Flags, this information should be recorded in order to communicate with other investigators and replicate the analyses. <br/>
 +
 
 +
[[File:CNP FinalSamples 2 030713.png]]
 +
 
 +
As of 3/7/13, no additional changes should be made to subjects' status, which would alter the Master Set and Population Stratified Set. If any additional changes are agreed to by the exec committee, they should be documented here:
 +
 
 +
Update on 6/11/13: Cleaning rules were applied within the database, such that data can now be downloaded as Cleaned Data. See [[HTAC Database - Cleaned Data: Cleaning Rules]]
 +
 
 +
 
 +
go back to [[HTAC]]

Latest revision as of 12:05, 17 June 2013

Note: This guide assumes that you are familiar with the CNP dataset, names of the data subsets (e.g., LA5C), and have access to the HTAC Database.

In the HTAC Customized Data Export section, you can request data organized by Subject Type (Step 3) or Subject Status (Step 4).

  • For Step 3, you can chose to download only a certain set of patients, for example; if you want to download the entire dataset, select "ALL SUBJECTS".
  • For Step 4, you have three options:
  • "Master List (N = 1254)": This is most likely the option that all users will chose. This includes all subjects with Status = 2 (Complete).
  • "Population Stratified Set": This includes all subjects with Status = 2 (Complete), plus 62 additional subjects with Status = 0 and Genetic Recovery Case = 1. This larger dataset (N = 1316) will be used for primary genetic analyses only. We have included the additional Genetic Recovery Cases in an attempt to increase our total sample size as much as possible, but they don't necessarily meet inclusion criteria for the Master List.
  • "Inactive/Active/Complete (N = 1839)": This includes all subjects recorded in the study. This should only be downloaded for QC purposes. This dataset should not be downloaded and used for analyses.

CNP FinalSamples 030713.png

  • For Step 5, you have two options:

At this point, you have a complete data set with subjects that have been determined to be included in the Master or Population Stratified set. They vary in how complete the data are, but they have all been determined to be usable.


Notes about DropDate, DQ, and SF Fields:
These subjects may have values entered in the DropDate, DQ, SF, or Flag fields, which are variables listed in the Patient Registry form. These do not necessarily make the subject unusable.

  • Positive DropDate: Not a grounds for exclusion, but rather indicate who may have stopped, then restarted the study. Those that were really dropped for meeting exclusion criteria were marked as Inactive (Status = 1) and since these data should not be downloaded for analysis, there should be no confusion between a "real" DQ and the DropDate/DQ fields here.
  • DQ_Reason: This variable was used throughout the study to record why a subject did not complete the study. This does not make their data unusable. Many of the remaining DQ_codes were entered at the scanning stage and are therefore scan specific (e.g., subject failed to show up for their scan, so this code was entered in the Registry in order to indicate why this portion of their data are missing). Consistent with the fact that their Status = 2 (Complete), the data are usable, despite a positive DQ code.
  • SF_Reason: This was another field used in the Registry to record information about why complete data are not available from the subject. Presence of a SF flag does not mean that the data are unusable.

Master Set (N = 1254), all patients and controls: There are 6 subjects with DropDates, 17 subjects with a DQ_Reason, and 3 subjects with a SF_Reason. Since many of these overlap, there are 20 total subjects with either of these fields filled in. Whatever data were collected from these subjects has been determined to be usable. A table listing these subjects with either DropDate, DQ or SF is here.


Notes about Flags:
These subjects may have values entered in the Flag field, which is a variable listed in the Patient Registry form. These do not necessarily make the subject unusable.
For the most part, Flags record information that an investigator may want to use to further filter out subjects (specific to the question being asked) or simply record additional information about the subject/data.

Master Set (N = 1254), all patients and controls: There are 70 subjects with a Flag. A table listing these subjects with a Flag is here.


Notes about Additional Cleaning based on DropDates, DQ, SF and Flags:
The Master Set is available for download and analysis, and all subjects have been determined to be usable. No subjects *need* to be further excluded based on a DQ or Flag, as explained above. After downloading the Master Set, if an investigator wants to do further filtering based on DQ or Flags, this is the investigator's decision and can be done by sorting based on those fields.
Whether to include or exclude based on Flags is up to the person analyzing the data, just as it is the person's responsibility to make sure that they include subjects with complete data for their measure(s) of interest, that those data have been checked, cleaned, etc.
If the investigator decides to conduct further filtering based on DQ or Flags, this information should be recorded in order to communicate with other investigators and replicate the analyses.

CNP FinalSamples 2 030713.png

As of 3/7/13, no additional changes should be made to subjects' status, which would alter the Master Set and Population Stratified Set. If any additional changes are agreed to by the exec committee, they should be documented here:

Update on 6/11/13: Cleaning rules were applied within the database, such that data can now be downloaded as Cleaned Data. See HTAC Database - Cleaned Data: Cleaning Rules


go back to HTAC