Explore UCD

UCD Home >

Prepare Your Data

Preparing Your Data For Deposit

To ensure that research data involving human participants can be made available for future reuse, it is important that consent for future reuse of the data by other researchers is sought from participants. Where your study contains personal data under General Data Protection Regulation (GDPR), informed consent is required for data processing activities, including data anonymisation and any future data sharing and archiving. Ideally consent for processing activities should be collected separately from other consents such as taking part in the research.

For surveys where no personal data is collected an information sheet should be supplied to participants or the survey introduction should state that taking part in the survey implies consent for the data being used for certain purposes. Any plans for future data sharing should be mentioned. A clause should also be included that individual responses will not be used in any way that would allow identification.  

More information on consent for data sharing including sample consent forms and information sheets are available on (opens in a new window)UK Data Service consent for data sharing pages and in the following section of the (opens in a new window)CESSDA Data Management Expert Guide -  Informed Consent. The Childhood Development Initiative (CDI) published a toolkit on sharing research data in Ireland : (opens in a new window)McGrath, B. and Hanan, R., Sharing Social Research Data in Ireland: A Practical Toolkit (2016) Dublin: Childhood Development Initiative (CDI).

When submitting data to ISSDA involving human participants please include a copy of the blank informed consent form or participant information sheet. Please also include information on research ethics board approval or exemption related to the study. 

If your data contains personal data under GDPR you must ensure that your data is anonymised or pseudonymised prior to deposit. Anonymisation/pseudonymisation is the responsibility of the Data Provider

Definitions - 

Personal Data as defined under Article 4(1) of GDPR  “Any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person”.

Personal data can be disclosed through two types of identifiers  - 

Direct identifiers  - name, address, telephone number, IP address, email, student ID, PPS no.

Indirect identifiers - information that in combination with other information could identify individuals examples  - sex, age, region, occupation, income, ethnicity, religious affiliation, education level, nationality, rare diseases.

Anonymisation of data as defined by the Data Protection Commission "means processing it with the aim of irreversibly preventing the identification of the individual to whom it relates. Data can be considered effectively and sufficiently anonymised if it does not relate to an identified or identifiable natural person or where it has been rendered anonymous in such a manner that the data subject is not or no longer identifiable."

Pseudonymisation under Article 4(5) of GDPR ‘means the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person’.

Anonymous data is not considered personal data. Pseudonymised data, however, is considered personal data.

Further guidance on anoymisation and pseudonymisation is available from the (opens in a new window)Data Protection Commission 

Information and guidance on anonymising data is available from:

(opens in a new window)CESSDA DMEG chapter on anonymisation 

(opens in a new window)UK Data Service anonymising data pages

UK Anonymisation Network (UKAN) provides an (opens in a new window)Anonymisation Decision-making Framework available on their website.

Data cleaning is the process of removing incomplete or inaccurate records from your database. A cleaned database is one that only has valid codes for each variable.  This means that each code in the dataset must be described in the data dictionary or questionnaire.  Note that there may be codes used in the dataset that are not mentioned on the questionnaire.   There should be no numerical measures with out of range codes.  Missing data codes must be explicitly stated.  Data should also have been checked, as far as possible for internal consistency (for example a never-smoker should not have a cigarette consumption field completed).

Ideally the database should include long descriptive labels for each variable and labels for each discrete variable value.  In SPSS these would be created using the VALUE LABLES and VARIABLE LABELS commands.

There are tools available that can assist with data cleaning and quality control:

(opens in a new window)OpenRefine – this data manipulation tool is used for data cleansing and quality control purposes. 

(opens in a new window)QAMyData - this open source tool developed by the UK Data Service can be used to automatically assess and report on elements of quality, such as missingness, labelling, duplication, formats, outliers and direct identifiers.

Ensure your data is in a file format accepted by ISSDA. ISSDA accepts data in both our preferred and acceptable file formats. Preferred file formats are open source formats that support long-term preservation, while acceptable file formats include commonly used packages (e.g. SPSS, Stata, SAS). ISSDA does not encourage submission of dataset in Excel files. 

The Dataverse software used by ISSDA for its data repository automatically transforms tabular data files such as SPSS files and other statistical files into TAB files. The metadata which describes the content of these datafiles is separately stored in an XML file. Together this information can be read into SPSS or STATA as well as into other applications. 

Type of Data Preferred Formats Acceptable Formats
Quantitative Datasets (statistical file formats)
  • Tab or comma-delimited file (.tab, csv) with setup file (for SPSS, Stata, SAS, etc.)
  • SPSS (.por, .sav)
  • Stata (.dta) 
  • SAS (.7bdat; .sd2; .tpt) 
  • Tab- or comma-delimited text documents (e.g., .csv, .tab, .tsv), without import syntax/script.
Documentation (text documents) 
  • PDF/A (.pdf) 
  • PDF (.pdf) 
  • ODT (.odt) 
  • MS Word (.doc, .docx) 
  • RTF (.rtf)

Please see the ISSDA File Format Policy for further information

Gather together your documentation. Include all documentation that describes the research data and context of study including the objectives, methodology, fieldwork, variables and summaries of findings which will assist in the understanding and re-use of the data. The (opens in a new window)CESSDA DMEG Documentation and metadata chapter outlines information that should be included in your documentation.

Codebook or Data Dictionary

The data dictionary or codebook is a central document that describes the different datasets being deposited, the sample size in each and the storage format (e.g. csv, SPSS, SAS). It outlines each element in the dataset, giving the structure, content, and variable definitions for a dataset. They are critical tools for understanding and using data. While the terms are often used interchangeably a codebook is generally used to describe survey data.

Introductory or context information for the original data study should be included. There should be a description of how the data were anonymised and a list of the variables (on the questionnaire) not included in the database, or variables which were altered to ensure anonymity (e.g. age groups instead of exact ages).

Each variable should be listed, usually in the order in which it appears in the dataset, including the following : 

  • Variable name - name assigned to each variable in the dataset
  • Variable label - brief description of variable, where possible using the exact wording of the question 
  • Exact wording used to elicit the information - this may be available from the questionnaire but should be repeated in the data dictionary 
  • Variable meaning - exact definition of the variable
  • Level of measurement - method the value was measured with such as nominal, scale, interval, ratio
  • Variable format - number, date
  • Valid codes and meanings - actual coded values in the data for this variable such as 1, 2, 3 and what these mean e.g. Excellent, Good
  • Codes for missing data, with reason for missing data or for ‘irrelevant’ data (e.g. date of marriage for someone who was never married) should be given. Often 9, 99 or 999 are used for missing data. The (opens in a new window)UK Data Service suggest - ’99=not recorded’, ’98=not provided (no answer)’, ’97=not applicable’, ’96=not known’, ’95=error’

For derived variables (e.g. Body Mass Index, an SF-36 domain) the formula or algorithm used should be given or referenced. 

Special care should be taken if dates are included in the dataset, and the format should be described.

Questionnaires

If PAPI (Paper and Pencil Interviews) has been used the questionnaire should be included.  For CAPI (Computer Aided Personal Interviewing) and CATI (Computer Aided Telephone Interviewing) the question wording should be supplied with notes for branched questions (i.e. questions that depend on a positive answer to a previous question).  For CASI (Computer Assisted Self Interviewing) systems such as Survey Monkey, a html file(s) displaying the questions should be provided.

Publications

Include any publications associated with the data study such as journal articles, project, summary or technical reports. These often contain important information such as research context and design, data collection methods and data preparation, plus summaries of findings based on the data.

Include blank consent form

Please include a blank copy of the consent form used with your study as part of the documentation. Blank consent forms will be archived with the data but will not be made available to the End User unless they request to see it.

See more information about (opens in a new window)data preparation from the UK Data Service.

Irish Social Science Data Archive (ISSDA)

James Joyce Library, University College Dublin, Belfield, Dublin 4, Ireland.
E: issda@ucd.ie