Explore UCD

UCD Home >

Prepare Your Data

Preparing Your Data For Deposit

To ensure that research data involving human participants can be made available for future reuse, it is important that consent for future reuse of the data by other researchers is sought from participants. Where your study contains personal data under General Data Protection Regulation (GDPR), informed consent is required for data processing activities, including data anonymisation and any future data sharing and archiving. Ideally consent for processing activities should be collected separately from other consents such as taking part in the research.

For surveys where no personal data is collected an information sheet should be supplied to participants or the survey introduction should state that taking part in the survey implies consent for the data being used for certain purposes. Any plans for future data sharing should be mentioned. A clause should also be included that individual responses will not be used in any way that would allow identification.  

More information on consent for data sharing including sample consent forms and information sheets are available on (opens in a new window)UK Data Service consent for data sharing pages and in the following section of the (opens in a new window)CESSDA Data Management Expert Guide -  Informed Consent. The Childhood Development Initiative (CDI) published a toolkit on sharing research data in Ireland : (opens in a new window)McGrath, B. and Hanan, R., Sharing Social Research Data in Ireland: A Practical Toolkit (2016) Dublin: Childhood Development Initiative (CDI).

When submitting data to ISSDA involving human participants please include a copy of the blank informed consent form or participant information sheet. Please also include information on research ethics board approval or exemption related to the study. 

Data must be anonymised or pseudonymised prior to deposit with ISSDA. Anonymisation or pseudonymisation is the responsibility of the Data Depositor. 

Definitions - 

Personal Data as defined under Article 4(1) of GDPR  “Any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person”.

Personal data can be disclosed through two types of identifiers  - 

Direct identifiers  - name, address, telephone number, IP address, email, student ID, PPS no.

Indirect identifiers - information that in combination with other information could identify individuals examples  - sex, gender, age, region, occupation, work place, status in employment, economic activity, occupation status, income, ethnicity, religious affiliation, socio-economic status, marital status, household composition, education level, nationality, mother tongue, rare diseases, etc.

Anonymisation of data as defined by the Data Protection Commission "means processing it with the aim of irreversibly preventing the identification of the individual to whom it relates. Data can be considered effectively and sufficiently anonymised if it does not relate to an identified or identifiable natural person or where it has been rendered anonymous in such a manner that the data subject is not or no longer identifiable."

Pseudonymisation under Article 4(5) of GDPR ‘means the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person’.

Anonymous data is not considered personal data. Pseudonymised data, however, is considered personal data.

Further guidance on anoymisation and pseudonymisation is available from the (opens in a new window)Data Protection Commission 

Prior to depositing data with ISSDA:

  • Remove all direct identifiers
    • name, address or detailed geographic location including postal code, date of birth, telephone number, IP address, email, student ID, PPS no, passport no. etc. 
  • Check all indirect identifiers to ensure there are no outliers and that identifiers cannot be combined to identify an individual. Where there are outliers or low numbers of observations it may be necessary to recode or aggregate variables that could allow re-identification in combination with other variables.  
    • Indirect or quasi identifiers include  - sex, gender, age, region, occupation, work place, status in employment, economic activity, occupation status, income, ethnicity, religious affiliation, socio-economic status, marital status, household composition, education level, nationality, mother tongue, rare diseases, etc. 
  • Remove or check answers to all open-ended questions for direct and indirect identifiers.

Techniques for quantitative data

  • Banding or aggregating  - for continuous variables like age or income to create broader categories
  • Top or bottom coding  - for extremes at the top or bottom of scale for age, household composition, income or financial variables
  • Re-coding or generalisation  - for ethnicity, educational attainment, employment, nationality, religion, geographic location, etc., merge detailed subcategories into broader groups. 
  • Using standard coding frames is recommended where possible to increase interoperability, e.g.
    • NUTS2 (Nomenclature of Territorial Units for Statistics) for geographic variables
    • ISCED 2011 ( International Standard Classification of Education) for levels of education
    • ISCO (International Standard Classification of Occupations) for coding occupations.
  • Keep a record of all actions taken

Information and guidance on anonymising data is available from:

(opens in a new window)CESSDA DMEG chapter on anonymisation 

(opens in a new window)UK Data Service anonymising data pages

UK Anonymisation Network (UKAN) provides an (opens in a new window)Anonymisation Decision-making Framework available on their website.

(opens in a new window)Anonymisation and Personal Data Guidance from Finish Social Science Data Archive (FSD) 

(opens in a new window)Data Privacy Handbook from the University of Utrecht 

(opens in a new window)Handbook for data containing personal information from Swedish National Data Service (SND) which offers guidance on managing personal data in research.

Tools for anonymisation 

(opens in a new window)Amnesia  -  is a tool from OpenAIRE for anonymising data which allows you to aggregate variables and evaluate the re-identification risk. Amnesia is Java-based and can be downloaded and run locally on your computer.

(opens in a new window)ARX(opens in a new window)  - is an open source software for anonymising sensitive personal data. ARX is Java-based and can be run locally on your computer via a compatible Java environment.

(opens in a new window)sdcMicro - is an R-package to anonymise data which allows you to check disclosure risk by examining combinations of key variables. sdcMicro can be downloaded and run locally on your computer.

Data cleaning is the process of removing incomplete or inaccurate records from your dataset. A cleaned dataset  is one that only has valid codes for each variable.  This means that each code in the dataset must be described in the data dictionary or questionnaire.  Note there may be codes used in the dataset that are not mentioned on the questionnaire, but that are included in the codebook/data dictionary.   There should be no numerical measures with out of range codes.  Missing data codes must be explicitly stated.  Data should also have been checked, as far as possible for internal consistency (for example a never-smoker should not have a cigarette consumption field completed).

Ideally the data should include long descriptive labels for each variable and labels for each discrete variable value.  In SPSS these would be created using the VALUE LABLES and VARIABLE LABELS commands.

Prior to depositing data with ISSDA check the following:

  • Check that values are labelled clearly, correctly and consistently.  
  • Check all missing values are accounted for (e.g. Don’t know, refusal, non-response, etc)
  • Check for errors or inconsistencies in data (e.g. a date instead of a number, a non-smoker should not have a cigarette consumption field completed, etc.)
  • Check for spelling and typing errors - Spellcheck variable names, labels, value labels and string variables (e.g. by exporting all labels to Excel and conducting a spell check).
  • Check for information attached to the dataset that you do not want to include (e.g. notes attached to the dataset or preliminary comments).  
  • Scan data for any unlabelled values. 
  • Check data file against the codebook/data dictionary 
    • Ensure all variables in the data file are included in the codebook
    • Ensure consistency in naming across data file and codebook

There are tools available that can assist with data cleaning and quality control:

(opens in a new window)OpenRefine – this data manipulation tool is used for data cleansing and quality control purposes. 

(opens in a new window)QAMyData - this open source tool developed by the UK Data Service can be used to automatically assess and report on elements of quality, such as missingness, labelling, duplication, formats, outliers and direct identifiers.

Ensure your data is in a file format accepted by ISSDA. ISSDA accepts data in both our preferred and acceptable file formats. Preferred file formats are open source formats that support long-term preservation, while acceptable file formats include commonly used packages (e.g. SPSS, Stata, SAS). ISSDA does not encourage submission of dataset in Excel files. 

The Dataverse software used by ISSDA for its data repository automatically transforms tabular data files such as SPSS files and other statistical files into TAB files. The metadata which describes the content of these datafiles is separately stored in an XML file. Together this information can be read into SPSS or STATA as well as into other applications. 

Type of Data Preferred Formats Acceptable Formats
Quantitative Datasets (statistical file formats)
  • Tab or comma-delimited file (.tab, csv) with setup file (for SPSS, Stata, SAS, etc.)
  • SPSS (.por, .sav)
  • Stata (.dta) 
  • SAS (.7bdat; .sd2; .tpt) 
  • Tab- or comma-delimited text documents (e.g., .csv, .tab, .tsv), without import syntax/script.
Documentation (text documents) 
  • PDF/A (.pdf) 
  • PDF (.pdf) 
  • ODT (.odt) 
  • MS Word (.doc, .docx) 
  • RTF (.rtf)

Please see the ISSDA File Format Policy for further information

Gather together your documentation. Include all documentation that describes the research data and context of study including the objectives, methodology, fieldwork, variables and summaries of findings which will assist in the understanding and re-use of the data. The (opens in a new window)CESSDA DMEG Documentation and metadata chapter outlines information that should be included in your documentation.

This can include:

  • Codebook/Data dictionary
  • Information on derived variables and anonymisations actions
  • Questionnaire(s)/survey instrument
  • Interviewer instructions and showcards, where relevant 
  • Summary guide to dataset
  • Methodological information
  • Reports - Final report, Technical report
  • Informed consent  - where relevant
  • Participant Information Leaflet(PIL)  - where relevant
  • Information on any incentives for taking part in survey
  • Related publications  - citation of related publications in APA format and link to publications
  • Readme File 
Codebook or Data Dictionary

The data dictionary or codebook is a central document that describes the different datasets being deposited, the sample size in each and the storage format (e.g. csv, SPSS, SAS). It outlines each element in the dataset, giving the structure, content, and variable definitions for a dataset. They are critical tools for understanding and using data. While the terms are often used interchangeably a codebook is generally used to describe survey data.

Introductory or context information for the original data study should be included. There should be a description of how the data were anonymised and a list of the variables (on the questionnaire) not included in the database, or variables which were altered to ensure anonymity (e.g. age groups instead of exact ages).

Each variable should be listed, usually in the order in which it appears in the dataset, including the following : 

  • Variable name - name assigned to each variable in the dataset
  • Variable label - brief description of variable, where possible using the exact wording of the question 
  • Exact wording used to elicit the information - this may be available from the questionnaire but should be repeated in the data dictionary 
  • Variable meaning - exact definition of the variable
  • Level of measurement - method the value was measured with such as nominal, scale, interval, ratio
  • Variable format - number, date
  • Valid codes and meanings - actual coded values in the data for this variable such as 1, 2, 3 and what these mean e.g. Excellent, Good
  • Codes for missing data, with reason for missing data or for ‘irrelevant’ data (e.g. date of marriage for someone who was never married) should be given. Often 9, 99 or 999 are used for missing data. The (opens in a new window)UK Data Service suggest - ’99=not recorded’, ’98=not provided (no answer)’, ’97=not applicable’, ’96=not known’, ’95=error’

For derived variables (e.g. Body Mass Index, an SF-36 domain) the formula or algorithm used should be given or referenced. 

Special care should be taken if dates are included in the dataset, and the format should be described.

Questionnaires

If PAPI (Paper and Pencil Interviews) has been used the questionnaire should be included.  For CAPI (Computer Aided Personal Interviewing) and CATI (Computer Aided Telephone Interviewing) the question wording should be supplied with notes for branched questions (i.e. questions that depend on a positive answer to a previous question).  For CASI (Computer Assisted Self Interviewing) systems such as Survey Monkey, a html file(s) displaying the questions should be provided.

Methodological Report or Summary Guide 

A  methods report or summary guide to the dataset including the following information:

  • Project: context, problem, objectives, hypotheses. 
  • Method: population, sampling (size, response rate), data collection method. 
  • Data: cleaning, anonymisation, coding, validation, description of recoded or constructed variables, weighting.
Publications

Include any publications associated with the data study such as journal articles, project, summary or technical reports. These often contain important information such as research context and design, data collection methods and data preparation, plus summaries of findings based on the data.

Include blank consent form or Participant Information Leaflet(PIL)

Please include a blank copy of the consent form or a copy of the Participant Information Leaflet(PIL) used with your study as part of the documentation.

See more information about (opens in a new window)data preparation from the UK Data Service.

The following guidelines for depositing data have been consulted in the preparation of these pages

(opens in a new window)CESSDA DMEG

(opens in a new window)UK Data Service Deposit Data pages

(opens in a new window)AUSSDA Data Deposit Guidelines

Irish Social Science Data Archive (ISSDA)

James Joyce Library, University College Dublin, Belfield, Dublin 4, Ireland.
E: issda@ucd.ie