Explore UCD

UCD Home >

Prepare Your Data

Preparing Your Data For Deposit

If your data contains personal data under GDPR you must ensure that your data is anonymised or pseudonymised prior to deposit. Anonymisation/pseudonymisation is the responsibility of the Data Provider

Definitions - 

"Personal Data" as defined under Article 4(1) of GDPR  “Any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person”

"Anonymisation" of data as defined by the Data Protection Commission "means processing it with the aim of irreversibly preventing the identification of the individual to whom it relates. Data can be considered effectively and sufficiently anonymised if it does not relate to an identified or identifiable natural person or where it has been rendered anonymous in such a manner that the data subject is not or no longer identifiable."

Pseudonymisation’ under Article 4(5) of GDPR ‘means the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person’.

Anonymous data is not considered personal data. Pseudonymised data, however, is considered personal data.

(opens in a new window)Further guidance on anoymisation and pseudonymisation is available from the Data Protection Commission 

Information and guidance on anonymising data is available from:

(opens in a new window)UK Data Service's anonymising data pages   

(opens in a new window)CESSDA DMEG chapter on anonymisation 

(opens in a new window)UK Anonymisation Network ((opens in a new window)UKAN(opens in a new window)) provides an Anonymisation Decision-making Framework available on theor website.

A cleaned database is one that only has valid codes for each variable.  This means that each code in the dataset must be described in the data dictionary or questionnaire.  Note that there may be codes used in the dataset that are not mentioned on the questionnaire.   There should be no numerical measures with out of range codes.  Missing data codes must be explicitly stated.  Data should also have been checked, as far as possible for internal consistency (for example a never-smoker should not have a cigarette consumption field completed).

Ideally the database should include long descriptive labels for each variable and labels for each discrete variable value.  In SPSS these would be created using the VALUE LABLES and VARIABLE LABELS commands.

Ensure your data is in a file format accepted by ISSDA. Ideally the dataset format should be based on a commonly used package (e.g. SPSS, STATA, SAS). We also recommend depositing data in multiple formats, for example SPSS plus an open standard, therefore allowing the largest number of users to access the data. Note that ISDDA does not encourage submission of dataset in Excel files. 

Type of Data Preferred Formats Acceptable Formats
Quantitative Datasets (statistical file formats)

● SPSS Portable (.por) 

● Tab-delimited file (.tab) with setup file (for SPSS, Stata, SAS, etc.)

● SPSS (.sav) 

● STATA (.dta) 

● SAS (.7bdat; .sd2; .tpt) 

● CSV (.csv)

Documentation (text documents) 

● PDF/A (.pdf) 

● PDF (.pdf) 

● ODT (.odt) 

● MS Word (.doc, .docx) 

● RTF (.rtf)

Please see the ISSDA File Format Policy for further information

Gather together your documentation. Include all documentation that describes the research data and context of study including the objectives, methodology, fieldwork, variables and summaries of findings which will assist in the understanding and re-use of the data.

Codebook or Data Dictionary

The data dictionary is a central document that describes the different datasets being deposited, the sample size in each and the storage format. (e.g. SPSS, SAS, Excel).  For each database the data dictionary will list each variable, usually in the order in which it appears in the dataset, giving the variable name, the variable label, and a copy of the exact wording used to elicit the information.  This may be available from the questionnaire but should be repeated in the data dictionary. For derived variables (e.g. Body Mass Index, an SF-36 domain) the formula or algorithm used should be given or referenced.

For each variable the data dictionary should list the valid codes and their meaning.  Missing value codes should be identified and the codes used for ‘irrelevant’ (e.g. date of marriage for someone who was never married).  Often all ‘9’s are used for missing data and all ’8’s for irrelevant data.  The cleaned database should only contain codes that are identified in the data dictionary.  Note that special care should be taken if dates are included in the dataset, and the format should be described.

The data dictionary should also include a description of how the data were anonymised and list the variables (on the questionnaire) not included in the database, or variables which were altered to ensure anonymity (e.g. age groups instead of exact ages).

Questionnaires

If PAPI (Paper and Pencil Interviews) has been used the questionnaire should be included.  For CAPI (Computer Aided Personal Interviewing) and CATI (Computer Aided Telephone Interviewing) the question wording should be supplied with notes for branched questions (i.e. questions that depend on a positive answer to a previous question).  For CASI (Computer Assisted Self Interviewing) systems such as Survey Monkey, a html file(s) displaying the questions should be provided.

Publications

Include any publications associated with the data study such as journal articles, project, summary or technical reports. These often contain important information such as research context and design, data collection methods and data preparation, plus summaries of findings based on the data.

Include blank consent form

Please include a blank copy of the consent form used with your study as part of the documentation. Blank consent form will be archived with the data but will not be made available to the End User unless they request to see it.

See more information about (opens in a new window)data preparation from the UK Data Service.

Irish Social Science Data Archive (ISSDA)

James Joyce Library, University College Dublin, Belfield, Dublin 4, Ireland.
E: issda@ucd.ie |