Data Service Infrastructure for the Social Sciences and Humanities

Final Report Summary - DASISH (Data Service Infrastructure for the Social Sciences and Humanities)

Executive Summary:
DASISH brings together all five research infrastructure initiatives in the social sciences and humanities (SSH) domain and even goes beyond. All five initiatives made big progress in the preparatory phase addressing their own specific issues technologically as well as organisationally. During the 18 months that DASISH has operated the individual infrastructures have matured even further and legal entities either have been established or are well under way. Through DASISH the infrastructures have profited from cross-fertilization.

Concept and Idea - General View:

In the Social Sciences and Humanities research is increasingly driven by the availability of a variety of digital resources. This not only includes primary data such as audio/video recordings or collections of digitized primary sources in the humanities area and the results of surveys as in the Social Science area, but also the secondary data that is derived from these sources as well as the continuous enrichments resulting from multiple annotations for example. All these data need to be made persistently available to the users mostly via computer environments that implement discipline specific workflows and traditions.

Research in social sciences and arts and humanities has created a large quantity of digital material that represents a significant investment, both in terms of public funding and of intellectual effort. Given the current lack of infrastructure for sustaining this material, these resources are often hosted in their home institutions using a variety of approaches and technologies. This situation incurs a number of risks. Access to legacy resources may be limited to a simple download or by browser access in a website; in neither case does this facilitate advanced research services that will be more commonly used in future digital research such as mash-ups or data/text mining.

The enhanced visibility and re-usability of digital resources and tools/services is at the heart of DASISH. Only like this we can ensure long-term interest and sustainability of these resources. The quality of data and metadata in SSH often is dependent on human efforts. In CLARIN, DARIAH, CESSDA, SHARE and ESS it is widely agreed upon that data need to be made available via networks of community based solutions, organized and sustained by centres of expertise organized within the communities, which can provide the stability and reliability that is required by the researchers.

Each of the SSH initiatives has defined the kind of services that they need to implement to satisfy the communities and each established requirements for their centres to get the infrastructure and their services operational.

Project Context and Objectives:
b. A description of the work performed since the beginning of the project and the main results achieved so far

The last 18 months have been very productive. All work packages have come very well from start and have performed as planned with very slight deviations from the original work plan. Of the twelve deliverables due within the period, only a couple of deliverables have been delayed, and those that were are delivered before the end of the periodic reporting.

In some cases it has been relevant to move tasks from one partner to another to make the whole consortium perform better. These changes have also happened very smoothly. WP1 has been responsible for administration, coordination and reporting. Everything is on schedule and the partners are working together well. A central element in the governance of DASISH is the executive board, consisting of all WP leaders. The executive board has met seven times, four times face to face and three times in virtual meetings. The strategic board has met twice face to face, andthe advisory board has met once face to face.

In WP2 a major result is the State of the Architectures report which gives an overview of the present situation. This report is the outcome of Task 2.1. Tasks 2.2 and 2.3 have begun their work and are progressing well.

In WP3 all three tasks have begun their work. They are progressing as planned and the outcomes of the work are scheduled for the next term.

WP4 consists of four tasks. During the first year of the project work in WP4 was concentrated on tasks 4.1 and 4.2 which resulted in two deliverables: A publishable set of guidelines for data management, data preservation and data curation (D4.1) and an overview of available models and solutions and recommendations based on these guidelines (D4.2). The two remaining tasks have begun their work and are progression towards deliverables due later on.

WP5 consist of six tasks. Work has begun in five of them. Task 5.5 is schedule to start in month 18. All deliverables from WP5 are scheduled after month 18. Work has started on some of the deliverables and some are nearly completed.

WP6 consists of three tasks. The work in the first period has been concentrated on Tasks 6.1 and 6.2 and work on the third task has only just begun. Of three deliverables due at month 18, two are ready to submission with one month’s delay, while the third have received a minor correction and will follow soon. The three deliverables are "Report about New IPR Challenges", "Sample Merged
Paradata Sets" and "Midterm Report on the Establishment of Virtual Centres".

WP7 consists of two tasks. Both tasks have started their work. One training module has been produced on “Access Policies and Licensing” and a workshop on this topic has been held.

WP8 contains six tasks. Work on five of them has begun and two are actually completed. Three
deliverables have been completed (Dissemination strategy, Web-site up and running and Qualitative Data Workshop Report). Work on three more deliverables has begun. Two workshops have been organized on qualitative and quantitative data respectively; another quantitative data workshop will follow.

c. The expected final results and their potential impact and use (including the socio-economic impact and the wider societal implications of the project so far)

DASISH combines all ESFRI projects in the SSH, and hence establishes a direct link to some of the most renowned researchers in SSH in Europe and beyond. The collaboration of these ESFRI projects in DASISH enables the development and co-evolution of a consistent and pivotal eco-system of research infrastructures - employing shared infrastructure systems where feasible and supporting each other in the creation of disciplinary virtual research environments. The collaboration across SSH in DASISH facilitates the structuring of the scientific community and therefore plays a key role in the construction of an efficient research and innovation environment. DASISH contributes to (1) fostering standards for the interoperability and sharing of research data; (2) coordination and pooling of resources in the SSH through interacting digital research infrastructures; as well as (3) supporting researchers in the adequate creation, management, curation and publication of research data.

Project Results:
WP3
D3.7 – Report on keystroke analysis and implications for field work

This deliverable reports on how keystroke data can be analysed to inform fieldwork. Keystroke and time stamp analysis is a new field of analysis which offers lots of opportunities. Time measures are a valuable tool during all phases of the survey lifecycle to inform survey managers - before fieldwork for developing the questionnaire, during fieldwork to check the data quality and post fieldwork for data quality analysis. The data become especially informative when being augmented with respondent characteristics, other data quality indicators such as item nonresponse or when being used for evaluating interviewer performance. We suggest analysing time stamps and keystroke as part of the quality control process of a survey.

Paradata such as keystroke and time stamp data are raw data. Hence, time for preparation, data cleaning and outlier diagnostic is needed. The automatic recording of time measures in ESS-CAPI and SHARE avoids rounding error caused by interviewers. In general, it is important to evaluate the quality of the original paradata first, that will then be used to analyse data quality of the survey answers.

The analysis of time measures for cross-national flagship surveys like the ESS and SHARE revealed some similarities that seem to go beyond survey-specific peculiarities. In our comparisons on interview length we identified a similar cross-national pattern for both surveys. Therefore, linguistic and country-specific influences need to be taken into account when using time measures for data quality assessment or fieldwork monitoring. Insights from fieldwork analysis can be used to provide guidelines for other surveys which do not yet provide or use keystroke and time stamp data themselves.

Summary

This report discusses keystroke analyses in SHARE and the ESS and the implications for field work. Keystroke and time stamp data as part of paradata are key data for analysing data quality in survey production. Keystroke data and paradata are useful tools for informing survey management in several areas and throughout various steps of the survey process. Information about the time used provides additional insight into survey data to inform survey managers. We report on analyses conducted during the survey lifecycle of SHARE and after the survey fieldwork in ESS and derive suggestions for further potential analysis. Thereby, this report provides an overview of keystroke and time stamp analysis and its use for fieldwork decisions.

Data collection of paradata and data quality is an important but little discussed topic. Automatic capturing of time provides much more possibilities than manual collection in PAPI questionnaire. Information collected in PAPI can be prone to rounding errors since most interviewers give estimates on time, which can be seen by peaks at 30, 45 or 60 minutes. Automatic capturing of time eliminates the rounding error. But still it is far from error free. The preparation of the raw data and also the analysis requires advanced skills of the data analysts and researchers. Data preparation from the raw keystroke data to the final data involves many decisions for editing and cleaning. Keystroke and time stamp data are available for SHARE and ESS data. The SHARE data are comprehensive and available for project-specific use upon request. The ESS data are less extensive, but publicly available on the ESS website.

The use of keystroke and time stamp analysis can be valuable before, during and after fieldwork. In the pre-test phase, the data can be analysed in manifold ways to inform decision-making on questionnaire changes. The length of the questionnaire imposes a burden on the respondent. Also longer surveys usually imply higher survey costs. Analysis on the length for the panel and the refreshment samples in SHARE monitor the development of interview length over waves and how changes to the questionnaire might affect the overall interview length. In combination with item non-response, analyses based on pre-test data can support decisions on the inclusion, change or exclusion of items. Time measures, especially on item level, are a useful tool for informing questionnaire development.

Analysing paradata during fieldwork provides valuable insights on interviewer performance on a regular basis which can be fed back to the survey agencies and interviewers. The focus of ESS and SHARE as cross-national survey is on the cross-country perspective. Parts of the length variation are due to linguistic differences between countries, other differences might be due to non-standardised interviewing. Different survey management strategies, interviewer training styles and survey climate might cause unwanted variation in international data collection.

Fieldwork monitoring in SHARE comprises a broad range of indicators on different levels. They range from a broad perspective of interview length across all countries to a very detailed level of investigation of item length per interviewer. The analysis of interviewer length can be a good tool to check for interview abnormalities. Interviews with very short or long duration do not follow the standardised interview standard. Further investigation on very short interviews might provide an indication of interview fraud in the total interview, or also of response styles of the respondents like satisficing or straight lining. Further investigation of these interviews is recommended. Looking at interview length, the duration of modules and question can guide survey managers to cases which need further investigation, e.g. data quality checks on the actual survey data. The use of paradata can help to estimate if, for example, introductory texts are read in full length. This information can be used for interviewer training and check on standardized interviewing. It is important to take the interviewer into account and to disentangle the influence of respondent and interviewer on response styles and overall survey quality. Paradata on interview length and item length can help to analyse the role of the interviewer in the interview process and help to investigate data quality.

Results from the post survey checks can be used for informed questionnaire development in future surveys. It can provide insights in the fieldwork process and can be used for cross-national and cross-survey comparison. Besides cultural and linguistic differences different length of the survey or modules can be an indicator for difficulty of a topic or the cognitive ability of the respondent. Combined with respondent characteristics we could show that education and nationality correlate with interview length. Post-survey analysis of suspiciously short interviews combined with measures like item-nonresponse, speeding and shortcutting is a valuable tool for data cleaning.

Looking at the time and date of the interview can provide insights into fieldwork process. For example the number of interviews conducted on average per interviewer, as well as the percentage of interviewers who conducted the first interview within the first weeks of fieldwork, can be used as an indicator of survey management decisions. Also the number of interviews conducted during weekday daytime, during the evenings or on the weekend allows us to learn more about the fieldwork process. Augmented with respondent characteristics, like the occupational status, we can learn more about hard-to-reach populations that are usually underrepresented in surveys and how they are best contacted. This information of course is not limited to ESS and SHARE, but can also be applied to other surveys.

Lessons learned
Keystroke and time stamp analysis is a new field of analysis which offers lots of opportunities. Information about the time can be used in multiple phases of the survey lifecycle to inform survey managers. Paradata are a valuable tool for data quality analysis before fieldwork for developing the questionnaire, during fieldwork to check the data quality and also for post-survey quality analysis of the interview process. We suggest analysing time stamps and keystroke as part of the quality control process of a survey.

Paradata such as keystroke and time stamp data are raw data. Hence, time for preparation, data cleaning and outlier diagnostic is needed. The automatic recording of time measures in ESS-CAPI and SHARE avoids rounding error caused by interviewers. In general, it is important to evaluate the quality of the original paradata first, that will then be used to analyse data quality of the survey answers. In general, we recommend using time measures throughout the whole survey lifecycle if possible. Analysing the time recording as part of quality assessment is valuable during pretest, during fieldwork and after the survey. The example of SHARE shows that time stamps offer good indicators for assessing item characteristics at the pretest as well as for monitoring interviewers during fieldwork. ESS analyses showed the added value of using time measures in combination with respondent characteristics. However, due to the vast amount of raw data, the relevant indicators need to be carefully selected. The analysis of time stamps and keystroke data adds interesting new aspects on the quality assessment of surveys.

The analysis of time measures for cross-national flagship surveys like the ESS and SHARE revealed some similarities that seem to go beyond survey-specific peculiarities. In our comparisons on interview length we identified a similar cross-national pattern for both surveys. Therefore, linguistic and country-specific influences need to be taken into account when using time measures for data quality assessment or fieldwork monitoring. Information on fieldwork can be used to guide survey researchers in the planning of surveys. Insights from fieldwork analysis can be used to provide guidelines for other surveys which do not yet provide or use keystroke and time stamp data themselves.

D3.8 – Fieldwork monitoring application for decentralized surveys

This report describes the development of a prototype portable, standardised fieldwork management system (FMS) for the ESS and SHARE based on SHARE’s existing system. The FMS will eventually consist of two linked components – a mobile application for interviewers to use and a central server (comprising of a database, admin interface, and communication technology) accessible to survey agencies, national teams and central coordinators. Under the DASISH project it has been possible to create a prototype of the mobile application and the basic structure needed for a central server. The development of both components is described in this report, although there is a greater focus on the development of the prototype mobile application.

Under the DASISH project, we have been able to develop a prototype mobile application, which contains some of the essential features for fieldwork management. The basic structure needed for a central server has been developed, and it is possible to transfer information between this and the mobile application, using API service protocols. The system does not yet contain all functionalities needed to be a fully operational fieldwork management system as envisaged in deliverable 3.6 therefore it will need further work before being ready to roll out to participating countries. To access the mobile application, please use Google Chrome to open the following web link: http://cdata21.uvt.nl/slimfms/

The FMS will consist of a central server and an application for smart phones and small tablet computers. The development of the FMS is part of the collaboration between ESS and SHARE to enhance survey instruments for cross-national fieldwork in Europe, to combine acquired knowledge for developments which have relevance beyond their own survey work, and in the end to increase survey quality by having more standardised approaches, based on mutual efforts.

WP4
D4.3 – List of Recommended Deposit Services for SSH

This report was produced in the context of the project Data Service Infrastructure for the Social Sciences and Humanities (DASISH) work package 4.3 Convergence of Data Services. The goal has been to allow the selection and promotion of high-quality deposit services for researchers in the Social Sciences and Humanities (SSH) and to make suggestions for service improvements.

A survey was sent to 89 persons working at existing and developing data archives services (DASs) in Europe. With one exception, all these DASs have a scope related to at least one of the following infrastructures: CESSDA, CLARIN, DARIAH, ESS, and SHARE1. The survey had a response rate of 61%. The respondents are from 54 organisations, 42 of which have a fully functioning DAS, 9 a DAS under development and 3 no plans to set up a DAS.

The survey results reveal that CLARIN and DARIAH are relatively often interconnected, whereas ESS and SHARE are infrastructures with a strong basis in the Social Sciences and therefore are more often related to CESSDA. About one third of the DASs have relationships with two or more ESFRIs.

The maturity of the 42 existing DASs is related to the availability of a mission statement, deposit agreement, code of conduct, and a preservation policy. It appears that in southern Europe the maturity is somewhat lower than in other parts of Europe. The survey results indicate that the DASs from North-Western Europe have generally spoken reached a higher trust level than the ones from Eastern and Southern Europe (percentage of trust level 1 and above approximately 60% resp. 25%). The also indicate that the DASs within CESSDA DASs the maturity rate is somewhat higher in comparison to CLARIN and DARIAH.

The availability of a mission statement within a DAS is strongly correlated to trustworthy activities. Only about half of the respondents mention the existence of a preservation policy. This policy is not always accessible online or available in English. Further, it appears that the majority of the DASs has already implemented deposit and user agreements. A majority of the archives has a long-term preservation strategy, in most cases migration. About half of the DASs is involved in (self-)audit or certification activities intended to increase trustworthiness in the services. The Data Seal of Approval (DSA) is the most usual instrument for certification. The majority of services is publicly funded and in nearly all cases the cost for deposit or for access is borne by the DASs.

To gain more insight in the policies and views of DASs, in-depth interviews have been conducted with representatives of six different archives, related to the ESFRI’s CESSDA, CLARIN, and DARIAH. Based on the information gathered during this task, a list of high-quality and promising DASs has been composed and suggestions for further improvements of existing DASs have been made.

D4.4 – Comprehensive Policy-Rules for Data - Management in SSH

A vital tool to sustain long-term preservation and accessibility of data is to provide a robust, explicit and declarative set of institutional policy-rules and requirements that can build a solid platform of trust between stakeholders involved in the creation, curation and dissemination of research outputs. A key institution in the stakeholder taxonomy of policy formation and long-term preservation is the data centre as a link between the funders and the researcher and a service centre for long-term preservation and endurable access to research data and documentation. As a key stakeholder data centres have to develop a transparent set of policy-rules and procedures that support internal data management procedures and ensure accountability and allow for external quality control. Accountability and transparency are key factors for creating trust in deposit service providers by funders and researchers.

A comprehensive policy framework should take into consideration both the wider strategic policies of the institution and closely connected network institutions. A preservation policy is a vital tool to establish the boundaries within which an institution operates: it supports the shorter-term management of the institutional activities while also taking into account the longer-term vision of operational activities.

The report assesses a selection of state-of-the-art guidelines and recommendations for formation of preservation policies, and describe, compare and analyse the scope of policy-rules and the requirements they set for the SSH domain. Based on the assessment of the policy rules and procedure implemented by selected data archives and services the report goes on to recommend a set of policy-rules covering the full scope of a well-defined preservation policy..

WP5

D5.1 – Trust and PID Services Federation Report

Part A – Trust Federation Report

This report analyses Task 5.1 Establishment of Trust Federation.
The goal for this task is to investigate and set-up a functional service of Federated Identity Management (FIM) for the Social Sciences and Humanities (SSH). A service that would allow users to utilize their home organization credentials to access services or portals on the world-wide web, distributed over different SSH centers and do that using Single sign-on (SSO), having only to sign in once during a session.

The work in this task was executed by the MPI-PL as the coordinator and main contributor and other DASISH partners UKDA and GESIS who served as representatives from the different partner infrastructures and contributed infrastructure expertise and insights. This group had to (1) explore the current landscape with respect to FIM for the SSH, (2) discuss and collaborate with relevant existing trust-federation solutions (3) to conclude about other best options.

Other DASISH partners contributed by brokering contacts with experts and relevant DASISH centers.

Part B – PID Services Report

In DASISH (www.dasish.eu) T5.2 the aim is to promote the usage of Persistent Identifier (PID) services at data centres within the DASISH communities (CESSDA, CLARIN, DARIAH, ESS, and SHARE), all within the Social Sciences and Humanities. The reason to encourage the use of globally unique persistent object identifiers is that too often individuals are confronted with objects that are no longer traceable and accessible on the Internet. In most cases the reason for the disappearance of objects is the absence of a policy for sustainable access within the organisation responsible for the production of these objects. Without such a policy, which should include the use of PIDs for unique and sustainable identification, objects may be deleted or re-located without alerting users. The necessity of using globally unique Persistent Identifiers (PIDs) should therefore be obvious to researchers and research organisations as well as to people responsible for data archives and repositories. These identifiers are crucial for the advancement of science, as they are (or should be) coupled to policies on sustainable access. The need for PIDs has driven the development of PID systems, for example The HandleTM System.

There is a need for additional services coupled/related to the registration of PIDs for objects, for example services for PID registration and possibilities to store descriptions of the objects in central PID metadata repositories. However, user requirements for PID service providers may differ within scientific or scholarly disciplines. Consequently, it is important to assess these requirements within the different communities to find out if it is possible to arrive at a general, widely accepted list of requirements. To find out what these requirements are a survey was conducted among the data centres and communities of the SSH infrastructures composing DASISH (Questionnaires in Appendix C & D).

This report focuses on the answers from these surveys, and includes a more thorough description and comparison of three of the most commonly used PID services within the communities. The descriptions of the PID services have been verified by the service providers to ensure their correctness. The PID service providers are analysed and compared in the light of the requirements derived from the surveys of the DASISH communities (see Conclusions chapter 8.3). One desirable outcome of DASISH T5.2 would be if these PID service providers would take action to further improve their services based on this analysis and the recommendations derived from it.

The report also discusses PID service functionality above the simple DO URI resolving. Recent developments in thinking about data management for research data have also indicated a need for some tightly coupled metadata linked directly with the object identifier. For instance, for integrity checking of data objects by using checksums. PID services directly supporting such tightly coupled metadata can be considered advantageous.

The report will also serve as a status snapshot of the major players in current EU PID service landscape, indicating the availability and

D5.2A & D5.2B

Part A – Metadata Quality Improvement

The aim of this task was to analyse and compare the different metadata strategies of CLARIN, DARIAH and CESSDA, and to identify possibilities of cross-fertilization to take profit from each other solutions where possible. To have a better understanding in which stages of the research lifecycle metadata comes to the fore, we looked at several research data lifecycles and business process models. However the current research data lifecycle models have the ‘static’ data object as basis, whereas metadata design, redesign, creation and management can continue to be ‘live’ issues within the research lifecycle. We therefore developed a metadata lifecycle based closely on familiar lifecycle models but extended to support the more dynamic metadata issues.

To describe the metadata management of the different infrastructures we took a double approach. We looked on a more general level and outlined the policies and strategies regarding metadata of the three infrastructures. We evaluated these strategies on metadata quality issues with the Bruce and Hillmann criteria. On the other hand we looked with more detail how the work on metadata management is done by the individual data repositories.

The infrastructures of CESSDA, CLARIN and DARIAH differ in visions, strategies and initiatives regarding metadata issues; similarly there is a difference in metadata management among the various repositories. Despite these differences, cross fertilisation by coordination on common lists of metadata elements, sharing of knowledge, and linking resources would leverage the overall metadata quality. Evaluation of the prototype of the joint CLARIN, DARIAH and CESSDA metadata portal endorses the opinion that more coordination is needed.

Metadata quality must be discussed in relation to the activities for which they are used. We suggest that the infrastructures DARIAH and CLARIN prioritise future collaboration about standardisation efforts, which have already been initialised in dialogue between the CLARIN Standards Committee and the DARIAH representatives. Similar initiatives could be established with CESSDA.

Part B – Portal Progress report

This document reports on the progress of the DASISH Joint Metadata Domain JMD, task 5.4.

Partners in this task group were: DANS, GESIS, MPI-PL, OEAW and UGOT. As a task division, MPI-PL was responsible for the task coordination and the technical infrastructure (catalogue software, metadata harvesting etc.) while the other partners contributed with their expert knowledge especially on the metadata infrastructure used in their respective communities: DANS & OEAW for DARIAH, GESIS and UGOT for CESSDA, MPI-PL for CLARIN.

The Joint Metadata Domain was implemented as a SSH metadata catalogue software, filled with metadata harvested from metadata providing centers from the three participating infrastructures. The harvested metadata is mapped on a set of community-discussed facets and the facet values are normalized. As regards extending and configuring the catalogue software as well as developing the mapping and normalization software, considerable effort went into finding and documenting suitable metadata providers as well as mapping and normalization rules.

The original proposal in the DASISH Description of Work (DoW) also recommended using RDF technology. However, having studied the problem further and finding which technologies were already at our disposal, we decided to use semantic mappings that are implemented as schema-derived XPath specifications.

We decided on an approach with proven technology to look for and investigate the availability of SSH metadata that - according to our information – should exist, rather than experimenting with metadata search technologies. Providing a tool that can be used to inspect the metadata of the participating infrastructures. Our reasoning is that reporting on the availability of metadata is an important part of this task.

When initially confronted with the task to obtain information and documentation about existing metadata providers and metadata schemas used in the SSH, it turned out to be a more difficult task than predicted. In the end we managed to get a sufficient overview of the available metadata within the three research infrastructures: CESSDA, CLARIN and DARIAH.

D5.3 – Workflow Requirements and Application Report

One of the aims of the DASISH project was to identify typical cross-disciplinary workflows candidate for being dealt by automatic processing chains, study the requirements and implement a number of demonstration cases.

Social Sciences and Humanities (SSH) research has used computers to assist text analysis work since the time of punch cards. The more recent irruption of terms like Digital Humanities, Computational Social Sciences, Culturomics or Big Data Humanities, Arts, and Social Sciences is a further evidence of the interest in text analysis tools in fields such as linguistics, literature, psychology, political science, economics, scientometrics and bibliometrics, sociolinguistics, history, management, education and communication. Although there is a certain variation in how these disciplines refer to what they do (“text analysis”, "distant reading", “content analysis”, “text mining” or “text analytics”), after some analysis it is clear that they are all referring to the extraction of information from texts with the assistance of software tools.

We first conducted a thorough survey of research papers and project descriptions, with the objective of identifying the kind of software tools that are common to the different SSH disciplines. We have proposed the found common tools as a typical automated e-Research workflow for scholars working with texts. Once identified, DASISH can now offer a discipline-neutral typical workflow, deployed as a web service-based web application, for demonstration purposes. Eventually, this demonstration has been used to ask researchers about requirements for future deployment of tools to support their workflows.

Research in SSH very often involves text analysis to find evidence in terms of particular words that appear in texts. For instance, proper nouns identify entities that can be plotted in maps when they are geographical locations, or counted differently if correspond to male or female for gender related queries. The occurrence and frequency of other types of words can contribute to assess public opinions, to trace events through time, etc. When large quantities of text have to be studied, the use of automatic means becomes a necessity. This is the origin of the deployed workshop for Named Entity Recognition (NER).

Research also involves the compilation of the texts, conversion to a suitable format and character encoding, cleaning of non-linguistic elements, segmentation and tokenization to identify words, and the application of tools that recognize the particular type of sought words. Therefore, WP5.5 has also deployed a workflow that processes text in order to allow processing them.

Named-entity recognition (NER) tools are able to identify proper nouns and other expressions in text that have a unique reference and classify them into pre-defined categories such as the names of persons, organizations, locations, expressions of time, quantities, monetary values, percentages, etc. The raw results are as in the sample below, where Named Entities found in a text are marked with "//TAG". Recognized entities are tagged as follows: person or tag NP00SP0, place or tag NP00G00, organization or NP00O00, and others or NP00V00.

NER tool input text can be any utf8 unformatted text file. If your text is a pdf, html or rtf file, first use the "Clean" tool or the "Clean+NER" option. The CLEANING workflow accepts PDF, HTML, RTF and flat text in 17 languages. The output is tokenised – taking account of abbreviations – and segmented in sentences, and then realized as flat text or encoded in TEI P5 (Text Encoding Initiative) standard format.

The workflows are not only run without any human intervention, but also automatically assembled from a pool of registered tools when a request is received, using the file format and language of input of output as constraints.

D5.4 – DASISH Web Annotation (DWAN) framework

The availability of digital archives and other research data via the Internet creates new chances for collaboration. Indeed, equipped with special software, researchers from different institutions, countries and fields can work together via the Internet. Such collaboration can take the form of annotating the on-line data and sharing these annotations using an annotation infrastructure. As stated in the task 5.6 description: researchers need to be able to store the results of collaborative intellectual work either as an annotation of a single fragment or in the form of typed relations between a number of fragments.

The aim of this document is to provide a specification of the framework for annotating webdocuments developed according to task 5.6 plan. In this context an annotation is a remark over a fragment(s) of an on-line document(s).

From the technical point of view the proposed framework consists of one back-end, constituted of the server software and the database, and possibly multiple front-ends (clients). Developed within DASISH project the DWAN tools are an instance of the DWAN framework. It consists of the back- end part and the client, which is a significantly adjusted version of the Wired Marker1 Firefox extension. We chose Wired Marker after a selection process, looking for a suitable tool as a general DWAN client for annotating objects on the Web. The selection process is separately described later. The core of the back-end is a database where annotations and information about corresponding annotated target documents are stored together with the targets’ cached representations. Archiving cached representations in the database is relevant when annotated documents are dynamic pages like news sites or wiki-pages under construction.

A client in the DWAN framework exchanges data with the server by sending REST2 requests and getting responses. Client-request bodies and server's responses have a form of XML. The client is able to accept and send XML structures that obey a pre-defined XML schema. The schema mirrors a data model that has been designed to represent the main data structures, which are involved in constructing annotations.

The work in task 5.6 succeeded in delivering, next to the back-end, one front-end (client) tool for the DWAN framework, and in collaboration with other projects, integrated two extra client tools. For future work, this task also found a number of other tools from the Humanities domain that looked promising to integrate in DWAN. In separate chapters, we present, an analysis of the annotation task of DWAN clients in the context of a general overview of Humanities tools and Humanities research workflow. We present also some user-scenarios that could be fulfilled by DWAN, either by the current version or with some future development.

WP6

D6.3 – Handbook on legal and ethical issues for SSH data in Europe, Part II

The deliverable "Exemplary analyses of confidential paradata" of work package 6 "Legal and Ethical Issues" (WP6) of the "Data Service Infrastructure for the Social Sciences and Humanities" (DASISH) project deals with ethical and legal aspects that have to be considered when analysing confidential paradata that is generated in the process of survey production. It builds on and continues the work of deliverable D6.2 ("Sample merged paradata sets") of the DASISH project.

With the increasing use and further development of technological means in the context of survey-based data collection the amount of information collected about the process of survey production has increased. In particular, in connection with the use of computer assisted interviewing (CAPI) techniques and the implementation of web surveys much paradata, i.e. micro-level data about the process of survey production, are generated. Furthermore, survey researchers are also increasingly making use of paradata, such as keystroke data or contact protocols.

Paradata are key data for analysing data quality and are used for different purposes, ranging from the evaluation and improvement of survey instruments to a better understanding of respondents and their answers in surveys (cf. Couper and Singer, 2013: 57). Consequently, a strong demand from researchers and, in particular, the survey methodology community to make paradata of surveys available can be observed recently.

In deliverable D6.2 the extent to which this increasing and more structured collection, processing and use of different types of paradata impose legal and ethical challenges to survey researchers has been explored. A central finding of deliverable D6.2 is that legal and ethical issues that are connected to the collection, processing, use and re-use of paradata require a nuanced approach. Since there are different kinds of paradata that can be collected (depending on the survey mode and the technical system in place) and specific kinds of paradata are/can be used for certain analyses only, legal and ethical questions need to be answered on a case-by-case basis (cf. Schmidutz and Bristle, 2013: 16-19).

The current demonstrator takes account of this finding and discusses ethical and legal issues in relation to a few concrete practical examples of paradata usage. In doing so, a special focus of deliverable D6.3 lies on paradata which can be classified as 'confidential data' , and on the questions of how these data may be used and made accessible for re-use to researchers of the scientific community. In order to demonstrate how certain ethical, and legal aspects in relation to confidential paradata may be considered on a case-by-case basis, three practical examples from the Survey of Health, Ageing and Retirement in Europe (SHARE) are introduced and discussed.

Example 1 focuses on the use of paradata for fieldwork monitoring purposes. It addresses the questions of how contact related paradata about cases in which either no contact to a target person could be established or in which the target person refused to participate in a survey may be used for analyses, although no explicit consent has been obtained from the target persons. Furthermore, the use of item-level time stamp data as an indicator for standardised data collection is addressed. Example 2 concerns the framework conditions under which paradata documenting the outcomes of contacts with target respondents (i.e. information on the cooperation process) may be used as information about interviewers’ work performance as part of research of survey methodological interests. Finally, example 3 discusses the ethical and legal aspects connected to the analysis of keystroke paradata as information about respondents in the context of substantial scientific research, i.e. when being used in order to enhance the data provided by respondents in the course of a survey.

Summary

Only few researchers have started to address ethics and legal aspects in relation to the collection, use and release of paradata, even though these are still unclear in many cases. Chapter 5 of this demonstrator builds on the theoretical work of deliverable D6.2 (Schmidutz and Bristle, 2013) and tackles ethical and legal questions on a case-by-case basis with respect to specific examples of paradata usage from SHARE (as illustrated in chapter 4).

The outcomes of the ethical and legal considerations with regard to the exemplary analyses of confidential paradata support the finding of Schmidutz and Bristle (2013) that ethical and legal questions cannot be answered in relation to paradata in general but need to be explored on a case-by-case basis. Different ethical and legal aspects have to be considered depending on [1] the specific kind of paradata, [2] the way of their collection, [3) the group of human subjects about which they provide information and [4] the purpose, for which they may be used.

With regard to the examples that have been discussed, the following differences within these 4 dimensions could be identified:

[1] Three different types of paradata have been used in the examples: Contact data (contact attempts, outcomes of contacts), item-level time stamp data (based on keystroke data) and interviewer characteristics (interviewer demographics and interviewing experience).

[2] In principle two different ways of paradata collection have been identified: Paradata recorded as a by-product in the course of conducting a survey (i.e. process paradata), and additional paradata obtained separately from external sources or with a specifically targeted effort (i.e. auxiliary paradata). At this, only one external source has been considered; namely the survey agencies, which provided the information on their interviewers.

[3] The different types of paradata have the potential to provide (resp. reveal) information about different groups of human subjects: Contact information may provide information about sampled target persons, respondents and interviewers. Keystroke data may constitute information about respondents or interviewers, and interviewer characteristics obviously only constitute additional information on interviewers.

[4] Three general fields of paradata usage have been considered: Fieldwork monitoring (which tries to understand and improve strategies of contacting households, response rates and cooperation rates, standardised data collection, e.g.) research of survey methodological interests (interested in the cooperation process and interviewers’ work performance, e.g.) and substantial scientific research (with the purpose of enhancing the data provided by respondents: e.g. on cognitive abilities).

With regard to both use and release of paradata the two key ethics principles of assuring data subjects’ autonomy and of protecting the m from harm have been considered. Besides, some specific legal aspects concerning the examples have been touched upon. E.g. with regard to paradata that constitute information on interviewers, it has been highlighted that permissible use and release of certain paradata depends on national data protection and employment legislation as well as on contractual agreements between survey managers and survey agencies or interviewers (cf. chapter 5.2). While making use of paradata is possible in all of the discussed examples, the outcomes with regard to the question of how and under which conditions the different kinds of paradata used in the examples can be released for scientific re-use vary from case to case.

As far as paradata also constitute personal data (besides being data about the process of survey production), when processing these data "particular importance has to be placed on the compliance with European and national/regional data protection law as well as on the safeguarding of sensitive data and confidential information" (Schmidutz and Bristle, 2013: 14-15). While in this connection, anonymisation and pseudonymisation are central measures that can be taken by researchers in order to ensure data confidentiality, with regard to the re-u se of sensitive or confidential information data access and usage restrictions may be considered as additional safeguard measures.

In general, finding out how and under which conditions paradata can/may be released for scientific re-use, appears to be the most challenging task in relation to different kinds of paradata; in particular, since this issue is c10sely related to the question of whether paradata may be used in substantive research. Example 3, in which item-level time stamp data are used as information about respondents, clearly shows that all ways in which released paradata possibly could be used should be considered before releasing any paradata. If with regard to this kind of paradata only the use as an indicator for standardised data collection (cf. example 1) would be considered prior to releasing keystrokes on a micro-level, the fact that sensitive information can be concluded from this kind of paradata may be overlooked.

Especially, when paradata include sensitive information about respondents, making them available for re-u se subject to certain special data access and usage restrictions only, may provide a solution, which enables scientific research and at the same time ensures an appropriate level of data protection. This also holds for other kinds of paradata that are classified as confidential for other reasons (such as proprietary considerations, etc.). In general, making confidential paradata available for re-u se subject to special data access and usage restrictions appears to be possible in most of those cases in which releasing micro-level paradata to the scientific community or the entire public seems to be problematic. In all cases that have been explored so far, however, aggregated anonymised research results (e.g. from fieldwork monitoring) can be made available publicly.

D6.5 – Handbook on legal and ethical issues for SSH data in Europe, Part II

The use of research data is often restricted by a set of legal regulations and ethical guidelines. Researchers working with data must know the possible legal restrictions that apply to research data, and the potential ethical issues in handling data. The present handbook intends to be a concise introduction to some legal and ethical aspects of working with research data in the Social Sciences and Humanities.

The handbook will present issues related to the legal and ethical regulations relevant for the use of personal data in research. Moreover, it will deal with copyright issues, as well as the ongoing debate connected to the draft of a new general EU data protection regulation. The handbook will provide an overview of these issues in relation to the various process steps in research projects, from preparation and design, to the preservation and the access and reuse of personal data for research purposes.

Currently, personal data in the European Union is protected by domestic implementations of the Data Protection Directive (95/46/EC). The challenges of the Directive from a research perspective are the different ways that each EU country implements or practices the law, and how this has led to an uneven level of data protection and an uneven level of access to individual data for scientific purposes, seriously impeding cross country research.

Consistent with the advisory nature of an EU directive, the member state data laws vary widely. In addition, all member states have established their own unique Data Protection Authorities, compliance structures, notification and approval processes, and other bureaucratic or regulative procedures. In particular, the variations of definitions of personal data, anonymisation and the requirements for legal approval provide different preconditions for the kind of research that is allowed within each country. While some laws offer data subjects at least the Directive’s core protections, some add extra rights with regard to e.g. the use of encrypted or de-identified data, requirements for consent and access to data.

The legal variations, combined with new technological developments and an increase in the amount of personal data that is being collected and processed across Europe, pose new challenges. As a result, the demand for a new coherent general data protection regulation has come to the fore. The need for a consistent legal framework across Europe is one important reason why the EU is currently upgrading the data protection regulation from directive to law. Thus the proposal for a new General Data Protection Regulation (GDPR) and the subsequent practice at the national level is of great interest to the scientific communities and research infrastructures in Europe.

Summary
This handbook has been intended as a concise introduction to some complex legal and ethical issues that researchers in the Social Sciences and Humanities ought to be aware of when using private or copyrighted data.

With respect to copyrighted information, the diversity in law and legal practice across Europe is a barrier for cooperative research and data sharing across borders. As long as there are no Europe-wide exemptions for copying and distributing research data which contains copyrighted information, researchers are therefore advised to check who owns the rights to the data and under which conditions it may be handled and distributed. Unless the data is entirely in the public domain, it may be necessary for researchers to make and observe license agreements with the right holders.

Three main issues have been presented in this handbook regarding the regulation of personal (or private) data: First, there are various definitions of personal data. At one end of the scale, UK’s legislation refers to data as personal only if the data controller can identify the data subject. At the other end of the scale, in countries like Estonia and Norway, any data concerning an identified person is defined as personal data, regardless of the form or format in which such data exists. In between the two endpoints of the scale, e.g. Germany and the Netherlands allow for a more flexible interpretation of what is defined as personal data. For instance, the Dutch Act leaves it open as to who can identify the data subject. Consequently, the various definitions of personal data which determine whether the processing of data falls within the scope of the privacy regulation or not, have significant impact on conditions under which research operates in the various countries. These variations regarding what is considered anonymous data affect the demands for obtaining consents and the possibilities for preservation as well as reuse of the data. As a result, the quality of research data that researchers can access without data protection regulations varies, which in turn gives richer data available for analysis in some countries compared to others. This situation represents a severe obstacle for comparative cross country research in particular.

Second, the patchwork of approval requirements and bodies across Europe constitute another barrier for research. In some countries, such as the UK, a general notification to the Data Protection Authority is sufficient, whereas in other countries (e.g. Norway and Sweden), it is required to obtain legal approval regardless of whether the processing is based on consent or not. Due to these variations, it is suggested in the proposal for a new EU data protection regulation that obtaining approvals from the Authority should be replaced by a duty to register the processing and in cases involving specific risks, to conduct a prior consultation of a data protection officer.

Though there still is some uncertainty connected to when the proposed GDPR will become law and it is unclear at this point what the final regulation would look like, the intension is to reduce the fragmentation of data protection regimes, including both laws and supervisory systems across Europe. The concern from the research point of view is that the legal framework conditions within the SSH research domain might be limited. There is a clear tendency in the proposal towards strengthening the right to personal privacy and control of own personal data at the expense of researchers access to personal data.

D6.6 – Report about Preservation Policy-Rules (Preservation Challenges)

In this task we are looking at ethical and legal issues and challenges that confront researchers, data owners and digital repositories in the emerging European data preservation infrastructure environment. Building on Deliverable 6.1 (“Report about new IPR challenges”) and Work Package 4 (“Data Archiving”), the main focus is on data preservation and data sharing, notably issues related to data protection, data ownership and copyright.

Managing privacy and access is an issue of major importance and concern in the current infrastructure landscape, and various procedures to ensure the optimal balance between data protection and data access are being developed and tested among various data producers and research fields across Europe. A vast majority of these models are based on technical solutions (disclosure techniques, remote access and execution models) to protect privacy in the accession process. In this report we identify the problems that may occur in relation to preservation and sharing of sensitive data, and the policy-rules that need to be considered when sensitive data will be preserved in a distributed data preservation and curation infrastructure. Further, we will identify and define policies and policy- rule mechanisms that guide the preservation and access rights while maintaining trust. Data protection and copyright issues has a major influence on the possibilities for long-term preservation and data sharing, and we consider the proposed EU General Data Protection Regulation (GDPR) and the ongoing modernisation of the EU copyright framework as key aspects of the emerging infrastructure landscape. Both will have considerable impact on the development of new policies and policy-rules in digital repositories.

The increased focus on legal and ethical issues, constraints and requirements for all data types in the SSH domain occur as a result of the development of new and powerful tools and methods for data mining, and the integration and linkage of multiple data sources. The challenges imposed by these changes are connected to the growing importance of data in most people’s lives. The collection, storage, and analysis of data is on an upward growing trajectory, where the declining cost of collection, storage, and processing of data seems to grow exponentially with the amount of data. This combined with more recent sources of data like sensors, cameras, geospatial and other observational technologies (“internet of things”), means that we now “…live in a world of near-ubiquitous data collection”. The term “big data” reflects this growing technological ability to capture, aggregate, and process an ever-greater volume, velocity, and variety of data (“the three V’s”). According to some sources, the World Wide Web now (as of May 2014) contains somewhere between 2.2 and 2.5 billion webpages and almost 700 exabytes of accessible data (that is 700,000,000,000 gigabytes).

Research using 'big data' is of growing importance for several research disciplines in Europe. Increasingly, scholars of arts and humanities, information science, linguistics and social studies do their research based on language material from all kinds of audio-visual content (films, TV series, music, speech, etc.), e-books, magazines, journals, newspapers and different types of user-generated content. New forms of data on human activities are now being recorded in a variety of domains from blogs and social media to new forms of data which arise from digital processes involving registrations, transactions, sensing devices, internet activity, telecommunications, retail sales, utility consumption, etc. These data types are not necessarily designed for research, but they may have significant research value, especially when linked across domains, or to survey data.

The fusion of different kinds of data, e.g. the linking of survey data with administrative data involves a risk of disclosure of personal data. Similarly, integrating diverse data can lead to what some analysts call the “mosaic effect,” where the new dataset is much richer in detail than the individual dataset. From this new dataset personally identifiable information can be derived or inferred (although the new dataset do not include any direct personal identifiers), bringing into focus a picture of who an individual is and what he or she likes.

This study focuses on the ethical and legal issues that may occur in the long-term preservation of these new digital resources. We interpret long-term preservation in a broad sense, namely as the process (es) that “…refer to policies, strategies and actions that ensure permanent access to digital content over time”. This broader understanding of preservation can include the transfer of data from point A (e.g. a researcher) to an archive or a data deposit service (ingest); the data storage and access arrangements (long-term preservation, or digest); and the dissemination (sharing, reuse) phase. Maintaining access rights (i.e. reuse of data) is a key function of research data repositories and a challenge for long-term data preservation. In this context we highlight these challenges by analysing key articles in the proposed GDPR. We also look at Intellectual Property Rights (IPR) and copyright issues in preservation and sharing by focusing on the ongoing European reform and various data licensing schemes.

The articulation of the legal text in the GDPR and the ongoing IPR reform will have significant impact on the definition of policy-rules for SSH data infrastructures now and in the future, and by exploring these issues we aim to look for professional long-run preservation strategies based on e-Infrastructures for data in the social sciences and humanities.

Summary
Data archives, data centres and data repositories are currently expanding their role as research data infrastructures and support services for several stakeholders in the data life-cycle, e.g. those creating data and those accessing and consuming data. Within the social science community the data archives have been providing a wide range of services for many years, services needed to ensure easy access to high quality data. In addition to the development of various data access tools, data and metadata standards, trust certifications (e.g. the Data Seal of Approval) and providing training in methodological and analytical techniques, they are providing assistance and advice regarding data management, data depositing, and legal and ethical issues.

This contributes to ensuring the long-term preservation, accessibility and quality of research data. In this process of expanding their roles several of these services have built, and keep building, a strong and varied ‘repository’ of expertise through experience and direct contact with data and legal/ethical issues. Through a diversity of projects they encounter more data than most researchers or short-term and temporary service providers are likely to ever encounter. Hence the archives and data repositories can be valuable data facilitators and provide support services in several areas, e.g. provide information about various legal and procedural access requirements and support services; provide information and guidance about legal and ethical requirements and practices; provide active help with specific queries; and through proactive help like providing training and support materials. Additionally, they can provide the technological capacity to share data that otherwise would be lost or difficult to share, providing legally binding user licenses and licence agreements that can secure data services for rich but sensitive data.

The best way for data repositories and archives to prepare for the possible practical effects of the GDPR and the current IPR reform is through solid methodologies for enforcing the desired attributes and properties of their collections, i.e. through a policy-based data environment that can build trustworthy collections.

It is important to recognise that a preservation policy is essential for the archive or repository regardless of choices made for how the services are implemented and delivered. Even if distributed service options are used, a digital preservation policy is necessary to frame the requirements for service level agreements and licensing agreements for all levels and involved partners in the distributed preservation service network

The curation, processing and preservation of sensitive data, and clauses concerning intellectual property rights and restrictions on use of repository content, should be clearly stated in well-formulated preservation policies. The policies should be regulated and implemented through deposit agreements, contracts, and licenses. This is necessary in order to allow the repository to track, act on, and verify rights and restrictions related to the use of the digital objects within the repository.

The preservation policy should be more than a general management statement. It should contain general policy clauses and a clear description of how these clauses are to be implemented. There should be a close connection between repository purposes and properties on the one hand, and policies and implementable procedures on the other.

Thus, a preservation policy should define and specify the repository’s requirements and processes for managing personal and sensitive data, intellectual property rights, depositor agreement and access and license agreements. In light of the proposed changes to the GDPR and the ongoing European reform on copyright issues and licensing schemes, special considerations should be taken with regard to an explicit and concise specification of the technical and organizational measures that are in place to regulate access to data that are limited to specific purposes (referred to earlier as ‘storage minimisation’).

There should be a continual updating of standards and processes, and development of the necessary skill and expertise. Staff training and financial and organisational planning for the archive or repository should be clearly stated and should include provision for activities like staff training, technical infrastructure, preservation activities, storage and media (formats) routines, and changes due to evolving technology and legal framework. The preservation policy should also clearly state the legal and regulatory framework(s) under which the repository or archive operates.

Further the policy should include an assertion of copyright and intellectual property rights, and agreements with authors and data owners should be made clear and recorded through explicit agreements with authors on rights for preservation and reproduction of the data. This should be combined with explanations of access levels and access restrictions, and procedures for how different levels are assigned to different datasets or data collections. A commitment to keep data secure should be stated and any changes to data and/or metadata should be tracked.

As established in DASISH D4.1 there are (at least) five methods or key standards which digital repositories can use to assess themselves and to support public statements about their level of trustworthiness, ranging from OAIS core conformance to initial self-assessment (DRAMBORA, PLATTER, DSA) to formal audit and certification by external auditors (TRAC, DIN 31644, ISO 16363). These tools and standards, combined with an attentive focus on legal reform and progress, should constitute the core element of the development and maintenance of data preservation policies.

The development of a new GDPR and the ongoing IPR reform, and recent recommendations (e.g. OECD) and developments (e.g. the transferring of assets and activities of the ESFRIs to becoming ERICs) in the European research infrastructure network, seem to imply a professionalization of the SSH data preservation community. The purpose of a professional data deposit and preservation service for scientific use is that the services can comply with requirements for safety, security, longevity and continued access. Services must hence operate in a perspective that spans decades. At the same time research communities as well as the general community need to be assured that data delivered today can be retrieved and used tomorrow - while maintaining the interest of the data subject through solid data protection measures. In order to achieve this objective requires facilities which are well institutionally embedded and can demonstrate a high degree of permanence.

WP7
D7.1 – Course modules

The objectives of the work package “Training and Education” (WP7) were to establish a joint domain for training and education for the SSH infrastructures; to inspire researchers and developers to come up with new research methodologies and approaches using research infrastructures and to discuss with and gather feedback from researchers in all SSH domains about the role of data infrastructures in research methodologies.

To pursue these objectives, the DASISH Description of Work (DoW) specified two tasks. Task 7.1 “Training Modules” was dedicated to developing online training modules for topics and target groups relevant to the SSH communities. The findings here also served as a basis for planning and organizing workshops for these topics in the context of Task 7.2 “Workshop Programme”. This report documents the work carried out with regard to Task 7.1 and presents its results. The specific activities and results of Task 7.2 are described in a separate report (D7.2 A compendium of workshop reports). Comprehensive aspects relevant to both Task 7.1 and Task 7.2 are described in the document at hand.

In Task 7.1 several training modules were developed and made accessible to the public via the DASISH website1. In section 2, this document first describes the processes of
• assessing the training needs of and available material from ESFRI communities
• organizing the work for creating and publishing online training modules
• gathering feedback from the SSH communities and the other DASISH WPs
• revising the modules based on internal and external feedback
• ensuring sustainability of the material produced

After this, in section 0 the main structure of the actual training modules are described with a brief summary of their chapters’ content. The actual training modules can be accessed through the training section on the DASISH website2.

The Appendix then provides additional details and documents, and also includes a list of acronyms used in this report.

D7.2 – A compendium of workshop reports

This report describes the work performed in Task 7.2 “Workshop Programme” within DASISH WP7 “Training and Education”. The objectives of the work package were to establish a joint domain for training and education for the SSH infrastructures; to inspire researchers and developers to come up with new research methodologies and approaches using research infrastructures and to discuss with and gather feedback from researchers in all SSH domains about the role of data infrastructures in research methodologies.

Task 7.2 “Workshop Programme” was dedicated to offering and organizing training workshops for topics relevant to the SSH communities, both domain-specific and cross-domain. The topics addressed here were partially based on outcomes of Task 7.1 “Training Modules”, and partially on direct requirements from other DASISH WPs.

In Task 7.2 several training workshops were organized. This document describes and summarizes each of these workshops with respect to
• the motivation, general topic and target groups of the workshops (based on outcomes of Task 7.1);
• scheduling and organization of the workshops including tools, resources and infrastructure used and number of registrations/participants;
• speakers and presentation topics of the workshop;
• publicly available material from the workshop;
• general reception and feedback from the workshop participants.

WP8
D8.5 – Conference list

The Conference list can be found on the external DASISH website, which can be reached at www.dasish.eu.

Throughout the DASISH project, the website has been maintained, extended with new content, and restructured on an ongoing basis in order to reflect the different types of activities taking place and the results achieved by the project.

The Conference list was originally created and maintained on the front page of the web site where it was displayed in the same layout as the News section, but as a separate list called Conferences. Over time it became clear that it was more intuitive for the users of the website and for the providers of content to see the announcement of a conference as a piece of news rather than as a separate thing. Following from this, the original list named Conferences was merged with the News section. This restructuring met the needs, and at the end of the project conferences are still announced as news items. Following from this, two more structural changes were made: one list was created to document conferences and events with DASISH participation, preferably with a link to the presentation given by the DASISH representative, and another list to document events organized by DASISH, also with links to the program and the presentations.

Potential Impact:
The expected final results and their potential impact and use (including the socio-economic impact and the wider societal implications of the project so far)

DASISH combines all ESFRI projects in the SSH, and hence establishes a direct link to some of the most renowned researchers in SSH in Europe and beyond. The collaboration of these ESFRI projects in DASISH enables the development and co-evolution of a consistent and pivotal eco-system of research infrastructures - employing shared infrastructure systems where feasible and supporting each other in the creation of disciplinary virtual research environments. The collaboration across SSH in DASISH facilitates the structuring of the scientific community and therefore plays a key role in the construction of an efficient research and innovation environment. DASISH contributes to (1) fostering standards for the interoperability and sharing of research data; (2) coordination and pooling of resources in the SSH through interacting digital research infrastructures; as well as (3) supporting researchers in the adequate creation, management, curation and publication of research data.

Synergies within ESFRI and related initiatives:

The DASISH project establishes a cross-disciplinary collaboration of the ESFRI projects ESS, SHARE, DARIAH, CESSDA and CLARIN. The cooperation includes both, technologies such as registries, single-sign-on systems and persistent identifiers, and also includes coordinating efforts in the field of data quality, data archiving and on-line research. By developing new approaches and tools for the construction and execution of social science surveys, comparison and aggregation of survey results are drastically enhanced greatly improving the usability, applicability and quality of social science data. By combining and exchanging insights in data archiving methods and technology, and offering generic procedures and applications to support this, both cost reduction and international alignment of requirements are achieved. DASISH taps into the synergies between infrastructure systems and frameworks across the SSH.
DASISH explores and exploits technical, operational and advocacy synergies such as interoperability, persistent identification and publication of research data; convergence in best practice guidelines and e.g. certification of data management; as well as communication and training to support researchers in employing research infrastructures in and across the SSH. Integrating both,
diverse disciplines in the SSH as well as local (e.g. national) efforts will add to the visibility of European research, improve durability of research results and make it available to a larger group of potential consumers.

Impact Determining Factors and Assumptions:

The close cooperation between the SSH ESFRI projects and within DASISH is of the highest significance for maximizing impact for the SSH in Europe and beyond. DASISH needs to present itself as a coherent group that effectively caters to all SSH communities involved without losing sight of their specific context and requirements.

Another factor is the availability of actual data. Without a sufficient and representative amount of readily available datasets it cannot be ensured that common solutions will be built that can cater to the respective infrastructures. The influx of high quality research data has already been ensured by the individual research infrastructures, and DASISH will ensure that bridges will be built between infrastructures in order for this data to circulate. The more data will be available, both inside and across the domains, the higher the appeal will be for others to use the data and contribute their results.

Quality and trustworthiness of both the infrastructures as well as the data they serve will have an effect on the impact of this effort as well. Both the individual infrastructures as well as DASISH need to establish themselves as a haven for high quality research and durable, high availability services. Failure to address this issue will result in the infrastructures only playing a marginal role within the research life-cycles and can seriously stem their growth. The efforts in WP3 (Data Quality) and WP4 (Data Archiving) as well as the technical coordination in WP2 (Architecture & Quality Assessment) and development in WP5 (Shared Data Access and Enrichment) will address trust and quality throughout their activities.

Legal issues can form an obstacle for international cooperation and exchange of research. These issues have already been identified by the participating infrastructures, and it is foreseen that a lot can be gained by forming an inter-infrastructural consortium. Being able to act as a representative of the majority of European SSH research will give DASISH the mandate to negotiate with national governing bodies and other relevant players to break down existing barriers to data exchange and cooperation (cf. WP6 - Legal and Ethical Issues).

Perspectives on Impact: Roles and their Goals:

DASISH combines the efforts of various researchers and infrastructure initiatives in the SSH. Beyond those contributing to DASISH and the individual ESFRI initiatives, there are various other stakeholders that DASISH connects to.
Researchers in all SSH domains will - through DASISH - have data and tools available that increase their efficiency and productivity. The mechanisms for interaction within their community and across communities may provide them with opportunities for collaboration and new insights. Eventually, their research will gain more visibility, and - vice versa - they will have the possibility to build on and re-use existing research activities and hence "stand on the shoulders of giants". Funders and policy makers will - through DASISH - increase the impact of their programmes through better exchange between projects and by eliminating duplication of work. This, as well as the available data in DASISH will support evidence-based decisions and increase the return on investment of their work.

List of Websites:
E-mail: dasish@dasish.eu

Project manager

Daniel Knezevic
Swedish National Data Service (SND)
Bohusgatan 15
Box 330
SE-405 30 Gothenburg
Sweden
Ph: +46 31 786 1207
Fax: +46 31 786 4913
Web: http://www.snd.gu.se

Final Report Summary - DASISH (Data Service Infrastructure for the Social Sciences and Humanities)

Share this page

Download