A Community Data Sharing Resource: The LDbase Data Repository

Abstract

The purpose of this invited paper is to show the learning disabilities field what LDbase is, why it’s important for the field, what it offers the field, and examples of how you can leverage LDbase in your own work.

Keywords

Data sharing data reuse data repository

In 2018 our team, represented in part by the authors of this paper, began work on a data repository to serve as a resource for the learning disabilities field. That data repository, LDbase (publicly available at www.LDbase.org since 2021; Hart et al., 2020), is the result of this ongoing collaboration, which has grown into something bigger than any of us imagined when we started. One of our early supporters, Professor Stephanie Al Otaiba, invited us to write this paper to show the learning disabilities field what LDbase is, why it’s important for the field, what it offers the field, and examples of how you can leverage LDbase in your own work.

What Is LDbase?

LDbase is a domain-specific data repository for the field of learning disabilities (Hart et al., 2020). A domain-specific repository is one that is limited to a specific data type or related to a certain discipline. This contrasts with generalist repositories (e.g., ICPSR, OSF) which accept any data, no matter the data type, format, content, or disciplinary focus. Generally, it is recommended that researchers give priority to domain-specific repositories when choosing where to share their data (National Institute of Health, 2020). There are many reasons for this recommendation: Domain-specific repositories have more detailed metadata profiles, making it easier to upload and search for datasets that fit very specific parameters in your research. They also build community around the shared data, allowing you to find potential collaborators interested in your field of research. In addition, domain-specific repositories, like LDbase, are built with researchers in a particular field in mind, developing interfaces and features tailored to create an optimized experience.

To make the difference between generalist repositories and domain-specific repositories more concrete, imagine you were shopping for a pair of sunglasses. You could go to a generalist store to buy sunglasses, like Walmart or Target. You would be faced with figuring out where the sunglasses are in the store, having a limited selection of sunglasses, difficulty finding glasses with the specific features you were looking for because you have to look at each tag, and you will lose time trying to find a mirror somewhere in the store. A more successful approach would be to go to a store that sells only sunglasses, provides you with a much larger selection, displays labels to you so that you can quickly find what you were looking for, and has mirrors set up on every wall so you can see if they fit you correctly. Similarly, generalist repositories are available to store or find data from our field, but they are not created to fit the needs of our field, making them a less ideal choice. A domain-specific repository makes data sharing and data reuse easier for a field. Since becoming publicly available, LDbase has been listed as a National Institutes of Health (NIH)-supported domain-specific repository (https://www.nlm.nih.gov/NIHbmic/domain_specific_repositories.html).

Who Is LDbase for?

Simply put, LDbase is for the learning disabilities community broadly defined. We allow for the “LD” in the LDbase to stand not only for “learning disabilities” but also for “learning and development.” We do not restrict to projects that contain only samples identified as learning disabled, but instead have made LDbase well positioned to store data related to individuals in learning contexts or impacting individuals’ learning outcomes. LDbase is freely available to use for any researcher if they have data that fit the scope of LDbase. Investigators from any country can use LDbase. However, LDbase was created through funding from the NIH (2020) by researchers based in the United States, and non-U.S. investigators are warned to check their local laws and customs for sharing data within LDbase. For example, we have attempted to comply with the European Union GDPR law which oversees, in part, data sharing, but we cannot guarantee that we do, given the intricacies of the how the law is being enforced.

What Type of Data Does LDbase Store?

LDbase accepts only behavioral data, defined as quantitative data gathered via questionnaire, cognitive testing, and the like. We do not accept image data, neurophysiological data, video data, and others. We limit data format to only behavioral data for multiple reasons. From the start, we focused on the sustainability of LDbase, and the latter file types have extensive, and expensive, data storage needs. Second, other data repositories already exist to host that type of data (e.g., Databrary for video data), and it is not appropriate to duplicate those efforts. Instead, we allow our video and image sharing users to still share their projects on LDbase, complete with metadata, and then provide a URL link to where their data are stored, rather than uploading their data. The benefit of this is that the LD community will still find, reuse, and cite your data.

Datasets may be shared in the format of your choice. Common examples include csv, SAS, Excel, SPSS, and R. Flat files, as opposed to relational databases, should be shared to allow for easy download and reuse of data.

In addition to datasets, LDbase is also designed for researchers to share codebooks, data dictionaries, code in the original format, project documentation, and any other project resources that will help others understand your study. The goal of open science is to provide and share the full picture of your research, so that your data can be properly understood, reused, and cited for years to come, ideally with no need to contact you in the future.

Why Is LDbase Important?

Why Did the Learning Disabilities Field Need LDbase?

We saw that the broad field of learning disabilities, which intersects the fields of education, psychology, and communication science and disorders, among others, was unique in its need, and positionality, for a domain-specific data repository. First, many of the projects in the field are grant funded, which means that the field feels considerable top-down pressure from granting agencies to share data. A mandate passed in 2013 requires data collected using federal funding to be open and accessible to the public (The White House, 2013). In response to this, the major federal agencies funding learning disabilities research now require explicit data sharing plans as part of their grant applications. For example, as of January 2023, NIH requires a formal data management and sharing plan, requiring data sharing at publication or end of grant, whatever comes first. Institute of Education Sciences (IES) has required data sharing of final research data for most grant types since 2013. For IES-funded projects alone, as of 2020 approximately 350 research projects are subject to this data sharing requirement and have data which are ready to share (Albro, 2020). These investigators need a place to share their data, and the support of a domain-specific repository on how to properly share their data. Second, the learning disabilities community is unique in being well poised to capitalize on the power of shared data. We often use common measures (e.g., Woodcock–Johnson Tests of Achievement) which makes integrating across datasets possible. Many of our research questions often require hard to collect populations (e.g., children with specific disorders) or methods that require expensive specialized data collection (e.g., intervention studies). Investigators who have limitations on how many hard to collect participants they can reach can combine across datasets to create a more powerful single dataset of their population for analysis. Investigators without access to funded resources can use shared data to advance the field by bringing their unique perspectives to their research questions, democratizing access to data.

What Does LDbase Offer?

LDbase is a data repository that has been created specifically for the learning disabilities community. When we started to make LDbase, we knew that our learning disabilities community users wanted features that were not available with other data repositories. The following is a list of some of these features (see also Logan & Hart, 2021).

Flexible layouts. The other currently available data repositories assume a one-to-one-to-one correspondence of project to data collection to manuscript. This means that the data repository interface tends to be flat and, in many cases, does not allow any file nesting at all. Many in the learning disabilities community consider a project to be a broad bucket that might have multiple waves or types of data collection (e.g., longitudinal, teacher surveys, parent questionnaires, and child testing), and publish multiple manuscripts from the same project or even same data source. This meant we needed to build LDbase to allow hierarchical file storage that had built-in flexibility. All new data uploads start first with a project page, which allows investigators to provide metadata on the broader project (e.g., project description, project investigators). After a project page is established, the investigator can add any number of datasets and documents, with any file nesting desired. This type of nested project layout allows others, who are reusing your data, to easily navigate your data and understand what pieces of the project they are looking at, and how it connects to the project as a whole. For example, nesting allows people to see that there are three cohorts stored in three datasets, and that “this” codebook and “this” piece of code goes specifically with “this” dataset, and it does not apply to other datasets on this project.

Controlled access capabilities. Data sharing means providing summary or individual-level data in a data repository. It does not require openly sharing all data, and indeed, many in the learning disabilities community might have various reasons for why they want to control access to their data. LDbase allows for open sharing as well as controlled access sharing. Open sharing fully meets the goals of data sharing, in that the data are available to any user with an easy download. On LDbase, openly shared data are available to any user, including those who have a registered account or not. Controlled access sharing means an investigator has stored their data on LDbase, including the important metadata to describe the data which are openly available, but the data itself are not available to use to any user. When using this feature an investigator can choose to set a date by which the data will become fully open (e.g., to meet a grant requirement for sharing), or the investigator can choose to not set a date, which effectively makes it permanently restricted access. Any LDbase registered user can request access to controlled access data through the internal messaging system, and the investigator can decide if they want to accept the request and allow access to that user through a specialized sharing feature. Each individual dataset shared on LDbase can have different access settings. This feature might be useful for investigators who want to (or are required to) participate in data sharing but they would rather have some data not fully open to any user, for ethical (e.g., sensitive data) or personal (e.g., don’t want to share until the first major manuscripts are completed) reasons.

DOI minting. Best practices in data sharing say that each product shared should get a persistent unique identifier, to allow for better citation and data discovery. LDbase uses DOIs as a permanent digital objective identifier to assist with the permanent archiving and access of shared data, and we are set up to provide a DOI to anyone, free of charge, at any time.

Citation (for project and each document). Data that are shared are a citable product, which can be added to CVs and grant progress reports. Each data product, and documentation, is fully citable, and the LDbase interface has a feature that allows a user to easily copy a citation. A unique feature is your project can also get a DOI, meaning you will have a citation for your project for the manuscripts you write using the data from that project.

License. Assigning a reuse license to products shared in the public is important as it allows someone who wants to use that product to know how, where, and when they can use that product. Any product that is shared on LDbase (e.g., data, documents) can be assigned any license a user desires. We suggest options that are currently typically used depending on the product (and provide resources to learn more about each), but also allow a user to assign any license they wish. This allows flexibility for users who use bespoke licenses, like those required by some institutions.

Data sharing resources. In the end, we not only created a data repository, but also an informative website. By that we mean we have built out informational resources on open science and data sharing, with many of our resources answering questions that have been asked by the LD community. We have provided templates (e.g., informed consent language for data sharing, samples of Data Management Plans) and white papers sharing best practices particular to our field (e.g., data management basics, working with your institutional review board, de-identifying your data), all available on LDbase.org with a specialized search function of the resources. We also have data management and data sharing training materials and opportunities available on the website.

Management of your project users. We anticipated that the principal investigator (PI) of projects would not necessarily be the individual who does the actual file uploads for their data sharing, but instead they might have students or staff who would do it. We have two levels of project users who have various editing and access capabilities. A Project Administrator is a role that provides the highest level of access to a project, its data, and any related documentation. We imagine that a PI and potentially a senior-level staff member on a project would have this level of access. A Project Administrator can add and edit all metadata, upload files, access embargoed data, allow other users to access embargoed data, edit all other user access roles for the project, and delete pieces of the project or even delete the project entirely. A Project Editor is the second level of administrative user on a project. We imagine this role would be given to trainees and staff working with a PI. A Project Editor can add and edit metadata, upload files, access embargoed data, and allow other users to access embargoed data, but they cannot delete any data. Any registered user of LDbase can be given these project administrator roles. Access is assigned per project, can be different from project to project, and is managed by you. You can add/revoke access rights at any time, allowing you to remain in control of your data.

Advanced search interface. The purpose of data sharing is to allow data to be accessible to others, for reproducibility or reuse. This means that others must be able to find your data. Part of our LDbase team are librarians, specially trained in information finding. We have advanced search capabilities, including the ability to use Boolean search terms or use preselected facets to limit search results. You can also directly search on specific terms. Do you want to find a dataset with dual language learners, collected via a longitudinal study, in elementary schools? You can do that on LDbase. But remember, one can only search on study attributes that have been provided. Therefore, we require or request metadata about the research study to be entered when projects are created. Metadata not only makes your data more findable but also makes our search functionality better than those currently available in all other data repositories.

Sharing data by uploading or linking. An LD investigator can share data on LDbase in two ways. If the data are otherwise not available online, then they can upload their data to be stored on and shared from LDbase. However, if the investigator already has the data stored somewhere else online, either because they had to or they did before LDbase was available, then best practice is that they do not upload the data again to LDbase. Instead, for these investigators, we provide an option to create your project on LDbase, add your metadata that supports FAIR data sharing, and then add a link to the external website where the data are stored, rather than uploading the data to the repository. As LDbase is a one-stop shop for our field to find data, doing this allows secondary users access to your data which otherwise they might struggle to find. Linking to an external resource is also helpful for files that are better off stored in other places, like preregistrations (which OSF is purposely set up to host) or file types LDbase doesn’t store (e.g., if you have a data file of imaging data stored somewhere else but all behavioral data will be on LDbase).

Metadata. This is the key that makes the repository work for everyone. Metadata is all of the informational tags that are used in LD research. It includes everything you can think of, including: assessments used, educational environments the research took place in, what year the study happened, who provided funding, was a control group used, is the socio-economic status shared, was it randomized, were twins studied, was attention-deficit/hyperactivity disorder a focus, and who are the PIs. You will be asked to enter in metadata related to your project, data, and other documentation during your upload process (you can see what will be expected of you here https://ldbase.org/data-sharing-resources/guides/what-information-do-I-need).

Broader findability. LDbase currently uses a schema.org metadata format. This means that LDbase metadata is searchable by other programs, importantly including the Google crawler, allowing for LDbase data to be “found” by Google (and Google data and Google Scholar). As part of our future work, we are continuing to build out other application programming interface (API) capabilities, allowing easier access to the metadata stored in LDbase.

Sustainability

LDbase is a free to use resource, and we were sensitive to sustainability from the very start of our work building LDbase. We made choices regarding our features that ensure that LDbase is sustainable after our NIH funding is done. The two most major decisions were first we would give the ability to store and access data and related documentation entirely to users, and second, we would not be involved in any step of looking into a dataset. There are implications to these decisions. First, we do not restrict who can become a registered user of LDbase. This contrasts with data repositories such as Databrary, which has a process of requiring all new users to get user requirements signed by their university official. This ensures that all users have certain credentials; however, it does not allow inclusive data access and the process is labor intensive. Second, we also cannot speak to data “quality” of the data stored in LDbase. To do so would require us to go inside datasets uploaded to LDbase and confirm aspects such as full deidentification, cleanliness, documentation, and the like. This would require staff time and expertise. ICPSR does provide such services, for a fee. We have heard from one user that public access to a dataset on ICPSR has been held up for over 15 months because the data checking process is backed up. We provide many resources on how to check for deidentification, good data management practices, and documentation, but in the end the onus is on both the data depositor and any secondary data user to ensure data quality. This allows us to keep ongoing costs very low. To cover those remaining costs, we built LDbase in partnership with the Florida State University Libraries, which recognize LDbase as a part of their archive. This means the FSU Libraries will maintain curation of LDbase after our funding period.

Examples of Secondary Data Uses

LDbase was made for two reasons. First, it provides a free place where LD researchers can store their data, to meet federal data sharing requirements or to contribute to open science. Second, it provides a place for LD investigators to go to find data for reuse. This reuse is useful for many different goals. By bringing new investigators to data, with their own backgrounds, their own theoretical leanings, and their own methodological approaches, new research questions can be conceived which were not considered by the original investigator team. These new ideas increase the creative use of data and can lead to breakthroughs and advance the field. These new investigators also sometimes become new collaborators, increasing the collaborative networks of all investigators involved. This has been shown to increase creativity in research (e.g., K. L. Hall et al., 2018). Data reuse can also increase the transparency of the research process. By openly sharing data that contributed to scientific knowledge, others can access the data and check the analytical pipeline for reproducibility. This adds to the credibility of our field’s findings, a crucial task for a field like ours which directly contributes to the instruction and interventions that children receive. Data sharing can also promote equity in the research process. Not all investigators have access to high-quality learning disabilities data, because they are students, junior faculty, have backgrounds that make them less likely to be funded, or are at under resourced institutions. We have found that shared data are particularly a boon for students to use for their thesis and dissertations. Finally, openly available “real” data are very useful for teaching statistics and methods classes, either for in class examples or for students to use as part of their class assignments. For example, there is a lot of harder to come by data like item-level and longitudinal data in LDbase, making it an ideal place to find data for measurement and developmental modeling classes. By opening data for others to use, you are providing a resource that can benefit all.

Through our years of working to support our community in data sharing, we have seen many examples of successful data reuse, in both papers and grants.

Papers

Our field has examples of large data collections that were collected with the purpose of sharing. Examples include the Program for International Student Assessment (PISA), Trends in International Mathematics and Science Study (TIMSS), and Early Childhood Longitudinal Studies (ECLS). There are many datasets available that are less well known but no less valuable. Many of these are stored in data repositories like LDbase. One example of an interesting reuse of data stored on LDbase is Hall et al.’s paper (2024). This paper uses part of the Project KIDS dataset (Hart et al., 2021), with full data description available in van Dijk, Norris, et al. (2022). Project KIDS combined nine extant intervention projects into one very large integrated dataset for novel research questions concerning individual differences in intervention responses, which was already an example of the power of data sharing. Hall et al. (2024) further extended the usefulness of the shared data to use innovative analytical techniques to examine the impacts of a reading intervention on math outcomes. They found that the reading intervention had a small impact on applied problem solving, and that activating word-level reading skills via the reading intervention impacted math fluency (Hall et al., 2024). Interestingly, the first author has used the same shared dataset again for a very different manuscript, using the data as part of a “how-to” paper in using longitudinal structural equation models for a school psychology audience (Hall & Clark, 2023). Indeed, shared data can be used to advance research questions important to the field, as well as to support pedagogy.

Grants

Sometimes a new idea or new method does not need new data collection to be tested. This is especially the case in an era of shrinking buying power from grant dollars, where collecting data with children, teachers, and families is expensive and takes time. By leveraging already committed resources, it is possible to propose effective and interesting ideas to granting agencies that include secondary data analysis. For example, some of our team is part of a newly funded NIH project that leverages existing datasets from efficacy trials of small group interventions for reading difficulties (Jessica Toste, PI). In this grant, the datasets will be combined using integrative data analysis (Curran et al., 2014; van Dijk, Schatschneider, et al., 2022), creating a large dataset of thousands of students who have received supplemental instruction in reading, including their treatment status, pre- and post-reading scores, and demographic information. The research questions for this grant are related to group differences in treatment response, questions that were not possible in any of the individual projects because of small sample sizes within a project. This is not an issue when we combine across datasets. This project was possible due to personal connections by the investigators allowing access to each dataset. The final combined dataset, and many of the individual datasets, will be made available on LDbase, which will make these once private datasets available to others. By continuing to share the data from our field, we will allow others to innovate and maximize the knowledge that can be gained from the data.

It is our dream, and one reason to support and participate in data sharing, that as the field of learning disabilities shares more data, inequities in access to high-quality data will diminish. In the end, the north star of our field is individuals with learning disabilities. Data sharing is one way we can advance our field toward supporting all individuals who are at risk, or have, learning disabilities.

Footnotes

Acknowledgements

Building LDbase has truly been a team effort. Beyond the authors, the following people, listed in alphabetical order, have contributed to LDbase, giving expertise, time, and effort to its success: Brian Arsenault, Bryan Brown, Veronica Mellado De La Cruz, Ashley Edwards, Stephanie Estrera, Mason Hall, Jessica Logan, Jean Philips, Jeffrey Shero, Rachel Smart, Micah Vandegrift, Wilhelmina van Dijk, and Christine White.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

This work is supported by Eunice Kennedy Shriver National Institute of Child Health & Human Development (grant no. R01HD095193). Views expressed herein are those of the authors and have been neither reviewed nor approved by the granting agencies.

ORCID iD

Tara Reynolds

References

Albro

(2020, January). IES annual principal investigators. Meeting. https://ies.ed.gov/pimeeting/

Curran

P. J.

McGinley

J. S.

Bauer

D. J.

Hussong

A. M.

Burns

Chassin

Sher

Zucker

(2014). A moderated nonlinear factor model for the development of commensurate measures in integrative data analysis. Multivariate Behavioral Research, 49(3), 214–231.

Hall

G. J.

Clark

K. N.

(2023). Demystifying longitudinal data analysis using structural equation models in school psychology. Journal of School Psychology, 98, 181–205. https://doi.org/10.1016/j.jsp.2023.03.003

Hall

G. J.

van Dijk

Chow

J. C.

Comella

(2024). Decrypting the code: Investigating a reading intervention’s impact on math problem solving and calculation fluency. https://osf.io/preprints/psyarxiv/jvyzw

Hall

K. L.

Vogel

A. L.

Huang

G. C.

Serrano

K. J.

Rice

E. L.

Tsakraklides

S. P.

Fiore

S. M.

(2018). The science of team science: A review of the empirical evidence and research gaps on collaboration in science. American Psychologist, 73(4), 532.

Hart

S. A.

Al Otaiba

Connor

Schatschneider

(2021). Project KIDS. LDbase. https://doi.org/10.33009/ldbase.1619716971.79ee

Hart

S. A.

Schatschneider

Reynolds

T. R.

Calvo

F. E.

Brown

B. J.

Arsenault

Hall

M. R. K.

van Dijk

Edwards

A. A.

Shero

J. A.

Smart

Phillips

J. S.

(2020). LDbase. http://doi.org/10.33009/ldbase

Logan

J. A. R.

Hart

S. A.

(Hosts). (2021, July 8). S3E3: All about that LDbase [Audio postcast episode]. In Within & between. http://www.withinandbetweenpod.com/

National Institutes of Health. (2020, October 29). Supplemental information to the NIH policy for data management and sharing: Selecting a repository for data resulting from NIH-supported research. NIH Grants & Funding. https://grants.nih.gov/grants/guide/notice-files/NOT-OD-21-016.html

10.

van Dijk

Norris

C. U.

Al Otaiba

Schatschneider

Hart

S. A.

(2022). Exploring individual differences in response to reading intervention: Data from Project KIDS (kids and individual differences in schools). Journal of Open Psychological Data, 10(1), 2. http://doi.org/10.5334/jopd.58

11.

van Dijk

Schatschneider

Al Otaiba

Hart

S. A.

(2022). Assessing measurement invariance across multiple groups: When is fit good enough? Educational and Psychological Measurement, 82(3), 482–505.

12.

The White House. (2013, May 9). Executive order: Making open and machine readable the new default for government information. https://obamawhitehouse.archives.gov/the-press-office/2013/05/09/executive-order-making-open-and-machine-readable-new-default-government-