  |
The RCSB Protein Data Bank (PDB) - http://www.rcsb.org/pdb/
Archive of experimentally-determined, biological macromolecule 3-D structures from the Brookhaven National Laboratory. |
  |
National Space Science Data Center - http://nssdc.gsfc.nasa.gov/
Provides access to a wide variety of astrophysics, space physics, solar physics, lunar and planetary data from NASA space flight missions, in addition to selected other data and some models and software. |
  |
TREC Data - http://trec.nist.gov/data.html
Text datasets used in information retrieval and learning in text domains. |
  |
The StatLib Datasets Archive - http://lib.stat.cmu.edu/datasets/
A repository of datasets used in statistics and machine learning. |
  |
Penn Treebank Project - http://www.cis.upenn.edu/~treebank/
A corpus of parsed sentences. Used by many researchers for training data-driven parsing algorithms. |
  |
Web->KB dataset - http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/
Web pages partitioned into classes, with hyperlink data. The dataset has been used for text categorization and learning to extract symbolic knowledge from the World Wide Web. |
  |
HS3D - Homo Sapiens Splice Sites Dataset - http://www.sci.unisannio.it/docenti/rampone/
HS3D (Homo Sapiens Splice Sites Dataset) is a database of Homo Sapiens Exon, Intron and Splice regions extracted from GenBank primate sequences Rel.123. The aim of this data set is to give standardized material to train and to assess the prediction accuracy of computational approaches for gene identification and characterization. |
  |
TechTC - Technion Repository of Text Categorization Datasets - http://techtc.cs.technion.ac.il
Provides a large number of diverse test collections for use in text categorization research. |
  |
WordSimilarity-353 Test Collection - http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/wordsim353.html
Contains 353 English word pairs along with human-assigned similarity judgements. |
  |
Dataset generator - http://www.datgen.com/
Datgen, formerly SCDS, is a computer program that generates data to systematically test programs that consume data. These synthetic datasets can be used to validate learning algorithms. |