Cross Validated is a question and answer site for people interested in statistics, machine learning, data analysis, data mining, and data visualization. Join them; it only takes a minute:

Sign up
Here's how it works:
  1. Anybody can ask a question
  2. Anybody can answer
  3. The best answers are voted up and rise to the top

Is it just the aggregation of data points? Or is it the representation of data points for different elements in a tabular format arranged with values of the different variables? How is it different from raw data?

share|improve this question
    
What do you mean by "data point", do you expect it to be at least 2D? A time-series or a set of exam scores can be a data set; at minimum those could just be series in one variable, possibly without row labels. Per the answer by @FranckDernoncourt – smci 2 hours ago

I think that Wikipedia does a decent job at defining it:

Most commonly a data set corresponds to the contents of a single database table, or a single statistical data matrix, where every column of the table represents a particular variable, and each row corresponds to a given member of the data set in question. The data set lists values for each of the variables, such as height and weight of an object, for each member of the data set. Each value is known as a datum. The data set may comprise data for one or more members, corresponding to the number of rows.

The term data set may also be used more loosely, to refer to the data in a collection of closely related tables, corresponding to a particular experiment or event. An example of this type is the data sets collected by space agencies performing experiments with instruments aboard space probes.

In the open data discipline, dataset is the unit to measure the information released in a public open data repository. The European Open Data portal aggregates more than half a million datasets. In this field other definitions have been proposed but currently there is not an official one. Some other issues (real-time data sources, non-relational datasets, etc.) increases the difficulty to reach a consensus about it.

As you can see, the term is somewhat vague.

share|improve this answer
    
And in a computer vision setting, a data set could just be a collection of natural images and their labels or annotations. – Sycorax 16 hours ago
    
What is meant by "database*? – ankit 9 hours ago
    
@ankit The traditional CS meaning en.wikipedia.org/wiki/Database – Franck Dernoncourt 8 hours ago
    
@Sycorax Yes, I guess we could consider one image (or some other signal) as one blob datum in the database. – Franck Dernoncourt 8 hours ago

In my experience, "dataset" (or "data set") is an informal term that refers to a collection of data. Generally a dataset contains more than one variable and concerns a single topic; it's likely to concern a single sample.

A mistake I often see writers of Cross Validated questions make is using "dataset" as a synonym for "variable" or "vector".

share|improve this answer
2  
Agreed on dataset vs variable or vector. Don't get me started on "a data", as in "I have a data". Conversely, "I have a dataset" is a wonderful way not to irritate either way, either irritating those who insist that data are plural or irritating those who regard that insistence as pedantic, if they think about it at all. – Nick Cox 20 hours ago
1  
@NickCox In the grammar wars over "data", I'm in the least popular faction, which claims that "data" is a mass noun. – Kodiologist 19 hours ago
1  
I suspect that's a majority now and more strongly think it's gaining popularity. – Nick Cox 18 hours ago

I think you might need to define data point before you can define data set: why is one primitive and not needing definition, but not vice versa?

At least two definitions make sense to me:

  1. One or more observations (cases, records, rows) for one or more variables (fields. columns).

  2. Whatever is stored as data within a file readable by a program of choice.

Tabular layout is common but I don't think it's part of any definition; how the data are stored can be practically important, naturally.

P.S. The word "format" is so overloaded that to me it's best avoided unless specified unambiguously. I've seen it used for

  1. General or specific text or binary file format

  2. Data structure, e.g. tabular or other

  3. Data storage or variable types, e.g. bit, integer, real, character

  4. Display format controlling presentation, e.g. details on number of decimal places; decimal, hexadecimal or binary display.

share|improve this answer

There are already some good answers here and I don't think I can penetrate any deeper than Nick Cox or Franck Dernoncourt the issue of whether "dataset" refers to the conceptual collection of related data, or to the particular arrangement of those data e.g. into a table/matrix or a computer-readable file. Franck's extract mentions edge cases like continuously-collected data, or data spread across several tables, which are worth bearing in mind if you assumed there was going to be a simple definition. (Not all statistics software can handle it, but it is very easy to imagine a case where data is stored in a relational database with multiple tables. Is the entire database a single "dataset"?)

One thing I will add though is that datasets aren't generally sets, in the mathematical sense! Sensu stricto either a set contains an object or it doesn't, but can't contain more than one copy of that object. If I roll a die eight times and score 1, 4, 3, 5, 5, 4, 6, 4 then the set of scores rolled is just {1, 3, 4, 5, 6}. Note that the elements could be in any order, I've just written them ascending in value but the set {5, 4, 1, 6, 3} is mathematically equal to it, for instance. This isn't what we usually mean by a dataset though!

A multiset (or bag) allows entries to be repeated, e.g. {1, 4, 3, 5, 5, 4, 6, 4} though note this still doesn't include a sense of order, so is equal to {1, 3, 4, 4, 4, 5, 5, 6}. So the "set" in "dataset" really means "multiset". Moreover, if you want order to be preserved, you might instead use a vector: (1, 4, 3, 5, 5, 4, 6, 4) is not the same as (1, 3, 4, 4, 4, 5, 5, 6}. But this is only for recording one variable - for several, it may be more convenient to use a matrix to tabulate with order preserved. For more sophisticated situations such as measuring a property of a 3d grid of voxels over time, you might even move up to arranging the data in a tensor.

But note that conceptually a multiset may suffice in most simple situations, even if it's inconvenient for practical purposes. If I tossed a coin simultaneously with rolling the die, and wanted to record the two results together, then I could use a multiset like {(1, H), (3, T), (4, H), (4, H), (4, T), (5, H), (5, T), (6, T)} instead of a matrix. An ordinary set will not suffice, as it wouldn't count the multiplicity of the (4, H), for instance.

share|improve this answer
1  
I could buy the idea that a dataset is a set of observations with just the wrinkle that it might need their identifiers to make them distinct. But you're right that the meaning here is some distance from that in set theory. Underline, as you hint here, that the order of observations is often crucial and will often, but not always, be given by a time or other ordering variable(s). – Nick Cox 2 hours ago
    
@NickCox (+1) Indeed what I haven't yet found the time, or moreover manner, to express is that observations often come with an identifier - sometimes temporal, sometimes location-based, sometimes both. When we encode the data into a vector, matrix or tensor, that often directly provides the structure we want and an explicit identifier (like a hard-coded index) may be rendered unnecessary, particularly if it is only order or relative position that matters. No doubt there is a correct terminology for all this. – Silverfish 7 mins ago

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.