What is exactly meant by a “data set”?

Question

Is it just the aggregation of data points? Or is it the representation of data points for different elements in a tabular format arranged with values of the different variables? How is it different from raw data?

What do you mean by "data point", do you expect it to be at least 2D? A time-series or a set of exam scores can be a data set; at minimum those could just be series in one variable, possibly without row labels. Per the answer by @FranckDernoncourt — smci, 2 hours ago

Franck Dernoncourt · Answer 1 · 2016-11-05 17:08:41Z

I think that Wikipedia does a decent job at defining it:

Most commonly a data set corresponds to the contents of a single database table, or a single statistical data matrix, where every column of the table represents a particular variable, and each row corresponds to a given member of the data set in question. The data set lists values for each of the variables, such as height and weight of an object, for each member of the data set. Each value is known as a datum. The data set may comprise data for one or more members, corresponding to the number of rows.

The term data set may also be used more loosely, to refer to the data in a collection of closely related tables, corresponding to a particular experiment or event. An example of this type is the data sets collected by space agencies performing experiments with instruments aboard space probes.

In the open data discipline, dataset is the unit to measure the information released in a public open data repository. The European Open Data portal aggregates more than half a million datasets. In this field other definitions have been proposed but currently there is not an official one. Some other issues (real-time data sources, non-relational datasets, etc.) increases the difficulty to reach a consensus about it.

As you can see, the term is somewhat vague.

And in a computer vision setting, a data set could just be a collection of natural images and their labels or annotations. — Sycorax, 16 hours ago
@ankit The traditional CS meaning en.wikipedia.org/wiki/Database — Franck Dernoncourt, 8 hours ago
@Sycorax Yes, I guess we could consider one image (or some other signal) as one blob datum in the database. — Franck Dernoncourt, 8 hours ago

Kodiologist · Answer 2 · 2016-11-05 17:09:39Z

up vote 5 down vote

In my experience, "dataset" (or "data set") is an informal term that refers to a collection of data. Generally a dataset contains more than one variable and concerns a single topic; it's likely to concern a single sample.

A mistake I often see writers of Cross Validated questions make is using "dataset" as a synonym for "variable" or "vector".

answered 20 hours ago

Kodiologist

8,19321636

2

Agreed on dataset vs variable or vector. Don't get me started on "a data", as in "I have a data". Conversely, "I have a dataset" is a wonderful way not to irritate either way, either irritating those who insist that data are plural or irritating those who regard that insistence as pedantic, if they think about it at all. – Nick Cox 20 hours ago

1

@NickCox In the grammar wars over "data", I'm in the least popular faction, which claims that "data" is a mass noun. – Kodiologist 19 hours ago

1

I suspect that's a majority now and more strongly think it's gaining popularity. – Nick Cox 18 hours ago

add a comment |

Nick Cox · Answer 3 · 2016-11-05 17:10:26Z

I think you might need to define data point before you can define data set: why is one primitive and not needing definition, but not vice versa?

At least two definitions make sense to me:

One or more observations (cases, records, rows) for one or more variables (fields. columns).
Whatever is stored as data within a file readable by a program of choice.

Tabular layout is common but I don't think it's part of any definition; how the data are stored can be practically important, naturally.

P.S. The word "format" is so overloaded that to me it's best avoided unless specified unambiguously. I've seen it used for

General or specific text or binary file format
Data structure, e.g. tabular or other
Data storage or variable types, e.g. bit, integer, real, character
Display format controlling presentation, e.g. details on number of decimal places; decimal, hexadecimal or binary display.

Nick Cox · Answer 4 · 2016-11-06 10:49:48Z

There are already some good answers here and I don't think I can penetrate any deeper than Nick Cox or Franck Dernoncourt the issue of whether "dataset" refers to the conceptual collection of related data, or to the particular arrangement of those data e.g. into a table/matrix or a computer-readable file. Franck's extract mentions edge cases like continuously-collected data, or data spread across several tables, which are worth bearing in mind if you assumed there was going to be a simple definition. (Not all statistics software can handle it, but it is very easy to imagine a case where data is stored in a relational database with multiple tables. Is the entire database a single "dataset"?)

One thing I will add though is that datasets aren't generally sets, in the mathematical sense! Sensu stricto either a set contains an object or it doesn't, but can't contain more than one copy of that object. If I roll a die eight times and score 1, 4, 3, 5, 5, 4, 6, 4 then the set of scores rolled is just {1, 3, 4, 5, 6}. Note that the elements could be in any order, I've just written them ascending in value but the set {5, 4, 1, 6, 3} is mathematically equal to it, for instance. This isn't what we usually mean by a dataset though!

A multiset (or bag) allows entries to be repeated, e.g. {1, 4, 3, 5, 5, 4, 6, 4} though note this still doesn't include a sense of order, so is equal to {1, 3, 4, 4, 4, 5, 5, 6}. So the "set" in "dataset" really means "multiset". Moreover, if you want order to be preserved, you might instead use a vector: (1, 4, 3, 5, 5, 4, 6, 4) is not the same as (1, 3, 4, 4, 4, 5, 5, 6}. But this is only for recording one variable - for several, it may be more convenient to use a matrix to tabulate with order preserved. For more sophisticated situations such as measuring a property of a 3d grid of voxels over time, you might even move up to arranging the data in a tensor.

But note that conceptually a multiset may suffice in most simple situations, even if it's inconvenient for practical purposes. If I tossed a coin simultaneously with rolling the die, and wanted to record the two results together, then I could use a multiset like {(1, H), (3, T), (4, H), (4, H), (4, T), (5, H), (5, T), (6, T)} instead of a matrix. An ordinary set will not suffice, as it wouldn't count the multiplicity of the (4, H), for instance.

I could buy the idea that a dataset is a set of observations with just the wrinkle that it might need their identifiers to make them distinct. But you're right that the meaning here is some distance from that in set theory. Underline, as you hint here, that the order of observations is often crucial and will often, but not always, be given by a time or other ordering variable(s). — Nick Cox, 2 hours ago
@NickCox (+1) Indeed what I haven't yet found the time, or moreover manner, to express is that observations often come with an identifier - sometimes temporal, sometimes location-based, sometimes both. When we encode the data into a vector, matrix or tensor, that often directly provides the structure we want and an explicit identifier (like a hard-coded index) may be rendered unnecessary, particularly if it is only order or relative position that matters. No doubt there is a correct terminology for all this. — Silverfish, 7 mins ago

asked	today
viewed	345 times
active	today

current community

your communities

more stack exchange communities

What is exactly meant by a “data set”?

4 Answers 4

Your Answer

Not the answer you're looking for? Browse other questions tagged dataset terminology definition or ask your own question.

Visit Chat

Hot Network Questions

current community

your communities

more stack exchange communities

What is exactly meant by a “data set”?

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged dataset terminology definition or ask your own question.

Visit Chat

Related

Hot Network Questions