Is it just the aggregation of data points? Or is it the representation of data points for different elements in a tabular format arranged with values of the different variables? How is it different from raw data?
Sign up
- Anybody can ask a question
- Anybody can answer
- The best answers are voted up and rise to the top
|
|
I think that Wikipedia does a decent job at defining it:
As you can see, the term is somewhat vague. |
|||||||||||||||||
|
|
In my experience, "dataset" (or "data set") is an informal term that refers to a collection of data. Generally a dataset contains more than one variable and concerns a single topic; it's likely to concern a single sample. A mistake I often see writers of Cross Validated questions make is using "dataset" as a synonym for "variable" or "vector". |
|||||||||||||
|
|
I think you might need to define data point before you can define data set: why is one primitive and not needing definition, but not vice versa? At least two definitions make sense to me:
Tabular layout is common but I don't think it's part of any definition; how the data are stored can be practically important, naturally. P.S. The word "format" is so overloaded that to me it's best avoided unless specified unambiguously. I've seen it used for
|
|||
|
|
|
There are already some good answers here and I don't think I can penetrate any deeper than Nick Cox or Franck Dernoncourt the issue of whether "dataset" refers to the conceptual collection of related data, or to the particular arrangement of those data e.g. into a table/matrix or a computer-readable file. Franck's extract mentions edge cases like continuously-collected data, or data spread across several tables, which are worth bearing in mind if you assumed there was going to be a simple definition. (Not all statistics software can handle it, but it is very easy to imagine a case where data is stored in a relational database with multiple tables. Is the entire database a single "dataset"?) One thing I will add though is that datasets aren't generally sets, in the mathematical sense! Sensu stricto either a set contains an object or it doesn't, but can't contain more than one copy of that object. If I roll a die eight times and score 1, 4, 3, 5, 5, 4, 6, 4 then the set of scores rolled is just {1, 3, 4, 5, 6}. Note that the elements could be in any order, I've just written them ascending in value but the set {5, 4, 1, 6, 3} is mathematically equal to it, for instance. This isn't what we usually mean by a dataset though! A multiset (or bag) allows entries to be repeated, e.g. {1, 4, 3, 5, 5, 4, 6, 4} though note this still doesn't include a sense of order, so is equal to {1, 3, 4, 4, 4, 5, 5, 6}. So the "set" in "dataset" really means "multiset". Moreover, if you want order to be preserved, you might instead use a vector: (1, 4, 3, 5, 5, 4, 6, 4) is not the same as (1, 3, 4, 4, 4, 5, 5, 6}. But this is only for recording one variable - for several, it may be more convenient to use a matrix to tabulate with order preserved. For more sophisticated situations such as measuring a property of a 3d grid of voxels over time, you might even move up to arranging the data in a tensor. But note that conceptually a multiset may suffice in most simple situations, even if it's inconvenient for practical purposes. If I tossed a coin simultaneously with rolling the die, and wanted to record the two results together, then I could use a multiset like {(1, H), (3, T), (4, H), (4, H), (4, T), (5, H), (5, T), (6, T)} instead of a matrix. An ordinary set will not suffice, as it wouldn't count the multiplicity of the (4, H), for instance. |
|||||||||
|