Чем data frame отличается от двумерной матрицы
Перейти к содержимому

Чем data frame отличается от двумерной матрицы

  • автор:

data.frame: Data Frames

The function data.frame() creates data frames, tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R ‘s modeling software.


data.frame(…, row.names = NULL, check.rows = FALSE, check.names = TRUE, fix.empty.names = TRUE, stringsAsFactors = default.stringsAsFactors()) 



these arguments are of either the form value or tag = value . Component names are created based on the tag (if present) or the deparsed argument itself.

NULL or a single integer or character string specifying a column to be used as row names, or a character or integer vector giving the row names for the data frame.

if TRUE then the rows are checked for consistency of length and names.


logical. If TRUE then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names and are not duplicated. If necessary they are adjusted (by make.names ) so that they are.


logical indicating if arguments which are “unnamed” (in the sense of not being formally called as someName = arg ) get an automatically constructed name or rather name «» . Needs to be set to FALSE even when check.names is false if «» names should be kept.


logical: should character vectors be converted to factors? The ‘factory-fresh’ default is TRUE , but this can be changed by setting options(stringsAsFactors = FALSE) .


A data frame, a matrix-like structure whose columns may be of differing types (numeric, logical, factor and character and so on).

How the names of the data frame are created is complex, and the rest of this paragraph is only the basic story. If the arguments are all named and simple objects (not lists, matrices of data frames) then the argument names give the column names. For an unnamed simple argument, a deparsed version of the argument is used as the name (with an enclosing I(. ) removed). For a named matrix/list/data frame argument with more than one named column, the names of the columns are the name of the argument followed by a dot and the column name inside the argument: if the argument is unnamed, the argument’s column names are used. For a named or unnamed matrix/list/data frame argument that contains a single column, the column name in the result is the column name in the argument. Finally, the names are adjusted to be unique and syntactically valid unless check.names = FALSE .


A data frame is a list of variables of the same number of rows with unique row names, given class «data.frame» . If no variables are included, the row names determine the number of rows.

The column names should be non-empty, and attempts to use empty names will have unsupported results. Duplicate column names are allowed, but you need to use check.names = FALSE for data.frame to generate such a data frame. However, not all operations on data frames will preserve duplicated column names: for example matrix-like subsetting will force column names in the result to be unique.

data.frame converts each of its arguments to a data frame by calling as.data.frame(optional = TRUE) . As that is a generic function, methods can be written to change the behaviour of arguments according to their classes: R comes with many such methods. Character variables passed to data.frame are converted to factor columns unless protected by I or argument stringsAsFactors is false. If a list or data frame or matrix is passed to data.frame it is as if each component or column had been passed as a separate argument (except for matrices protected by I ).

Objects passed to data.frame should have the same number of rows, but atomic vectors (see is.vector ), factors and character vectors protected by I will be recycled a whole number of times if necessary (including as elements of list arguments).

If row names are not supplied in the call to data.frame , the row names are taken from the first component that has suitable names, for example a named vector or a matrix with rownames or a data frame. (If that component is subsequently recycled, the names are discarded with a warning.) If row.names was supplied as NULL or no suitable component was found the row names are the integer sequence starting at one (and such row names are considered to be ‘automatic’, and not preserved by as.matrix ).

If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number).

Names are removed from vector inputs not protected by I .

default.stringsAsFactors is a utility that takes getOption(«stringsAsFactors») and ensures the result is TRUE or FALSE (or throws an error if the value is not NULL ).


Chambers, J. M. (1992) Data for models. Chapter 3 of Statistical Models in S eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole.

See Also


# NOT RUN L3  LETTERS[1:3] fac  10, replace = TRUE) (d  1, y = 1:10, fac = fac)) ## The "same" with automatic column names: data.frame(1, 1:10, sample(L3, 10, replace = TRUE)) is.data.frame(d) ## do not convert to factor, using I() : (dd letters[1:10]))) rbind(class = sapply(dd, class), mode = sapply(dd, mode)) stopifnot(1:10 == row.names(d)) # (d0  FALSE]) # data frame with 0 columns and 10 rows (d.0 FALSE, ]) # data frame (3 named cols) (d00 FALSE, ]) # data frame with 0 columns and 0 rows # > 

Run the code above in your browser using DataLab



A DataFrame is a data structure that organizes data into a 2-dimensional table of rows and columns, much like a spreadsheet. DataFrames are one of the most common data structures used in modern data analytics because they are a flexible and intuitive way of storing and working with data.

Every DataFrame contains a blueprint, known as a schema, that defines the name and data type of each column. Spark DataFrames can contain universal data types like StringType and IntegerType, as well as data types that are specific to Spark, such as StructType. Missing or incomplete values are stored as null values in the DataFrame.

A simple analogy is that a DataFrame is like a spreadsheet with named columns. However, the difference between them is that while a spreadsheet sits on one computer in one specific location, a DataFrame can span thousands of computers. In this way, DataFrames make it possible to do analytics on big data, using distributed computing clusters.

The reason for putting the data on more than one computer should be intuitive: either the data is too large to fit on one machine or it would simply take too long to perform that computation on one machine.


The concept of a DataFrame is common across many different languages and frameworks. DataFrames are the main data type used in pandas, the popular Python data analysis library, and DataFrames are also used in R, Scala, and other languages.

Добавить комментарий

Ваш адрес email не будет опубликован. Обязательные поля помечены *