Internet DRAFT - draft-hildebrand-nfsv4-fadvise
draft-hildebrand-nfsv4-fadvise
NFSv4 Working Group D. Hildebrand
Internet Draft IBM Almaden
Intended status: Standards Track T. Myklebust
Expires: January 2012 NetApp
S. Falkner
Oracle
July 7, 2011
Support for posix_fadvise
draft-hildebrand-nfsv4-fadvise-02.txt
Status of this Memo
This Internet-Draft is submitted to IETF in full conformance with the
provisions of BCP 78 and BCP 79.
This document may contain material from IETF Documents or IETF
Contributions published or made publicly available before November
10, 2008. The person(s) controlling the copyright in some of this
material may not have granted the IETF Trust the right to allow
modifications of such material outside the IETF Standards Process.
Without obtaining an adequate license from the person(s) controlling
the copyright in such materials, this document may not be modified
outside the IETF Standards Process, and derivative works of it may
not be created outside the IETF Standards Process, except to format
it for publication as an RFC or to translate it into languages other
than English.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html
This Internet-Draft will expire on January 7, 2011.
Hildebrand, et al. Expires January 7, 2012 [Page 1]
Internet-Draft Support for posix_fadvise July 2011
Copyright Notice
Copyright (c) 2011 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the BSD License.
This document may contain material from IETF Documents or IETF
Contributions published or made publicly available before November
10, 2008. The person(s) controlling the copyright in some of this
material may not have granted the IETF Trust the right to allow
modifications of such material outside the IETF Standards Process.
Without obtaining an adequate license from the person(s) controlling
the copyright in such materials, this document may not be modified
outside the IETF Standards Process, and derivative works of it may
not be created outside the IETF Standards Process, except to format
it for publication as an RFC or to translate it into languages other
than English.
Abstract
This document proposes a new FADVISE operation for NFSv4.2 to support
the posix_fadvise function. FADVISE will communicate expected
application behavior to the server, allowing servers to optimize
future I/O requests for a file. The posix_fadvise function is
supported in Linux and many other operating systems and is starting
be widely used by applications. In addition, the FADVISE operation
can communicate other application directions such as the use of
direct I/O.
Table of Contents
1. Introduction...................................................3
1.1. Requirements Language.....................................4
2. POSIX Requirements.............................................4
3. Other Requirements.............................................5
4. Operation TBD: FADVISE - Send application access pattern
hints to server ...............................................6
Hildebrand, et al. Expires January 7, 2012 [Page 2]
Internet-Draft Support for posix_fadvise July 2011
4.1. ARGUMENTS.................................................6
4.2. RESULTS...................................................7
4.3. DESCRIPTION...............................................7
4.4. IMPLEMENTATION............................................8
5. Security Considerations........................................8
6. IANA Considerations............................................8
7. References.....................................................9
7.1. Normative References......................................9
7.2. Informative References....................................9
8. Acknowledgments................................................9
1. Introduction
NFS is now used in many data centers as the sole or primary method of
data access. Consequently, more types of applications are using NFS
than ever before, each with their own requirements and generated
workloads. This document puts forth a proposal for the NFSv4.2
protocol to support the posix_fadvise function [2], allowing
applications to communicate their expected behavior to the server.
The posix_fadvise operation allows applications to provide hints to
the storage system regarding its expected access pattern, e.g.,
sequential or random, and data re-use behavior, e.g., data range will
be read multiple times and should be cached. These hints allow the
file system to understand what optimizations it should implement for
a specific access to a file. For example, if a application indicates
it will never read the data more than once, then the file system can
avoid polluting the data cache and not cache the data.
Another instance where applications provide an indication of their
desired I/O behavior is when an application specifies the use of
direct I/O. This can be done in Linux and AIX via the open()
O_DIRECT parameter and in Solaris via the directio() function.
Applications specifying the use of direct I/O are telling the file
system that it must not cache file data.
While applications can use the posix_fadvise function and direct I/O
today, with NFS it will only affect behavior on the client. While
this can help the NFS client optimize I/O and caching for a file, it
does not allow the NFS server and its exported file system to do
likewise. For example, with direct I/O, while the client no longer
caches data, the NFS server and its exported file system will
continue caching data. By caching data that will not be re-read, the
server is polluting its cache and possibly causing useful cached data
to be evicted.
Hildebrand, et al. Expires January 7, 2012 [Page 3]
Internet-Draft Support for posix_fadvise July 2011
One option is to modify the existing READ and WRITE operations with
FADVISE hints. In the case of READ, optimizations are related to
prefetching. In the case of WRITE, FADVISE hints inform the server
whether it should write through its read cache or whether it should
use an O_DIRECT-like mechanism in order to do an uncached write. In
both cases, we're talking about hints that constitute a client's best
estimate for how it will be using the data in the future. While that
estimate may indeed change, it is only useful to the server if it is
stable for a non-zero period of time, i.e., more than a single READ
or WRITE operation.
This document adds a new FADVISE operation to communicate the client
file access patterns as specified in posix_fadvise to the NFS server.
The NFS server upon receiving a FADVISE operation MAY choose to
change how it performs I/O and its caching policies, but is under no
obligation to do so.
The XDR description is provided in this document in a way that makes
it simple for the reader to extract into a ready to compile form.
The reader can feed this document into the following shell script to
produce the machine readable XDR description of the metadata layout:
#!/bin/sh
grep "^ *///" | sed 's?^ */// ??' | sed 's?^.*///??'
I.e. if the above script is stored in a file called "extract.sh", and
this document is in a file called "spec.txt", then the reader can do:
sh extract.sh < spec.txt > md.x
The effect of the script is to remove leading white space from each
line of the specification, plus a sentinel sequence of "///".
1.1. Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC-2119 [1].
2. POSIX Requirements
This proposal is to create a new NFS operation to support the
posix_fadvise function, defined as follows [2],
int posix_fadvise(int fd, off_t offset, off_t len, int advice);
Hildebrand, et al. Expires January 7, 2012 [Page 4]
Internet-Draft Support for posix_fadvise July 2011
The posix_fadvise() function shall advise the implementation on the
expected behavior of the application with respect to the data in the
file associated with the open file descriptor, fd, starting at offset
and continuing for len bytes. The specified range need not currently
exist in the file. If len is zero, all data following offset is
specified. The implementation may use this information to optimize
handling of the specified data. The posix_fadvise() function shall
have no effect on the semantics of other operations on the specified
data, although it may affect the performance of other operations.
The advice to be applied to the data is specified by the advice
parameter and may be one of the following values:
POSIX_FADV_NORMAL - Specifies that the application has no advice to
give on its behavior with respect to the specified data. It is the
default characteristic if no advice is given for an open file.
POSIX_FADV_SEQUENTIAL - Specifies that the application expects to
access the specified data sequentially from lower offsets to higher
offsets.
POSIX_FADV_RANDOM - Specifies that the application expects to access
the specified data in a random order.
POSIX_FADV_WILLNEED - Specifies that the application expects to
access the specified data in the near future.
POSIX_FADV_DONTNEED - Specifies that the application expects that it
will not access the specified data in the near future.
POSIX_FADV_NOREUSE - Specifies that the application expects to access
the specified data once and then not reuse it thereafter.
Upon successful completion, posix_fadvise() shall return zero;
otherwise, an error number shall be returned to indicate the error.
3. Other Requirements
Many applications do not use or require POSIX semantics. These
applications may benefit from additional hints (and points when they
are set) that are not covered by posix_fadvise. At this point, these
hints and requirements are unclear, but may include per-read and per-
write hints as well as two additional hints:
Opportunistic Prefetch - This hint indicates that the stateid holder
expects to access the data soon; prefetch if it can be done at a
marginal cost. The use case for this hint is unclear, since if the
Hildebrand, et al. Expires January 7, 2012 [Page 5]
Internet-Draft Support for posix_fadvise July 2011
client knows that it will want to read the data soon, then when would
it not want the server to prefetch the data at any cost?
Recently Used - The client has recently accessed the byte range in
its own cache. This informs the server that the data in the byte
range remains important to the client. When the server reaches
resource exhaustion, knowing which data is more important allows the
server to make better choices about which data to, for example purge
from a cache, or move to secondary storage. It also informs the
server which delegations are more important, since if delegations are
working correctly, once delegated to a client, a server might never
receive another I/O request for the file.
The use case for this is also unclear, as most clients already cache
data that they know is important and having this data cached twice
may be unnecessary. In fact, substantial performance improvements
have been demonstrated by making caches more exclusive between each
other [8], not the other way around. Other work showed that even
infinite sized secondary caches can be largely ineffective [7], but
this of course is subject to the workload.
4. Operation TBD: FADVISE - Application access pattern hints to server
The section introduces a new operation, named FADVISE, which allows
NFS clients to communicate application file access pattern hints to
the NFS server. A new operation is will allow hints to be sent to
the server when applications use posix_fadvise, direct I/O, or at any
other point at which the client finds useful.
4.1. ARGUMENTS
enum fadvise_type {
FADVISE_NORMAL = 0,
FADVISE_SEQUENTIAL = 1,
FADVISE_RANDOM = 2,
FADVISE_WILLNEED = 3,
FADVISE_DONTNEED = 4,
FADVISE_NOREUSE = 5,
};
struct FADVISE4args {
/* CURRENT_FH: file */
stateid4 stateid;
offset4 offset;
length4 count;
bitmap4 hints;
};
Hildebrand, et al. Expires January 7, 2012 [Page 6]
Internet-Draft Support for posix_fadvise July 2011
4.2. RESULTS
struct FADVISE4resok {
bitmap4 hints_res;
};
union FADVISE4res switch (nfsstat4 _status) {
case NFS4_OK:
FADVISE4resok fadvise_resok4;
default:
void;
};
4.3. DESCRIPTION
The FADVISE operation sends an I/O access pattern hint to the server
for the owner of stated for a given byte range specified by offset
and count. The byte range need not currently exist in the file, but
the hint will apply to the byte range when it does exist. The server
MAY ignore the advice.
The following are the possible hints:
o FADVISE_NORMAL - Specifies that the application has no advice to
give on its behavior with respect to the specified data. It is
the default characteristic if no advice is given for an open
file.
o FADVISE_SEQUENTIAL - Specifies that the application expects to
access the specified data sequentially from lower offsets to
higher offsets.
o FADVISE_RANDOM - Specifies that the application expects to access
the specified data in a random order.
o FADVISE_WILLNEED - Specifies that the application expects to
access the specified data in the near future.
o FADVISE_DONTNEED - Specifies that the application expects that it
will not access the specified data in the near future.
o FADVISE_NOREUSE - Specifies that the application expects to access
the specified data once and then not reuse it thereafter.
The server will return success if the operation is properly formed,
otherwise the server will return an error. The server MUST NOT
Hildebrand, et al. Expires January 7, 2012 [Page 7]
Internet-Draft Support for posix_fadvise July 2011
return an error if it does not recognize or does not support the
requested advice.
The hints_res returned by the server is primarily for debugging
purposes and the client SHOULD NOT use this information to change or
modify its file access behavior. This is for several reasons. First,
the server is under no obligation to carry out any hints that it
describes in the hints_res result. Second, the FADVISE operation is
a point in time operation, and the server can only respond based upon
information at this point in time. As time progresses, the server
may need to change its handling of a given file due to several
reasons including, but not limited to, memory pressure, additional
FADVISE hints sent by other clients, and heuristically detected file
access patterns.
The server MAY return different advice than what the client
requested. If it does, then this might be due to one of several
conditions, including, but not limited to another client advising of
a different I/O access pattern; a different I/O access pattern from
another client that that the server has heuristically detected; or
the server is not able to support the requested I/O access pattern,
perhaps due to a temporary resource limitation.
4.4. IMPLEMENTATION
The NFS client may choose to issue and FADVISE operation to the
server in several different instances. The most obvious is in direct
response to an applications execution of posix_fadvise. Another
useful point would be when an application indicates it is using
direct I/O. Direct I/O may be specified at file open, in which case
a FADVISE may be included in the same compound as the OPEN operation
with the FADVISE_NOREUSE flag set. Direct I/O may also be specified
separately, in which case a FADVISE operation can be sent to the
server separately.
5. Security Considerations
None.
6. IANA Considerations
The fadvise_type should be able to be extended.
Hildebrand, et al. Expires January 7, 2012 [Page 8]
Internet-Draft Support for posix_fadvise July 2011
7. References
7.1. Normative References
[1] Bradner, S., "Key words for use in RFCs to Indicate Requirement
Levels", BCP 14, RFC 2119, March 1997.
[2] The IEEE and The Open Group, "IEEE Std 1003.1, 2004 Edition,
The Open Group Technical Standard Base Specifications, Issue
6", 2004
7.2. Informative References
[1] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame,
C., Eisler, M., and D. Noveck, "Network File System (NFS)
version 4 Protocol", RFC 3530, April 2003.
[2] Shepler, S., Eisler, M., and D. Noveck, "Network File System
(NFS) Version 4 Minor Version 1 Protocol", RFC 5661, January
2010.
[3] Shepler, S., Eisler, M., and D. Noveck, "Network File System
(NFS) Version 4 Minor Version 1 External Data Representation
Standard (XDR) Description", RFC 5662, January 2010.
[4] Nowicki, B., "NFS: Network File System Protocol specification",
RFC 1094, March 1989.
[5] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS Version 3
Protocol Specification", RFC 1813, June 1995.
[6] S. VanDeBogart, C. Frost, E. Kohler, "Reducing Seek Overhead
with Application-Directed Prefetching", in Proceedings of
USENIX Annual Technical Conference, June 2009.
[7] D. Muntz, P. Honeyman, "Multi-level Caching in Distributed File
Systems", in Proceedings of USENIX Annual Technical Conference,
1992.
[8] T.M. Wong, J. Wilkes, "My cache or yours? Making storage more
exclusive", in Proceedings of the USENIX Annual Technical
Conference, 2002.
8. Acknowledgments
This document was prepared using 2-Word-v2.0.template.dot.
Hildebrand, et al. Expires January 7, 2012 [Page 9]
Internet-Draft Support for posix_fadvise July 2011
Authors' Addresses
Dean Hildebrand
IBM Almaden
650 Harry Rd
San Jose, CA 95120
Phone: +1 408-927-2013
Email: dhildeb@us.ibm.com
Trond Myklebust
NetApp
3215 Bellflower Ct
Ann Arbor, MI 48103
USA
Phone: +1-734-662-6608
Email: Trond.Myklebust@netapp.com
Sam Falkner
Oracle
500 Eldorado Blvd.
Broomfield, CO 80021
Phone: +1 720-279-4303
Email: sam.falkner@oracle.com
Hildebrand, et al. Expires January 7, 2012 [Page 10]