Internet DRAFT - draft-hildebrand-nfsv4-fadvise

draft-hildebrand-nfsv4-fadvise



NFSv4 Working Group                                       D. Hildebrand
Internet Draft                                              IBM Almaden
Intended status: Standards Track                           T. Myklebust
Expires: January 2012                                            NetApp
                                                              S. Falkner
                                                                  Oracle
                                                            July 7, 2011


                         Support for posix_fadvise
                   draft-hildebrand-nfsv4-fadvise-02.txt


Status of this Memo

   This Internet-Draft is submitted to IETF in full conformance with the
   provisions of BCP 78 and BCP 79.

   This document may contain material from IETF Documents or IETF
   Contributions published or made publicly available before November
   10, 2008. The person(s) controlling the copyright in some of this
   material may not have granted the IETF Trust the right to allow
   modifications of such material outside the IETF Standards Process.
   Without obtaining an adequate license from the person(s) controlling
   the copyright in such materials, this document may not be modified
   outside the IETF Standards Process, and derivative works of it may
   not be created outside the IETF Standards Process, except to format
   it for publication as an RFC or to translate it into languages other
   than English.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html

   This Internet-Draft will expire on January 7, 2011.




Hildebrand, et al.     Expires January 7, 2012                 [Page 1]

Internet-Draft        Support for posix_fadvise               July 2011


Copyright Notice

   Copyright (c) 2011 IETF Trust and the persons identified as the
   document authors. All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the BSD License.

   This document may contain material from IETF Documents or IETF
   Contributions published or made publicly available before November
   10, 2008.  The person(s) controlling the copyright in some of this
   material may not have granted the IETF Trust the right to allow
   modifications of such material outside the IETF Standards Process.
   Without obtaining an adequate license from the person(s) controlling
   the copyright in such materials, this document may not be modified
   outside the IETF Standards Process, and derivative works of it may
   not be created outside the IETF Standards Process, except to format
   it for publication as an RFC or to translate it into languages other
   than English.

Abstract

   This document proposes a new FADVISE operation for NFSv4.2 to support
   the posix_fadvise function.  FADVISE will communicate expected
   application behavior to the server, allowing servers to optimize
   future I/O requests for a file.  The posix_fadvise function is
   supported in Linux and many other operating systems and is starting
   be widely used by applications. In addition, the FADVISE operation
   can communicate other application directions such as the use of
   direct I/O.

Table of Contents


   1. Introduction...................................................3
      1.1. Requirements Language.....................................4
   2. POSIX Requirements.............................................4
   3. Other Requirements.............................................5
   4. Operation TBD: FADVISE - Send application access pattern
      hints to server ...............................................6


Hildebrand, et al.     Expires January 7, 2012                 [Page 2]

Internet-Draft        Support for posix_fadvise               July 2011


      4.1. ARGUMENTS.................................................6
      4.2. RESULTS...................................................7
      4.3. DESCRIPTION...............................................7
      4.4. IMPLEMENTATION............................................8
   5. Security Considerations........................................8
   6. IANA Considerations............................................8
   7. References.....................................................9
      7.1. Normative References......................................9
      7.2. Informative References....................................9
   8. Acknowledgments................................................9

1. Introduction

   NFS is now used in many data centers as the sole or primary method of
   data access.  Consequently, more types of applications are using NFS
   than ever before, each with their own requirements and generated
   workloads.  This document puts forth a proposal for the NFSv4.2
   protocol to support the posix_fadvise function [2], allowing
   applications to communicate their expected behavior to the server.

   The posix_fadvise operation allows applications to provide hints to
   the storage system regarding its expected access pattern, e.g.,
   sequential or random, and data re-use behavior, e.g., data range will
   be read multiple times and should be cached.  These hints allow the
   file system to understand what optimizations it should implement for
   a specific access to a file.  For example, if a application indicates
   it will never read the data more than once, then the file system can
   avoid polluting the data cache and not cache the data.

   Another instance where applications provide an indication of their
   desired I/O behavior is when an application specifies the use of
   direct I/O.  This can be done in Linux and AIX via the open()
   O_DIRECT parameter and in Solaris via the directio() function.
   Applications specifying the use of direct I/O are telling the file
   system that it must not cache file data.

   While applications can use the posix_fadvise function and direct I/O
   today, with NFS it will only affect behavior on the client.  While
   this can help the NFS client optimize I/O and caching for a file, it
   does not allow the NFS server and its exported file system to do
   likewise.  For example, with direct I/O, while the client no longer
   caches data, the NFS server and its exported file system will
   continue caching data.  By caching data that will not be re-read, the
   server is polluting its cache and possibly causing useful cached data
   to be evicted.




Hildebrand, et al.     Expires January 7, 2012                 [Page 3]

Internet-Draft        Support for posix_fadvise               July 2011


   One option is to modify the existing READ and WRITE operations with
   FADVISE hints.  In the case of READ, optimizations are related to
   prefetching.  In the case of WRITE, FADVISE hints inform the server
   whether it should write through its read cache or whether it should
   use an O_DIRECT-like mechanism in order to do an uncached write.  In
   both cases, we're talking about hints that constitute a client's best
   estimate for how it will be using the data in the future. While that
   estimate may indeed change, it is only useful to the server if it is
   stable for a non-zero period of time, i.e., more than a single READ
   or WRITE operation.

   This document adds a new FADVISE operation to communicate the client
   file access patterns as specified in posix_fadvise to the NFS server.
   The NFS server upon receiving a FADVISE operation MAY choose to
   change how it performs I/O and its caching policies, but is under no
   obligation to do so.

   The XDR description is provided in this document in a way that makes
   it simple for the reader to extract into a ready to compile form.
   The reader can feed this document into the following shell script to
   produce the machine readable XDR description of the metadata layout:

   #!/bin/sh
   grep "^  *///" | sed 's?^  *///  ??' | sed 's?^.*///??'

   I.e. if the above script is stored in a file called "extract.sh", and
   this document is in a file called "spec.txt", then the reader can do:

    sh extract.sh < spec.txt > md.x

   The effect of the script is to remove leading white space from each
   line of the specification, plus a sentinel sequence of "///".

1.1. Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC-2119 [1].

2. POSIX Requirements

   This proposal is to create a new NFS operation to support the
   posix_fadvise function, defined as follows [2],

   int posix_fadvise(int fd, off_t offset, off_t len, int advice);




Hildebrand, et al.     Expires January 7, 2012                 [Page 4]

Internet-Draft        Support for posix_fadvise               July 2011


   The posix_fadvise() function shall advise the implementation on the
   expected behavior of the application with respect to the data in the
   file associated with the open file descriptor, fd, starting at offset
   and continuing for len bytes. The specified range need not currently
   exist in the file. If len is zero, all data following offset is
   specified. The implementation may use this information to optimize
   handling of the specified data. The posix_fadvise() function shall
   have no effect on the semantics of other operations on the specified
   data, although it may affect the performance of other operations.

   The advice to be applied to the data is specified by the advice
   parameter and may be one of the following values:

   POSIX_FADV_NORMAL - Specifies that the application has no advice to
   give on its behavior with respect to the specified data. It is the
   default characteristic if no advice is given for an open file.

   POSIX_FADV_SEQUENTIAL - Specifies that the application expects to
   access the specified data sequentially from lower offsets to higher
   offsets.

   POSIX_FADV_RANDOM - Specifies that the application expects to access
   the specified data in a random order.

   POSIX_FADV_WILLNEED - Specifies that the application expects to
   access the specified data in the near future.

   POSIX_FADV_DONTNEED - Specifies that the application expects that it
   will not access the specified data in the near future.

   POSIX_FADV_NOREUSE - Specifies that the application expects to access
   the specified data once and then not reuse it thereafter.

   Upon successful completion, posix_fadvise() shall return zero;
   otherwise, an error number shall be returned to indicate the error.

3. Other Requirements

   Many applications do not use or require POSIX semantics.  These
   applications may benefit from additional hints (and points when they
   are set) that are not covered by posix_fadvise.  At this point, these
   hints and requirements are unclear, but may include per-read and per-
   write hints as well as two additional hints:

   Opportunistic Prefetch - This hint indicates that the stateid holder
   expects to access the data soon; prefetch if it can be done at a
   marginal cost.  The use case for this hint is unclear, since if the


Hildebrand, et al.     Expires January 7, 2012                 [Page 5]

Internet-Draft        Support for posix_fadvise               July 2011


   client knows that it will want to read the data soon, then when would
   it not want the server to prefetch the data at any cost?

   Recently Used - The client has recently accessed the byte range in
   its own cache.  This informs the server that the data in the byte
   range remains important to the client.  When the server reaches
   resource exhaustion, knowing which data is more important allows the
   server to make better choices about which data to, for example purge
   from a cache, or move to secondary storage.  It also informs the
   server which delegations are more important, since if delegations are
   working correctly, once delegated to a client, a server might never
   receive another I/O request for the file.

   The use case for this is also unclear, as most clients already cache
   data that they know is important and having this data cached twice
   may be unnecessary. In fact, substantial performance improvements
   have been demonstrated by making caches more exclusive between each
   other [8], not the other way around.  Other work showed that even
   infinite sized secondary caches can be largely ineffective [7], but
   this of course is subject to the workload.

4. Operation TBD: FADVISE - Application access pattern hints to server

   The section introduces a new operation, named FADVISE, which allows
   NFS clients to communicate application file access pattern hints to
   the NFS server.  A new operation is will allow hints to be sent to
   the server when applications use posix_fadvise, direct I/O, or at any
   other point at which the client finds useful.

4.1. ARGUMENTS

         enum fadvise_type {
               FADVISE_NORMAL     = 0,
               FADVISE_SEQUENTIAL = 1,
               FADVISE_RANDOM     = 2,
               FADVISE_WILLNEED   = 3,
               FADVISE_DONTNEED   = 4,
               FADVISE_NOREUSE    = 5,
         };

         struct FADVISE4args {
               /* CURRENT_FH: file */
               stateid4        stateid;
               offset4         offset;
               length4         count;
               bitmap4         hints;
         };


Hildebrand, et al.     Expires January 7, 2012                 [Page 6]

Internet-Draft        Support for posix_fadvise               July 2011


4.2. RESULTS

         struct FADVISE4resok {
               bitmap4              hints_res;
         };

         union FADVISE4res switch (nfsstat4 _status) {
               case NFS4_OK:
                     FADVISE4resok  fadvise_resok4;
          default:
               void;
         };

4.3. DESCRIPTION

   The FADVISE operation sends an I/O access pattern hint to the server
   for the owner of stated for a given byte range specified by offset
   and count.  The byte range need not currently exist in the file, but
   the hint will apply to the byte range when it does exist.  The server
   MAY ignore the advice.

   The following are the possible hints:

   o  FADVISE_NORMAL - Specifies that the application has no advice to
      give on its behavior with respect to the specified data. It is
      the default characteristic if no advice is given for an open
      file.

   o  FADVISE_SEQUENTIAL - Specifies that the application expects to
      access the specified data sequentially from lower offsets to
      higher offsets.

   o  FADVISE_RANDOM - Specifies that the application expects to access
      the specified data in a random order.

   o  FADVISE_WILLNEED - Specifies that the application expects to
      access the specified data in the near future.

   o  FADVISE_DONTNEED - Specifies that the application expects that it
      will not access the specified data in the near future.

   o  FADVISE_NOREUSE - Specifies that the application expects to access
      the specified data once and then not reuse it thereafter.

   The server will return success if the operation is properly formed,
   otherwise the server will return an error.  The server MUST NOT



Hildebrand, et al.     Expires January 7, 2012                 [Page 7]

Internet-Draft        Support for posix_fadvise               July 2011


   return an error if it does not recognize or does not support the
   requested advice.

   The hints_res returned by the server is primarily for debugging
   purposes and the client SHOULD NOT use this information to change or
   modify its file access behavior. This is for several reasons.  First,
   the server is under no obligation to carry out any hints that it
   describes in the hints_res result.  Second, the FADVISE operation is
   a point in time operation, and the server can only respond based upon
   information at this point in time.  As time progresses, the server
   may need to change its handling of a given file due to several
   reasons including, but not limited to, memory pressure, additional
   FADVISE hints sent by other clients, and heuristically detected file
   access patterns.

   The server MAY return different advice than what the client
   requested.  If it does, then this might be due to one of several
   conditions, including, but not limited to another client advising of
   a different I/O access pattern; a different I/O access pattern from
   another client that that the server has heuristically detected; or
   the server is not able to support the requested I/O access pattern,
   perhaps due to a temporary resource limitation.

4.4. IMPLEMENTATION

   The NFS client may choose to issue and FADVISE operation to the
   server in several different instances.  The most obvious is in direct
   response to an applications execution of posix_fadvise.  Another
   useful point would be when an application indicates it is using
   direct I/O.  Direct I/O may be specified at file open, in which case
   a FADVISE may be included in the same compound as the OPEN operation
   with the FADVISE_NOREUSE flag set.  Direct I/O may also be specified
   separately, in which case a FADVISE operation can be sent to the
   server separately.

5. Security Considerations

   None.

6. IANA Considerations

   The fadvise_type should be able to be extended.







Hildebrand, et al.     Expires January 7, 2012                 [Page 8]

Internet-Draft        Support for posix_fadvise               July 2011


7. References

7.1. Normative References

   [1]   Bradner, S., "Key words for use in RFCs to Indicate Requirement
         Levels", BCP 14, RFC 2119, March 1997.

   [2]   The IEEE and The Open Group, "IEEE Std 1003.1, 2004 Edition,
         The Open Group Technical Standard Base Specifications, Issue
         6", 2004

7.2. Informative References

   [1]   Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame,
         C., Eisler, M., and D. Noveck, "Network File System (NFS)
         version 4 Protocol", RFC 3530, April 2003.

   [2]   Shepler, S., Eisler, M., and D. Noveck, "Network File System
         (NFS) Version 4 Minor Version 1 Protocol", RFC 5661, January
         2010.

   [3]   Shepler, S., Eisler, M., and D. Noveck, "Network File System
         (NFS) Version 4 Minor Version 1 External Data Representation
         Standard (XDR) Description", RFC 5662, January 2010.

   [4]   Nowicki, B., "NFS: Network File System Protocol specification",
         RFC 1094, March 1989.

   [5]   Callaghan, B., Pawlowski, B., and P. Staubach, "NFS Version 3
         Protocol Specification", RFC 1813, June 1995.

   [6]   S. VanDeBogart, C. Frost, E. Kohler, "Reducing Seek Overhead
         with Application-Directed Prefetching", in Proceedings of
         USENIX Annual Technical Conference, June 2009.

   [7]   D. Muntz, P. Honeyman, "Multi-level Caching in Distributed File
         Systems", in Proceedings of USENIX Annual Technical Conference,
         1992.

   [8]   T.M. Wong, J. Wilkes, "My cache or yours? Making storage more
         exclusive", in Proceedings of the USENIX Annual Technical
         Conference, 2002.

8. Acknowledgments

   This document was prepared using 2-Word-v2.0.template.dot.



Hildebrand, et al.     Expires January 7, 2012                 [Page 9]

Internet-Draft        Support for posix_fadvise               July 2011


Authors' Addresses

   Dean Hildebrand
   IBM Almaden
   650 Harry Rd
   San Jose, CA 95120

   Phone: +1 408-927-2013
   Email: dhildeb@us.ibm.com


   Trond Myklebust
   NetApp
   3215 Bellflower Ct
   Ann Arbor, MI  48103
   USA

   Phone: +1-734-662-6608
   Email: Trond.Myklebust@netapp.com


   Sam Falkner
   Oracle
   500 Eldorado Blvd.
   Broomfield, CO  80021

   Phone: +1 720-279-4303
   Email: sam.falkner@oracle.com





















Hildebrand, et al.     Expires January 7, 2012                [Page 10]