From: Brandon S. Allbery (kf8nh@kf8nh.wariat.org)
Date: 01/18/93


From: kf8nh@kf8nh.wariat.org (Brandon S. Allbery)
Subject: Re: FDFLUSH [and fsync too]
Date: Mon, 18 Jan 1993 18:15:27 EST

sct@dcs.ed.ac.uk (Stephen Tweedie) writes:
> Fine - my point was exactly this, that if you are transferring data
> between different machines then volume labels WILL cost you
> compatibility. If tar disks created on my linux box have volume
> labels then they won't work on the Suns at work unless I have
> additional user-mode programs working for me.

Fine, so you don't use the option to fdformat that puts a volume label on
the floppy. That was my point; if a floppy doesn't have a label, it acts
just like a floppy does now. If it does, it gets used and you get all the
added benefits like automatic multi-volume support. Linux should retain
full support for both, and should probably default to the compatible format.
This never caused me trouble with the 3B1, and I certainly wouldn't want to
design something that made it difficult for me to move stuff between the
Linux box here at home, the SCO UNIX thingie at the client site, and the Sun
IPX at the office.

> Given the ability to manage disk changes through system calls, most of
> what you want can be done at the user level by, for example, piping
> tar into an archive splitter. This sort of solution will only work
> for stream data, however; there is no way for an lseek on a pipe to be
> passed to the program at the other end of the pipe!

Exactly. Although an lseek on a multivolume-support /dev/fd0 could get
interesting:

  NOTICE: device fd(0,0) pid 1045 end of media, insert "BACKUP" volume 7

  NOTICE: device fd(0,0) pid 1045 lseek requires "BACKUP" volume 6, please insert

> In the mean time, is FDFLUSH worth inclusion in the kernel? I find it
> extraordinarily useful for a ten-line patch.

Thanks for the compliment, but I'm not Linus :-) It's up to him.

> Oh yes, and you were also surprised at linux's lack of fsync(). Yup,
> it's true. I have been looking at implementing this for a while now,
> and it would almost certainly have to be done as an extra vfs entry
> point. Given the existing file systems, fsync() could be added fairly

I'm surprised it wasn't designed in from the start. If there is *one*
system call from BSD that us System V types have been yearning for for ages
(and finally have with SVR4), it's fsync(). (The O_SYNC flag to open()
doesn't give enough control.)

++Brandon

> easily by walking through a file's data blocks looking for dirty
> blocks, but this would be inefficient for large files because you
> would have to load every indirect block (dirty or not) during the
> search for dirty data.

Yuck. fsync() is most in demand by database implementors... I would hate to
have to wait around while the kernel checked for dirty blocks in a 75MB
database volume.

> the kernel. For example, the buffer-head struct could be augmented
> with an inode reference number (the device number is already stored
> there); this would require passing the inode number to getblk()
> whenever new blocks are requested, and would also require a way to
> unmark the blocks on truncate() or unlink().

But you have to do that anyway, don't you? I would hope the kernel isn't
going to keep blocks invalidated by such calls in the buffer pool to be
updated to disk anyway.... (unlink() had better not immediately unmark
anything in the buffer cache anyway, since it's a Unixism to create a private
temporary file by unlinking it while keeping it open. If a program chooses
to unlink it after doing a few writes but before lseeking back to the start
to reread it for pass 2, it'll be in for a nasty surprise.) If you're
invalidating the blocks in the cache anyway at that point, the inode pointer
can be ignored.

> By the way, what should the behaviour of fsync() be on character
> devices? I'm tempted just to return ENOSYS, but it might be suitable
> to perform a block until the device's output buffer is empty.

ENOSYS is fine. If you do it for, say, ttys then you will be pushed to do
it for sockets, etc. Is it worthwhile to special-case fsync() for every non-
"ordinary file" it might be used on? Especially if someone adds a new type
later? (For an exmaple: consider how fsync() should act upon, say, a
hypothetical semaphore filesystem object a' la Xenix 3.x? Is it even
meaningful? And do we want to have to modify fsync() when adding such
semaphore objects, or other potential enhancements, to the filesystem?)

Not to mention another nasty: a process opens both master and slave ends of
a pty, writes down one end, then fsync()'s it. (Presumably due to a bug of
some kind, like getting confused about which file descriptor is which.) So
now fsync() has to special case ptys and check for potential deadlock! (I
wonder what 4.3BSD does if you try this. :-)

++Brandon