home *** CD-ROM | disk | FTP | other *** search
- From: pcg@cs.aber.ac.uk (Piercarlo Grandi)
- Newsgroups: comp.arch,alt.sources
- Subject: Alignment IS important
- copying between [un]aligned source and/or destination addresses
- Message-ID: <PCG.90Sep25171913@odin.cs.aber.ac.uk>
- Date: 25 Sep 90 16:19:13 GMT
- Sender: pcg@aber-cs.UUCP
- Organization: Coleg Prifysgol Cymru
- Lines: 1135
- Nntp-Posting-Host: odin
-
-
- There has been some debate in this newsgroup about the importance of
- aligned memory access. I have finally neatly packaged my own technology
- for doing core-to-core memory copies, aligned and unaligned, and here I
- am posting the technology and some discussion of the results.
-
- This article is posted to comp.arch because it discusses architecture,
- and to alt.sources because it contains generally useful source code.
-
- Usual disclaimer: this work has no relationship whatever to that
- of the University College of Wales; it was performed exclusively
- by me, with the use of my own time, funds, machines, know-how,
- and has not been aided abetted or supported in any way by the
- Unviersity College of Wales. I thank them for providing the
- opportunity to access News and therefore to post this article,
- about which they do not actually know anything.
-
- This article is about a library function essentially equivalent to
- memcpy(3), that I have called CoreCopy(). It does not handle
- overlapping moves, even if it would not be difficult to extend it so.
-
- It is very portable, but also highly tuned and parametric on the machine
- characteristics. I use a set of my own (longish) headers for the parametric
- information; they have been summarized here in the file "CoreHdr.h". I hope
- that the parameters are self explanatory. The way the parametrization is
- used in the CoreCopy source is I hope quite clear, even if virtually all of
- the source is preprocessor source, which I have tried to make as readable as
- possible. This is one obvious case where very careful hand optimization and
- parametrization gives a pay-off and is relevant, as core-to-core copy
- bandwidth is often crucial, e.g. to the overall efficiency of the UNIX
- kernel.
-
- Here is a list of files contained in the attached shar and their
- contents:
-
- Core.h The user interface of CoreCopy()
- CoreCopy.c The source for CoreCopy()
- CoreHdr.h Environment parameters
- CoreSun3.h Tuning parameters for Sun 3 machines
- CoreSv386.h Tuning parameters for SysV/386 machines
- CoreTest.c A program to "benchmark" CoreCopy()
- CoreRun.sh A shell script to run CoreTest
- CoreSv386.pr Results of running CoreRun.sh under SysV/386
- CoreSun3.pr Results of running CoreRun.sh on a Sun 3
-
- I will provide here some comments on the "benchmark" results:
-
- The benchmarks involve three cases, copying a total 16MB, in chunks of 8,
- 32, 128, 512 and 2048 bytes; each copy is done first with both source and
- destination aligned on a "double" boundary, then with both misaligned by 1
- byte, then with source aligned and destination misaligned by 3 bytes, and
- then the reverse.
-
- User time in seconds.centiseconds is reported, as returned by the OS.
-
- The first case does not really copy anything; it is run just to have an idea
- of the function calling overhead, which dominates when calling small chunks.
- The second is copying using the system provided memcpy(3) function; the
- third case is running CoreCopy() itself. (you can if you want run additional
- cases, just for comparison, as they do not involve CoreCopy() itself; look
- at the "CoreTest.c" file).
-
- Some parts of the "benchmark" may be not perfectly portable (one example is
- that I ensure that the source and destination buffers are aligned by putting
- before their definition a definition for a 'double'), but should be to
- nearly every common architecture I can imagine.
-
- Environment of benchmarks:
-
- Sv386 is an i386DX 20Mhz with (write-thru) 64KB cache, running System
- V/386 with the Register C Compiler. As a very rought measure of power,
- it does a bit more than 6000 2.x dhrystone.
-
- Sun3 is a Sun 3/280, 68020 25Mhz with (write-thru?) cache, running
- SunOS 4.0.3 with the PCC descended compiler. This does also a bit
- more than 6000 2.x dhrystones.
-
- Here is a subset of the results; on the left is the Sun3, the right is
- the Sv386. I have chosen as block sizes 512 because it is large enough
- that procedure call overhead is not large, and 32 because it is small
- enough that the overhead starts to matter.
-
- .------------------ Size of block copied in bytes
- |
- | .------------ Destination address modulus 4
- | |
- | | .------- Source address modulus 4
- | | |
- | | | .--- Time in seconds.centiseconds to copy 16MB
- | | | |
- | | | |
- V V V V
-
- Sun3 memcpy(3) Sv386 memcpy(3)
-
- 512B t% 0 f% 0 2.25u 512B t% 0 f% 0 1.55u
- 512B t% 1 f% 1 3.02u 512B t% 1 f% 1 4.21u
- 512B t% 0 f% 3 8.38u 512B t% 0 f% 3 3.32u
- 512B t% 3 f% 0 8.44u 512B t% 3 f% 0 2.43u
- 32B t% 0 f% 0 7.02u 32B t% 0 f% 0 4.71u
- 32B t% 1 f% 1 8.02u 32B t% 1 f% 1 7.39u
- 32B t% 0 f% 3 12.20u 32B t% 0 f% 3 6.44u
- 32B t% 3 f% 0 12.01u 32B t% 3 f% 0 5.64u
-
- Sun3 CoreCopy() Sv386 CoreCopy()
-
- 512B t% 0 f% 0 2.49u 512B t% 0 f% 0 1.68u
- 512B t% 1 f% 1 3.11u 512B t% 1 f% 1 1.77u
- 512B t% 0 f% 3 4.09u 512B t% 0 f% 3 2.65u
- 512B t% 3 f% 0 3.23u 512B t% 3 f% 0 2.57u
- 32B t% 0 f% 0 6.10u 32B t% 0 f% 0 6.48u
- 32B t% 1 f% 1 6.46u 32B t% 1 f% 1 9.21u
- 32B t% 0 f% 3 6.09u 32B t% 0 f% 3 8.28u
- 32B t% 3 f% 0 6.29u 32B t% 3 f% 0 7.45u
-
-
- The results are often surprising, and must be analyzed with some detailed
- knowledge of the logic used by CoreCopy(), memcpy(3), and the performance
- profiles of the compiler and CPU architecture and implementation
- involved (please also refer to the full set of results in the shar
- archive below).
-
- In general CoreCopy() is as fast or just a little bit slower than the
- in-built memcpy(3) function for aligned copies; it is usually much faster
- for unaligned copies. This holds true down to fairly small chunk sizes; for
- very small chunk sizes the higher overheads of CoreCopy() become more
- important.
-
- I have not included any statistics on this, but indeed aligning the
- destination instead of the source does provide a significant performance
- benefit. Another interesting note is that (4-way) loop unrolling does not
- buy much for the machines I have used; probably tight loops in this case are
- just as good, because of pipelining or something else. It helps instead to
- unroll the code that copies the misaligned head and tail of the core area to
- copy, because on most machines 4-way unrolling means that the loop will
- never be repeated, because head and tail are 1, 2 or 3 bytes long.
-
- Probably substituting memcpy(3) with CoreCopy() on each of the tested
- machines would provide overall benefits, because CoreCopy() is only a
- little worse then memcpy(3) with aligned copies, but usually
- dramatically better with unaligned ones. In particular if you use it in
- the "insdel.c" module of GNU Emacs, for which an opportune patch will be
- posted, you may experience huge speedups; currently "insdel.c" uses a C
- coded char-by-char loop to shift the buffer. Even just using the system
- provided bcopy(3) or memcpy(3) will help a lot.
-
- It is essential to performance to have inline assembler code on the 386, but
- not on the 68020; this is probably because the code to do a string copy on a
- 386 looks fairly large -- using the in-built string copy instructions
- provides a 3x speedup, probably most because of saving on instruction word
- fetches, and the limited pipelining of the 386. It would be interesting to
- see how the 486 compares. It is interesting to note that memcpy(3) on the
- 386 also uses the string copy instructions; it however ignores alignment,
- and copies word by word until there is less than a wordful of bytes, and
- then byte by byte.
-
- I think that some inline machine language would also be vital for machines
- like the MIPS or SPARC that do traps to support unaligned accesses in
- general. I did actually run some cases on a Mips, but the unaligned cases
- are simply too slow because of trapping (in the aligned ones CoreCopy() is
- as quick as the assembler coded memcpy(3)).
-
- You are welcome to provide machine dependent headers for other architectures
- and compilers, and to experiment with the various parameters, thresholds,
- etc... you will find in the source. I would be interested in knowing the
- times for the VAX-11/780, on which the first incarnation of this function
- was developed (as soon I had read how the CPU-SBI-Memory interface worked
- for byte stores :->).
-
- The source and the full result files are in the following shar archive.
-
- ---------------------------cut here---------------------------------------
- #! /bin/sh
- # This is a shell archive. Remove anything before this line, then unpack
- # it by saving it into a file and typing "sh file". To overwrite existing
- # files, type "sh file -c". You can also feed this as standard input via
- # unshar, or by typing "sh <file", e.g.. If this archive is complete, you
- # will see the following message at the end:
- # "End of shell archive."
- # Contents: Core.h CoreCopy.c CoreHdr.h CoreRun.sh CoreSun3.h
- # CoreSun3.pr CoreSv386.h CoreSv386.pr CoreTest.c
- # Wrapped by pcg@thor on Tue Sep 25 16:32:40 1990
- PATH=/bin:/usr/bin:/usr/ucb ; export PATH
- echo '
- Copyright 1982,1990 Piercarlo Grandi. All rights reserved.
-
- This shar archive contains free software; you can redistribute
- it and/or modify it under the terms of the GNU General Public
- License as published by the Free Software Foundation; either
- version 1, or (at your option) any later version.
-
- This shar archive is distributed in the hope that it will be
- useful, but WITHOUT ANY WARRANTY; without even the implied
- warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR
- PURPOSE. See the GNU General Public License for more details.
-
- You may have received a copy of the GNU General Public License
- along with this program; if not, write to the Free Software
- Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
- '
- sleep 4
- if test -f 'Core.h' -a "${1}" != "-c" ; then
- echo shar: Will not clobber existing file \"'Core.h'\"
- else
- echo shar: Extracting \"'Core.h'\" \(487 characters\)
- sed "s/^X//" >'Core.h' <<'END_OF_FILE'
- X#ifndef Core_H
- X#define Core_H
- X#if __STDC__
- X# pragma once
- X#endif
- X
- X#if 0
- X#ifndef Extend_H
- X# include "Extend.h"
- X#endif
- X#endif
- X
- X/*
- X This is a set of library routines to allocate virtual memory and
- X manipulate it. It strives to be reliable and consistent,
- X efficient and portable. Unfortunately this latter quality is
- X more difficult to obtain than the others for such a low level
- X library.
- X*/
- X
- extern pointer CoreCopy of((pointer,pointer,addressy));
- X
- X#endif /* Core_H */
- END_OF_FILE
- if test 487 -ne `wc -c <'Core.h'`; then
- echo shar: \"'Core.h'\" unpacked with wrong size!
- fi
- # end of 'Core.h'
- fi
- if test -f 'CoreCopy.c' -a "${1}" != "-c" ; then
- echo shar: Will not clobber existing file \"'CoreCopy.c'\"
- else
- echo shar: Extracting \"'CoreCopy.c'\" \(8519 characters\)
- sed "s/^X//" >'CoreCopy.c' <<'END_OF_FILE'
- X#if 1
- X# include "CoreHdr.h"
- X#else
- X#ifndef Extend_h
- X# include "Extend.h"
- X#endif
- X
- X#include <import>
- X#ifndef Here_h
- X# include "Here.h"
- X#endif
- X#ifndef With_h
- X# include "With.h"
- X#endif
- X#ifndef Type_h
- X# include "Type.h"
- X#endif
- X#ifndef Convert_h
- X# include "Convert.h"
- X#endif
- X#ifndef Bits_h
- X# include "Bits.h"
- X#endif
- X#ifndef Assert_h
- X# include "Assert.h"
- X#endif
- X
- X#include <export>
- X#endif /* 1 */
- X
- X#ifndef Core_h
- X# include "Core.h"
- X#endif
- X
- X#if ((CcFEATURE & CcKR78) != CcKR78)
- X# include "ERROR: language supported too old"
- X#endif
- X
- X/*
- X This function is handed pointers to two memory areas, and copies as many
- X units as it is told from the second to the first. The two areas are
- X expected to begin at any byte boundary, and the size is given in bytes
- X too.
- X
- X If the memory subsystem of the machine handles more efficiently
- X naturally aligned requests in clusters (multiples of a unit), we try to
- X take advantage of that. Since we cannot take advantage of moving clusters
- X for both source and destination, we optimize the writing of clusters, of
- X course...
- X*/
- X
- X/*
- X We need three copying operations. Only the first is always needed,
- X the remaining two are needed only if copying by clusters pays.
- X
- X BYTECOPY(to,from,bytes) copies bytes by byte, the number of bytes is
- X guaranteed to be >= 0.
- X
- X ODDCOPY(to,from,bytes) also copies byte by byte, but the number of bytes
- X is guaranteed to be >= 0 && < ClusterBYTES.
- X
- X CLUSTERCOPY(to,from,clusters) copies cluster by cluster, and the number
- X of clusters is guaranteed to be > 0.
- X
- X For all these macros, the value of bytes is not touched, but to and from
- X are updated to point to the end of the copied area.
- X*/
- X
- X#if (CpuIS == CpuIAPX && CpuMODEL == 0x0386)
- X# include "CoreSv386.h"
- X#endif /* CpuIAPX && 0x0386 */
- X
- X#if (CpuIS == CpuMC68000 && CpuMODEL == 0x0020)
- X# include "CoreSun3.h"
- X#endif /* CpuMC68000 && 0x0020 */
- X
- X#if (CpuIS == CpuMIPS /* && CpuMODEL == 0x3000 */)
- X# include "CoreMips.h"
- X#endif /* CpuMIPS */
- X
- X#ifndef ClusterBITS
- X
- X# ifdef CoreFASTALIGN
- X# define ClusterBITS (CoreFASTALIGN*CpuUNIT)
- X# else
- X# if (CoreFEATURE & (CoreDCACHE|CoreWRITETHRU) == (CoreDCACHE))
- X# define ClusterBITS (CoreCACHELINE*CpuUNIT)
- X# else
- X# ifdef CoreCORELINE
- X# define ClusterBITS (CoreCORELINE*CpuUNIT)
- X# else
- X# ifdef CoreINTERLEAVE
- X# define ClusterBITS (CoreINTERLEAVE*CpuUNIT)
- X# else
- X# define ClusterBITS CpuUNIT
- X# endif
- X# endif
- X# endif
- X# endif
- X
- X# if (ClusterBITS >= LongBITS && (ClusterBITS % LongBITS) == 0)
- X# undef ClusterBITS
- X# define ClusterBITS LongBITS
- X# endif
- X
- X# if ((ClusterBITS % ByteBITS) == 0)
- X# define ClusterBYTES (ClusterBITS/ByteBITS)
- X# else
- X# include "ERROR: Cluster size is not an even # of bytes"
- X# endif
- X
- X#endif /* ndef ClusterBITS */
- X
- X#if (ClusterBYTES > 1)
- X
- X# ifndef ClusterLNBYTES
- X# if (ClusterBYTES == 2)
- X# define ClusterLNBYTES 1
- X# endif
- X# if (ClusterBYTES == 4)
- X# define ClusterLNBYTES 2
- X# endif
- X# if (ClusterBYTES == 8)
- X# define ClusterLNBYTES 3
- X# endif
- X# endif
- X
- X# ifndef ClusterBEST
- X# if (CoreFEATURE & CoreWRITETHRU)
- X# define ClusterBEST (ClusterBYTES*4)
- X# else
- X# define ClusterBEST (ClusterBYTES*8)
- X# endif
- X# endif
- X
- X# ifndef ClusterALIGNTO
- X# define ClusterALIGNTO 1
- X# endif
- X
- X# ifndef ClusterDOALIGN
- X# define ClusterDOALIGN (ClusterBEST*4)
- X# endif
- X
- X# ifndef ClusterTYPE
- X# if (ClusterBITS == ShortBITS && !defined ClusterTYPE)
- X# define ClusterTYPE short
- X# endif
- X# if (ClusterBITS == IntBITS && !defined ClusterTYPE)
- X# define ClusterTYPE int
- X# endif
- X# if (ClusterBITS == LongBITS && !defined ClusterTYPE)
- X# define ClusterTYPE long
- X# endif
- X# if (!defined ClusterTYPE)
- X# include "ERROR: cannot define a sensible ClusterTYPE"
- X# endif
- X# endif
- X
- X# if (!defined ClusterREM && defined ClusterLNBYTES)
- X# define ClusterREM(n) ((n) & (ClusterBYTES-1))
- X# define ClusterDIV(n) ((n) >> ClusterLNBYTES)
- X# else
- X# define ClusterREM(n) ((n) % ClusterBYTES)
- X# define ClusterDIV(n) ((n) / ClusterBYTES)
- X# endif
- X
- X# if (!defined Core4CLUSTERCOPY \
- X && (CodeREGISTERS >= 6 || CodePREGISTERS >= 5))
- X# define Core4CLUSTERCOPY(to,from,clusters) \
- X begindef \
- X fast ClusterTYPE *CoreTo = (ClusterTYPE *) (to); \
- X fast ClusterTYPE *CoreFrom = (ClusterTYPE *) (from); \
- X fast addressy CoreClusters = (clusters); \
- X while (CoreClusters) switch (CoreClusters) \
- X { \
- X default: *CoreTo++ = *CoreFrom++; --CoreClusters; \
- X case 3: *CoreTo++ = *CoreFrom++; --CoreClusters; \
- X case 2: *CoreTo++ = *CoreFrom++; --CoreClusters; \
- X case 1: *CoreTo++ = *CoreFrom++; --CoreClusters; \
- X case 0: break; /* keep this "useless" break in ... */ \
- X } \
- X /* do *CoreTo++ = *CoreFrom++; while (--CoreClusters); */ \
- X (to) = (pointer) CoreTo, (from) = (pointer) CoreFrom; \
- X enddef
- X# endif
- X
- X# ifndef Core4CLUSTERCOPY
- X# define Core4CLUSTERCOPY(to,from,clusters) \
- X begindef \
- X fast addressy CoreClusters = (clusters); \
- X while (CoreClusters) switch (CoreClusters) \
- X { \
- X default: \
- X *(ClusterTYPE *) (to) = *(ClusterTYPE *) (from); \
- X (to) += ClusterBYTES, (from) += ClusterBYTES; \
- X --CoreClusters; \
- X case 3: \
- X *(ClusterTYPE *) (to) = *(ClusterTYPE *) (from); \
- X (to) += ClusterBYTES, (from) += ClusterBYTES; \
- X --CoreClusters; \
- X case 2: \
- X *(ClusterTYPE *) (to) = *(ClusterTYPE *) (from); \
- X (to) += ClusterBYTES, (from) += ClusterBYTES; \
- X --CoreClusters; \
- X case 1: \
- X *(ClusterTYPE *) (to) = *(ClusterTYPE *) (from); \
- X (to) += ClusterBYTES, (from) += ClusterBYTES; \
- X --CoreClusters; \
- X case 0: break; /* keep this "useless" break in ... */ \
- X } \
- X enddef
- X# endif
- X
- X# ifndef CoreCLUSTERCOPY
- X# define CoreCLUSTERCOPY(to,from,clusters) \
- X begindef \
- X fast addressy CoreClusters = (clusters); \
- X do { *(ClusterTYPE *) (to) = *(ClusterTYPE *) (from); \
- X (to) += ClusterBYTES, (from) += ClusterBYTES; \
- X } while (--CoreClusters); \
- X enddef
- X# endif
- X
- X#endif /* ClusterBYTES > 1 */
- X
- X#ifndef Core4BYTECOPY
- X# define Core4BYTECOPY(to,from,bytes) \
- X begindef \
- X fast addressy CoreBytes = (bytes); \
- X while (CoreBytes) switch (CoreBytes) \
- X { \
- X default: *(to)++ = *(from)++; --CoreBytes; \
- X case 3: *(to)++ = *(from)++; --CoreBytes; \
- X case 2: *(to)++ = *(from)++; --CoreBytes; \
- X case 1: *(to)++ = *(from)++; --CoreBytes; \
- X case 0: break; /* keep this "useless" break in ... */ \
- X } \
- X enddef
- X#endif /* ndef Core4BYTECOPY */
- X
- X/*
- X You may want to define this, but at least on my machine (iAPX 386)
- X unrolling loops does not pay.
- X*/
- X
- X#ifndef CoreBYTECOPY
- X# if ((CpuFEATURE&CpuPIPELINE) && !(CpuIS == CpuIAPX && CpuMODEL == 0x0386))
- X# define CoreBYTECOPY Core4BYTECOPY
- X# endif
- X#endif
- X
- X#ifndef CoreBYTECOPY
- X# define CoreBYTECOPY(to,from,bytes) \
- X begindef \
- X fast addressy CoreBytes = (bytes); \
- X while (CoreBytes) *(to)++ = *(from)++, --CoreBytes; \
- X enddef
- X#endif /* ndef CoreBYTECOPY */
- X
- X#ifndef CoreODDCOPY
- X# if (ClusterBYTES <= 4)
- X# define CoreODDCOPY Core4BYTECOPY
- X# else
- X# define CoreODDCOPY CoreBYTECOPY
- X# endif
- X#endif /* ndef CoreODDCOPY */
- X
- global pointer CoreCopy(to,from,bytes)
- X fast pointer to;
- X fast pointer from;
- X addressy bytes;
- X{
- X# ifndef CoreCLUSTERCOPY
- X CoreBYTECOPY(to,from,bytes);
- X# else
- X {
- X copySmallBlock:
- X
- X if (bytes < ClusterBEST)
- X {
- X CoreBYTECOPY(to,from,bytes);
- X return to;
- X }
- X
- X# if (ClusterDOALIGN != 0)
- X {
- X /*
- X Note that here we usually want align cluster transfers
- X on 'to', as we care more about aligning writes than
- X reads, that are often easier to pipeline.
- X */
- X
- X copyHead:
- X
- X if (bytes >= ClusterDOALIGN)
- X {
- X addressy odd;
- X
- X# if (ClusterALIGNTO)
- X# define ClusterALIGN to
- X# else
- X# define ClusterALIGN from
- X# endif
- X
- X if ((odd = ClusterREM((addressy) ClusterALIGN)) != 0)
- X {
- X CoreODDCOPY(to,from,odd = ClusterBYTES - odd);
- X bytes -= odd;
- X }
- X
- X# undef ClusterALIGN
- X }
- X }
- X# endif /* ClusterDOALIGN != 0 */
- X
- X copyClusters:
- X
- X assert (ClusterREM((addressy) to) == 0,"CoreCopy");
- X CoreCLUSTERCOPY(to,from,ClusterDIV(bytes));
- X assert (ClusterREM((addressy) to) == 0,"CoreCopy");
- X
- X copyTail:
- X
- X CoreODDCOPY(to,from,ClusterREM(bytes));
- X }
- X#endif /* ndef CoreCLUSTERCOPY */
- X
- X return to;
- X}
- END_OF_FILE
- if test 8519 -ne `wc -c <'CoreCopy.c'`; then
- echo shar: \"'CoreCopy.c'\" unpacked with wrong size!
- fi
- # end of 'CoreCopy.c'
- fi
- if test -f 'CoreHdr.h' -a "${1}" != "-c" ; then
- echo shar: Will not clobber existing file \"'CoreHdr.h'\"
- else
- echo shar: Extracting \"'CoreHdr.h'\" \(2367 characters\)
- sed "s/^X//" >'CoreHdr.h' <<'END_OF_FILE'
- X#define CpuIAPX 0x0005
- X#define CpuMC68000 0x0006
- X#define CpuMIPS 0x0007
- X
- X#ifdef i386
- X#define CpuIS CpuIAPX /* Type of instruction set */
- X#define CpuMODEL 0x0386 /* In HEX ! */
- X#endif
- X#ifdef sun3
- X#define CpuIS CpuMC68000 /* Type of instruction set */
- X#define CpuMODEL 0x0020 /* In HEX ! */
- X#endif
- X#ifdef mips
- X#define CpuIS CpuMIPS /* Type of instruction set */
- X#define CpuMODEL 0x3000 /* In HEX ! */
- X#endif
- X
- X#define CpuUNIT 8 /* Bits in addressable unit */
- X
- X#define CpuFEATURE 0x0008 /* Peculiarities of CPU */
- X#define CpuPIPELINE 0x0002 /* Multi stage command obey */
- X#define CpuDALIGN 0x000a /* Must align data */
- X
- X#define CoreFEATURE 0x0006 /* Peculiarities of memory sys */
- X#define CoreDCACHE 0x0002 /* Has a DATA cache */
- X#define CoreWRITETHRU 0x0004 /* Updates directly to memory */
- X
- X#define CoreCACHELINE 16 /* D cache line size in units */
- X#define CoreCORELINE 4 /* Units to/from mem at a time */
- X#define CoreINTERLEAVE 1 /* Interleaving in units */
- X#define CoreFASTALIGN 4 /* Align at this for fast move */
- X
- X#define CcPORTABLE 0x0005 /* Johnson's classic */
- X#define CcREGISTER 0x0006 /* Successor to PORTABLE */
- X
- X#ifdef i386
- X#define CcIS CcREGISTER /* Type (author) of compiler */
- X#define CodeREGISTERS 5 /* Spare universal registers */
- X#define CodeDREGISTERS 0 /* Spare data only registers */
- X#define CodePREGISTERS 0 /* Spare pointer only registers */
- X#endif
- X#ifdef sun3
- X#define CcIS CcPORTABLE /* Type (author) of compiler */
- X#define CodeREGISTERS 0 /* Spare universal registers */
- X#define CodeDREGISTERS 3 /* Spare data only registers */
- X#define CodePREGISTERS 3 /* Spare pointer only registers */
- X#endif
- X#ifdef mips
- X#define CcIS CcPORTABLE /* Type (author) of compiler */
- X#define CodeREGISTERS 8 /* Spare universal registers */
- X#define CodeDREGISTERS 0 /* Spare data only registers */
- X#define CodePREGISTERS 0 /* Spare pointer only registers */
- X#endif
- X
- X#define CcFEATURE 0x027f /* Compiler dependent C */
- X#define CcKR78 0x007f /*=All that is in K&R 1st ed. */
- X#define CcASM 0x0200 /*!asm(" ... "); */
- X
- X#define ByteBITS 8
- X#define ShortBITS 16
- X#define IntBITS 32
- X#define LongBITS 32
- X
- X
- X#define of(ARGS) (/* ARGS */)
- X#define begindef do {
- X#define enddef } while (0)
- X
- X#define global /* extern */
- X#define fast register
- X#define assert(c,m) /* no op */
- X
- typedef unsigned addressy;
- typedef char *pointer;
- END_OF_FILE
- if test 2367 -ne `wc -c <'CoreHdr.h'`; then
- echo shar: \"'CoreHdr.h'\" unpacked with wrong size!
- fi
- # end of 'CoreHdr.h'
- fi
- if test -f 'CoreRun.sh' -a "${1}" != "-c" ; then
- echo shar: Will not clobber existing file \"'CoreRun.sh'\"
- else
- echo shar: Extracting \"'CoreRun.sh'\" \(126 characters\)
- sed "s/^X//" >'CoreRun.sh' <<'END_OF_FILE'
- for C in 0 1 2
- do
- X for B in 2048 512 128 32 8
- X do
- X $1 $C $B 0 0
- X $1 $C $B 1 1
- X $1 $C $B 0 3
- X $1 $C $B 3 0
- X done
- done
- END_OF_FILE
- if test 126 -ne `wc -c <'CoreRun.sh'`; then
- echo shar: \"'CoreRun.sh'\" unpacked with wrong size!
- fi
- chmod +x 'CoreRun.sh'
- # end of 'CoreRun.sh'
- fi
- if test -f 'CoreSun3.h' -a "${1}" != "-c" ; then
- echo shar: Will not clobber existing file \"'CoreSun3.h'\"
- else
- echo shar: Extracting \"'CoreSun3.h'\" \(1135 characters\)
- sed "s/^X//" >'CoreSun3.h' <<'END_OF_FILE'
- X#define ClusterBITS 32 /* Bits in a cluster */
- X#define ClusterBYTES 4 /* Bytes in a cluster */
- X#define ClusterLNBYTES 2 /* Log2 of ClusterBYTES */
- X#define ClusterTYPE int /* The type of a cluster */
- X
- X#define ClusterALIGNTO 1 /* Align destination */
- X
- X#if (CcIS == CcPORTABLE)
- X
- X# define ClusterBEST 16 /* Copy clusters when longer */
- X# define ClusterDOALIGN 64 /* Align clusters when longer */
- X# define CoreODDCOPY CoreBYTECOPY
- X
- X /* Asm inlines do not improve speed */
- X# if (0 && (CcFEATURE&CcASM))
- X /*
- X Having had a look at the generated code, we know that to is
- X a5, from is a4, and the count is "always" ready in d0.
- X */
- X
- X# define CoreBYTECOPY(to,from,bytes) \
- X begindef \
- X fast unsigned CoreBytes; \
- X if (CoreBytes = (bytes)) { \
- X asm ("1: movb a4@+,a5@+"); \
- X asm (" dbra d0,1b"); } \
- X enddef
- X
- X# define CoreCLUSTERCOPY(to,from,clusters) \
- X begindef \
- X fast unsigned CoreClusters; \
- X if (CoreClusters = (clusters)) { \
- X asm ("1: movl a4@+,a5@+"); \
- X asm (" dbra d0,1b"); } \
- X enddef
- X
- X# endif /* 0 */
- X
- X#endif /* CsIS == CcPORTABLE */
- END_OF_FILE
- if test 1135 -ne `wc -c <'CoreSun3.h'`; then
- echo shar: \"'CoreSun3.h'\" unpacked with wrong size!
- fi
- # end of 'CoreSun3.h'
- fi
- if test -f 'CoreSun3.pr' -a "${1}" != "-c" ; then
- echo shar: Will not clobber existing file \"'CoreSun3.pr'\"
- else
- echo shar: Extracting \"'CoreSun3.pr'\" \(2040 characters\)
- sed "s/^X//" >'CoreSun3.pr' <<'END_OF_FILE'
- X 16MB C=0 2048B t% 0 f% 0 0.01u
- X 16MB C=0 2048B t% 1 f% 1 0.02u
- X 16MB C=0 2048B t% 0 f% 3 0.02u
- X 16MB C=0 2048B t% 3 f% 0 0.02u
- X 16MB C=0 512B t% 0 f% 0 0.07u
- X 16MB C=0 512B t% 1 f% 1 0.04u
- X 16MB C=0 512B t% 0 f% 3 0.08u
- X 16MB C=0 512B t% 3 f% 0 0.04u
- X 16MB C=0 128B t% 0 f% 0 0.27u
- X 16MB C=0 128B t% 1 f% 1 0.25u
- X 16MB C=0 128B t% 0 f% 3 0.25u
- X 16MB C=0 128B t% 3 f% 0 0.23u
- X 16MB C=0 32B t% 0 f% 0 1.44u
- X 16MB C=0 32B t% 1 f% 1 1.52u
- X 16MB C=0 32B t% 0 f% 3 1.43u
- X 16MB C=0 32B t% 3 f% 0 1.45u
- X 16MB C=0 8B t% 0 f% 0 7.04u
- X 16MB C=0 8B t% 1 f% 1 7.23u
- X 16MB C=0 8B t% 0 f% 3 7.25u
- X 16MB C=0 8B t% 3 f% 0 7.25u
- X 16MB C=1 2048B t% 0 f% 0 2.10u
- X 16MB C=1 2048B t% 1 f% 1 2.45u
- X 16MB C=1 2048B t% 0 f% 3 8.38u
- X 16MB C=1 2048B t% 3 f% 0 8.37u
- X 16MB C=1 512B t% 0 f% 0 2.25u
- X 16MB C=1 512B t% 1 f% 1 3.02u
- X 16MB C=1 512B t% 0 f% 3 8.38u
- X 16MB C=1 512B t% 3 f% 0 8.44u
- X 16MB C=1 128B t% 0 f% 0 3.22u
- X 16MB C=1 128B t% 1 f% 1 4.32u
- X 16MB C=1 128B t% 0 f% 3 10.43u
- X 16MB C=1 128B t% 3 f% 0 10.00u
- X 16MB C=1 32B t% 0 f% 0 7.02u
- X 16MB C=1 32B t% 1 f% 1 8.02u
- X 16MB C=1 32B t% 0 f% 3 12.20u
- X 16MB C=1 32B t% 3 f% 0 12.01u
- X 16MB C=1 8B t% 0 f% 0 22.05u
- X 16MB C=1 8B t% 1 f% 1 25.23u
- X 16MB C=1 8B t% 0 f% 3 24.37u
- X 16MB C=1 8B t% 3 f% 0 24.35u
- X 16MB C=2 2048B t% 0 f% 0 3.01u
- X 16MB C=2 2048B t% 1 f% 1 2.44u
- X 16MB C=2 2048B t% 0 f% 3 3.06u
- X 16MB C=2 2048B t% 3 f% 0 3.21u
- X 16MB C=2 512B t% 0 f% 0 2.49u
- X 16MB C=2 512B t% 1 f% 1 3.11u
- X 16MB C=2 512B t% 0 f% 3 4.09u
- X 16MB C=2 512B t% 3 f% 0 3.23u
- X 16MB C=2 128B t% 0 f% 0 3.48u
- X 16MB C=2 128B t% 1 f% 1 4.10u
- X 16MB C=2 128B t% 0 f% 3 4.53u
- X 16MB C=2 128B t% 3 f% 0 4.19u
- X 16MB C=2 32B t% 0 f% 0 6.10u
- X 16MB C=2 32B t% 1 f% 1 6.46u
- X 16MB C=2 32B t% 0 f% 3 6.09u
- X 16MB C=2 32B t% 3 f% 0 6.29u
- X 16MB C=2 8B t% 0 f% 0 29.09u
- X 16MB C=2 8B t% 1 f% 1 29.53u
- X 16MB C=2 8B t% 0 f% 3 28.50u
- X 16MB C=2 8B t% 3 f% 0 28.56u
- END_OF_FILE
- if test 2040 -ne `wc -c <'CoreSun3.pr'`; then
- echo shar: \"'CoreSun3.pr'\" unpacked with wrong size!
- fi
- # end of 'CoreSun3.pr'
- fi
- if test -f 'CoreSv386.h' -a "${1}" != "-c" ; then
- echo shar: Will not clobber existing file \"'CoreSv386.h'\"
- else
- echo shar: Extracting \"'CoreSv386.h'\" \(1879 characters\)
- sed "s/^X//" >'CoreSv386.h' <<'END_OF_FILE'
- X#define ClusterBITS 32 /* Bits in a cluster */
- X#define ClusterBYTES 4 /* Bytes in a cluster */
- X#define ClusterLNBYTES 2 /* Log2 of ClusterBYTES */
- X#define ClusterTYPE int /* The type of a cluster */
- X
- X#define ClusterALIGNTO 1 /* This should be 1; 25% diff. */
- X
- X#if (CcIS == CcREGISTER)
- X
- X# define ClusterBEST 16 /* Copy clusters when longer */
- X# define ClusterDOALIGN 64 /* Align clusters when longer */
- X
- X /* This is 0, but should be 1. Cannot get asm procs to work! */
- X# if (0 && (CcFEATURE&CcASM))
- X
- X asm void CoreByteCopy(to,from,bytes)
- X {
- X % ureg to,from; reg bytes;
- X
- X movl to,%edi
- X movl from,%esi
- X movl bytes,%ecx
- X rep
- X movsb /* (%esi),(%edi) */
- X }
- X
- X asm void CoreClusterCopy(to,from,clusters)
- X {
- X % ureg to,from; reg clusters;
- X
- X movl to,%edi
- X movl from,%esi
- X movl clusters,%ecx
- X rep
- X movsl /* (%esi),(%edi) */
- X }
- X
- X# define CoreBYTECOPY CoreByteCopy
- X# define CoreODDCOPY CoreByteCopy
- X# define CoreCLUSTERCOPY CoreClusterCopy
- X
- X# endif /* 0 */
- X
- X# /* This is 1, but should be 0, because we should use inline asm procs */
- X# if (1 && (CcFEATURE&CcASM))
- X /*
- X Having had a look at the generated code, we know that to is
- X %esi, from is in %edi, and bytes is in %ebx.
- X */
- X
- X# define CoreBYTECOPY(to,from,bytes) \
- X begindef \
- X fast addressy CoreBytes = (bytes); \
- X asm (" movl %ebx,%ecx"); \
- X asm (" rep"); \
- X asm (" movsb / (%esi),(%edi)"); \
- X enddef
- X
- X# define CoreCLUSTERCOPY(to,from,clusters) \
- X begindef \
- X fast addressy CoreClusters = (clusters); \
- X asm (" movl %ebx,%ecx"); \
- X asm (" rep"); \
- X asm (" movsl / (%esi),(%edi)"); \
- X enddef
- X
- X# define CoreODDCOPY CoreBYTECOPY
- X
- X# endif /* 1 */
- X
- X# ifndef CoreCLUSTERCOPY
- X# define CoreCLUSTERCOPY Core4CLUSTERCOPY
- X# endif
- X
- X#endif /* CsIS == CcREGISTER */
- END_OF_FILE
- if test 1879 -ne `wc -c <'CoreSv386.h'`; then
- echo shar: \"'CoreSv386.h'\" unpacked with wrong size!
- fi
- # end of 'CoreSv386.h'
- fi
- if test -f 'CoreSv386.pr' -a "${1}" != "-c" ; then
- echo shar: Will not clobber existing file \"'CoreSv386.pr'\"
- else
- echo shar: Extracting \"'CoreSv386.pr'\" \(2040 characters\)
- sed "s/^X//" >'CoreSv386.pr' <<'END_OF_FILE'
- X 16MB C=0 2048B t% 0 f% 0 0.05u
- X 16MB C=0 2048B t% 1 f% 1 0.05u
- X 16MB C=0 2048B t% 0 f% 3 0.05u
- X 16MB C=0 2048B t% 3 f% 0 0.05u
- X 16MB C=0 512B t% 0 f% 0 0.20u
- X 16MB C=0 512B t% 1 f% 1 0.20u
- X 16MB C=0 512B t% 0 f% 3 0.20u
- X 16MB C=0 512B t% 3 f% 0 0.20u
- X 16MB C=0 128B t% 0 f% 0 0.81u
- X 16MB C=0 128B t% 1 f% 1 0.81u
- X 16MB C=0 128B t% 0 f% 3 0.81u
- X 16MB C=0 128B t% 3 f% 0 0.82u
- X 16MB C=0 32B t% 0 f% 0 3.23u
- X 16MB C=0 32B t% 1 f% 1 3.23u
- X 16MB C=0 32B t% 0 f% 3 3.22u
- X 16MB C=0 32B t% 3 f% 0 3.22u
- X 16MB C=0 8B t% 0 f% 0 12.88u
- X 16MB C=0 8B t% 1 f% 1 12.89u
- X 16MB C=0 8B t% 0 f% 3 12.88u
- X 16MB C=0 8B t% 3 f% 0 12.88u
- X 16MB C=1 2048B t% 0 f% 0 1.39u
- X 16MB C=1 2048B t% 1 f% 1 4.08u
- X 16MB C=1 2048B t% 0 f% 3 3.16u
- X 16MB C=1 2048B t% 3 f% 0 2.28u
- X 16MB C=1 512B t% 0 f% 0 1.55u
- X 16MB C=1 512B t% 1 f% 1 4.21u
- X 16MB C=1 512B t% 0 f% 3 3.32u
- X 16MB C=1 512B t% 3 f% 0 2.43u
- X 16MB C=1 128B t% 0 f% 0 2.18u
- X 16MB C=1 128B t% 1 f% 1 4.84u
- X 16MB C=1 128B t% 0 f% 3 3.94u
- X 16MB C=1 128B t% 3 f% 0 3.07u
- X 16MB C=1 32B t% 0 f% 0 4.71u
- X 16MB C=1 32B t% 1 f% 1 7.39u
- X 16MB C=1 32B t% 0 f% 3 6.44u
- X 16MB C=1 32B t% 3 f% 0 5.64u
- X 16MB C=1 8B t% 0 f% 0 14.79u
- X 16MB C=1 8B t% 1 f% 1 17.57u
- X 16MB C=1 8B t% 0 f% 3 16.10u
- X 16MB C=1 8B t% 3 f% 0 16.10u
- X 16MB C=2 2048B t% 0 f% 0 1.42u
- X 16MB C=2 2048B t% 1 f% 1 1.44u
- X 16MB C=2 2048B t% 0 f% 3 2.33u
- X 16MB C=2 2048B t% 3 f% 0 2.31u
- X 16MB C=2 512B t% 0 f% 0 1.68u
- X 16MB C=2 512B t% 1 f% 1 1.77u
- X 16MB C=2 512B t% 0 f% 3 2.65u
- X 16MB C=2 512B t% 3 f% 0 2.57u
- X 16MB C=2 128B t% 0 f% 0 2.72u
- X 16MB C=2 128B t% 1 f% 1 3.03u
- X 16MB C=2 128B t% 0 f% 3 3.89u
- X 16MB C=2 128B t% 3 f% 0 3.61u
- X 16MB C=2 32B t% 0 f% 0 6.48u
- X 16MB C=2 32B t% 1 f% 1 9.21u
- X 16MB C=2 32B t% 0 f% 3 8.28u
- X 16MB C=2 32B t% 3 f% 0 7.45u
- X 16MB C=2 8B t% 0 f% 0 22.22u
- X 16MB C=2 8B t% 1 f% 1 22.24u
- X 16MB C=2 8B t% 0 f% 3 22.23u
- X 16MB C=2 8B t% 3 f% 0 22.23u
- END_OF_FILE
- if test 2040 -ne `wc -c <'CoreSv386.pr'`; then
- echo shar: \"'CoreSv386.pr'\" unpacked with wrong size!
- fi
- # end of 'CoreSv386.pr'
- fi
- if test -f 'CoreTest.c' -a "${1}" != "-c" ; then
- echo shar: Will not clobber existing file \"'CoreTest.c'\"
- else
- echo shar: Extracting \"'CoreTest.c'\" \(3039 characters\)
- sed "s/^X//" >'CoreTest.c' <<'END_OF_FILE'
- X#include <sys/types.h>
- X#include <sys/times.h>
- X#include <sys/param.h>
- X
- X#ifndef HZ
- X# define HZ 60
- X#endif
- X
- X#include <stdio.h>
- X
- X#ifndef B
- X# define B 4096 /* Maximum & default # of bytes */
- X#endif
- X#ifndef M
- X# define M (16<<20) /* Default megabytes copied */
- X#endif
- X
- typedef char *(*method)();
- X
- static time_t measure(p,t,f,b)
- X register method p;
- X register char *t,*f;
- X register unsigned b;
- X{
- X register unsigned i;
- X struct tms tms;
- X time_t utime;
- X
- X (void) times(&tms);
- X utime = tms.tms_utime;
- X
- X for (i = 0; i < M; i += b)
- X (void) (*p)(t,f,b);
- X
- X (void) times(&tms);
- X return tms.tms_utime - utime;
- X}
- X
- extern char *null();
- extern char *memcpy();
- extern char *CoreCopy();
- extern char *copy1();
- extern char *copy2();
- extern char *copy3();
- X
- static method methods[] = {null,memcpy,CoreCopy,copy1,copy2,copy3};
- static unsigned nmethods = sizeof methods/sizeof (method);
- X
- X#define SLOP sizeof (long unsigned)
- X
- long unsigned alignit1;
- char bfrom[B+SLOP];
- X
- long unsigned alignit2;
- char bto[B+SLOP];
- X
- extern int main(argc,argv)
- X int argc;
- X char **argv;
- X{
- X register unsigned i,b;
- X register char *f,*t;
- X unsigned of,ot;
- X unsigned m;
- X time_t utime;
- X
- X
- X m = (argc <= 1) ? 0 : atoi(argv[1]);
- X b = (argc <= 2) ? B : atoi(argv[2]);
- X ot = (argc <= 3) ? 1 : atoi(argv[3]);
- X of = (argc <= 4) ? 1 : atoi(argv[4]);
- X
- X if (m >= nmethods) m = 1;
- X if (b > B) b = B;
- X if (ot > SLOP) ot %= SLOP;
- X if (of > SLOP) of %= SLOP;
- X
- X f = bfrom + of; t = bto + ot;
- X
- X printf("%3uMB C=%u %4uB t%% %u f%% %u ",
- X M>>20,m,b,(unsigned) t%SLOP,(unsigned) f%SLOP);
- X fflush(stdout);
- X
- X utime = measure(methods[m],f,t,b);
- X
- X printf("%3u.%02uu\n",utime/HZ,utime%HZ);
- X fflush(stdout);
- X
- X return 0;
- X}
- X
- extern char *null(to,from,bytes)
- X register char *to,*from;
- X register unsigned bytes;
- X{
- X return to+bytes;
- X}
- X
- extern char *copy1(to,from,bytes)
- X register char *to,*from;
- X register unsigned bytes;
- X{
- X if (bytes)
- X {
- X do *to++ = *from++;
- X while (--bytes);
- X }
- X
- X return to;
- X}
- X
- extern char *copy2(to,from,bytes)
- X register char *to,*from;
- X register unsigned bytes;
- X{
- X while (bytes >= sizeof (long))
- X {
- X *(long *) to = *(long *) from;
- X to += sizeof (long), from += sizeof (long);
- X bytes -= sizeof (long);
- X }
- X
- X if (bytes)
- X {
- X do *to++ = *from++;
- X while (--bytes);
- X }
- X
- X return to;
- X}
- X
- extern char *copy3(to,from,bytes)
- X register char *to,*from;
- X register unsigned bytes;
- X{
- X while (bytes >= 2*sizeof (long))
- X {
- X *(long *) to = *(long *) from;
- X *((long *) to + 1) = *((long *) from +1);
- X to +=2*sizeof (long), from += 2*sizeof (long);
- X bytes -= 2*sizeof (long);
- X }
- X
- X while (bytes >= sizeof (long))
- X {
- X *(long *) to = *(long *) from;
- X to += sizeof (long), from += sizeof (long);
- X bytes -= sizeof (long);
- X }
- X
- X if (bytes)
- X {
- X do *to++ = *from++;
- X while (--bytes);
- X }
- X
- X return to;
- X}
- END_OF_FILE
- if test 3039 -ne `wc -c <'CoreTest.c'`; then
- echo shar: \"'CoreTest.c'\" unpacked with wrong size!
- fi
- # end of 'CoreTest.c'
- fi
- echo shar: End of shell archive.
- exit 0
- --
- Piercarlo "Peter" Grandi | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
- Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg
- Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
-