Eric Schrock's Blog

Month: June 2005

There’s actually a decent piece over at eWeek discussing the future of Xen and LAE (the project formerly known as Janus) on OpenSolaris. Now that our marketing folks are getting the right message out there about what we’re trying to accomplish, I thought I’d follow up with a little technical background on virtualization and why we’re investing in these different technologies. Keep in mind that these are my personal beliefs based on interactions with customers and other Solaris engineers. Any resemblance to a corporate strategy is purely coincidental 😉

Before diving in, I should point out that this will be a rather broad coverage of virtualization strategies. For a more detailed comparison of Zones and Jails in particular, check out James Dickens’ Zones comparison chart.

Benefits of Virtualization

First off, virtualization is here to stay. Our customers need virtualization – it dramatically reduces the cost of deploying and maintaining multiple machines and applications. The success of companies such as VMWare is proof enough that such a market exists, though we have been hearing it from our customers for a long time. What we find, however, is that customers are often confused about exactly what they’re trying to accomplish, and companies try to pitch a single solution to virtualization problems without recognizing that more appropriate solutions may exist. The most common need for virtualization (as judged by our customer base) is application consolidation. Many of the larger apps have become so complex that they become a system in themselves – and often they don’t play nicely with other applications on the box. So “one app per machine” has become the common paradigm. The second most common need is security, either for your application administrators or your developers. Other reasons certainly exist (rapid test environment deployment, distributed system simulation, etc), but these are the two primary ones.

So what does virtualization buy you? It’s all about reducing costs, but there are really two types of cost associated with running a system:

  1. Hardware costs – This includes the cost of the machine, but also the costs associated with running that machine (power, A/C).
  2. Software management costs – This includes the cost of deploying new machines, and upgrading/patching software, and observing software behavior.

As we’ll see, different virtualization strategies provide different qualities of the above savings.

Hardware virtualization

One of the most well-established forms of virtualization, the most common examples today are Sun Domains and IBM Logical Partitions. In each case, the hardware is responsible for dividing existing resources in such a way as to present multiple machines to the user. This has the advantage of requiring no software layer, no performance impact, and hardware fault isolation. The downside to this is that it requires specialized hardware that is extremely expensive, and provides zero benefit for reducing software management costs.

Software machine virtualization

This approach is probably the one most commonly associated with the term
“virtualization”. In this scheme, a software layer is created which allows
multiple OS instances to run on the same hardware. The most commercialized
versions are VMware and Virtual PC,
but other projects exist (such as qemu and PearPC). Typically, they require a
“host” operating system as well as multiple “guests” (although VMware ESX server
runs a custom kernel as the host). While Xen uses a
paravitualization technique that requires changes to the guest OS, it is still
fundamentally a machine virtualization technique. And Usermode Linux takes a
radically different approach, but accomplishes the basic same task.

In the end, this approach has similar strengths and weaknesses as the hardware assisted
virtualization. You don’t have to buy expensive special-purpose hardware, but
you give up the hardware fault isolation and often sacrifice performance (Xen’s
approach lessens this impact, but its still visible). But most importantly, you
still don’t save any costs associated with software management – administering
software on 10 virtual machines is just as expensive as administering 10
separate machines. And you have no visibility into what’s happening within the
virtual machine – you may be able to tell that Xen is consuming 50% of your CPU,
but you can’t tell why unless you log into the virtual system itself.

Software application virtualization

On the grand scale of virtualization, this ranks as the “least virtualized”.
With this approach, the operating system uses various tricks and techniques to
present an alternate view of the machine. This can range from simple
chroot(1), to BSD
Jails
, to Solaris
Zones
. Each of these provide a more complete OS view with varying degrees
of isolation. While Zones is the most complete and the most secure, they all
use the same fundamental idea of a single operating system presenting an
“alternate reality” that appears to be a complete system at the application
level. The upcoming Linux Application Environment on OpenSolaris will take this
approach by leveraging Zones and emulating Linux at the system call layer.

The most significant downside to this approach is the fact there is a single kernel. You cannot run different operating systems (though LAE will add an interesting twist), and the “guest” environments have limited access to hardware facilities. On the other hand, this approach results in huge savings on the software management front. Because applications are still processes within the host environment, you have total visibility into what is happening within each guest, using standard operating system tools, as well as manage them as you would any other processes, using standard resource management tools. You can deploy, patch, and upgrade software from a single point without having to physically log into each machine. While not all applications will run in such a reduced environment, those that do will be able to benefit from vastly simplified software management. This approach also has the added bonus that it tends to make better use of shared resources. In Zones, for example, the most common configuration includes a shared /usr directory, so that no additional disk space is needed (and only one copy of each library needs to be resident in memory).

OpenSolaris virtualization in the future

So what does this all mean for OpenSolaris? Why are we continuing to pursue Zones, LAE, and Xen? The short answer is because “our customers want us to.” And hopefully, from what’s been said above, it’s obvious that there is no one virtualization strategy that is correct for everyone. If you want to consolidate servers running a variety of different operating systems (including older versions of Solaris), then Xen is probably the right approach. If you want to consolidate machines running Solaris applications, then Zones is probably your best bet. If you require the ability to survive hardware faults between virtual machines, then domains is the only choice. If you want to take advantage of Solaris FMA and performance, but still want to run the latest and greatest from RedHat with support, then Xen is your option. If you have 90% of your applications on Solaris, and you’re just missing that one last app, then LAE is for you. Similarly, if you have a Linux app that you want to debug with DTrace, you can leverage LAE without having to port to Solaris first.

With respect to Linux virtualization in particular, we are always going to pursue ISV certification first. No one at Sun wants you to run Oracle under LAE or Xen. Given the choice, we will always aggressively pursue ISVs to do a native port to Solaris. But we understand that there is an entire ecosystem of applications (typically in-house apps) that just won’t run on Solaris x86. We want users to have a choice between virtualization options, and we want all those options to be a fundamental part of the operating system.

I hope that helps clear up the grand strategy. There will always be people who disagree with this vision, but we honestly believe we’re making the best choices for our customers.

Tags:


You may note, that I failed to mention cross-architecture virtualization. This is most common at the system level (like PearPC), but application-level solutions do exist (including Apple’s upcoming Rosetta). This type of virtualization simply doesn’t factor into our plans, yet, and still falls under the umbrella of one of the broad virtualization types.

I also apologize for any virtualization projects out there that I missed. There are undoubtedly many more, but the ones mentioned above serve to illustrate my point.

A while ago, for my own amusement, I went through the Solaris source base and searched for the source files with the most lines. For some unknown reason this popped in my head yesterday so I decided to try it again. Here are the top 10 longest files in OpenSolaris:

Length Source File
29944 usr/src/uts/common/io/scsi/targets/sd.c
25920 [closed]
25429 usr/src/uts/common/inet/tcp/tcp.c
22789 [closed]
16954 [closed]
16339 [closed]
15667 usr/src/uts/common/fs/nfs4_vnops.c
14550 usr/src/uts/sfmmu/vm/hat_sfmmu.c
13931 usr/src/uts/common/dtrace/dtrace.c
13027 usr/src/uts/sun4u/starfire/io/idn_proto.c

You can see some of the largest files are still closed source. Note that the length of the file doesn’t necessarily indicate anything about the quality of the code, it’s more just idle curiosity. Knowing the quality of online journalism these days, I’m sure this will get turned into “Solaris source reveals completely unmaintable code” …

After looking at this, I decided a much more interesting question was “which source files are the most commented?” To answer this question, I ran evey source file through a script I found that counts the number of commented lines in each file. I filtered out those files that were less than 500 lines long, and ran the results through another script to calculate the percentage of lines that were commented. Lines which have a comment along with source are considered a commented line, so some of the ratios were quite high. I filtered out those files which were mostly tables (like uwidth.c), as these comments didn’t really count. I also ignored header files, because they tend to be far more commented that the implementation itself. In the end I had the following list:

Percentage File
62.9% usr/src/cmd/cmd-inet/usr.lib/mipagent/snmp_stub.c
58.7% usr/src/cmd/sgs/libld/amd64/amd64unwind.c
58.4% usr/src/lib/libtecla/common/expand.c
56.7% usr/src/cmd/lvm/metassist/common/volume_nvpair.c
56.6% usr/src/lib/libtecla/common/cplfile.c
55.6% usr/src/lib/libc/port/gen/mon.c
55.4% usr/src/lib/libadm/common/devreserv.c
55.1% usr/src/lib/libtecla/common/getline.c
54.5% [closed]
54.3% usr/src/uts/common/io/ib/ibtl/ibtl_mem.c

Now, when I write code I tend to hover in the 20-30% comments range (my best of those in the gate is gfs.c, which with Dave’s help is 44% comments). Some of the above are rather over-commented (especially snmp_sub.c, which likes to repeat comments above and within functions).

I found this little experiment interesting, but please don’t base any conclusions on these results. They are for entertainment purposes only.

Technorati Tag:

Since Bryan solved my last puzzle a little too quickly, this post will serve as a followup puzzle that may or may not be easier. All I know is that Bryan is ineligible this time around 😉

Once again, the rules are simple. The solution must be a single line dcmd that produces precise output without any additional steps or post processing. For this puzzle, you’re actually allowed two commands: one for your dcmd, and another for ‘::run’. For this puzzle, we’ll be using the following test program:

#include 
int
main(int argc, char **argv)
{
int i;
srand(time(NULL));
for (i = 0; i < 100; i++)
write(rand() % 10, NULL, 0);
return (0);
}

The puzzle itself demonstrates how conditional breakpoints can be implemented on top of existing functionality:

Stop the test program on entry to the write() system call only when the file descriptor number is 7

I thought this one would be harder than the last, but now I’m not so sure, especially once you absorb some of the finer points from the last post.

Technorati Tag:

On a lighter note, I’d thought I post an “MDB puzzle” for the truly masochistic out there. I was going to post two, but the second one was just way too hard, and I was having a hard time finding a good test case in userland. You can check out how we hope to make this better over at the MDB community. Unfortunately I don’t have anything cool to give away, other than my blessing as a truly elite MDB hacker. Of course, if you get this one right I might just have to post the second one I had in mind…

The rules are simple. You can only use a single line command in ‘mdb -k’. You cannot use shell escapes (!). Your answer must be precise, without requiring post-processing through some other utility. Leaders of the MDB community and their relatives are ineligible, though other Sun employees are welcome to try. And now, the puzzle:

Print out the current working directory of every process with an effective user id of 0.

Should be simple, right? Well, make sure you go home and study your MDB pipelines, because you’ll need some clever tricks to get this one just right…

Technorati Tags:

On opening day, I chose to post an entry on adding
a system call
to OpenSolaris. Considering the feedback, I thought I’d
continue with brief “How-To add to OpenSolaris” documents for a while.
There’s a lot to choose from here, so I’ll just pick them off as quick as I can.
Todays topic as adding a new kernel module to OpenSolaris.

For the sake of discussion, we will be adding a new module that does nothing
apart from print a message on load and unload. It will be architecture-neutral,
and be distributed as part of a separate package (to give you a taste of our
packaging system). We’ll continue my narcissistic
tradition and name this the “schrock” module.

1. Adding source

To begin, you must put your source somewhere in the tree. It must be put
somewhere under usr/src/uts/common,
but exactly where depends on the type of module. Just about the only real rule
is that filesystems go in the “fs” directory, but other than that there are no
real rules. The bulk of the modules live in the “io” directory, since the
majority of modules are drivers of some kind. For now, we’ll put ‘schrock.c’ in
the “io” directory:

#include <sys/modctl.h>
#include <sys/cmn_err.h>
static struct modldrv modldrv = {
&mod_miscops,
"schrock module %I%",
NULL
};
static struct modlinkage modlinkage = {
MODREV_1, (void *)&modldrv, NULL
};
int
_init(void)
{
cmn_err(CE_WARN, "OpenSolaris has arrived");
return (mod_install(&modlinkage));
}
int
_fini(void)
{
cmn_err(CE_WARN, "OpenSolaris has left the building");
return (mod_remove(&modlinkage));
}
int
_info(struct modinfo *modinfop)
{
return (mod_info(&modlinkage, modinfop));
}

The code is pretty simple, and is basically the minimum needed to add
a module to the system. You notice we use ‘mod_miscops’ in our modldrv.
If we were adding a device driver or filesystem, we would be using a
different set of linkage structures.

2. Creating Makefiles

We must add two Makefiles to get this building:

usr/src/uts/intel/schrock/Makefile
usr/src/uts/sparc/schrock/Makefile

With contents similar to the following:

UTSBASE = ../..
MODULE          = schrock
OBJECTS         = $(SCHROCK_OBJS:%=$(OBJS_DIR)/%)
LINTS           = $(SCHROCK_OBJS:%.o=$(LINTS_DIR)/%.ln)
ROOTMODULE      = $(ROOT_MISC_DIR)/$(MODULE)
include $(UTSBASE)/intel/Makefile.intel
ALL_TARGET      = $(BINARY)
LINT_TARGET     = $(MODULE).lint
INSTALL_TARGET  = $(BINARY) $(ROOTMODULE)
CFLAGS          += $(CCVERBOSE)
.KEEP_STATE:
def:            $(DEF_DEPS)
all:            $(ALL_DEPS)
clean:          $(CLEAN_DEPS)
clobber:        $(CLOBBER_DEPS)
lint:           $(LINT_DEPS)
modlintlib:     $(MODLINTLIB_DEPS)
clean.lint:     $(CLEAN_LINT_DEPS)
install:        $(INSTALL_DEPS)
include $(UTSBASE)/intel/Makefile.targ

3. Modifying existing Makefiles

There are two remaining Makefile chores before we can continue. First, we have
to add the set of files to usr/src/uts/common/Makefile.files:

KMDB_OBJS += kdrv.o
SCHROCK_OBJS += schrock.o
BGE_OBJS += bge_main.o bge_chip.o bge_kstats.o bge_log.o bge_ndd.o \
bge_atomic.o bge_mii.o bge_send.o bge_recv.o

If you had created a subdirectory for your module instead of placing it in
“io”, you would also have to add a set of rules to usr/src/uts/common/Makefile.rules.
If you need to do this, make sure you get both the object targets and the
lint targets, or you’ll get build failures if you try to run lint.

You’ll also need to modify the usr/src/uts/intel/Makefile.intel
file, as well as the corresponding SPARC version:

MISC_KMODS      += usba usba10
MISC_KMODS      += zmod
MISC_KMODS      += schrock
#
#       Software Cryptographic Providers (/kernel/crypto):
#

4. Creating a package

As mentioned previously, we want this module to live in its own package. We
start by creating usr/src/pkgdefs/SUNWschrock and adding it to the list
of COMMON_SUBDIRS in usr/src/pkgdefs/Makefile:

SUNWsasnm \
SUNWsbp2 \
        SUNWschrock \
SUNWscpr  \
SUNWscpu  \

Next, we have to add a skeleton package system. Since we’re only adding a
miscellaneous module and not a full blown driver, we only need a simple
skeleton. First, there’s the Makefile:

include ../Makefile.com
.KEEP_STATE:
all: $(FILES)
install: all pkg
include ../Makefile.targ

A ‘pkgimfo.tmpl’ file:

PKG=SUNWschrock
NAME="Sample kernel module"
ARCH="ISA"
VERSION="ONVERS,REV=0.0.0"
SUNW_PRODNAME="SunOS"
SUNW_PRODVERS="RELEASE/VERSION"
SUNW_PKGVERS="1.0"
SUNW_PKGTYPE="root"
MAXINST="1000"
CATEGORY="system"
VENDOR="Sun Microsystems, Inc."
DESC="Sample kernel module"
CLASSES="none"
HOTLINE="Please contact your local service provider"
EMAIL=""
BASEDIR=/
SUNW_PKG_ALLZONES="true"
SUNW_PKG_HOLLOW="true"

And ‘prototype_com’, ‘prototype_i386’, and ‘prototype_sparc’ (elided) files:

# prototype_i386
!include prototype_com
d none kernel/misc/amd64 755 root sys
f none kernel/misc/amd64/schrock 755 root sys
# prototype_com
i pkginfo
d none kernel 755 root sys
d none kernel/misc 755 root sys
f none kernel/misc/schrock 755 root sys

5. Putting it all together

If we pkgadd our package, or BFU to the resulting archives, we can see our
module in action:

halcyon# modload /kernel/misc/schrock
Jun 19 12:43:35 halcyon schrock: WARNING: OpenSolaris has arrived
halcyon# modunload -i 197
Jun 19 12:43:50 halcyon schrock: WARNING: OpenSolaris has left the building

This process is common to all kernel modules (though packaging is simpler for
those combined in SUNWckr, for example). Things get a little more complicated
and a little more specific when you begin to talk about drivers or filesystems
in particular. I’ll try to create some simple howtos for those as
well.

Technorati Tag:

Just a heads up that we’ve formed a new OpenSolaris Observability community. There’s not much there night now, but I encourage to head over and check out what OpenSolaris has to offer. Or come to the discussion forum and gripe about what features we’re still missing. Topics covered include process, system, hardware, and post mortem observability. We’ll be adding much more content as soon as we can.

Technorati Tag:

Recent Posts

April 21, 2013
February 28, 2013
August 14, 2012
July 28, 2012

Archives