Search This Blog

Friday, December 28, 2007

Musings on the MoPaQ Format

So, Q survived finals and term papers, though not unscathed. Now for something completely different, and pseudo-random.

A couple weeks ago, BahamutZero and I were talking about the MPQ file format. I can't remember exactly what we were talking about, but he mentioned that it didn't seem all that good. It occurred to me that while I'm probably the most knowledgeable person on the format outside Blizzard (excluding, of course, the author of the format, who no longer works for Blizzard), I'd never really formed an opinion on how good or bad it was in general. That gave me the idea to write a critique of the format - at least, of the more interesting things in the format - on my blog.

Hashing

One of the more distinctive features of MPQs is the way in which files are located within the archives. The standard method of finding files in databases and file systems is the B-tree or B+-tree. For databases, there is one tree for each index in each table; for file systems, each directory has its own tree. This allows for efficient searching (O(log N) disk accesses per lookup) and the ability to keep only part of the index in memory at any time. However, archive files are generally concerned more about optimizing size than speed (ACE being the extreme example), they may use something simpler, such as a sorted or unsorted array of filenames (and if unsorted, then it is possible to just store the filenames along with the file data, itself).

The MPQ format, however, opted for something more industrial-strength: hashing. Each MPQ archive has a single large hash table (an O(1) structure), containing entries for all (readable) files in the archive. Files are identified by two 32-bit hashes of the filename, a language code, and a platform code (what exactly this is remains to be seen, as it's still unused); the index is derived from a third hash of the filename. Note that nowhere is the filename itself.

You really can't do any better than this for the original purpose MPQs were designed for: to store read-only game data in a compact form that is both fast to access and fast to load. The use of fixed-size hashes (rather than the actual filenames) makes the size of archive data in memory optimal, and the hash table structure makes file lookups optimal. The format of the hash table (specifically, how unused entries are encoded) means that the size on disc could be further reduced by compressing the hash table (it isn't compressed on disc); presumably this was not done to make opening archives faster. Apart from that, the hash table has no real weaknesses for the originally intended purpose.

However, the hash table proves to be a pain for modders because of the lack of absolute filenames in the archives. This apparently became a pain for Blizzard as well, indicated by the fact that in Starcraft: Brood War (the third game/expansion to use MPQs), Blizzard added a separate mechanism not integrated with the hash table for storing the names of files in the archive.

These two things taken together, however, suggest one way in which the MPQ format could have been improved (besides a minor alteration that would have allowed the hash table to be resized; presumably Blizzard didn't see a need for this ability when creating the MPQ format). Rather than including the hash table in the archive format itself, Blizzard could have made the hash table a global, separate entity. That is, store the filenames in some simple structure in the archives, and have the installer/patcher code create a global hash table (containing files of all archives) which would be used by the archive accessing code; this could even be specialized, such as only indexing files of the correct language. This would reduce the in-memory overhead (although increase the total size on disc) as well as potentially decrease the search time for finding a file in multiple archives. It has the added benefit of removing that complexity from the format itself, and moving it to support code. If memory serves, Neverwinter Nights took this approach.

Language/Platform Codes

Another interesting design decision in MPQs was the choice of including support for different languages/platforms in the archive format itself. As briefly mentioned previously, files in MPQs are identified, rather than by filenames, by three hashes derived from the filename, a language code, and a platform code. When a game runs, it sets the language and platform currently running. The MPQ access code then looks through the archives for a file matching that description. If no such file exists, it performs a second search, using the same hashes, but using default language and platform codes (in practice, files identified by these codes are generally American English files).

While it's nice that Blizzard put some thought into internationalization, I tend to think this was something that could have been better implemented outside the archive format itself. It would have been just as easy to use a prefix or suffix on the filename for this purpose (e.g. "platform\language\filename" or "filename.platform.language"). Both of these could have been implemented completely in code.

Alternatively, had they used a global index, they could have simply included only files of the proper language/platform (or the neutral language/platform, if no specific version exists) in the index when it is built by the installer/patcher, without the need to modify the filenames at all.

Separate Hash and Block Tables

One detail about the hash table that wasn't previously mentioned was the fact that the hash table stores only some information about each file (specifically, only that already mentioned). Additional information, such as the file size and offset, as well as file modes and flags, is stored in the block/file table ('block' is more technically accurate, but 'file' better describes its intended purpose). The hash table then stores an index into the file table for each entry in the hash table.

This is something that is common in file systems - you typically have a file table, containing the basic info about each file's data, and an index (often B/B+-tree) of filenames that index into the file table. The obvious benefit of this is that the file table can be modified less frequently, and doesn't need to be completely rewritten every time there's a change in the directory structure. A second benefit is that it's possible to create hard links - multiple 'files' in multiple directories (or in the same directory with different names) that refer to the same physical file.

File Data Blocking

Another feature of the MPQ format that resembles a file system more than an archive format is the way it structures file data. It's generally beneficial, from a compression ratio perspective, to compress a data set as a whole (though compression libraries typically feature a mechanism to indicate when it's necessary to read more data or write out compressed data, so that you don't have to load the entire file into memory at once). However, file systems that support compression don't do that. As file systems must read and write entire sectors of data, file data is broken into chunks of some number of sectors, compressed as a single block, then written out, padding out to the nearest sector; if at least one sector cannot be saved by the compression, the compressed data is discarded, and the uncompressed data is used, instead (there no point having to decompress something when compression doesn't even reduce the size).

However, the MPQ format also does this, despite the fact that it does not pad out disc sectors. The reason is that there's a second benefit of blocking data in this way. When a block of data is compressed as a stream (a single entity), it can generally only be read likewise - as a stream, from beginning to end. That is, you cannot seek in compressed data. As seeking is an essential function in general file systems (and similarly in game archive formats), this constraint is unacceptable. Blocking is thus used to reduce the size of the blocks that must be accessed atomically. To read from an arbitrary point, you find the block containing the position desired and decompress it.

One last point unique to the MPQ format is that when files are blocked, each block may use a different compression algorithm, indicated by a byte prefix with the compression algorithm.

Single Block Files

Nevertheless, as mentioned previously, it's better from a compression (and possibly performance, as well) standpoint to compress files as a single unit. In most cases this would work for arbitrary data (the type that might appear in an MPQ), as many file types will be read once and then kept in memory. To take advantage of this, the MPQ format has a second mode of file storage (I can't recall off the top of my head when this was added. Possibly in World of Warcraft): single block compression. This is basically exactly what it sounds like: the entire file is treated exactly as a single block of data - compressed in a single go, and prefixed with a byte indicating the compression algorithm, just like each block normally is.

In theory (and at the level of the file format), this isn't any different from normal, non-blocked archive formats. The problem is really with the implementation (the code, to be precise). The implementor wanted an easy way to take advantage of improved compression by not blocking data, when seeking is not necessary, so decided to use exactly the same block decompression code as for blocked files - blocks are loaded into memory whole, decrypted, and decompressed all at once. The problem here is that it requires two buffers, and requires the entire compressed block to be read before anything can be decrypted or uncompressed.

ADPCM Compression

As mentioned, the MPQ format supports multiple compression algorithms. Currently, the PKWare Implode, Zip Deflate, BZip2, and Huffman encoding algorithms are supported. There's also one more that's more unexpected: the ADPCM algorithm. I briefly discussed ADPCM compression on the E Terra blog. It's a lossy format that compresses 16-bit audio samples down to 4 bits, giving 4:1 compression ratio, at a non-negligible loss of quality.

This is used for a compression mode typically referred to as "WAV compression". This mode is used for compressing sound effects and music prior to Warcraft III. The MPQ editor scans the WAV to import, and distinguishes blocks that contain only PCM data from blocks that contain anything else. Blocks containing only PCM data are compressed with ADPCM, the others are compressed with one of the lossless compression algorithms. This way, the game sees only a PCM-encoded WAV, though it had actually been converted transparently to ADPCM and back.

This is a pretty strange feature for an archive file format - at least, as a basic compression type. Especially considering that there are better formats for this that have higher compression ratios and higher quality (MP3 comes readily to mind), without requiring an alteration to the basic archive format.

I suspect the real reason for this design decision was historical. WAV compression first appeared in Starcraft (the same time support for different compression algorithms for each block appeared, by the way). Diablo, the only game to use MPQs before that, used uncompressed WAVs, played through several Storm.dll APIs. I suspect ADPCM was used to allow the existing Storm.dll streaming code to continue to work using the new compression algorithm. Though again, it's not a particularly good solution, I think.

Extended Attributes

The last feature worth noting is a system for adding new metadata attributes to files in MPQ archives that weren't included in the original format, which I've dubbed extended attributes. Extended attributes are stored as a file (conveniently named "(attributes)") in the archive consisting of parallel arrays of the extended attributes, each array containing one entry for each entry in the block table. The extended attributes in use at present (as of World of Warcraft) are CRC32, FILETIME, and MD5; an archive may contain any combination of these. Oddly, when Storm loads these extended attributes, it merges them with the rest of the fields in the block table, creating a single large table in memory.

This design decision has both advantages and disadvantages. Storing these in parallel arrays makes it, in theory, efficient to store, load, and access them when present, and ignore them when absent; however, the implementation peculiarity mentioned previously makes it slower to load them, with the benefit to slightly improve access (as it doesn't have to check whether they're all present each time you access them).

There are a couple disadvantages, however. First, as extended attributes are stored in arrays parallel to the block table, it isn't possible to exclude entries for files where you don't need that attribute, requiring more space if only some files need certain attributes. I suppose this is only an issue for the tables when loaded in memory, as, as the attributes file is compressed in the archive, unused entries can (have have been observed to) be zeroed, allowing them to take almost no disk space after compression.

The other disadvantage is that this system is not readily extensible. Because new attributes must be known by the MPQ library (for size calculations when processing the attributes file), it's cumbersome to add new attributes, as the MPQ library must also be updated to support them. This is even worse with regard to the fact that the block table mixing code and data must be updated correspondingly. It also doesn't allow for custom metadata at all.

Sunday, December 09, 2007

Random Fact of the Day

Q is officially starting to panic.

Stuff to do this week (last week of school):
- Finish the class version of E Terra
- Start/finish video of something for school
- Start/finish term paper
- Study for and take 4 finals

Friday, November 30, 2007

Exercise for the Reader

Okay, here's one you can apply to your real, daily lives (well, some of you, anyway). When does throwing a stronger punch/kick (something along those lines) require using less strength/energy than throwing a weaker one?

No, there's no esoteric Zen metaphysics at work, here. Nothing but classical European physics.

Edit: Also, English is a bit ambiguous for the question. I'm asking when is it necessary to punch weaker in order to punch stronger?

Thursday, November 29, 2007

Solution



The basic steps used in the link to find the distance between P3 and P are:
1. Calculate the distance between P1 and P
2. Calculate the exact location of P
3. Calculate the distance between P and P3

Now, a little common sense suggests there are two too many steps here. But we can't compute the distance between P and P3 directly, because at this point the location of P is unknown.

Yet there's a very simple solution: we rotate P2 90 degrees around P1 to create P2'. By definition the distance between P1 and P' is the same as the distance between P3 and P. This has exactly the same form as the version in the link, but it is able to find the distance between P1 and P' in the first step, instead of the distance between P1 and P. Thus we've eliminated steps 2 and 3 entirely, and use exactly the same math as before:
u' = ( (x3 - x1)(x2' - x1) + (y3 - y1)(y2' - y1) ) / ( ||p2' - p1||^2 )

A visual proof:


While understanding the way the distance is calculated using the dot product requires knowledge of calculus/linear algebra, deriving this improved equation from the original in the link is just trivial high school geometry. That's why it's neat.

Sucky Test Teacher Strikes Again

So, just got back the graded midterms in networking class (theoretical stuff), taught by the same teacher. Not surprisingly, it was disappointing. As before, this guy could teach a class on how not to write (and grade) tests. Among smaller gripes were two main things:

1. One question (in particular) was taken straight out of the book, and was fairly simple. What could go wrong? Well, giving the same answer to the problem as in the book got you a wrong answer on the test (and no, it wasn't an essay question where theoretically he could expect more detail).

2. Throughout the semester (there's one week left), whenever math was used, it was exclusively high-school-level math (basic algebra and such; there was one area under the curve problem, but as the "curve" was a line, it could be calculated with basic algebra). Neither the book, the lectures, nor any homework has gone beyond that. So, what does he do? He puts a calculus question on the test, and makes it worth 25% of the test (note that calculus is not a prerequisite for this class; I wonder if he could get in trouble with the dean for this).

Ultimately, Q got (like always, although it's always hard to believe in classes with this teacher) an A on the test. I don't know what the exact ranges for grades are, but I talked to one person who got an A- for 58%, and rumor has it that one person got a C for 18%.

Which brings me back to what I said in the previous post about this guy: CURVING THE SCORE DOES NOT MAKE UP FOR A HORRIBLE TEST.

Wednesday, November 28, 2007

Exercise for the Reader

Oh no, not another one of Q's easy but slightly tricky quiz questions. Today it's regarding finding the shortest distance between a point and a line. This explains the theory behind it, and how to calculate it.

But if you are only interested in the distance between P and P3 itself, and you don't care what P is, this math is moderately wasteful, as you have to first find P, then calculate the distance between P and P3. Surely there must be a way to calculate the distance without ever calculating P. And in fact, there is.

Find it, without cheating (looking up the answer online of through friends. It's actually pretty easy and simple. I just thought it was kind of neat.

Monday, November 05, 2007

& Insanity

*evil, maniacal laughter*

No, that's not me starting to panic about E Terra and deadlines (although I am starting to panic about E Terra and deadlines, as indicated by my physical stress symptoms that are starting to appear). Let's just say that I'm very pleased with my class schedule and related things, and gloating without being able to say exactly why.

Happy, happy months ahead (and at least one very stressful month)!

Friday, November 02, 2007

World Without Windows

Okay, so that title is a bit misleading. Anyway, this post hopes to provide some meaningful answers to the question: what would the world be like if the overwhelmingly dominant operating system was secure in ways that Windows is not. For the purposes of this discussion, I'm defining "secure" by several criteria:
1. All users run as limited users - they can't do administrative tasks or screw with the OS without explicitly logging on as admin or running a program as administrator (e.g. Windows run as or Unix sudo)
2. The system is fully isolating with respect to users - one user may not access another user's data without explicit permission
3. There are no privilege escalation exploits in the OS - tricks that limited users could use to gain administrator privilege without having to enter the administrator password
4. There are no remote exploits in the OS itself - in the kernel, standard drivers, basic services, etc.

So, we have this idealized, nonexistent operating system; let's call it Qunix. How exactly, then, would the world look if Qunix had 95% market share? Would this be, as the average Slashdotter seems to believe, a secure and malware-free utopia, where nobody knows what viruses, worms, spyware, or security breaches are, because they don't exist?

The answer, actually, is somewhat depressing: the world would look pretty similar to how it looks right now. Malware and security breaches would still be prominent, the security industry (anti-malware products) would still be big business, and the black hat industry would have similar job security. Granted, the nature of malware would be different, but that would not make it any less prolific or dangerous.

Ultimately, those four criteria I specified have one intended goal: to put everything the user does in a sandbox, where it can't harm the OS or other users (this was how Windows NT was originally envisioned, but time has proved that hope misplaced). Let's assume, for the moment, that these measures achieve that goal (we'll come back to why they don't, later). With this assumption, it becomes impossible for a piece of malware (or a hacker exploiting a buffer overflow, or some such) to invade the kernel, either to destroy the system or to merely hide its existence from the user and malware scanners (a rootkit, in other words).

Unfortunately, while there's no denying that this would make the lives of evil-doers harder, this is anything but the doom of malware/security breaches. Even without the ability to harm the OS itself, a piece of malware could still damage that user's data, and data is often more valuable than the computer it resides on.

Furthermore, the ability to invade the kernel is no requirement for a virile piece of malware. While hiding is more difficult, creating a virus/worm/etc. that runs entirely in user mode is completely viable. Macro viruses, worms that spread through chat programs, and old-fashioned viruses that spread from a disk/e-mail to the computer and back would still be viable and common (although, amusingly, Windows is more resistant to this last type of virus than Linux). There would still inevitably be security holes in third party applications allowing an attacker to get a foothold in the computer and execute code under the user's privileges, and the user could still get (their data) owned, without the attacker ever invading the kernel.

Thus, the necessity of anti-malware products would remain. Now, it would be reasonable to assume that anti-malware products would run with administrative privileges. However, this advantage of privilege would only make life more difficult for malware authors. While it would make it impossible to completely hide from a scanner running at higher privilege, there are many ways of obfuscating, evolving, and encrypting a piece of malware such that it is not readily recognizable by a malware scanner.

Clearly this could be overcome by the malware scanner being updated to respond to a new threat... but that's exactly how the world works right now: anti-malware programs must be kept up to date, or they will not be able to protect against everything that has been analyzed (not to mention the time between when a piece of malware is released into the wild and protection is added to anti-malware products). Consequently, malware analysis labs would still be working frantically, and companies would still have support contracts with anti-malware companies to keep their computers perpetually updated with the latest malware protection.

Now, let's make one final invalid assumption, for the sake of argument: through a combination of various methods, such as security cookies, data execution prevention, and other manner of code hardening, that it's impossible for an attacker to penetrate an application running on the computer (e.g. code injection into a web server, an office application executing code in a document, etc.). That leaves one final mode of attack, one which has been used for decades with incredible success, and one which all of the aforementioned measures combined can't stop: PEBKAC; that is, user naivety.

Even if you could stop all remote and automated methods of invading a system, it will always be trivial to trick the user into running something that is actually malware. This fact nullifies every one of the defense measures proposed previously. Even if a user cannot be attacked other ways, an executed program could wipe all their data. Even if a user only runs as an administrator to install new programs/drivers and perform administrative tasks, an executed "installer" could wipe the data of all other users, and an installed "driver" could install a rootkit for future or immediate use. Similarly, even an air-gapped computer (one which has no network connection at all) still remains susceptible to infection (remember, viruses were rampant on air-gapped computers long before networks or the internet entered the average home/business).

To give you an idea how easily malware can spread relying only on tricking users into manually running it, you only need to take a brief look at the Storm worm. While this worm has been revised and updated extensively over its life, it began as a humble executable that was e-mailed to people; when run, it infected the computer. This worm is now considered to compose the largest botnet in history.

Thursday, November 01, 2007

It's That Time Again

Time to register for spring classes.

So, what's on the plate this semester? After talking to the adviser, it looks like I have exactly 8 classes (3 units each) needed to graduate (I've already finished my biology major, including GE courses, so all 8 are in computer science - third year and fourth year courses). Some of these courses are mandatory, either because they're directly required by the major, or they're required as prerequisites for courses I absolutely want to take. The ones I need specifically, along with their description from the school catalog:

Programming Languages and Translation
Introduce both basic concepts of programming languages and principles of translation. The topics include the history of programming languages and various programming paradigms, language design issues and criteria, developing practical translator for modern programming languages.

Software Engineering
Basic concepts, principles, methods, techniques and practices of software engineering. All aspects of software engineering fields will be covered briefly. Software engineering tools are recommended to use.

Artificial Intelligence
Use of computers to simulate human intelligence. Topics include production systems, pattern recognition, problem solving, searching game trees, knowledge representation, and logical reasoning. Programming in AI environments.

Principles of Computer Graphics
Examination and analysis of computer graphics; software structures, display processor organization, graphical input/output devices, display files. Algorithmic techniques for clipping, windowing, character generation and viewpoint transformation.

Advanced Game Programming
Intermediate and advanced game programming techniques including 3D game development, realtime rendering, physic simulation, etc.

Game Development Project [thesisish thingy]
Individual or team develops realistic games based on the theories and techniques, present and demonstrate their work regularly.

I definitely need to take advanced game programming and computer graphics this semester. The other two slots are open. I'm thinking of taking AI, since that would be useful for E Terra. Unfortunately, that and compilers are at the same time (so are mutually exclusive), and they would both be used for E Terra AI :P I'm hoping to be able to use E Terra for the game project, but I won't be able to take that until the fall.

Besides those, I have a couple decisions to make. I have to take one other programming language than C++ - either Visual Basic, Java, or C#. VB is definitely out of the running, but I'm not sure whether Java or C# would be better. I'm learning some C# this semester because we use it in the game programming class (with XNA), but other than that I'm not sure which is better.

Finally, if my petition to drop one class (not mentioned), which the teacher says is unnecessary, is accepted, I'll need one more upper-division class for the units. Not too sure about which to take for that one. Here are the most appealing prospects, although none of them are something I'd be inclined to take if I didn't have to:

UNIX and Open Source Systems
Introduces the UNIX operating systems, various open source applications and systems, open source programming languages, and open source software development techniques.

Data Security and Encryption Techniques
System security and encryption. Current issues in security, encryption and privacy of computer based systems.

Advanced Operating Systems
The course covers internal structures of a modern operating system. The specific topics include processing, process communication, file systems, networking, and the I/O system. There are several programming assignments which include system calls, and other low level interfaces.

Web Programming and Data Management
Various techniques for developing Web-based database applications using software engineering methodology. Introduce concept and architecture of Web servers, Web database design techniques, client/server side programming, and Web application tools and techniques.

Or, I suppose there's always...
Independent Study
Special topic in Computer Science selected in consultation with and completed under the supervision of instructor.

Internship in Computer Science
Practical experience and service learning relevant to computer science in industry or organizations. Written and oral reports are required.

Hmmmm. Are you thinking what I'm thinking, Pinky?

Thursday, October 11, 2007

Annoyance

How dost thou suck, Thursday? Let me count the ways. Well, my Visual C# Express just stopped working due to me not registering it; more specifically, due to a bug in Visual C# that makes it not accept any registration numbers, regardless of whether they are valid. Did I mention that MS offers NO technical support for the express editions (and XNA only works with Visual C# Express)? So no working on E Terra (or anything else related to video game development class, for that matter), for the time being.

On a related note, I still have no programmers who have volunteered to work with me on it. Although arguably that's my fault for waiting so long to get started. As well, the forum of the video game design club is broken and not accepting registrations, so I can't post help wanted ads on their site. In either case, doing all the coding for an unsimple RTS myself will be... interesting.

Finally, Live Spaces is broken today and not allowing any new blog posts, so I can't even blog about E Terra (I'd already written up a new entry to post).

Saturday, October 06, 2007

E Terra

Okay, my new blog E Terra is now online. No, I'm not giving up on Q & Stuff (which is still my main blog); the first post explains everything.

Saturday, September 29, 2007

& Homework Solution

This problem is actually very straightforward, and does not require much cunning to solve. By definition, raising a number to a power is the same as multiplying the number by itself repeatedly (e.g. a^3 = a*a*a). However, polynomial expansion, a 7th grade math principle, yields a trick that can be used to compute the partial result of each multiplication (az % x) without ever computing the entire result of the exponent (a^b). A rudimentary proof of the trick:

az % x

Decompose a and z into portions which are multiples of x and portions which aren't: a = xm + n, z = xo + p
= (xm + n)(xo + p) % x

Polynomial expansion
= (x^2*mo + xmp + xno + np) % x

(y + z) % x = y % x + z % x. We have to add another % x at the end because it's possible that the sum of the remainders exceeds x.
= ((x^2*mo + xmp + xno) % x + np % x) % x

Extract common multiple
= (x(xmo + mp + no) % x + np % x) % x

xy % x = 0 by definition
= (np % x) % x

Two modulus in series are redundant
= np % x

As the result of each multiplication/modulus will always be less than x (which must fit in a word), there is no need for big numbers or for many multiplications for each step (for comparison, the most straightforward method of multiplying big numbers requires n^2 primitive multiplications, where n is the number of digits in the big numbers). Thus this algorithm is O(power) in the worst case.

This process can be further optimized by noting that a^b % x = (a^(b/c) % x)^c when b is a multiple of c. Thus divide and conquer can be used with recursion to eliminate duplicate work in the case where the exponent is a multiple of c. With this optimization, the algorithm becomes OMEGA(log power) in the best case; the example I gave in the original post was a best case scenario where the exponent was a power of two, and required only 62 multiplications and 62 divisions to solve.

Another problem in the same homework is proving much more difficult, however: write an algorithm using divide and conquer to, given an unsorted array of values, determine whether there is a single value that appears more than half as many times as the number of elements in the array; the algorithm must be generalizable such that you can only compare elements for equality - you can't tell if one element is greater or less than another (which rules out sorting and creating any kind of data structure to count them). Those multiple constraints, taken together, are really brutal for resisting solutions. I've figured out a fact of the problem that I think will be important in solving this (a condition that, if false, guarantees that such a value doesn't exist; however, if the condition is true, it's still possible a majority value doesn't exist), but I haven't quite figured out how to solve it, yet.

Thursday, September 27, 2007

& Homework Fun

Got this question on a homework assignment in algorithms class, and thought it was pretty cool (or rather, the solution is pretty cool): develop an algorithm to calculate x^pow % mod, which does not use big numbers (numbers that take more than one processor word to store, and more than one processor math operation for each calculation).

Random fact: 1928374655647382910^9223372036854775808 % 4123456789 = 3021338089

In other news, posts about generally cool things will now be tagged 'squishy', rather than the previously used 'shiny'.

Wednesday, September 19, 2007

Logitech: At It Again

So I just spent an enjoyable period attempting to contact Logitech customer support through their Email Us link in their support site. Shortly after sending the first one, I felt inclined to send this one:
I take it you people don't like getting support requests, and do as much as you can to discourage them. There's a rather gaping bug with your support site, where it will say invalid e-mail address/password on attempt to login, tell you that the e-mail address is not in the database when you attempt to get your password sent to you, then tell you that the e-mail address is already in the database when you try to create a new account using that e-mail address. This occurred with both my primary address and backup address. I wasn't able to manage to register until I tried this one, which isn't even mine.

As you can see, this has nothing to do with the mouse I entered (that was a different support request from a few minutes ago). I just thought you might want to know that your web site makes people want to hurt you physically, and I needed to enter product info to contact you at all.
I thought I wrote about Logitech's technical incompetence a year or so back, as well, but I guess not. Maybe that was on my todo list...

Friday, September 14, 2007

Public Service Announcement

Microsoft Office 2007 Ultimate for $60 through MS special to students in eligible countries (US, CA, UK, FR, IT, SP). That's 91% off retail (not that anyone would pay that).

Buy here
Verify offer is legit here

We now return you to your regularly scheduled doldrums.

Tuesday, September 11, 2007

I Like It! I'll Take It!

So, I've been reading my new book, Introduction to Typology: The Unity and Diversity of Language. I've just encountered something I like a lot, and plan to use it in Caia. Let me attempt to give a very brief amount of background and explanation.

Not all languages have tense. And some languages on both sides have something called aspect. Aspect indicates some facet of time about a verb in ways that are more complicated than tense. For example, some languages conjugate the same verb (without a helper verb, which English uses) differently depending on whether the subject is starting to do something (e.g. "began to eat"), finishing something (e.g. "finished eating"), etc.

One very common class of aspect (although usually this aspect in intrinsic to the verb, and does not have multiple conjugations for the same verb) is the distinction between state, achievement, activity, and accomplishment. The state aspect indicates that the subject is maintaining some preexisting state. The achievement aspect indicates a change in state. The activity aspect indicates some act that is self-contained and does not include a change in state. Finally, the accomplishment aspect indicates that the subject has caused an action with regard to some other object.

Some examples in English, using the verb 'break' (note that English cannot perfectly distinguish these four aspects for individual verbs, so try not to get confused by the ambiguity). The state aspect: "the window is broken" (this is a little ambiguous; specifically what that says is that the window was broken before the present time, and continues to be broken). The achievement aspect: "the window broke" (also "the window was broken", but that could also indicate the state aspect). The activity aspect: "I ran to the window". The accomplishment aspect: "the ball broke the window".

As this very elegantly accounts for the active and passive voices, as well as both transitive and intransitive verbs, I'm gonna definitely use those as some of the mandatory aspects of Caia (some aspects - the most useful and common ones - and possibly tense are included in the basic conjugation of the verb, while other aspects are formed, as in English, either through phrases or extra words such as adverbs). Though I may combine the achievement and activity aspects, as I'm hard-pressed to find a verb for which activity and achievement are not mutually exclusive (note how 'break', by its very definition, could not display the activity aspect, so I had to use a different verb). And for anyone who's curious: the jury is still out on whether Caia will use tense.

Monday, September 03, 2007

Sprachbund - UPDATED

So, I'm attempting to evade work (again), and was struck by another random thought. I mentioned previously that Japanese and Korean adjectives are really intransitive verbs, which use relative clause structure to act as adjectives. I also mentioned that this seemed to be an integral part of Korean (e.g. the suffixes for adjectives and verbs are the same), while there's some evidence to the contrary in Japanese - adjectives have different suffixes, all but one of which have the form -k_, leaving the possibility that Japanese adjectives were true uninflected adjectives at one point, and had some suffix beginning in k affixed to them.

Well, there's a linguistic principle known as sprachbund. That's German for "binding of languages" ('sprach' is the Old English word for 'speak', and is still used in German). This is the principle that, when you have two or more languages in contact for a large period of time (and, especially, when there is a high degree of multilingualism in the population), the syntax (word order) of unrelated languages will come into sync. Other things also happen, such as word borrowing and language mutation, but those are beside the point.

What if, at some point in the past, Japanese and Korean (for the purpose of this hypothesis, we'll assume the languages are unrelated; the relation of the two - or lack thereof - is still in dispute, so this is not a radical assumption) populations mixed extensively. It's possible that Japanese acquired the Korean syntax of adjectives, transforming Japanese adjectives into intransitive verbs.

If this were the case, it's possible Japanese acquired other aspects of Korean, as well. For example, I saw in one book from the early 1900s that you could use the '[subject] [verb]' form in Japanese as well as the modern forms '[topic] wa [verb]' and '[subject] ga [verb]' ('wa' and 'ga' mark the topic and subject, respectively; at least in these forms, they do) at one point in the past, whereas in current books you only see the 'wa'/'ga' forms. It's interesting to wonder if Japanese might have picked up the topic and/or subject particles from Korean; especially so considering that 'ga' originally meant something completely different, and has been crudely repurposed into the subject particle, in a downright bizarre manner. Though you should take that idea with a grain of salt, as I haven't researched that possibility at all, and don't even have the original book I read about that in anymore (returned it to the library).

In other news, I just lost contact with my computer at work (more than a thousand miles away, on a holiday weekend). I guess that means work's done for this weekend.

UPDATE: Okay, got a reply from a Japanese-speaking classmate (naturally, as I do most of my blogging on a whim, I didn't wait for a reply to the e-mail I sent before posting :P ). He says that the "[subject] [verb]" form is still used in informal speech. So that probably rules out my second item of speculation. I'm still interested in whether there is any truth to my Korean adjectives hypothesis, though.

Sunday, August 26, 2007

Writing Systems and...

As with most of my posts, this post is written on a whim. Specifically, I just happened to be talking with someone on an IM about this topic earlier today, and I had the whim to write a post about it.

There are four major types of writing systems: alphabets, abjads, syllabaries, and ideographic/logographic systems (those two are not the same thing, but for our purposes they're in the same category). You should be very familiar with alphabets. Alphabets are writing systems in which, more or less, each character corresponds to a single spoken sound, though the nature of language change makes it impossible for there to be a 1:1 relationship between sounds and letters indefinitely (some alphabets originally did have a 1:1 relationship). Some common examples are the Roman and Greek alphabets, Roman being perhaps the most widely used writing system in the world.

Next are the abjads. I believe I mentioned these a few posts ago. These are incomplete alphabets - that is, only some sounds (typically consonants) are written, and the rest are omitted. Sometimes additional characters may be represented by diacritic marks. Arabic, Hebrew, and Tengwar are examples of this. Arabic may, depending on the writer and the application, either omit short vowels (long vowels have characters just like consonants), or represent short vowels with diacritic marks (these marks are what give Arabic its distinctive glittery appearance, in cases where short vowels are written). Some writing systems may have an implied vowel after each consonant, and diacritics are only written when the vowel differs from the implied vowel; some may even have characters for all sounds, but omit some at the discretion of the writer.

Syllabaries, also mentioned previously, consist of one character for each syllable possible in the language (though they are also subject to loss of 1:1 correspondence due to language change). The syllabaries most familiar to me are the Japanese hiragana and katakana, although there are others.

Finally, we have a class that I don't know of a single name for, and include systems of ideographs and logographs. Logographs are when a single character represents an entire word - that is, there is a (near) 1:1 relationship between words and characters; the most well known logographic system, and the other contender for most used writing system in the world, is the Chinese writing system. Ideographs are similar, but in this case each character represents some idea or abstract concept. Contrary to common misconception, Chinese is not an ideographic system; however, the Japanese use of Chinese characters bears some resemblance to an ideographic system, where words frequently use multiple kanji (which is part of what makes the Japanese writing system harder than the Chinese system, as I mentioned in the distant past). For example, the Japanese word for goat (山羊) is written mountain (山) + sheep (羊).

Finally, there are complications/impurities in all of these. For example, some writing systems, such as Devanagari (an abjad with implied vowels), has one character for each consonant/syllable, but also has some characters which represent combinations of consonants/syllables; in other words, some characters may be combined to form entirely new compound characters in nontrivial ways.

Next post, resolve willing, I'll get to where I'm going with this (and why I've written this post so briefly and hastily - it's only background for the main topic).

Saturday, August 25, 2007

Thanks (not)

Got my monthly notice that my monthly tuition payment was due, a couple days ago.

Statement Date [this is the date printed on it, not the date it was actually received]: 08/17/07
Payment Due Date: 08/14/07

Talk about helpful. Makes you wonder why they bothered to send it at all; could have saved some on postage.

Tuesday, August 21, 2007

Awesome!

Blizzard negotiating with researchers for virtual epidemic study

Around this time last year, a strange phenomenon struck the virtual inhabitants of World of Warcraft. A disease designed to be limited to areas accessed by high-level characters managed to make it back to the cities of that virtual world, where it devastated their populations. At the time, Ars' Jeremy Reimer noted, "it would be even more interesting if epidemiologists in the real world found that this event was worthy of studying as a kind of controlled experiment in disease propagation." The epidemiologists have noticed, and there may be more of these events on the way for WoW players
...
On balance, the analysis in Epidemiology felt that virtual worlds might provide a useful supplement to traditional models of disease spread, and suggested working with game programmers to test a variety of disease conditions. "Multiplayer online role-playing games may even be useful as a testing ground for hypotheses about infectious disease dissemination," the author said, "Game programmers could allow characters to be inflicted by various infectious diseases, some of which may not be visible to the player, and track the dissemination patterns of the disease in specific subpopulations." It looks like something of the sort is in the works. A report from the Agence France-Presse indicates that Nina Fefferman, a researcher from Tufts University, is currently negotiating with Blizzard about running epidemiological tests in WoW.
Maybe I should go apply at Blizzard, now :P

Thursday, August 16, 2007

Everything Coming Up Roses

So, I've arrived back at home, after spending the summer at my job from last summer (the place Skywing works, and invited me to). Fortunately, the trip home was very uneventful (no jet engines falling off, getting flown into major landmarks, or anything). So, one of the first things I have to do is catch up on the stuff I've neglected over the summer, for the reason that it would have been difficult to do them away from home.

First, I checked what manga came out over the summer. But did that surprise me (and in a good way). It appears that two series I really liked (Gunslinger Girl and Yotsuba&!), and I thought had been discontinued (the last releases were over two years ago), have been resumed. Also, House season 3 is coming out very shortly, and a few other things that were expected also came out, as well. I guess my birthday came exactly 1 month early, this year :P All together, I'm ordering the following (and I'd recommend all of these series, with the possible exception of GTO: Early Years):
- Crest of the Stars novel (the manga based on the novel) part 3
- Death Note volume 12
- Gunslinger Girl volume 4
- GTO: The Early Years volume 4
- School Rumble volume 6
- Yotsuba&! volume 4

Also, as I've got more money (not that I didn't have money before; I was just too cheap to spend it), I'm gonna get some soundtracks I'd been putting off for various reasons. Lo and behold, it appears that there's a new Wild Arms 1 soundtrack from last year. The original "OST" of Wild Arms 1 was released with the game, in 1997 (that's actually more recent than I was thinking). However, there were a number of problems with it. If memory serves, it used different arrangements and synthesis than the original music from the game; as well, it only included about half the tracks. This new one has all tracks and is at least close enough to the game version that I can't tell it apart from memory (I had no problem telling the "OST" from the game). So, I'm buying:
- Final Fantasy XII OST. Had downloaded this and really liked it, but put off buying it due to laziness.
- Wild Arms Complete Tracks
- Xenogears OST. I never bought this because I ripped all the music straight from the console; but I suppose I should buy it as a token gesture at some point.
- Xenosaga I OST. Part of the same series as Xenogears, and by the same composer. Downloaded this to try it; while I'm not as fond of it as some others, I suppose it's worth the money, at least.

Lastly, I should list the various anime I've been watching/manga I've been reading over the summer. Describing each one would be too involved for my laziness, so I'll just let Anime News Network do the talking.
- Bleach
- Bokurano
- Busou Renkin
- Code Geass. Only got this one from Squid; haven't watched it, yet.
- D.Gray-Man. Don't ask me what's up with the format of the name.
- Death Note
- Lucky Star
- Negima
- Pumpkin Scissors. Strangely endearing.
- Skip Beat. *shrug*
- So Long, Mr. Despair

And now I need to go send in my two mice for warranty replacement. One as a tendency to fall asleep at inappropriate times, and the other tends to wander.

Sunday, August 12, 2007

& Debates - Quantum Physics

So, I had a random thought that started a debate thread on Star Alliance (remember that from way, way back?). And boy is it a whopper. Registration on the forum is required to participate, so I'll copy some of the bigger posts in the debate here.

The opening post:
So, I had a random thought; that's, of course, rarely a good thing. Now, let's see if this can turn into a full debate. *ahem* Have you ever considered that some of the most puzzling aspects of quantum physics could be logically explained by the universe being a computer simulation? Let's go over a couple examples.

- As best we can tell, mass, distance, and time all appear to be quantized; that is, they're integer values. Any computer constructed by a physics system remotely like ours is only capable of representing quantized values.
- One of the hardest to grasp concepts in quantum physics is that variables associated with things, particularly subatomic particles, don't appear to have values assigned until that variable is actually used, and values can even be lost once they are assigned. When those variables do not have values assigned, they are represented simply by probability distributions, with the actual value chosen randomly when it is needed. There's a saying in computer science (one of those things that you should be careful not to take too absolutely) - never store anything you can recalculate later; quantum physics appears to take this one step further, not bothering to store anything that you don't need at the moment. The point, of course, to massively reduce the amount of memory required for the simulation by only storing essential values.

Input?
My most recent post:
I myself have considered that the Planck length and Planck time could be a limit to the universe's "resolution".
Of course. As far as we know at the moment, the Planck constants are the resolution of the universe.
If the universe was a huge simulation, would everyone else be part of the simulation, or separate entities within the simulation, or all we all part of the simulation without any free will.
Unknown. There does appear to be full simulation of individual units (unknown exactly what that is; protons/neutrons/electrons/photons, quarks, etc.) in some cases, but it's possible there are optimizations to process clusters of those as a group. Perhaps the fact that particles also behave like waves is a trick to allow the behavior of particles to be calculated in bulk at once, using simpler computations. There are other examples, as well. Like some things cells do seem odd, in that it doesn't entirely seem like all the atoms in a cell are working independently; that in some things the cell or part of a cell seems like it's acting as an single unit.

Of course there are counter-examples, as well. If the simulation was abstracting as much as possible, it's strange that Brownian motion would exist, as it indicates individual atoms are being processed in a case where it would be logical to abstract them into a group.
Quote
Are we all bots on a counterstrike server, is just one of us a bot, or are there no bots?
There would be at least three basic possibilities, which could be combined:
- We are being controlled by players, either directly (e.g. an FPS or RPG), or indirectly (the player is able to manipulate things like basic predispositions, though actual actions are a result of physical processes using those basic predispositions; think like Populus). Possible, but seems less likely to me than the other two.
- We are entirely naturalistic constructions. That is, there is nothing different about us than anything else in the simulated universe; we are nothing more than a result of the laws of physics being simulated. This is the atheist/naturalist world view.
- We are programmed constructs - AIs. We have programs which operate independently but within the constraints of the laws of physics.

But holy crap. Now THAT is an interesting idea. Obviously you could call the programmer(s) God (in the monotheistic, omnipotent sense). But it's also possible that different programmers and/or players (if such a thing exists) could form an entire pantheon. In that case, it's entirely possible that every god that has ever been worshiped throughout history actual does exist (or existed at one time; it's possible gods "die off" as players/programmers lose interest in playing them).

Thursday, August 09, 2007

Of Codes and Languages - Trans-Roman Alpha

For those who haven't played it, World of Warcraft contains 10 different playable races organized into 2 factions. Each race has its own distinct language, with 1 on each faction known by all races of that faction. Most text spoken by players or other characters in game is in a particular language. If you know that language, it appears in English (or whatever language) exactly as it was typed/said. If you don't know that language, it gets translated into the language of the character that said it, making it unintelligible to you. Besides lending a touch of realism to the game, this also serves as a language barrier to prevent communication between opposing factions.

For example, the following items are spoken by some of the enemies in one instance, in Common - the native language of humans, and, as the name suggests, the common language known by all races on the Alliance:

Common: Andovis ras waldir
English: Release the hounds!

Common: Ras garde hamerung nud nud valesh noth. Hir bur dana bor.
English: The light condemns all who harbor evil. Now you will die.

If you carefully draw a line through a handful of data points, you can get an idea of how this works just from these two examples. Blizzard has created a small (several dozen words) vocabulary for each language - approximately 6 words of each length. When translating text, each word is processed individually; the word is hashed and used to choose a translated word of the same size, in a lossy, many-to-one relationship. An elegant, simple but effective algorithm.

This got me thinking - could you create a coding system such that you could reversibly encode data in something that looks like a foreign language? The point, of course, being to use the fact that it looks like a language as a decoy, while the information is actually in something like an encrypted form.

My first attempt at this was an algorithm I called Trans-Roman Alpha ("trans-Roman" because it used the Roman alphabet). This was an extremely simple algorithm: reversibly convert a word into a numeric form (basically treating it as a base-26 number), then decoding it in an opposite direction, using a different mapping of "digit" to letter. A few other complications were also added in, such as word fusions and splitting to hide the original word lengths. Some familiar phrases, in Trans-Roman Alpha:

"pqr bxq pgy psc dywddw jjf"
"psc dynts bl dytg djp mckgy bzz cy gcxy sdwcn jfydkltd yd htc yn r vy lzt fcypt jc"

As you can see, the fact that the algorithm is too dumb to form pronounceable syllables means that the best the algorithm can do is to either work with syllabaries like Hiragana, which represent one syllable in each character, or to use an abjad writing system: a system where only consonants are written. In the latter case, the resulting phrases would be far longer than the source text, making it somewhat impractical.

While use of a syllabary would not have the problem of length (and in fact would be about the best you could do with this algorithm), both it and the abjad solution have a more significant problem: both will generate a fairly even distribution of characters, in a nearly random order. This is entirely unlike real languages, which do not form an even distribution in this regard, nor do they occur in random order (though encryption systems do both). Such a system would be unlikely to hold up to the most basic tests used to identify languages (or at least make a best guess), and so would not be particularly likely to fool anyone knowledgeable about the topic.

Sunday, July 22, 2007

Middle East Loses What Little Was Left of Its Sanity

A few weeks ago, 14 squirrels equipped with espionage systems of foreign intelligence services were captured by [Iranian] intelligence forces along the country's borders. These trained squirrels, each of which weighed just over 700 grams, were released on the borders of the country for intelligence and espionage purposes. According to the announcement made by Iranian intelligence officials, alert police officials caught these squirrels before they could carry out any task.
- Iranian newspaper

Word spread among the populace that UK troops had introduced strange man-eating, bear-like beasts into the area to sow panic.

But several of the creatures, caught and killed by local farmers, have been identified by experts as honey badgers.
...
UK military spokesman Major Mike Shearer said: "We can categorically state that we have not released man-eating badgers into the area.
- Iraqi rumor

"We can categorically state that we have not released man-eating badgers into the area."
That is just the most godly quote ever.


Aww, aren't they cute? You'd never know they're man-eating badgers of death!

Saturday, July 14, 2007

Trique

Among various other things, the last few days I've been looking a bit more at the Trique language. I even started to write a mini-dictionary based on what I can make out from this Bible, which currently has several dozen words; this is more to help me remember the meaning of words I've figured out (and to remember meanings I've assigned them in case a different passage contradicts that meaning and I have to reconsider it) than a benefit for anyone else. Between what I've learned on my own and an article my grandpa sent me (I've actually tried to keep the information I acquire from my grandpa to a minimum, as I want to figure out as much of the language as I can on my own, just by reading this Trique Bible) on Trique morphology (the way words change form in different circumstances) which covers some of the most complicated grammar, I thought I'd write a blog post on some of the interesting features of Trique - those that are notably different from other languages, especially common Indo-European ones like English. At some point I might try to expand this post (and have it reviewed for correctness by my grandpa, a linguist fluent in Trique) into a Wikipedia entry on Trique.

First, my grandpa believes that Trique was initially an isolating language, though through various fusions it has become somewhat inflected (words change form in different circumstances). An isolating language is one in which each word carries exactly one piece of information. For example, while English indicates the plural of a noun with the suffix -(e)s, an isolating language would either have a separate word that indicates that the noun is plural, or only express plurals through numbers (Japanese is like this, although it isn't an isolating language; in Japanese, the number of nouns is usually not mentioned at all, and when present is usually simply specified by a number word). Likewise, while English specifies tense with various suffixes like -ed, isolating languages, if they specify time at all (some languages don't have tense at all), will add an additional word that indicates that the verb is a specific tense; Chinese is like this, though I don't have a specific example.

Trique stills resembles an isolating language, though it has acquired some degree of inflection. For example, plurals are indicated by the addition of additional words; I'm not sure if these are particles, adjectives, or something else. For example, the third person pronoun in Trique (it's base form, anyway) is "si3" (numbers indicate tones). To indicate the third-person dual (two people/things), you would say "ngüej5 si3". To indicate the third person plural (I'm not sure exactly when you start using the plural; I've only seen 'one', 'two', and 'many', though it's possible I jut haven't seen 'three' or 'four', etc.), you would say "nej3 si3". See how that works?

Unfortunately, Trique has become more complicated over time, in a fairly systematic way. There is evidence to suggest that prefixes and suffixes have developed through fusion - what was at one time two distinct words have become fused together, with the modifier becoming a prefix/suffix (both of which are seen in Trique). While some affixes are more obvious than others, in general this has severely degraded the "purity" of the language. For example, the verb 'ask' is, in its simplest indicative form, "achin21". The same verb in the simplest anticipatory form (don't ask my what the anticipatory form is; I don't know, yet - that's just what my grandpa called it) is "ga5chinj5". Even more interesting, you could say "I ask" (indicative mood) using the fused form "achin23" (there is no longer a separate word for 'I' in the fused form). As you can see, in this case, whatever the original suffix was, the sounds are completely gone, leaving only a change in tone. This should give you an idea of what I meant by saying this has made Trique morphology rather complicated.

Even more interesting, however, is a system of anticipatory inflection (I'm not sure what the technical term for it is; perhaps some strange kind of assimilation?). That is, where a word changes based on the word that follows it, even though the two words do not fuse like they did in the previous paragraph. For example, recall the dual marker mentioned earlier. Most of the time it takes the form shown previously; however, if you combine it with "re'5" to form the second person dual pronoun ('you two'), you get "ngüej5e3 re'5" - the form of the number has changed, even though it hasn't fused with any prefixes or suffixes. As far as I can tell, this is merely interaction between two particular words in a particular order; it doesn't appear to have any inherent meaning.

Modern Trique appears to have two cases: a standard case and a possessed case. The former is used for words that stand alone (or are the possessor of something else), with the various roles this case can play (subject, object, etc.) indicated by word order. The latter is used for nouns which are possessed by something else (the noun immediately following them, in all the cases I've seen). While the latter seems to be unusual in its own regard (Wikipedia lists only Tlingit as having a possessed case), what's even more unusual is the fact that some words (namely family relation nouns) in Trique have both inclusive and exclusive possessed forms (I wonder if you'd call those separate cases or the same). The difference is made obvious by comparison of two examples: "nej3 dinï1 [brothers of] si3 [him/her]" ('his brothers') and "nej3 dinïj5 Judá" ('Judah and his brothers').

My first "major" discovery, when I was just starting to look at this Trique Bible (and still didn't know any of the language, nor had I talked to my grandpa about it, yet), was learning how quotes - sequences of direct text one person says - are handled. A quote from Mark 6:37 illustrates this structure:

English: But Jesus answered, "You give them something to eat."
Trique: Sani4 gataj34 Jesús gunï3 nej3 si3, Ga'ui'5 nej3e3 re'5 si3 xa4 nej3 si3. Daj4 gataj34 so'2 gunï3 nej3 si3.

With a quote this short (and the Trique info I've already given you), the you should be able to make out the structure, though probably not the entire meaning. You can see that the quote is both preceded and followed by the phrase "gataj34 x gunï3 y"; this means "x said to y" (you could see that if you had more examples available to you). That is, Trique begins quotes by first saying who's saying what [to whom], and then concludes quotes by reiterating that fact (though one of the two may be omitted in short quotes; I actually did some looking to find an example that was both short and followed the typical customs); you can't see it in a quote this short, but usually the former will use pronouns, while the latter uses the full names of the speaker and listener. This was also how I learned the two basic third-person (first and second person pronouns were discovered somewhere else) pronouns and how plurals were constructed.

However, "x said to y" is not a literal translation. I knew from the beginning that "gataj34" meant 'said', but at first I thought that "gunï3" was simply a marker that indicated who was the target, experiencer, or some other role. This turned out to be incorrect. In fact, "gunï3" is a verb - to hear. What this is literally saying is "x said, y heard". This illustrates one of the interesting features of Trique: using short clauses with (preferably) intransitive verbs in series to form more complex thoughts and sentences.

This serial clause construction is also often used in places where languages like English would use relative clauses (I believe Trique does have relative clauses; they just aren't used as often), as is illustrated in the first line of the Lord's Prayer, which I posted previously. Another example (John 1:42) is also given, for a bit different structur:.

English: Our Father in heaven
Trique: Drej3 [father of] nej3 yunj2 [us] huuin2 [are] re'5 [you] nne2 [live in] re'5 [you] xata'4a [heaven].

English: You are Simon, son of John. But you will be called Cephas (Cephas means Peter).
Trique: Hue2 re'5 [you] huuin3 [are] Simón. Ni4 da'ni1 [son of] Juan huuin2 [are] re'5 [you] nej3. Sani4 hue2 re'5 dugu'na23 Cefas. Ni4 Cefas ruhuaj3 gata3 [means] huuej3'e [I'm guessing this is 'rock'].

Monday, July 02, 2007

Types of Multiprocessors

Following the Standard Operating Procedure for deciding what to write about for my blog posts (namely, whatever I have a whim to write at a particular moment; explains a lot, doesn't it?), today I'm gonna talk about the different types of multiprocessors. Multiprocessing refers to the parallel execution of one or more programs in multiple instances; more specifically, I'm going to talk about architectures where you explicitly code for parallelism, as opposed to implicit things, like the multiple arithmetic units present in most modern CPUs. There are lots of different types of multiprocessors (speaking from an architecture standpoint), but traditionally (before EDGE came along) they fall into one of these categories I'm going to describe. These categories form a neat gradient from single processing to processor clusters.

The class of processors most resembling single processors is the Single Instruction Multiple Data (SIMD) processors - processors which operate on multiple data items using the same sequence of instructions. Vector processors, the simplest SIMD processors, allow you to store multiple data items in large registers (vector registers), then perform a mathematical operation on all data items in a single instruction. I gave an example of this a while ago, when I made an MMX function to convert a string to uppercase in blocks of eight characters per instruction (MMX registers are 64 bits). The eight characters were loaded from (contiguous) memory in one instruction, several mathematical operations were performed, each in one instruction, and ultimately the block of upper-case characters was written back to (contiguous) memory in one more instruction.

The problem with vector processors is that they can only read vectors from memory at regular intervals - either contiguous memory, like MMX/SSE, or every X bytes, as I'm told Cray computers can do. If you don't like it, you can manually load each value into a general register, then push it into a vector. Obviously this will be slow, as it requires a couple instructions per data item, as opposed to loading a single vector in one instruction. Though you may not have much choice, if all you've got is a Core 2 with SSE3.

Stream processors, taking this SIMD idea one step further, answer this problem. While vector processors are like a single CPU operating on an array of data items, stream processors are like an array or processors, each with individual word-sized registers, working on the same sequence of instructions. While both operate on the same sequence of instructions, each pipe (for lack of a better word) in a stream processor has its own set of registers. If, for example, a particular register is a pointer, if each pipe has a different value in that register, that pipe will load a different value from memory and operate on it; what's more, the reads and writes to memory need not be contiguous. This is how Graphics Processing Unit shaders work.

Of course, stream processors still have the limitation that there's only a single program, even if it's being executed in parallel. To that end, we next have Multiple Instruction Multiple Data (MIMD) processors, or, more specifically, Symmetric MultiProcessing (SMP). SMP systems consist of two or more relatively independent processors or cores sharing a common bus and memory, each executing their own program. 'Symmetric' refers to the fact that all CPUs have equal access to the shared hardware; (with some exceptions) a request to something like memory will be processed exactly the same, regardless of which processor made the request. This is what current x86 CPUs are, though I've heard that AMD's next x86 architecture will be NUMA.

The problem with this is that there's an obvious bottleneck: the shared bus (or, for more innovated systems like the Opteron, which lack a shared bus, the memory itself). As the number of processors and memory or other hardware accesses increases, this becomes a big problem for performance.

Consequently, systems with more than just a few processors are often Non-Uniform Memory Architectures (NUMA). On a NUMA, there are two or more banks of memory, located at different places on the ystem. It's possible that there may be one bank per X CPUs (perhaps those X CPUs and memory bank reside on an expansion card), or one bank per CPU. In either case, each processor has access to all memory in the system, just like with SMP; however, the access time to access a particular location in memory varies, based on which bank that address resides in. If each CPU has the data for its program relatively self-contained in its local memory bank, so that requests to other memory banks are rare, this allows significantly better performance in parallel systems.

The last bottleneck remaining in NUMA is the fact that any processor can request reads or writes from any memory bank. While this is good from a programming perspective, as the size of the system continues to grow, there comes a point where you can't afford to make remote memory requests - it's more efficient to perform explicit communication between processors (either synchronously or asynchronously) and keep a copy of all the data one processor needs in its own memory bank. This setup is called a cluster, and is now very common in distributed systems. Each processor (which is an entire computer, in the case of distributed computing), runs independently, but uses explicit communication to optimize costly transfers of data between systems. This is perhaps the hardest to program when dealing with a program that can't be neatly decomposed by domain (having each computer process a portion of a common amount of data), but is becoming increasingly necessary.

However, while clusters are often composed of whole computers, that isn't a requirement. For a rather peculiar example of a cluster-like system on a single chip, we need only look at the Cell - the CPU of the Playstation 3. The Cell is composed of one main core, which has access to main memory, and 8 mini-cores, each with (and limited to) its own local memory bank; blocks of memory are copied between the local banks and main memory by explicit asynchronous Direct Memory Access (DMA), analogous to asynchronous network I/O in network clusters. The combination of the 8 mini-core memory banks actually residing on the chip itself (cache, in other words) and memory copying done asynchronously allow the Cell to process at blazing speed, with all memory accesses taking only as long as a cache hit (by definition). But between the architecture itself, and the fact that game data tends to be interdependent, making breaking games down into that many threads very difficult, the Cell is incredibly (and notoriously) hard to make full use of. This makes the Cell a peculiar choice for a game system, and it would probably be better suited to scientific applications, with simpler computations in large quantities.

Saturday, June 30, 2007

Japanese Adjectives Revisited

So, in between work, playing WoW (actually very little, these last few weeks), playing MechCommander 2, and watching anime (Bokurano, Lucky Star, D.Gray-Man, and Death Note), I've been researching Japanese adjectives and verbs a bit. You could call this an errata post on the topic.

Hmm, where to start? Well, first of all, I was a little bit off about the difference between the verb of a sentence and the verb of a relative clauses being identical. I was right in that there isn't a difference between the forms in modern spoken Japanese. However, according to An Historical Grammar of Japanese (don't ask me why it's 'an'; I don't know), the forms of verbs and adjectives in those two roles used to differ for some verbs (there are several conjugation patterns formed by progressive breakdown of the original grammar) in the written form of Japanese.

Japanese is a bit different than English in that there are several forms of it, spoken and written Japanese being the ones most relevant to us. Compared to spoken Japanese, written Japanese (I'm not exactly certain what writings this includes, but I'd imagine things like text books and scholarly works) is rather archaic. By my understanding (keeping in mind that I'm hardly an expert on Japanese), written Japanese is to spoken Japanese what Old English or Middle English (somewhere between the two, to be precise) is to modern English. It's somewhat similar to spoken Japanese, but there's a lot of difference in things like grammatical suffixes (like using the -nu suffix in the written language to indicate a negative, but using the adjective 'nai' in the spoken language) - enough that it would take some effort for someone who only knew spoken Japanese to figure out what was being said in something using written Japanese. I suppose you could argue that modern English retains some archaic spellings in the written form as well (like writing 'knight', even though it's said more like 'nite'), but this generally doesn't affect grammar so much, apart from some common contractions like 'wanna' and 'dunno'.

Now, getting back to Japanese adjectives. Historically, there are five distinct base forms of Japanese verbs, two of which are of concern to us, here: the predicative, the form used for the main verb of the sentence; and the attributive/substantival, used for the verb/adjective of relative clauses and when using a verb/adjective as a noun. One thing that differs between spoken and written Japanese is that spoken Japanese has replaced the predicative form of verbs and adjectives with the attributive form.

Going back to our examples from last time (and noting that we're now into the realm of written Japanese, my knowledge of which rates just above nonexistent), "the dog is bad" would be written "inu wa warushi" (predicative form), while "bad dog"/"dog that is bad" would be "waruki inu" (attributive form; note also that 'waruki' has become 'warui' in the spoken language). Both predicative and attributive forms are the same for 'kamitsuku' in modern written Japanese, so I'll use a different verb for the example. "the dog eats" would be "inu wa taburu", and "eating dog"/"dog that is eating" would be "tabu inu" (note that the verb has changed to 'taberu' in spoken Japanese).

I've done a tiny bit of looking into the matter in Korean, and it appears there is also a difference in predicative and attributive forms there, as well. This is an interesting piece of information, as it indicates that in Korean (and in older Japanese) this predicative/attributive difference took the place of the relative pronouns in English and other Indo-European languages ('that', 'which', etc.). Of course spoken Japanese has lost this interesting trait, replacing it with analytical methods, where the position of the verb/adjective alone indicates whether it's the main verb of a sentence or the verb of a relative clause.

Moving on to a different but related topic. Unfortunately, as attractive to me as the idea of a language where verbs and adjectives are one in the same is, there is some evidence in Japanese to the contrary, and this characteristic seems to me to be something Japanese (and by correlation Korean) evolved, rather than initially possessed in the language's original form (although it's not impossible that Japanese merely lost such an initial form and later reconstructed the functionality through different means; unlikely, but not impossible).

Taking a step backwards in time, we find that there are five bases to Japanese verbs: the predicative, the attributive, the conjunctive/adverbial (used as an adverb or to express things such as "he bled and died", the imperfect (used for various compound forms), and the perfect (an action that occurred in the past which determines the present state; an example from the book is "Nara no wagie wo wasureteomoe ya" - "have I forgotten my home, Nara?"). Using the verb 'shinu' - 'die' (this is actually an irregular verb, but there is evidence that it is actually the original, true verb conjugation), the forms are 'shinu', 'shinuru', 'shini', 'shina', and 'shinure' (note that two of them appear to be compound forms formed by adding a form of 'uru' - another verb meaning 'exist').

Now, compare those forms to the conjugation of the adjective; using the adjective with the root 'taka' ('high'): 'takashi' (predicative), 'takaki' (attributive/substantival), 'takaku' (conjunctive/adverbial), and 'takakere' (perfect). We can see three things, here. First, there is no imperfect form of the adjective; second, all of the adjective forms appear to be compound forms, with various suffixes tacked on to the root; lastly, three of the four conjugations have a 'k' in their suffix.

While this does not answer the question conclusively, to me it appears that initially adjectives were true adjectives (either that or nouns), consisting merely of the root, and over time were made to resemble verbs by the addition of various suffixes. This is in agreement with the fact that, in the oldest known Japanese writings (from about 800 AD), in addition to the forms just shown, we occasionally see the use of the root alone as an adjective or a noun (this fact brought to you by the same book).

Tuesday, June 26, 2007

Slashdot: The GPL vs. DRM

I suppose if I'm posting my posts on Slashdot, I might as well post the other, as well (going by the theory that it's marginally better than no blog posts at all :P). The post I was replying to (though you may have to read further up to get a feel for the context):
If you don't want to respect their license, that's fine, but then you shouldn't expect them to respect the GPL either.
There's an inherent difference here. Microsoft's licenses try to restrict you from doing things you would otherwise have the right to do. The GPL gives you rights to do things that you would not otherwise have. If you don't want to respect the GPL, that's fine, but you'd essentially be a software pirate if you distribute GPL software in violation of its terms. On the flip side, if you violate some of Microsoft's license terms, you might not have done anything illegal at all (running Vista in a VM, for instance). So I really do see a huge difference between the two licensing models, and therefore a difference between the nature of respect for them.
My response:
That is a beautiful piece of logic you have there. If you violate the terms of MS' license, you're okay, because they were artificial and arbitrary restrictions, anyway. If you violate the GPL's equally artificial and arbitrary limitations, you're a pirate and a lawbreaker, because you've violated the terms of the license. See how absurd it is?

Now, I'm a programmer. I've recently been working on releasing a couple of my programs as open source, so I've had to take a good look at the various licenses, and see which one is closest to my ideals. Just about anything but the BSD license (and arguably even that, though that would almost be splitting hairs) is indistinguishable from DRM, save for one exception: most open-source licenses attempt to achieve maximal collective benefit (rights), while DRM seeks nothing more than to maximize the benefit (profit) of the creators. That is, DRM and source licenses both prevent you from doing things with the code/media that you would otherwise be able to do; if you think differently, you surely have given up the term "DRM" in favor of "consumer enablement" (which it actually looks like you have, from your post).

The CDDL, the license closest to my ideals, is based on a single restriction: that if you modify the open code, you have to keep the CDDL for your changes, keeping the work open; so long as this rule is followed, you can use the code in any way, in any project. This is an arbitrary restriction on the ability of other people to use my code. However, I justify this restriction with the reasoning that I want as many people as possible to be able to make use of my code (and thus any advances to it). I'm sacrificing the ability of individuals to use my code in an unrestricted manner for the calculated benefit of the whole programming community.

While the GPL does this as well, it does something else that I consider uselessly arbitrary (that is, it limits the freedom of users without contributing significantly to the common good) and, for that reason, particularly obnoxious. Anyone who's read the GPL knows what I'm referring to: the requirement that any project which so much as uses GPL code must itself be GPL in its entirety. This is a political rather than practical requirement: the GPL serves to promote free software, and will restrict the freedom of users to attempt to increase the amount of free code available in total. I'd imagine the reasoning is that if all software were free and open, the world would be a better place; but I can't really agree with the sentiment or the means used to achieve it. The LGPL is better, but not as close as CDDL to my ideals (if you want more info on the topic, I wrote a several-page justification of my choice of license on my blog).

My Stance on the RIAA/MPAA

Well, I started out writing a reply to a post on Slashdot. But by the time I finished (some hour later), I'd written something as long as one of my typical lengthy blog posts. So I figured I might as well post it on my blog :P

First, the original post I was replying to (from this thread):
So its pretty clear that going after individual copyright violaters is looked down upon on slashdot. I also remember back when napster was big everyone on here was upset that they were getting sued because they werent actually breaking the law, it was just the individual users.

So is it just one group who thinks that indivuals shouldnt be sued and a different group who thinks that the companies should be immune? How should the RIAA protect its intellectual property rights? Is it just a fundamental belief on here that copyright holders should have no recourse against violaters?
And my reply:
I'm not sure they can defend themselves, frankly. I guess I differ from most of the most vocal Slashdot people in that I believe P2P file sharing (specifically a subset of that - theft) is hurting the music and movie industries, and I believe that stealing content via P2P is morally wrong. That said, I oppose their suits on a number of grounds (possibly more than I can recall off the top of my head); to name a few:

- Even in the worst case scenario - that the downloader and everyone they uploaded to are truly stealing the content (morally wrong), the suits they are bringing are orders of magnitude greater than the theft committed. $3k-$5k out-of-court settlements are nearly universal; tet most people will not upload much more than the amount they downloaded, and even in extreme cases not more than several times what they downloaded. So, if they downloaded a music CD worth $15, they have (assuming all parties are stealing) stolen (or willingly facilitated stealing) 2x-5x that (5x is a bit arbitrary), or $30-$75. Even at the high end of theft ($75) and low end of damages ($3k), that is *40 times* the value of that which was stolen; the other way around ($30 and $5k), it's almost 200 times the value. Can you honestly tell me that you support that? No, really, I want you to type in your answer and click "submit".

- The RIAA has been using downright illegal tactics to bring these cases (see Beckerman's site for details about this). And I'm not throwing the term "illegal" around lightly, like many Slashdotters. I mean they are literally defying direct orders made *by the courts* in order to bring law suits that they would not have been able to bring using purely legal methods.

- The RIAA has been using legal but immoral tactics to bring these cases. Most commonly, the RIAA picks people who they believe do not have the resources to defend against the suit, making the outcome the same, regardless of whether they are guilty or innocent (and short-circuiting the justice system entirely): they have to settle, because they can't afford a lawyer to defend themselves (and if they do hire a lawyer, the RIAA lets them accumulate a sizable bill, then drops the case, so the people will usually have to pay more than the settlement would have been). This is done to build fear by maintaining a perfect prosecution record, to discourage others from sharing files or hiring a lawyer to defend themselves. I know of not a single case the RIAA has won - that is, the case has gone to verdict and the verdict was in their favor; if you know of a single case, click "reply" and link to it, right now. And because they drop cases once it becomes obvious they cannot win them, they have just the same never suffered a loss (though that tide appears to be slowly turning). They care so much about their perfect record that they will continue suits against people who they either knew from the beginning or learned during the trial are innocent, because dropping the case would show that they don't have absolute power, and diminish the fear they seek to create. Do you support each and every one of these methods? Again, that is not a rhetorical question; I am expecting a real answer.

- The RIAA needs to preserve their undefeated status because they cannot possibly catch all of (or even enough to save their business) the true P2P thieves out there, let alone distinguish the thieves from the ones who have not immoral uses (such as seeing if they want to buy and album/movie legally). This is a painful exercise in futility. Their only (false) hope is to try to generate enough fear to make everyone they don't have resources to sue stop sharing. This is *not* working, and cannot in theory work. While it's true that they may be able to reduce (not stop) theft via P2P in the US by going after web sites and individual P2P users, where they have the power of law suits, such suits cannot be brought in a majority of the world because they do not have jurisdiction (let alone the resources to sue that many people), meaning that even fear is outside their reach.

- DRM is also a painful exercise in futility, and cannot stop people from making or sharing copies of media for any purpose, including true theft. Furthermore, it provides no inhibition for hackers who seek to make the content available (again, for any reason), but greatly inconveniences the casual user who isn't technically inclined, and wouldn't share the content anyway. Nevertheless, such a technically illiterate user will have no trouble downloading media ripped by an aforementioned professional hacker, meaning all that inconvenience is for no benefit whatsoever.

- Even in the most draconian of outcomes - that the RIAA gets the government itself on their side and using government resources to hunt down copyright infringers, that only goes as far as the US, and we've already been over that.

Thus, in conclusion, the RIAA/MPAA are on a sinking ship, and even if they were on the moral side (that is, they didn't have any of the moral or legal problems I've outlined), I don't believe they could stop it. People throw around this "time to find a new business model" all the time on Slashdot, usually because they believe there's something wrong with the old one. I don't know if I'd agree that the old one is wrong (or that it should be abolished), but I think what I think and what the RIAA/MPAA think is irrelevant: the end is coming - you adapt (and perhaps get "right-sized") or you die.

Lastly, I should add that I'm not categorically opposed to what the RIAA/MPAA try to accomplish. I think it's great news when they and/or the government busts large-scale counterfeiting rings, and I'm certainly not one of those "Who cares, music should be free" hippies (that's an actual quote from Slashdot, by the way). And for what it's worth, I'm a big fan of watermarking content. Ideally, they should be going after the people who make the content (widely) available - counterfeiters and the ones who initially rip the content (as they make it available for many, many people, potentially creating huge damages proportionate the their own numbers; as well, it's much less morally defensible to rip copyrighted content, as you know you will personally fuel a huge amount of theft, in the case of sharers without so much as some benefit to yourself), not the ones who download the content, which, as stated earlier, may not be immoral, depending on the context.