Deryk's stack: 2011

Tuesday, December 13, 2011

maximum-subarray problem

If we are given an array A, we now want to find the nonempty, contiguous subarray of A whose values have the largest sum. We call this contiguous subarray
the maximum subarray.

Clean way to illustrate basic algorithm design:
1: brute force algorithm: O(n^3)
2: algorithm that reuses data: O(n^2)
3: divide-and-conquer algorithm: O(n *lgn)

divide-and-conquer algorithm:
need consider two factors:
1: termination condition
2: combination of left and right .

1: termination condition:
if(size==1) return;

2: combination
    compare(left, right, combined);
    left and right can be obtained by recursion.
    combined is computed by array[i..mid]+array[mid+1...j];

   array[i..mid]: the max subarray beginning from mid. For example, suppose given array[0..9]. mid=4; array[i..4] means the max of array[4], array[4,3], array[4,3,2], array[4,3,2,1], array[4,3,2,1,0].

------------------------------------------------------------
Non-recursion solution:
Knowing a maximum subarray of A[1..j], extend the answer to find a maximum subarray ending at index by using the following observation: a maximum subarray of A[1..j] is either a maximum subarray of A[1..j] or a subarray A[1..j+1]. Determine a maximum subarray of the form A[1..j+1] in constant time based on knowing a maximum subarray ending at index j.

Wednesday, December 7, 2011

Introduction to Algorthms 3rd --P2.3.7

Describe aO(n *lg n)-time algorithm that, given a set S of n integers and another integer x, determines whether or not there exist two elements in S whose sum is exactly x.

This is the usual solution:
---------------------------------------------------------

Input: An array A and a value x.
Output: A boolean value indicating if there is two elements in A whose sum is x.
A ← SORT(A)
n ← length[A]
for i ← to n do
if A[i] 0 and B INARY-SEARCH(A, A[i] − x, 1, n) then
return true
end if
end for
return false
---------------------------------------------------------
My idea is to use merge sort.

x is sum of two numbers. then x/2 must be between those two numbers. in other words, if there are two numbers A and B whose sum is x, then A must be on the left of x and B must be on the right side of x.

The above idea cannot work. Because there is no way to determine how many elements should be chosen from left or right side of x/2.

Tuesday, December 6, 2011

Implementation of "strtok" in "string.h"

Some advice:

Never use this function. This function modifies its first
argument. The identity of the delimiting character is
lost. This function cannot be used on constant strings.

Usually, the following is how "strtok" is used.

char test[]="I Love You";
char *token;

token=strtok(test," ");
while(token!=NULL)
{
cout<<token<<endl;
token=strtok(NULL," ");
}

But the confusing part is why the 1st parameter of strtok can be NULL.
From http://www.cplusplus.com/reference/clibrary/cstring/strtok/, it states NULL means continuing to scan the left part of the original string.
If that is the case, how do you store the original string. For example, in while loop, how does the strtok know which string it is handling.

Delimiter-Characters at the beginning and end of str are skipped.

Useful implementation:http://clc-wiki.net/wiki/C_standard_library:string.h:strtok

Implementation of strtok:

http://www.openbsd.org/cgi-bin/cvsweb/src/lib/libc/string/strtok.c?rev=1.6&content-type=text/x-cvsweb-markup

Explanation: http://bytes.com/topic/c/answers/810975-looking-implementation-strtok

The version above (with the truncated URL) is a rehash of the
implementation I wrote in 1988. Mine did not have a strtok_r()
(which is not even in C99, much less C89), and in any case it
would be better to put strtok_r() in a separate file.

The C99 "restrict" qualifiers are missing, and should be added
if you have a C99-based system (though most people do not).

These days it probably would be better, in some sense, to go ahead
and call strspn() and strcspn(), rather than writing them in-line
as I did. (When I wrote this, subroutine calls were relatively
expensive, and compilers never did any inline expansion, so we had
to do it by hand wherever it was profitable.)

The assignment:

s[-1] = 0;

writes a '\0' at s[-1], of course. If you mean "what does s[-1]
even *mean*", it means: "add -1 to the pointer in s, and then
follow the resulting pointer". Adding an integer to a pointer
moves "forward" by that number of objects, so adding -1 moves
"forward" negative-one "char"s, or in other words, moves backward
by one "char". Earlier we had:

c = *s++;

so s now points one "char" beyond the location that holds the
character stored in "c". Pictorially, suppose we call strtok()
with:

char buf[] = ",hello,/,world";
char *result;

result = strtok(buf, ",/");

When we enter strtok(), we have "s" pointing to &buf[0], which
contains a ',' character. The static variable "last" is initially
NULL, or is left over from a previous call to strtok(), but either
way we are not going to use it, and are soon going to overwrite
it.

The strtok() call then calls strtok_r(), passing s (&buf[0]), delim
(pointing to the ',' in the string ",/"), and &last. The strtok_r()
function begins with (I will update this code for C99 while writing
this; note this changes the name of the third argument to "lastp",
as I would have called the variable if I had written this):

char *strtok_r(char *restrict s, const char *restrict delim, char **lastp) {
const char *restrict spanp;
char c, sc;
char *restrict tok;

if (s == NULL && (s = *lastp) == NULL)
return NULL;

Since s is not NULL, we do not even look at *lastp, much less
return NULL. (However, if we call strtok() again later, passing
NULL for s, we *will* look at *lastp, i.e., strtok()'s "last",
at that point. This is how we will save a "sideways" result
for later strtok() calls. This is not a *good* way to do it;
see the BSD strsep() function for a better method.)

Next, we enter the ugly loop that simply implements strspn():

cont:
c = *s++;
for (spanp = delim; (sc = *spanp++) != '\0';) {
if (c == sc)
goto cont;
}

In this case, "delim" points to ",/" and "c" initially picks up
the ',' from &buf[0], leaving s pointing to &buf[1]. The inner
loop sets spanp to delim, then loops while seting sc (the "span
character") from *spanp++ until sc is '\0'. Because the
loop increments spanp each time, it will look at all the non-'\0'
characters in delim, i.e., ',' and '/' respectively, until
something happens. In this case, the "something" happens
right away: c==',' and sc==',', so c==sc, so we execute the
"goto cont" line and thus start the outer loop over again.

This time, c gets the 'h' from buf[1], and s winds up pointing
to the 'e' in buf[2]. Since c=='h', and the inner loop looks
for ',' and '/', we get all the way through the inner loop
without doing a "goto cont". This leaves c=='h' and exits the
outer loop.

(The outer loop logically should use something other than a
"goto", but more realistically, we could just replace the whole
thing with:

s += strspn(s, delim);
c = *s++;

The strspn() function counts the number of characters that are
in the "span set", up to the first one that is not in it. That
is, since s points to ",hello,/,world" and the span-set is ",/",
strspn() would return 1, and s += 1 would then advance over the
initial comma. If s pointed to ",/,hello", strspn() would
return 3, and s += 3 would advance over all three initial
characters.)

In any case, now that we have skipped leading "non-token"
characters, we are ready to look for tokens:

if (c == '\0') {
*lastp = NULL;
return NULL;
}

If there are no more tokens, we set *lastp to NULL (this is, I
think, allowed but not required by the C standard) and return NULL.
However, c=='h', so we do not do this, but instead plow on to the
trickiest, densest part of the code:

tok = s - 1;
for (;;) {
c = *s++;
spanp = delim;
do {
if ((sc = *spanp++) == c) {
[to be shown and explained in a moment]
}
} while (sc != '\0');
}
/* NOTREACHED */

Remember that s points one character *past* the beginning of
the token -- in this case, to the 'e' in "hello" -- because we
did "c = *s++". So we set tok to s-1, so that tok points to
the 'h' in "hello". We then enter the outer loop, which -- as
the comment notes -- really just implements strcspn().

For each trip through the outer loop, we look at the next character
of the proposed token, via "c = *s++" as before. In this case,
it first sets c to the 'e' in buf[2], leaving s pointing to buf[3]
(the first 'l').

The inner loop then runs through each character in "delim", putting
it in sc each time, checking to see whether c==sc. Since the two
delimiter characters are ',' and '/', and 'e' is neither a comma nor
a slash, this will not find them. But here is one of the tricky
bits: since "delim" is a C string, it ends with a '\0' character.
I want *all* the loops to stop if c=='\0'. Rather than putting
in a special test for c=='\0', I simply allow the inner loop to
test c against the '\0' that ends the delimiter string.

In other words, since delim points to {',', '/', '\0'}, I can let
sc take on all *three* of these "char"s, and compare c==sc each
time. I stop the inner loop only *after* both c!=sc *and* sc=='\0'.
So this allows me to test c=='\0' without writing a separate "if"
statement.

In any case, none of 'e', 'l', 'l', and 'o' are token-delimiters,
so eventually "c = *s++" sets c to ',', leaving s pointing to the
'/' in ",hello,/,world". This time the inner loop *does* find
a token-delimiter, and we get to the code inside the "if":

if (c == '\0')
s = NULL;
else
s[-1] = '\0';
*lastp = s;
return tok;

Here, we check to see if c=='\0' first. A '\0' ends the string --
meaning s points one past the end -- and the end of the string is
necessarily the end of the token. But it also means there cannot
possibly be any additional tokens. So we want to sent *lastp to
NULL, so that a future call to strtok() will return NULL. Setting
s to NULL and *lastp to s will accomplish that. (In this case,
since c==sc, we will also have sc=='\0'. That is, we found both
the end of the delimiter string *and* the end of the token. This
is OK; we do not care which "token-ending character" we found.)

In our case, though, c!='\0', because c==','. So we set s[-1] to
'\0'. Here s points to the '/' in buf[7], so this replaces the
',' in buf[6] with '\0'. (Remember, buf is initially set to
",hello,/,world", so buf[0] is ',', buf[1] is 'h', buf[2] is 'e',
buf[3] is 'l', buf[4] is 'l', buf[5] is 'o', buf[6] is ',', buf[7]
is '/', buf[8] is ',', buf[9] is 'w', and so on.) We leave s==&buf[7],
and set *lastp = s -- so now the variable "last" in strtok()
points to the '/' in buf[7].

Finally, we return "tok", which is &buf[1]. The contents of buf[]
have been modified slightly, and the static variable "last" in
strtok() points just past the first token, so that a later call to
strtok() can find more tokens.

Again, the two nasty loops can be replaced by strspn() -- which
will "span" (or skip over) initial delimiters -- and strcspn(),
which will span the "c"omplement of the delimiter-set, i.e., skip
over initial "non-delimiters". Since everything that is not
a delimiter is a token character, this will give us just what
we want -- and we can rewrite the entire thing as:

char *strtok(char *restrict s, const char *restrict delim) {
static char *last;
char *restrict tok;

/* pick up from previous stopping point, if told to do so */
if (s == NULL && (s = last) == NULL)
return NULL;

/* skip initial delimiters to find start of token */
tok = s + strspn(s, delim);

/* skip over non-delimiters to find end of token */
s = tok + strcspn(tok, delim);

/*
* Save state for next call, and return token. If there
* is no token, both *tok and *s will be '\0'. If there
* is one token that is ended with '\0', *s will be '\0'.
* (If there are trailing delimiters, we will skip them
* later.)
*/
last = *s == '\0' ? NULL : s + 1;
return *tok == '\0' ? NULL : tok;
}

This code is now short enough that we can just use a separate copy
for strtok_r(), if we like; or we can use the BSD-style "strtok is
a wrapper around strtok_r", which just needs the obvious changes
with "lastp".

It is also short enough to demonstrate that strtok() is not a
particularly useful function: you can call strspn() and strcspn()
yourself, and avoid the problems that eventually crop up with the
saved state in the static variable called "last" here.

Monday, December 5, 2011

Notes of Introduction to Algorithms (3rd Edition) --1

"In Place": Algorithm operates on the same inputs without using extra space, or use a constant number of extra space, which implies the inputs are the outputs after the algorithm is completed.

1: Insertion Sort (in place algorithm)
Inputs: an array of n elements: int inputs[n];
Outputs: an array of n elements inputs[n] following ascending order.

Idea: Suppose the first m (<n) elements have been sorted, the (m+1)th element is being sorted. Then insert (m+1)th element into an appropriate place in the first m elements. Therefore, after the insertion, the (m+1) elements are sorted as well.

int Insertion(int inputs[], int size)
/*if array is passed as parameter, inputs should not have size with it: inputs[]; inputs[size] is not correct.*/
{
     for(int i=1;i<size;++i)
     {
for(int j=0;j<i;++j)
          {
   if(inputs[i] < inputs[j])
               {
                  exchange(inputs[i],inputs[j]);
                  break;
               }
           }
      }
}

The internal loop can also be scanning from right to left:
        for(int j=i-1;j>-1;--j)
          {
   if(inputs[i] > inputs[j])
               {
                  exchange(inputs[i],inputs[j]);
                  break;
               }

           }

Complexity: O(n^2)

Sunday, April 3, 2011

string cannot be allocated by "malloc"

string cannot be allocated by "malloc".the only way to allocate memory for string is to "new" one.

string test=new string;

"segment fault" in sort function

When I used "sort" (STL/Algorithm) to sort my own "struct", I need define my own compare function.

bool custom_sort(Var const& lhs, Var const& rhs)
{
   if (lhs.weight != rhs.weight)
   return lhs.weight < rhs.weight;
   return true;
}

then call it:

sort(sort_var.begin(), sort_var.end(), &custom_sort);

Then I sometimes got an error message:

Segment fault.

The problem is caused by ignoring the following fact:

In C++, your "compare" predicate must be a strict weak ordering. In particular, "compare(X,X)" must return "false" for any X. While in my case, I always return a "true" when the variables are equal to each other. Therefore, this "compare" predicate does not impose a strict weak ordering, and the result of passing it to "sort" is undefined.

So solution:
The right action to handle two equal variables is to "return false" not "return true".

Saturday, March 12, 2011

GPU bank conflicts

The first thing you’ll need to know about is the thread hierarchy. CPUs are designed to run just a few threads, very quickly. GPUs, on the other hand, are designed to process thousands of threads simultaneously, with great efficiency. So, in order to take full advantage of your graphics card, you’ll need to break your problem down into hundreds, or thousands of threads.

Half-Warp - A half-warp is a group of 16 consecutive threads. Half-warp threads are generally executed together. Half-warps are aligned. For instance, Threads 0->15 will be in the same half-warp, 16->31, etc.

Warp – A warp of threads is a group of 32 consecutive threads. On future computing devices from nVidia, it might be possible that all threads in the same Warp are generally executed together in parallel. Therefor, it is a good idea to make your programs as if all threads within the same warp will execute together in parallel. Threads 0->31 will be in the same warp, 32->63, etc.

Block – A block is a collection of threads. For technical reasons, blocks should have at least 192 threads to obtain maximum efficiency and full latency hiding. Typically, blocks might contain 256 threads, 512 threads, or even 768 threads. Here’s the important thing you need to know; threads within the same block can synchronize with each other, and quickly communicate with each other.

Grid – A grid is a collection of blocks. Blocks can not synchronize with each other, and therefore threads within one block can not synchronize with threads in another block.

Step 3: Understand the memory hierarchy

Global Memory – Global memory can be thought of as the physical memory on your graphics card. If you have an integrated nVidia chipset like the ION, the global memory can be thought of as the amount of memory alloted to the graphics device. All threads can read and write to Global memory. You can even read and write to Global memory from a thread on the CPU.

Shared Memory – A GPU consists of many processors, or multiprocessors. Each multiprocessor has a small amount of Shared memory, on the order of about 16KB of memory. Shared memory is generally used as a very quick working space for threads within a block. Shared memory is allocated on a block by block basis. For example, you may have three blocks running consecutively on the same multiprocessor. This means that the maximum amount of shared memory the blocks can reserve is 16KB / 3. Threads within the same block can quickly and easily communicate with each other by writing and reading to the shared memory. It’s worth mentioning that the shared memory is at least 100 times faster than global memory, so it’s very advantageous if you can use it correctly.

Texture Memory – A GPU also has texture units and memory which can be taken advantage of in some circumstances. Unlike global memory, texture memory is cached, and is generally read only. If you expect threads to access memory addresses which have some coherence, you might want to consider using texture memory to speed up those memory accesses.

If every thread in a half-warp requests data from the same address in constant memory, your GPU will generate only a single read request and subsequently broadcast the data to every thread.

To achieve high memory bandwidth, shared memory is divided into equally-sized memory modules, called banks, which can be accessed simultaneously. In the case of the shared memory space, the banks are organized such that successive 32-bit words are assigned to successive banks and each bank has a bandwidth of 32 bits per two clock cycles.

So, any memory read or write request made of n addresses that fall in n distinct memory banks can be serviced simultaneously, yielding an effective bandwidth that is n times as high as the bandwidth of a single module. However, if two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized. The hardware splits a memory request with bank conflicts into as many separate conflict-free requests as necessary, decreasing the effective bandwidth by a factor equal to the number of separate memory requests. If the number of separate memory requests is n, the initial memory request is said to cause n-way bank conflicts.

Functionname <<m,n>> (parameters);

M: number of blocks. <65535

N: number of threads in each block. <=maxThreadsPerBlock (512?)

Example: http://forums.nvidia.com/index.php?showtopic=99186

on page 89 in Cuda Programming Guide 2.2 there are two examples of bank conflicts:

__shared__struct type shared[32];
struct type data= shared[BaseIndex+tid];

it says that if the data accesed based on tid and the type is declared as this, it will result in NO bank conflict .

struct type{ float x,y,z; }

and if the type is declared as this it WILL have a bank conflict.

struct type{ float x,y; }

in 1st case compiler should be reading data as:
x.x = ((float*) shared)[tid * 3 + 0];
x.y = ((float*) shared)[tid * 3 + 1];
x.z = ((float*) shared)[tid * 3 + 2];

Imagine threads from the 1st half-wrap where tid ranges from 0 to 15, that means that we address following banks [((0..15)*3 + 0) % 16] = [0,3,6,9,12,15,2,5,8,11,14,1,4,7,10,13]. As you can see threads address different banks when reading 'x', same goes for 'y' and 'z'

32 * 3= 96 words. 1 bank has 1 word (32-bit). 16 banks. 96/16=6 lines.

In a half-warp, there are 16 threads. Shared[0..15] (in blue) will be access in the first half-warp simultaneously. And Shared[16..31] (in brown) will be accessed by second half-warp simultaneously. So, in above table, first 3 lines will be accessed simultaneously by 1st half-warp and since each x or y or z is in different bank so there is no conflict. Similarly, the last 3 lines will be accessed simultaneously by 2nd half-warp and since each x or y or z is in different bank so there is no conflict.

Most important fact: Threads only access one 32-bit word simultaneously.

now 2nd case

x.x = ((float*) shared)[tid * 2 + 0];

x.y = ((float*) shared)[tid * 2 + 1];

Reading 'x' addresses following banks [((0..15)*2 + 0) % 16] = [0,2,4,6,8,10,12,14,0,2,4,6,8,10,12,14]. As you can see here, every other bank is referenced 2 times, giving you a 2-way bank conflict. Same happens when reading 'y'.

In a half-warp, there are 16 threads. Shared[0..15] (in blue) will be access in the first half-warp simultaneously. And Shared[16..31] (in brown) will be accessed by second half-warp simultaneously. So, in above table, first 2 lines (in blue) will be accessed simultaneously by 1st half-warp and since each x or y is in the same bank so there is bank conflict.

When multiple threads in the same warp access the same bank, a bank conflict occurs unless all threads of the warp access the same address within the same 32-bit word" - First thing there are 16 memory banks each 4bytes wide. So essentially, if you have any thread in a half warp reading memory from the same 4bytes in a shared memory bank, you're going to have bank conflicts and serialization etc.

Conclusion: for the above example, it should be taken into consideration:

1: how many banks? Because in some gpus, it is 16 while others are 32. In above, it is 16.

2: whether it is accessed per-warp instead of per-half-warp. In above, it is per-half-warp.

Useful link: http://www.cs.caltech.edu/courses/cs101gpu/lectures/Lecture8.pdf

The CUDA C compiler treats variables in shared memory differently than typical variables. It creates a copy of the variable for each block that you launch on the GPU. Every thread in that block shares the memory, but threads cannot see or modify the copy of this variable that is seen within other blocks. This provides an excellent means by which threads within a block can communicate and collaborate on computations. Furthermore, shared memory

buffers reside physically on the GPU as opposed to residing in off-chip DRAM.

Because of this, the latency to access shared memory tends to be far lower

than typical buffers, making shared memory effective as a per-block, software-managed cache or scratchpad.

__syncthreads() guarantees that every thread in the block has completed instructions prior to __syncthreads() before the hardware will execute the next instruction on any thread.

Detect bank conflicts:

You can use bank checker macro (part of cutil) if you compile and run in emulation mode. It'll tell you where in your code you're getting bank conflicts.

Deryk's stack