KAIST
EE 209: Programming Structures for EE

Assignment 2: A Regular Expression Module


Purpose

The purpose of this assignment is to help you understand regular expression and how we implement it in C. C. It also will give you more opportunity to use the GNU/Unix programming tools, especially bash, emacs, gcc209, gdb.


Rules

Study the implementation of the regular expression by Rob Pike and Brighan Kernighan. Implementing ?, +, \x, is "on your own". That is, you must not consult with the course instructor, lab TAs, Moodle, etc., except perhaps to clarify requirements. That part is worth 20% of this assignment.


Background

Regular expressions describe patterns of text. They consist of literal chracters such as 'a', 'I', '9', and "wild cards" which have special meanings such as matching "multiple occurence of some character", "any one character", etc. For example, KA*IST matches any text that starts with K, followed by zero or more occurrence of 'A', and ends with IST. KIST and KAAAIST match it while KA%IST doesn't. * is a wild card that matches zero or more occurrences of the previous character and K,A,I,S,T are literals.

The list of wild cards for this assigment is as follows:

c  matches any literal character 'c' unless 'c' is a wild card character
^  matches the beginning of the input string
$  matches the end of the input string
.  matches any one character
?  matches zero or one occurrence of the previous character
*  matches zero or more occurrenece of the previous character
+  matches one or more occurrence of the previous character
\x matches the character, 'x' if 'x' is a wild card ('^', '$', '.', '?', '*', '+', '\') or 'x' can one of the characters below:
   \d matches any decimal digit (0,1,2,3,4,5,6,7,8,9)
   \D matches any character that is not a decimal digit
   \s matches any whitespace character. Whitespace characters are: space, form-feed ('\f'), newline ('\n'),
      carriage return ('\r'), horizontal tab ('\t'), and vertical tab ('\v'). 
   \S matches any character that is not a whiltespace character.

Regular Expression Examples

Regular Expression Match Not match
hello hello, Chello, hello EE209 hallo
^hello$ hello hello EE209, Chello
KA..T KAIST, KA33T KIST
(\s+cool EE2\d\d) ( cool EE219), ( cool EE209) (cool EE222)

Your Task

Your task is to write a C program named mygrep that performs the regular expression matching as below.


Functionality

Your program should take the regular expression as the first parameter and match it against the text either from standout input stream or a list of files.
  mygrep 'regexp' somefile or
  mygrep 'regexp' file1 ...
mygrep should read one line at a time and prints out the line if the regular expression matches it. The output should look like
filename:line1
filename:line2
...
Total # of matches: 7
If standard input is used, use "stdin" as the file name. For example,
$ ./mygrep 'argv' mygrep.c
mygrep.c:main(int argc, char *argv[])
mygrep.c:    nmatch = grep(argv[1], stdin, NULL);
mygrep.c:      f = fopen(argv[i], "r");
mygrep.c:	fprintf(stderr, "can't open %s:", argv[i]);
mygrep.c:      nmatch += grep(argv[1], f, argv[i]);
Total # of matching lines: 5
$ ./mygrep 'argv' < mygrep.c
stdin:main(int argc, char *argv[])
stdin:    nmatch = grep(argv[1], stdin, NULL);
stdin:      f = fopen(argv[i], "r");
stdin:	fprintf(stderr, "can't open %s:", argv[i]);
stdin:      nmatch += grep(argv[1], f, argv[i]);
Total # of matching lines: 5

Logistics

Develop on lab machines, using emacs to create source code and gcc209 to compile.

A skeleton C file is available here: mygrep.c It implements the basic file I/O, so all you need is to fill out the rest of the functionalities. Feel free to use this file as your starting point (of course, you don't have to use it), and change any part in the code if you'd like.

You can assume each input line is less than or equal to 1024 bytes. That is, it is OK if your program behaves incorrectly if some input line is larger than 1024 bytes, but your program should not crash. (grep() in the skeleton file already guards against this condition)

Your program should produce an error if the regular expression in the wrong format. For example, '\abc' is wrong regular expression since '\a' is not allowed. In that case, the execution should stop with an error message, "wrong regular expression format:\abc".

We provide reference program (mygrep) and some test cases at here. Download all files and enable the execution bit by

chmod u+x samplemygrep
chmod u+x test_regex.sh

Assuming your binary is 'mygrep', you can run the test cases with your program by

./test_regex.sh ./mygrep

Feel free to modify the test scripts and files. Compare the difference from samplemygrep and your mygrep.

Create a readme text file that contains:

Submit your work electronically on Moodle via following commands:

mkdir 20091234_assign2
mv readme mygrep.c 20091234_assign2
tar zcf 20091234_assign2.tar.gz 20091234_assign2

Grading

We will grade your work on quality from the user's point of view and from the programmer's point of view. From the user's point of view, your module has quality if it behaves as it should. The correct behavior of the mygrep is defined by the previous sections of this assignment specification and by the given samplemygrep program. From the programmer's point of view, your program has quality if it is well styled and thereby simple to maintain. See the specifications of the previous assignment for guidelines concerning style. Specifically, function modularity will a prominent part of your grade. To encourage good coding practices, we will deduct points if gcc209 generates warning messages.