The purpose of this assignment is to help you understand regular
expression and how we implement it in C. C. It also will give you
more opportunity to use the GNU/Unix programming tools,
especially bash
, emacs
, gcc209
, gdb
.
Study the implementation of the regular expression by Rob Pike and Brighan Kernighan. Implementing ?, +, \x, is "on your own". That is, you must not consult with the course instructor, lab TAs, Moodle, etc., except perhaps to clarify requirements. That part is worth 20% of this assignment.
Regular expressions describe patterns of text. They consist of literal
chracters such as 'a', 'I', '9', and "wild cards" which have special
meanings such as matching "multiple occurence of some character", "any
one character", etc. For example, KA*IST
matches any
text that starts with K
, followed by zero or more
occurrence of 'A', and ends with IST
. KIST
and KAAAIST
match it while KA%IST
doesn't. *
is a wild card that matches zero or more
occurrences of the previous character and K,A,I,S,T are literals.
The list of wild cards for this assigment is as follows:
c matches any literal character 'c' unless 'c' is a wild card character ^ matches the beginning of the input string $ matches the end of the input string . matches any one character ? matches zero or one occurrence of the previous character * matches zero or more occurrenece of the previous character + matches one or more occurrence of the previous character \x matches the character, 'x' if 'x' is a wild card ('^', '$', '.', '?', '*', '+', '\') or 'x' can one of the characters below: \d matches any decimal digit (0,1,2,3,4,5,6,7,8,9) \D matches any character that is not a decimal digit \s matches any whitespace character. Whitespace characters are: space, form-feed ('\f'), newline ('\n'), carriage return ('\r'), horizontal tab ('\t'), and vertical tab ('\v'). \S matches any character that is not a whiltespace character.
Regular Expression | Match | Not match |
hello
|
hello , Chello , hello EE209
|
hallo
|
^hello$
|
hello
|
hello EE209 , Chello
|
KA..T
|
KAIST , KA33T |
KIST
|
(\s+cool EE2\d\d) |
( cool EE219) , ( cool EE209) |
(cool EE222) |
Your task is to write a C program named mygrep
that performs the regular expression matching as below.
mygrep 'regexp' somefile or mygrep 'regexp' file1 ...
mygrep
should read one line at a time and prints out the line if the regular expression matches it. The output should look like
filename:line1 filename:line2 ... Total # of matches: 7If standard input is used, use "stdin" as the file name. For example,
$ ./mygrep 'argv' mygrep.c mygrep.c:main(int argc, char *argv[]) mygrep.c: nmatch = grep(argv[1], stdin, NULL); mygrep.c: f = fopen(argv[i], "r"); mygrep.c: fprintf(stderr, "can't open %s:", argv[i]); mygrep.c: nmatch += grep(argv[1], f, argv[i]); Total # of matching lines: 5
$ ./mygrep 'argv' < mygrep.c stdin:main(int argc, char *argv[]) stdin: nmatch = grep(argv[1], stdin, NULL); stdin: f = fopen(argv[i], "r"); stdin: fprintf(stderr, "can't open %s:", argv[i]); stdin: nmatch += grep(argv[1], f, argv[i]); Total # of matching lines: 5
Develop on lab machines, using emacs
to create source
code and gcc209
to compile.
A skeleton C file is available here: mygrep.c It implements the basic file I/O, so all you need is to fill out the rest of the functionalities. Feel free to use this file as your starting point (of course, you don't have to use it), and change any part in the code if you'd like.
You can assume each input line is less than or equal to 1024 bytes. That is, it is OK if your program behaves incorrectly if some input line is larger than 1024 bytes, but your program should not crash. (grep() in the skeleton file already guards against this condition)
Your program should produce an error if the regular expression in the wrong format. For example, '\abc' is wrong regular expression since '\a' is not allowed. In that case, the execution should stop with an error message, "wrong regular expression format:\abc".
We provide reference program (mygrep) and some test cases at here. Download all files and enable the execution bit by
chmod u+x samplemygrep
chmod u+x test_regex.sh
./test_regex.sh ./mygrep
Create a readme
text file that contains:
Submit your work electronically on Moodle via following commands:
mkdir 20091234_assign2 mv readme mygrep.c 20091234_assign2 tar zcf 20091234_assign2.tar.gz 20091234_assign2
We will grade your work on quality from the user's point of view
and from the programmer's point of view. From the user's point of
view, your module has quality if it behaves as it should. The correct
behavior of the mygrep
is defined by the previous
sections of this assignment specification and by the
given samplemygrep
program.
From the programmer's point of view, your program has quality if it is
well styled and thereby simple to maintain. See the specifications of
the previous assignment for guidelines concerning style. Specifically,
function modularity will a prominent part of your grade. To encourage
good coding practices, we will deduct points if gcc209
generates warning messages.