Implementing a relatively simple text-compression scheme

Assignment Help JAVA Programming

Reference no: EM13712367

You will be implementing a relatively simple text-compression scheme. your code should work as follows:

1. It should run from the command line, using the command:

java Compress <FILE_NAME>

where the <FILE_NAME> argument is the name of a text-file found in the same directory as the executable program. (This name is passed into your Java program as an argument, and will then be accessible from position 0 in the array args that is input to the main() method.

2. It should read through the text-file, word-by-word (using white-space to delineate words, as usual, so "words" may contain punctuation, etc.). For each word, it should count occurrences of any substring of 3 or more characters. For instance, if it reads in the word "googoo" it should count 2 occurrences of the substring "goo", 1 occurrence of "oog", and 1 occurrence each of "goog", "oogo", "ogoo", "googo", and "oogoo". Finally, it should count one occurrence of 6-letter substring, "googoo".

This counting should occur over the entire file, and it should be handle efficiently using a hash-map structure to map substrings to the number of times they occur. You can use a built-in Java structure here.

3. Once the counting is complete, the hash-map will contain a collection of substrings, each with a corresponding count value (1 or greater). You should then order these substrings:

- Create a new class that can store a substring and its count value.
- Make that class properly Comparable to other things of its type. In the compareTo ordering, each object should be ordered by its impact factor, which is the product of the substring's length and the number of times it occurs. For instance, if the substring "the" occurs 100 times, it has an impact factor of (3 × 100 = 300).

Two of your class objects should be ordered so that things with higher impact factor come before things with a lower impact factor. Thus, if "the" has impact factor 300, and "goo" has impact factor 100, then "the" comes before "goo" in comparison ordering.

- Once you have completed the class, place each of your substring and count value pairs into a priority queue, so that they will be in order, with highest impact factor at the front of the queue. Again, you can use a built-in Java structure here.

4. Now, you can generate a compressed encoding of the highest-impact substrings. For each of the first 52 such substrings (or al l of them, if your input doesn't contain 52 distinct substrings of length 3 or greater), you will encode each such string with a 2-letter coding, consisting of the plus sign '+', followed by a single character (in range a-z or A-Z).

For instance, if "the" is our highest-impact substring, then its encoding would be "+a".

Note 01: when generating codes, you should ignore all substrings that either contain, or are contained by, one you have already encoded. For instance, if you have already encoded "ther", then you should ignore any future substrings like "there", which contains it, or "the", which is contained by it. You should still attempt to generate 52 distinct codings, if that is possible after discarding overlapping substrings.

Note 02: for this to work, it assumes that the text-file does not actually include the special marker character, '+'; if this is not so, another special character would have to be used.

5. Your code should now read through the file again, writing out a compressed version to a second file. To do compression, read in each word again, and replace each substring in the word with its encoding, if such an encoding exists, and write the result.

For instance, if we read in the word "teethes", and we have an encoding of "+a" for "the", then we should re-write the word as "tee+as".

Note 01: if a word contains multiple possible substrings, then you should replace them all. When there are multiple choices, the replacement should happen by order of length of substrings, so that we get maximum compression of each word. For instance, if we have an encoding of "+a" for "the", "+b" for "theo", and "+c" for "log", then the word "theology" would first be encoded as "+blogy", since "theo" is the longest substring; it would then be encoded finally as "+b+cy", and written to the output file. If a word contains multiple possible substrings of the same length, remove them in the order given by Java String sorting (this means that every word has a uniquely defined coding).

Note 02: when writing to the output file, make sure that line-breaks and other white-space are preserved from the original file.

Note 03: the output file should have a name related, but not identical, to that of the original input. In particular, if the input is named "FILE.txt", then the compressed output file should be named "FILE_comp.txt" (where "FILE" can be any file-name).

6. The code above will somewhat compress a text-file, but we will not be able to de-compress it, unless we know the encoding used. Modify your code so that when it creates the output file, it first writes information about the compression into the first part of the file (in a format of your own choosing), and then follows it with the compressed text. Then, write a second program, which can also be called from the command line:

java Decompress <FILE_NAME>

When called, this should open the named file, if possible, and (assuming it is in proper format), de-compress it, producing a new version of the file that is identical to the original input to the compression program.

Reference no: EM13712367

Questions Cloud

Given the demand and cost estimates : Given the demand and cost estimates, what price should you change if you want to maximize your weekly profit?What output should you produce? What is your weekly profit?

Starting with the differential form of the cauchy equation : Starting with the differential form of the Cauchy equation, derive a differential equation for the kinetic energy, k = 1/(2*u^2) , of a moving fluid.

Steam steadily flows through turbine to produce mechanical : Steam steadily flows through a turbine to produce mechanical work. At the inlet, the pressure, temperature and velocity of the steam are 10 MPa, 450 °C and 110 m/s, respectively. The water exits this turbine at pressure of 10 kPa with the quality of ..

The wage-skills relationship in the labor market : 1. In the US, the wage-skills relationship in the labor market before government intervention is wus=100+0.4s, where s is the number of "efficiency units of skill" and w is the weekly wage. In Canada, the wage-skills relationship is WC=250 + 0...

Implementing a relatively simple text-compression scheme : You will be implementing a relatively simple text-compression scheme - It should run from the command line.

Explain the effect of fillers in altering polymer properties : Explain the effect of fillers in altering polymer properties? What are some of the main differences between thermoplastics and thermo sets?

Phil has two periods of work remaining prior to retirement : Phil has two periods of work remaining prior to retirement. Assume that Phil maximizes the present value of his expected lifetime earnings and his discount rate is 10 percent. He is currently employed in a firm that pays him the value of his ma..

Explain glass transition temperature : Explain glass transition temperature? Present some of the key properties and typical uses of Al2O3? What are main ingredients used in making Pig Iron and how is the oxygen removed.

The market demand for train service : (b) Suppose that the firm is a monopolist. Assuming the firm produces a positive level of output, calculate the output and price it sets. Explain why the profit-maximizing price is greater than the monopolist's marginal cost.

User Account

All Pages