Reference no: EM131442586
Use Crawler Java Assignment
Review, fix and run the crawler.
Add code for additional requiments.
Make sure you crawler does the following.
Test your crawler only on the data in:
https://lyle.smu.edu/~fmoore
Make sure that your crawler is not allowed to get out of this directory!!! Yes, there is a robots.txt file that must be used. Note that it is in a non-standard location.
The required input to your program is N, the limit on the number of pages to retrieve and a list of stop words (of your choosing) to exclude.
Perform case insensitive matching.
You can assume that there are no errors in the input. Your code should be robust under errors in the Web pages you're searching. If an error is encountered, feel free, if necessary, just to skip the page where it is encountered.
1. Identify the key properties of a web crawler. Describe in detail how each of these properties is implemented in your code.
2. Use your crawler to list the URL of all pages in the test data and report all out-going links of the test data. [10 points] display the contents of the <TITLE> tag
3. Implement duplicate detection, and report if any URLs refer to already seen content.
4. Use your crawler to list all broken links within the test data.
5. How many graphic files are included in the test data?
6. Have your crawler save the words from each page of type (.txt, .htm, .html). Make sure that you do not save HTML markup. Explain your definition of "word". In this process, give each page a unique document ID.
Implement Stemming
7. Report the 20 most common words with its document frequency. words or stemmed words?
Attachment:- crawler_project.zip
Skeptical of the business school claim
: You are skeptical of the business school claim and decide to evaluate the salary of the business school graduates, using ?= 0.05 (2-tail) what do you conclude?
|
Everything you think you know about addiction is wrong
: Psyc 164 : Please watch the following TED talk (there is some overlap with my module - wish I'd known that before I re-typed everything...haha) but he goes into more research and details around solutions.From the module and from this TED talk, the ..
|
Best estimate of the average savings
: Based on the answer from question 9, calculate 90% confidence limit around your best estimate of the average savings.
|
Estimate chances to earn
: We toss an unfair coin 100 times in a row. We play according to following rules: If tail: +$1 If head: -$1.45 P (head=0.4) Estimate chances to earn at least $3 at the end of this experiment.
|
Identify the key properties of a web crawler
: Identify the key properties of a web crawler. Describe in detail how each of these properties is implemented in your code.
|
Write an essay on the effects of internet usage
: Write an essay on the effects of Internet usage or lack thereof on your daily life. Following the steps Diane Wood took to write "The Hazards of Movie going," free write and explore your topic
|
Design database diagram for database that store information
: Design a database diagram for a database that stores information about the downloads that users make. Each user must have an email address, first name, and last name.
|
Why would it important to occasionally check your hyperlinks
: Why would it be important to occasionally check your hyperlinks manually? Why would it be important to use both external and internal links on your Web site?
|
Probability that a randomly selected dropout
: According to a recent study,9.3 % of high school dropouts are 16- to 17-year-olds. In addition,6.5 % of high school dropouts are white16- to17-year-olds. What is the probability that a randomly selected dropout is white, given that he or she is 16..
|