Metromix.com Data Scraping: June 2015

Saturday, June 27, 2015

Data Scraping - About Hand Scraped Flooring

Hand scraped hardwood flooring is one of the best floors that you can install in your house.

Advantages of Hand Scraped Hardwood Flooring

The product comes with a number of advantages which include:

Antique and modern technology: The floor professionally brings out the best elements of both antique and modern technology. The modern elements are in the quality of the product.

Unique patterns: Who doesn't want to be unique? These floors allow you to create your unique design. If you are going to use a machine, all you need to do is to set the machine such that it creates the pattern that you want. If the floor will be scraped by a craftsman, you should ask the craftsman to craft your desired pattern.

Character: The different depths in the floor provide you with character and color that you can't find in other types of floors. As the sun changes its angle during the day, the nooks and valleys on the board lit differently thus providing your board with an endless rich appearance.

Durability: Experts have been able to show that hand-scraped hardwood retains its look for a long time. If your kid or pet hits the floor, the dent just blends with the rest of the character making it hard for people to tell that there is a dent.

Making the floors shine again

Although, the scraped floors are designed to look worn and aged, they are made from modern wood which needs to be taken care of in order to retain its original look.

To make the floors shine again you need to remove all the dust and dirt that might be causing the wood to look dull.

After doing this you should mix 1 gallon of warm water with ½ teaspoon of dishwashing detergent and use it to clean the surface of the floor. The aim of doing this is to remove any stains that might be on the floor. When you complete doing this you should dampen the piece of cloth with club soda and then use another piece of cloth to buff the wood until it shines.

Conclusion

This is what you need to know about hand scraped hardwood flooring. When cleaning the floors you should avoid using oil based soaps as they dull the surface making your efforts worthless.

If the above method of shining the floor doesn't work, you should mix one part white vinegar and one part of cooking oil and use it to clean the floor.

Source: http://ezinearticles.com/?About-Hand-Scraped-Flooring&id=8990255

Monday, June 22, 2015

Rvest: easy web scraping with R

Rvest is new package that makes it easy to scrape (or harvest) data from html web pages, by libraries like beautiful soup. It is designed to work with magrittr so that you can express complex operations as elegant pipelines composed of simple, easily understood pieces. Install it with:

install.packages("rvest")

rvest in action

To see rvest in action, imagine we’d like to scrape some information about The Lego Movie from IMDB. We start by downloading and parsing the file with html():

library(rvest)

lego_movie <- html("http://www.imdb.com/title/tt1490017/")

To extract the rating, we start with selectorgadget to figure out which css selector matches the data we want: strong span. (If you haven’t heard of selectorgadget, make sure to read vignette("selectorgadget") – it’s the easiest way to determine which selector extracts the data that you’re interested in.) We use html_node() to find the first node that matches that selector, extract its contents with html_text(), and convert it to numeric with as.numeric():

lego_movie %>%

html_node("strong span") %>%
html_text() %>%
as.numeric()

#> [1] 7.9

We use a similar process to extract the cast, using html_nodes() to find all nodes that match the selector:

lego_movie %>%

html_nodes("#titleCast .itemprop span") %>%
html_text()

#> [1] "Will Arnett"     "Elizabeth Banks" "Craig Berry"

#> [4] "Alison Brie"     "David Burrows"   "Anthony Daniels"

#> [7] "Charlie Day"     "Amanda Farinos" "Keith Ferguson"

#> [10] "Will Ferrell"    "Will Forte"      "Dave Franco"

#> [13] "Morgan Freeman" "Todd Hansen"     "Jonah Hill"

The titles and authors of recent message board postings are stored in a the third table on the page. We can use html_node() and [[ to find it, then coerce it to a data frame with html_table():

lego_movie %>%

html_nodes("table") %>%
.[[3]] %>%
html_table()

#>                                              X 1            NA

#> 1 this movie is very very deep and philosophical   mrdoctor524

#> 2 This got an 8.0 and Wizard of Oz got an 8.1... marr-justinm

#> 3                         Discouraging Building?       Laestig

#> 4                              LEGO - the plural      neil-476

#> 5                                 Academy Awards   browncoatjw

#> 6                    what was the funniest part? actionjacksin

Other important functions

    If you prefer, you can use xpath selectors instead of css: html_nodes(doc, xpath = "//table//td")).

    Extract the tag names with html_tag(), text with html_text(), a single attribute with html_attr() or all attributes with html_attrs().

    Detect and repair text encoding problems with guess_encoding() and repair_encoding().
    Navigate around a website as if you’re in a browser with html_session(), jump_to(), follow_link(), back(), and forward(). Extract, modify and submit forms with html_form(), set_values() and submit_form(). (This is still a work in progress, so I’d love your feedback.)

To see these functions in action, check out package demos with demo(package = "rvest").

Source: http://www.r-bloggers.com/rvest-easy-web-scraping-with-r/

Saturday, June 13, 2015

Web Scraping : Data Mining vs Screen-Scraping

Data mining isn't screen-scraping. I know that some people in the room may disagree with that statement, but they're actually two almost completely different concepts.

In a nutshell, you might state it this way: screen-scraping allows you to get information, where data mining allows you to analyze information. That's a pretty big simplification, so I'll elaborate a bit.

The term "screen-scraping" comes from the old mainframe terminal days where people worked on computers with green and black screens containing only text. Screen-scraping was used to extract characters from the screens so that they could be analyzed. Fast-forwarding to the web world of today, screen-scraping now most commonly refers to extracting information from web sites. That is, computer programs can "crawl" or "spider" through web sites, pulling out data. People often do this to build things like comparison shopping engines, archive web pages, or simply download text to a spreadsheet so that it can be filtered and analyzed.

Data mining, on the other hand, is defined by Wikipedia as the "practice of automatically searching large stores of data for patterns." In other words, you already have the data, and you're now analyzing it to learn useful things about it. Data mining often involves lots of complex algorithms based on statistical methods. It has nothing to do with how you got the data in the first place. In data mining you only care about analyzing what's already there.

The difficulty is that people who don't know the term "screen-scraping" will try Googling for anything that resembles it. We include a number of these terms on our web site to help such folks; for example, we created pages entitled Text Data Mining, Automated Data Collection, Web Site Data Extraction, and even Web Site Ripper (I suppose "scraping" is sort of like "ripping"). So it presents a bit of a problem-we don't necessarily want to perpetuate a misconception (i.e., screen-scraping = data mining), but we also have to use terminology that people will actually use.

Source: http://ezinearticles.com/?Data-Mining-vs-Screen-Scraping&id=146813

Wednesday, June 3, 2015

Twitter Scraper Python Library

I wanted to save the tweets from Transparency Camp. This prompted me to turn Anna‘s basic Twitter scraper into a library. Here’s how you use it.

Import it. (It only works on ScraperWiki, unfortunately.)

from scraperwiki import swimport

search = swimport('twitter_search').search

Then search for terms.

search(['picnic #tcamp12', 'from:TCampDC', '@TCampDC', '#tcamp12', '#viphack'])

A separate search will be run on each of these phrases. That’s it.

A more complete search

Searching for #tcamp12 and #viphack didn’t get me all of the tweets because I waited like a week to do this. In order to get a more complete list of the tweets, I looked at the tweets returned from that first search; I searched for tweets referencing the users who had tweeted those tweets.

from scraperwiki.sqlite import save, select

from time import sleep

# Search by user to get some more

users = [row['from_user'] + ' tcamp12' for row in \

select('distinct from_user from swdata where from_user where user > "%s"' \

% get_var('previous_from_user', ''))]

for user in users:

    search([user], num_pages = 2)

    save_var('previous_from_user', user)

    sleep(2)

By default, the search function retrieves 15 pages of results, which is the maximum. In order to save some time, I limited this second phase of searching to two pages, or 200 results; I doubted that there would be more than 200 relevant results mentioning a particular user.

The full script also counts how many tweets were made by each user.

Library

Remember, this is a library, so you can easily reuse it in your own scripts, like Max Richman did.

Source: https://scraperwiki.wordpress.com/2012/07/04/twitter-scraper-python-library/