I just happened upon a link between two seed sites that wasn't picked up in the crawl. The seed sites are:

http://www.now.org/

and

http://www.emmagoldman.com/

When I created a sub network to look a the isolates ("IsolatesSeedsProSSM") I noticed that http://www.emmagoldman.com/ was listed as having a degree category = 0. This surprised me as I know the relationship between the two organizations and so I checked and found that there is a link in the web site, but it isn't being crawled for some reason.

IF you follow the directory (http://www.emmagoldman.com/links/NationalFeministOrganizations.htm) you'll find an active link to the former now.org site.

If it is happening here, should I be concerned about other missed links?

Any ideas?

Thanks,

Michael

  • anon

    I will look into to this to see if the site was successfully crawled. If it wasn't there are several possible reasons:

    - the URL is not correct (but you would have figured that out)
    - the site was down when the crawler visited it
    - the site is using robots.txt to prevent crawlers from accessing it
    - the crawl parameters meant that the crawler didn't reach the page where the missed links are

    So, that is a brief answer to your question: "If it is happening here, should I be concerned about other missed links?"

    But in general I would say there is going to be measurement error in any network study, and if you have enough nodes then missed links between a number of seeds should not qualitatively affect your analytical results.

    Rob

    Jan 16, 2013
  • anon

    Is it possible to have VOSON recrawl specific sub-directories from websites that were previously crawled?

    For instance, from the example above, if I wanted to have VOSON re-crawl just the subdirectory (http://www.emmagoldman.com/links/NationalFeministOrganizations.htm) from the original site, would I simply add this specific sub-directory as a seed site for crawling, "preserve" it's structure, and then "page-group" it back into the original after the crawl?

    If this is possible, is there a best order of operations to consider here?

    Thanks,

    M

    P.S. Just as an aside, out of 600 org websites, I have identified about 80 in which VOSON missed the outbound links between seed organizations. Would you recommend re-running the crawl in total (I'm only interested in the network among seed orgs) - or just crawling the sub-directories that contain the missing links for each of these 80 websites and then page-grouping as I describe above?

    Mar 13, 2013
  • anon

    Hi,

    We've been having this conversation via email and so I guess will take this opportunity to summarise the points I have already made to you via email to anyone else who may be interested.

    There are several reasons why the VOSON crawler may not pick up outbound links from any given seed site:

    - the site was not operating when the crawler visited

    - the site is preventing crawler access via the robots.txt protocol. The VOSON crawler obeys robots.txt

    - broken URL (I assume that would be checked by the user)

    - the site is using e.g. javascript or flash graphics that makes it impossible for the crawler to enter or move around the site

    Now, I unfortunately don't have time to check the specifics of the URLs you mention above but I recommend to you (again) the following:

    - if your analysis is such that measurement error for a few sites in the network is going to make a big impact on your results, then you are possibly using VOSON for a purpose other than it was intended. It is intended to be used for relatively large-scale quantitative analysis, where data quality issues for a small number of nodes should not be impacting on the results (or if they are, then one can only question the robustness of the statistical model).

    - if you are doing more qualitative research and it is going to be a huge problem for some websites to not have outdegree=0, when you know they do have outbound links, then I recommend that you: (a) look at the websites of those sites with outdegree=0; (b) manually compile a list of outbound links; (c) add these links to the e.g. pajek export data file and so you can then analyse your data in a 3rd party software with confidence that you have the links you know are there.

    - You might like to shop around and try another web crawler (or build your own?) e.g. Issuecrawler, SocSciBot. They might provide you with better crawl data.

    Regards,

    Rob

    Mar 14, 2013