Voson Team.

I am attempting to merge the Voson "SeedsAN" dataset with my master dataset/spreadsheet on my work computer in order to analyze the network across the many other variables I have coded. I ran into a challenge with the PageGroups produced in the Voson "SeedAn" database. I'm pleased with the PageGroup function and don't want to change it, I just have a few questions about the process and how to work with it, which I list below.

i. If it is the case that Voson collapses Seeds that share the same root URL into PageGroups, how can I see which seeds have been collapsed if only the lead URL is shown? Is there a way to generate a list of which seeds have been collapsed into which PageGroups so that I can code my master dataset on my home computer to match the URL changes? Or do I have to use the DataBrowser and sort through these 10 items at a time?

ii. Somewhat related to the first question, but not exactly - I submitted 632 URLs for the initial crawl. The "SeedAN" database has 620 rows, which means some were either excluded or collapsed into PageGroups. How do I identify the 12 excluded/collapsed URLs?

iii. Also, I noticed that there are 626 PageGroup ID’s. How is that possible with only 620 rows in the "SeedAN" dataset?

iv. When creating new PageGroups, if I want to further collapse additional URL’s together in the "SeedAN" dataset, but some of these have already been automatically collapsed by the Voson PageGroup function (when creating the "SeedAN" dataset), do I list the *original* ungrouped URLs or do I use the lead Pagegroup URL’s - or does it matter?

Thank you.

MDB

  • anon

    Thanks for these good questions, and here are some answers:

    > i. If it is the case that Voson collapses Seeds that share the same root URL into PageGroups, how can I see which seeds have been collapsed if only the lead URL is shown? Is there a way to generate a list of which seeds have been collapsed into which PageGroups so that I can code my master dataset on my home computer to match the URL changes? Or do I have to use the DataBrowser and sort through these 10 items at a time?

    The only (current) way you can see which pages are in which pagegroups (other than via the DataBrowser) is to download the voson database as a comma separated variable (CSV) file and then look at in excel. You can do this by loading the voson database and then selecting Data->Download->CSV (text or zip, depending on size of your dbase).

    > ii. Somewhat related to the first question, but not exactly - I submitted 632 URLs for the initial crawl. The "SeedAN" database has 620 rows, which means some were either excluded or collapsed into PageGroups. How do I identify the 12 excluded/collapsed URLs?

    You are correct that this must have happened because some of the seed URLs belong to the same pagegroup i.e. share the same hostname. There are a couple of ways of finding these cases. If you download the voson database as CSV and open it in excel (or another spreadsheet) and sort on both id number and URL you should be able to see the seed URLs that are in the same pagegroup next to one another. You can also find these URLs via the DataBrowser, but it may take you a bit longer.

    As agreed in our email exchange, I've been looking at your data. I opened the main voson-analysis dbase and then found via the DataBrowser that the pagegroup with id=23 was missing. I then opened the voson database and looked at the page with id=23 and saw that the id_pagegroup for this page is 251, not 23. id_pagegroup is the id of the page that is something like the "head" or representative of the pagegroup, and it is the one that is shown in the voson-analysis database.

    Anyway, to cut a long story short: the reason why pagegroup with id 23 is missing from the voson-analysis database is because you have two seed URLs in the same pagegroup: URL id=23 and URL id=251. I'm sure the other "missing" seed URLs from the Seeds voson-analysis database have been excluded for the same reason, but let me know if this is not the case. BTW, if you want to make sure two URLs sharing the same hostname are not collapsed into the one pagegroup you can do this via the node pre-processing feature (preserving).

    > iii. Also, I noticed that there are 626 PageGroup ID’s. How is that possible with only 620 rows in the "SeedAN" dataset?

    See above.

    > iv. When creating new PageGroups, if I want to further collapse additional URL’s together in the "SeedAN" dataset, but some of these have already been automatically collapsed by the Voson PageGroup function (when creating the "SeedAN" dataset), do I list the *original* ungrouped URLs or do I use the lead Pagegroup URL’s - or does it matter?

    Good question. You should use the original URLs as they appear in the voson database for any pagegrouping. If you set up your own pagegouping, it will override the default (which is to put all pages with the same hostname in the same pagegroup). You definitely need to work from what is in the voson database, not the voson-analysis database. In the next major release of VOSON (hopefully mid 2013) we are going to make this process more simple.

    Dec 13, 2012
  • anon

    This question has to do with how to best PRESERVE a URL with a sub-directory that Voson both collapsed (PageGroup) into the root directory *and* added a "www" to it.

    I understand that Voson adds a "www" to URLs when that version of the site is referenced in a crawl. And I'm fine with that. I just need to understand the implications of this for working with Preserving/PageGrouping.

    So, for example, I submitted the following for initial crawl:

    "http://rac.org/advocacy/rjv/"

    Voson both added the "www" and reduced it to the root directory, converting the original url to :

    "http://www.rac.org/"

    I want to Preserve the full directory because it designates a specific organization within the larger website. If I specifically preserve the original ("http://rac.org/advocacy/rjv/") will Voson skip over any links to a "www" version of this website, should it exist? For instance, if "http://www.rac.org/advocacy/rjv/" exists and is linked, will Voson pick it up even when I force Voson to Preserve the former version with no www? (I want both if possible.)

    Thanks,

    M

    Dec 21, 2012
  • anon

    You are correct that VOSON will automatically add "www." to a pagegroup, but it will only do this if there exists at least one page in the database from the "www." subdomain.

    Thus if your voson database had the following two pages:

    "http://rac.org/advocacy/rjv/"

    "http://www.rac.org/advocacy/rjv/"

    Then the default pagegroup would be "www.rac.org". So we treat the "www." version of the hostname as the canonical version, but only if a page with "www." exists in the database.

    So, if only the first page was in the database, then the pagegroup would be "rac.org"

    Note that a limitation here is that we are relying on the fact that people who create links on their sites are not creating broken links. That is, if on a seed site (not rac.org) there was a link to "http://www.rac.org/advocacy/rjv/", then this page goes into the voson database (if it's not already there). But if "http://www.rac.org/advocacy/rjv/" doesn't in fact exist (i.e. it is a broken link), then VOSON won't know this.

    The second part of your question is a bit complicated to answer because what you are asking about is the interaction between pagegrouping and preserving.

    If I were you, I would preserve "www.rac.org/advocacy/rjv", not "rac.org/advocacy/rjv", but then check to ensure that all the pages you expect to be in that pagegroup are there. So just say there were the following two pages in the voson database:

    "http://www.rac.org/advocacy/rjv/page1.html"

    "http://rac.org/advocacy/rjv/page2.html"

    I would expect both of these to be in the pagegroup "www.rac.org/advocacy/rjv" if that is what you have preserved. Let me know if that isn't the case.

    In creating the voson-analyis database, VOSON then simply ascribes any links to/from the pages in the pagegroup to the pagegroup itself. So links to/from "http://rac.org/advocacy/rjv/page2.html" would be shown as links to/from "www.rac.org/advocacy/rjv" (but note, they would be links to pagegroups, not pages).

    Hope this helps.

    Rob

    Dec 22, 2012